Archives: February 2010

New interest in MARC-HASH / JSON

February 26, 2010 at 12:29 amCategory:Uncategorized

EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  1.   # A record is a four-pair hash, as follows. UTF-8 is mandatory.
  2.   {
  3.     "type" : "marc-hash"
  4.     "version" : [1, 0]
  5.     "leader" : "…leader string … "
  6.     "fields" : [array, of, fields]
  7.   }
  8.  
  9.   # A field is an array of either 2 or 4 elements
  10.   [tag, value] # a control field
  11.   [tag, ind1, ind2, [array, of subfields]]
  12.  
  13.   # A subfield is an array of two elements
  14.  
  15.   [code, value]

So, a short example:

{
  1.     "type" : "marc-hash",
  2.     "version" : [1, 0],
  3.  
  4.     "leader" : "leader string"
  5.     "fields" : [
  6.        ["001", "001 value"]
  7.        ["002", "002 value"]
  8.        ["010", " ", " ",
  9.         [
  10.           ["a", "68009499"]
  11.         ]
  12.       ],
  13.       ["035", " ", " ",
  14.         [
  15.           ["a", "(RLIN)MIUG0000733-B"]
  16.         ],
  17.       ],
  18.       ["035", " ", " ",
  19.         [
  20.           ["a", "(CaOTULAS)159818014"]
  21.         ],
  22.       ],
  23.       ["245", "1", "0",
  24.         [
  25.           ["a", "Capitalism, primitive and modern;"],
  26.           ["b", "some aspects of Tolai economic growth" ],
  27.           ["c", "[by] T. Scarlett Epstein."]
  28.         ]
  29.       ]
  30.     ]
  31.   }

How's the speed?

I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end

# Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?

11 Responses to “New interest in MARC-HASH / JSON”

  1. What’s up with the ERROR on marshalling to json under jruby? Nothing there that shouldn’t work under jruby, I wouldn’t think? I’m confident that the actual performance metrics aren’t going to be different enough under jruby to effect the “win” conclusion, but we would of course want to make sure that ruby could do the serialization under jruby!

  2. [...] fact, I know of a couple people who had this idea of marc-json, but Bill Dueber did a little proto- mini- spec for a standard way to do marc in json, so different people writing tools can do it can be [...]

  3. Adding on to the proto-mini-spec, we should be clear that a ‘blank’ indicator is represented as a ascii space, yes?

    And likewise for the MARC “fill” character in fixed fields — which in marc8 is just ascii 7C, the “|” char, so I guess should still be represented as a |, it just probably deserves it’s own mention since it’s so weird.

  4. Naomi Dushay says:

    by “hash” do you mean “array”? Because order matters in Marc, but ruby hashes do not guarantee order, right?

  5. Bill says:

    Heh. Yeah, the major element in the hash is a big ol’ array of fields, composed of arrays of subfields. My original take on MARC-Hash was mostly as a hash, and resulted in my first real, painful schooling in how batshit-insane MARC is.

  6. GregPendlebury says:

    Bill, I’d be interested in hearing if anyone has put up a central site for this in terms of schema definition and/or collaboration. This is an area I think could gain some traction here, but going off on our own without a public schema doesn’t seem productive.

  7. [...] Should support output in Marc21, MarcXML, or Bill Dueber’s Marc in Json proto-spec. http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [...]

  8. File_MARC 0.6.0 – now offering two tasty flavours of MARC-as-JSON output…

    I’ve just released the PHP PEAR library File_MARC 0.6.0. This release brings two JSON serialization output methods for MARC to the table: toJSONHash() returns JSON that adheres to Bill Dueber’s proposal for the array-oriented MARC-HASH JSON format at…

  9. [...] far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have [...]

  10. [...] have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton [...]

NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So…good news all around, and huge kudos to Xiaoming Liu for his quick response!
**NOTE** It strikes me that I haven’t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you’ll get back either good data or nothing (and the “nothing” might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.

A long time ago, Jonathan Rochkind noted that the OCLC doesn’t correctly normalize their LCCNs.

Well, it’s not fixed.

I could really, really use the xlccn service right about now — a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you’re interested in.

Except they “normalize” their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.

The xLCCN service will silently provide no data or incorrect data for many LCCN requests!

An example:

  • (F) Full LCCN is “sn 83011407″
  • (D) First set of digits is “83011407″. This is what I think the OCLC is indexing.
  • (N) Correct normalization is “sn83011407″

The problem, of course, is that (D) “83011407″ is itself a valid LCCN.

  • (F) is associated with OCLC# 47212967
  • (D) is associated with OCLC# 12505148. That’s not the same record.

So, how do the OCLC services respond?

  • (F) Worldcat search finds correct (probably just doing a string match); xid finds nothing
  • (D) Worldcat finds both correct and incorrect records. The xLCCN service finds only the incorrect record, OCLC# 12505148.
  • (N) Neither worldcat nor xid finds anything for the correctly normalized version.

So, what am I supposed to do? Only use the service on LCCNs where the original and normalized versions are the same and include only digits? Frustrating.

One Response to “OCLC still not (NO! They are!) normalizing their LCCNs”

  1. Alice Sneary says:

    Thanks for sharing your frustration, Bill, and I’m glad you’ve called attention to it. I also see your posting to the OCLC Developer Network listserv, and we’re looking into things right now to get back with you.

[Note: in this post I'm just going to focus on the "get stuff into Solr" part. My normal focus -- MARC data -- will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love me the java.

So…taking Erik Hatcher’s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with Solr.

Getting the code

I’ve added the gems to gemcutter, if you want to play along at home:

  • jruby_producer_consumer (github, rdoc.info) Ruby syntax for threaded operations under jruby
  • jruby_streaming_update_solr_server (github, rdoc.info) Ruby syntax on top of the Java class of the same name
  • marc4j4r (github, rdoc.info) Ruby syntax on top of the marc4j java library.

WARNING: None of these gems have a 1.0 version tag on them, and that means that the API may change a titch in the future. Also, the fact that they’re released as gems means that it’s easy to release gems, not that I’m not an idiot.

The basics: Using SolrInputDocument and StreamingUpdateSolrServer

OK, with the disclaimer out of the way, let’s look at some code.

  1.   require 'rubygems'
  2.   require 'jruby_streaming_update_solr_server'
  3.  
  4.   solrurl = 'http://your.solr.server:port/solr'
  5.   sussqueuesize = 24 # how many items to buffer on their way to solr
  6.   sussthreads = 1   # how many threads to use to send stuff to solr
  7.  
  8.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  9.  
  10.   # Let's add a simple document via a hash: A title, three authors, and a year
  11.  
  12.   h = {
  13.     :title => "Never been deader",
  14.     :author => ['Bill', 'Mike', 'Molly'],
  15.     :year => 2003
  16.   }
  17.   suss << h
  18.   suss.commit
  19.  
  20.   # YEA! You just added a document to solr and committed it.
  21.   # Have a cookie!
  22.  
  23.   # We can also use a document object to do the same thing
  24.  
  25.   doc = SolrInputDocument.new
  26.   # Add the title
  27.   doc << ['title', 'Never been deader']
  28.  
  29.   # Add the first author
  30.   doc << [:author, 'Bill']
  31.  
  32.   # Add more. Re-used keys mean you're adding additional values
  33.   # Note values can be scalars or arrays
  34.  
  35.   doc << [:author, ['Mike', 'Molly']]
  36.  
  37.   # Add the wrong year using [] syntax
  38.   doc[:year] = 2001
  39.  
  40.   # Oops! fix it. []= overwrites existing value(s)
  41.  
  42.   doc[:year] = 2003
  43.  
  44.   # Finally, we can merge a hash (or anything else that responds to
  45.   # 'each_pair' with key-value pairs) into an existing doc
  46.  
  47.   doc.merge! {'author' => 'Ringo Starrre', 'publisher'=>'Vainity Books'}
  48.  
  49.   # Add it
  50.  
  51.   suss << doc
  52.  
  53.   # Commit and optimize if you'd like
  54.  
  55.   suss.commit
  56.   suss.optimize # if you want

Nothing really fancy in there — just a few things worth noting:

  • An suss object will take a hash (again, anything that responds to #each_pair) or a SolrInputDoc
  • You can use either strings or symbols to represent Solr field names
  • Values can be either a single value, or an array of multiple values

And there are three ways to get data into a doc:

  • Via << [field, value(s)] (additive)
  • Via doc.merge! hash (additive)
  • Via doc[field] = value (replaces)

Adding Threads

I also went down the garden path of threading things. There are an awful lot of operations that are not threadsafe (e.g., reading a line from a file) but once you’ve got a bunch of records to worth with, turning them into Solr documents is usually thread-safe.

My model is that there’s a producer (usually the method #each) from an underlying data object. A thread takes whatever that method yields and sticks the values into a java BlockingQueue awaiting consumption. You then use ProdcuerConsumer#threaded_each (or ProducerConsumer#threaded_each_with_index) to pull items out of the queue and do something useful with them.

I extracted stuff into a library (jruby_producer_consumer) for your viewing pleasure.

CONFUSION ALERT: It’s perhaps unfortunate that the object you send to ProducerConsumer.new(obj) must implement #each and that the ProducerConsumer method #threaded_each calls that underlying #each…well there’s a lot of #each‘s floating around. Keep them straight.

So…let’s look at some code to work with consumer threads.

  1.   # Start off the same as before
  2.   require 'rubygems'
  3.   require 'jruby_streaming_update_solr_server'
  4.   require 'jruby_producer_consumer'
  5.   require 'marc4j4r'
  6.  
  7.   solrurl = 'http://your.solr.server:port/solr'
  8.   sussqueuesize = 24 # how many items to buffer on their way to solr
  9.   sussthreads = 2   # how many threads to use to send stuff to solr
  10.  
  11.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  12.  
  13.   # I'll go ahead and use a MARC file as my example, but won't talk about the
  14.   # MARC parts of it. All you need to know is that the reader object
  15.   # implements #each
  16.  
  17.   reader = MARC4J4R.reader('test.xml', :marcxml)
  18.  
  19.   # Get a producer/consumer object with the reader at its base, using
  20.   # the default method #each to get stuff out of it, and with the assumption
  21.   # that we only need to keep the default 5 items in memory at a time to
  22.   # keep up with consumption
  23.  
  24.   pc = ProducerConsumer.new(reader)
  25.  
  26.   # Get three threads to actually consume the things, turn them into solr
  27.   # documents, and send them to solr (potentially out of order)
  28.  
  29.   numconsumerthreads = 3
  30.   pc.threaded_each(numconsumerthreads).each do |r|
  31.     suss << turn_marc_record_into_a_hash_or_solrdoc(r)
  32.   end
  33.   suss.commit

Again, not a lot happening here.

  • The “producer” is always one thread, because so little is thread-safe at the ‘each’ level. In this case, there’s a single thread pulling data out of the file and turning it into MARC records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don’t starve. I presume that producing items is cheaper than consuming them, or else this library won’t help you much.
  • ProducerConsumer#threaded_each calls the #each method of the underlying object. You can substitute anything that yields, though, as in this example where I call #each_line instead of the default #each
  1.   queuesize = 5
  2.   pc = ProducerConsumer.new(File.new('myfile.txt'), queuesize, :each_line)
  • Keep track of your threads. In this last example, there is one thread getting MARC records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the suss object, and another two pulling stuff out of the suss object and sending things to Sorl. And, of course, there’s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.
  • See the docs for how to mess with what goes into a ProducerConsumer object. It’s entirely possible to use, say, #each_slice. There’s also a convenience method #threaded_each_with_index, but it does not call the underlying #each_with_index, it produces its own index as things are read.

Feedback not only welcome but necessary!

I’ve done a lot of messing around with Ruby in the last 10 days or so, but I’m still basically converting from Perl in my head. Any comments, bugs reports, or whatnot are definitely welcome!

Comments are closed.

Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. And I didn't really understand it. So I dug in today and wrote this.]

I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer.

It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t a huge win under regular Ruby.

There’s no testing there because I’m not sure how to test threaded stuff :-(

It is, I hope, easy to use:

  1.    require 'rubygems'
  2.    require 'jruby_producer_consumer'
  3.  
  4.    # Create a ProducerConsumer. Arguments are anything that implements #each
  5.    # and the size for the underlying queue. For the former, I'll just use a Range object.
  6.  
  7.    eachable = 1..10
  8.    queuesize = 3
  9.  
  10.    pc = ProducerConsumer.new(eachable, queuesize)
  11.  
  12.    # Just a method to show what happens
  13.    def sample (consumerid, x)
  14.      puts "Consumer #{consumerid}: consuming #{x}"
  15.      sleep 1 # otherwise this'll finsish before I can create multiple consumers
  16.    end
  17.  
  18.    # Create three consumers. You can pass any number of args to
  19.    # #consumer, and must pass a block whose arguments are the
  20.    # object returned by eachable#each and those args back.
  21.  
  22.    ['A', 'B', 'C'].each do |consumerid|
  23.      pc.consumer(consumerid) do |x, consumerid|
  24.        sample(consumerid, x)
  25.      end
  26.    end
  27.  
  28.    # OUTPUT
  29.    # Consumer A: consuming 1
  30.    # Consumer B: consuming 2
  31.    # Consumer C: consuming 3
  32.    # Consumer A: consuming 4
  33.    # Consumer B: consuming 5
  34.    # Consumer C: consuming 6
  35.    # Consumer B: consuming 7
  36.    # Consumer A: consuming 8
  37.    # Consumer C: consuming 9
  38.    # Consumer B: consuming 10

Comments are closed.