New interest in MARC-HASH / JSON

EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

   // A record is a four-pair hash, as follows. UTF-8 is mandatory.   {     "type" : "marc-hash"     "version" : [1, 0]     "leader" : "...leader string ... "     "fields" : [array, of, fields]   }    // A field is an array of either 2 or 4 elements   [tag, value] // a control field   [tag, ind1, ind2, [array, of subfields]]    // A subfield is an array of two elements    [code, value]

So, a short example:

  {     "type" : "marc-hash",     "version" : [1, 0],      "leader" : "leader string"     "fields" : [        ["001", "001 value"]        ["002", "002 value"]        ["010", " ", " ",         [           ["a", "68009499"]         ]       ],       ["035", " ", " ",         [           ["a", "(RLIN)MIUG0000733-B"]         ],       ],       ["035", " ", " ",         [           ["a", "(CaOTULAS)159818014"]         ],       ],       ["245", "1", "0",         [           ["a", "Capitalism, primitive and modern;"],           ["b", "some aspects of Tolai economic growth" ],           ["c", "[by] T. Scarlett Epstein."]         ]       ]     ]   }

How’s the speed?

I think it’s important to separate the format marc-hash from the eventual marshaling format — partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let’s look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you’d expect. The JSON file is “Newline-Delimited JSON” — a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal   x.report("MARC Binary") do     reader = MARC::Reader.new('test.mrc')     reader.each do |r|       title = r['245']['a']     end   end    # Marshal   x.report("MARC Binary") do     reader = MARC::Reader.new('test.mrc')     writer = MARC::Writer.new('benchout.mrc')     reader.each do |r|       writer.write(r)     end     writer.close   end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I’ve been using for all my benchmarking of late. It’s nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format	Ruby 1.87	Ruby 1.9	JRuby 1.4	Jruby 1.4 –1.9
XML	393	443	188	356
MARC Binary	36	23	23	25
JSON/ NDJ	31	19	25	ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format	Ruby 1.87	Ruby 1.9	JRuby 1.4	Jruby 1.4 –1.9
XML	113	89	75	89
MARC Binary	29	16	16	19
JSON/ NDJ	17	9	13	16

And so…

I’m not sure what else to say. The format is totally brain-dead. It round-trips. It’s fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that’d be great. Any thoughts?