I am Mr. Rourke...er...Bill Dueber, your host.

New interest in MARC-HASH / JSON

EDIT: This is historical – the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!
For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  // A record is a four-pair hash, as follows. UTF-8 is mandatory.
    "type" : "marc-hash"
    "version" : [1, 0]
    "leader" : "...leader string ... "
    "fields" : [array, of, fields]

  // A field is an array of either 2 or 4 elements
  [tag, value] // a control field
  [tag, ind1, ind2, [array, of subfields]]

  // A subfield is an array of two elements

  [code, value]

So, a short example:

    "type" : "marc-hash",
    "version" : [1, 0],

    "leader" : "leader string"
    "fields" : [
       ["001", "001 value"]
       ["002", "002 value"]
       ["010", " ", " ",
          ["a", "68009499"]
      ["035", " ", " ",
          ["a", "(RLIN)MIUG0000733-B"]
      ["035", " ", " ",
          ["a", "(CaOTULAS)159818014"]
      ["245", "1", "0",
          ["a", "Capitalism, primitive and modern;"],
          ["b", "some aspects of Tolai economic growth" ],
          ["c", "[by] T. Scarlett Epstein."]

How's the speed?

I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along. Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance. The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line. The benchmark code looks like this: ~~~ruby # Unmarshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') reader.each do |r| title = r['245']['a'] end end # Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end ~~~~ Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem. The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.
Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds
Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I’m not sure what else to say. The format is totally brain-dead. It round-trips. It’s fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that’d be great. Any thoughts?


comments powered by Disqus