EDIT: This is historical -- the recommended serialization for marc in json is now Ross Singer's marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C'mon, Python people!

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  // A record is a four-pair hash, as follows. UTF-8 is mandatory.
  {
    "type" : "marc-hash"
    "version" : [1, 0]
    "leader" : "...leader string ... "
    "fields" : [array, of, fields]
  }

  // A field is an array of either 2 or 4 elements
  [tag, value] // a control field
  [tag, ind1, ind2, [array, of subfields]]

  // A subfield is an array of two elements

  [code, value]

So, a short example:

  {
    "type" : "marc-hash",
    "version" : [1, 0],

    "leader" : "leader string"
    "fields" : [
       ["001", "001 value"]
       ["002", "002 value"]
       ["010", " ", " ",
        [
          ["a", "68009499"]
        ]
      ],
      ["035", " ", " ",
        [
          ["a", "(RLIN)MIUG0000733-B"]
        ],
      ],
      ["035", " ", " ",
        [
          ["a", "(CaOTULAS)159818014"]
        ],
      ],
      ["245", "1", "0",
        [
          ["a", "Capitalism, primitive and modern;"],
          ["b", "some aspects of Tolai economic growth" ],
          ["c", "[by] T. Scarlett Epstein."]
        ]
      ]
    ]
  }

How's the speed?

I think it’s important to separate the format marc-hash from the eventual marshaling format – partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let’s look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you’d expect. The JSON file is “Newline-Delimited JSON” – a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end

  # Marshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    writer = MARC::Writer.new('benchout.mrc')
    reader.each do |r|
      writer.write(r)
    end
    writer.close
  end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I’ve been using for all my benchmarking of late. It’s nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I’m not sure what else to say. The format is totally brain-dead. It round-trips. It’s fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that’d be great. Any thoughts?