Home > Uncategorized > New interest in MARC-HASH / JSON

New interest in MARC-HASH / JSON

February 26, 2010 11 Comments »
EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  1.   # A record is a four-pair hash, as follows. UTF-8 is mandatory.
  2.   {
  3.     "type" : "marc-hash"
  4.     "version" : [1, 0]
  5.     "leader" : "…leader string … "
  6.     "fields" : [array, of, fields]
  7.   }
  8.  
  9.   # A field is an array of either 2 or 4 elements
  10.   [tag, value] # a control field
  11.   [tag, ind1, ind2, [array, of subfields]]
  12.  
  13.   # A subfield is an array of two elements
  14.  
  15.   [code, value]

So, a short example:

{
  1.     "type" : "marc-hash",
  2.     "version" : [1, 0],
  3.  
  4.     "leader" : "leader string"
  5.     "fields" : [
  6.        ["001", "001 value"]
  7.        ["002", "002 value"]
  8.        ["010", " ", " ",
  9.         [
  10.           ["a", "68009499"]
  11.         ]
  12.       ],
  13.       ["035", " ", " ",
  14.         [
  15.           ["a", "(RLIN)MIUG0000733-B"]
  16.         ],
  17.       ],
  18.       ["035", " ", " ",
  19.         [
  20.           ["a", "(CaOTULAS)159818014"]
  21.         ],
  22.       ],
  23.       ["245", "1", "0",
  24.         [
  25.           ["a", "Capitalism, primitive and modern;"],
  26.           ["b", "some aspects of Tolai economic growth" ],
  27.           ["c", "[by] T. Scarlett Epstein."]
  28.         ]
  29.       ]
  30.     ]
  31.   }

How's the speed?

I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end

# Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?

Tags:

Comments:6

Leave my own
  1. Jonathan Rochkind
    February 26, 2010 at 12:54 pm

    What’s up with the ERROR on marshalling to json under jruby? Nothing there that shouldn’t work under jruby, I wouldn’t think? I’m confident that the actual performance metrics aren’t going to be different enough under jruby to effect the “win” conclusion, but we would of course want to make sure that ruby could do the serialization under jruby!

  2. Jonathan Rochkind
    March 3, 2010 at 4:15 pm

    Adding on to the proto-mini-spec, we should be clear that a ‘blank’ indicator is represented as a ascii space, yes?

    And likewise for the MARC “fill” character in fixed fields — which in marc8 is just ascii 7C, the “|” char, so I guess should still be represented as a |, it just probably deserves it’s own mention since it’s so weird.

  3. Naomi Dushay
    March 4, 2010 at 6:57 pm

    by “hash” do you mean “array”? Because order matters in Marc, but ruby hashes do not guarantee order, right?

  4. Bill
    March 4, 2010 at 7:26 pm

    Heh. Yeah, the major element in the hash is a big ol’ array of fields, composed of arrays of subfields. My original take on MARC-Hash was mostly as a hash, and resulted in my first real, painful schooling in how batshit-insane MARC is.

  5. GregPendlebury
    March 7, 2010 at 7:10 pm

    Bill, I’d be interested in hearing if anyone has put up a central site for this in terms of schema definition and/or collaboration. This is an area I think could gain some traction here, but going off on our own without a public schema doesn’t seem productive.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackbacks:5

Listed below are links to weblogs that reference New interest in MARC-HASH / JSON

pingback from marc-json « Bibliographic Wilderness March 3, 2010

[...] fact, I know of a couple people who had this idea of marc-json, but Bill Dueber did a little proto- mini- spec for a standard way to do marc in json, so different people writing tools can do it can be [...]

pingback from spec for a better ILS marc exporter « Bibliographic Wilderness April 6, 2010

[...] Should support output in Marc21, MarcXML, or Bill Dueber’s Marc in Json proto-spec. http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [...]

trackback from Coffee|Code : Dan Scott August 15, 2010

File_MARC 0.6.0 – now offering two tasty flavours of MARC-as-JSON output…

I’ve just released the PHP PEAR library File_MARC 0.6.0. This release brings two JSON serialization output methods for MARC to the table: toJSONHash() returns JSON that adheres to Bill Dueber’s proposal for the array-oriented MARC-HASH JSON format at…

pingback from Dilettante's Ball » Blog Archive » For your consideration: yet another MARC-in-JSON proposal pt. 1 September 2, 2010

[...] far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have [...]

pingback from Size/speed of various MARC serializations using ruby-marc » Robot Librarian September 29, 2010

[...] have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton [...]