EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!
For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.
When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.
Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.
For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.
What is MARC-HASH?
At some point, we’ll want a real spec, but right now it’s just this:
// A record is a four-pair hash, as follows. UTF-8 is mandatory. { "type" : "marc-hash" "version" : [1, 0] "leader" : "...leader string ... " "fields" : [array, of, fields] } // A field is an array of either 2 or 4 elements [tag, value] // a control field [tag, ind1, ind2, [array, of subfields]] // A subfield is an array of two elements [code, value]
So, a short example:
{ "type" : "marc-hash", "version" : [1, 0], "leader" : "leader string" "fields" : [ ["001", "001 value"] ["002", "002 value"] ["010", " ", " ", [ ["a", "68009499"] ] ], ["035", " ", " ", [ ["a", "(RLIN)MIUG0000733-B"] ], ], ["035", " ", " ", [ ["a", "(CaOTULAS)159818014"] ], ], ["245", "1", "0", [ ["a", "Capitalism, primitive and modern;"], ["b", "some aspects of Tolai economic growth" ], ["c", "[by] T. Scarlett Epstein."] ] ] ] }
How’s the speed?
I think it’s important to separate the format marc-hash from the eventual marshaling format — partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.
Having said that, in real life people are mostly concerned about JSON. So, let’s look at JSON performance.
The MARC-Binary and MARC-XML files are normal files, as you’d expect. The JSON file is “Newline-Delimited JSON” — a single JSON record on each line.
The benchmark code looks like this:
# Unmarshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') reader.each do |r| title = r['245']['a'] end end # Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end
Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.
The test file is a set of 18,831 records I’ve been using for all my benchmarking of late. It’s nothing special; just a nice size.
Marshalling Speed (read from binary marc, dump to given format)
Times are in seconds on my Macbook laptop, using ruby-marc.
Format | Ruby 1.87 | Ruby 1.9 | JRuby 1.4 | Jruby 1.4 –1.9 |
---|---|---|---|---|
XML | 393 | 443 | 188 | 356 |
MARC Binary | 36 | 23 | 23 | 25 |
JSON/ NDJ | 31 | 19 | 25 | ERROR |
Unmarshalling speed (from pre-created file)
Again, times are in seconds
Format | Ruby 1.87 | Ruby 1.9 | JRuby 1.4 | Jruby 1.4 –1.9 |
---|---|---|---|---|
XML | 113 | 89 | 75 | 89 |
MARC Binary | 29 | 16 | 16 | 19 |
JSON/ NDJ | 17 | 9 | 13 | 16 |
And so…
I’m not sure what else to say. The format is totally brain-dead. It round-trips. It’s fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.
If folks are interested in implementing this across other libraries, that’d be great. Any thoughts?