New interest in MARC-HASH / JSON
February 26, 2010 at 12:29 amCategory:Uncategorized
EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!
For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.
When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.
Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.
For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.
What is MARC-HASH?
At some point, we’ll want a real spec, but right now it’s just this:
- # A record is a four-pair hash, as follows. UTF-8 is mandatory.
- {
- "type" : "marc-hash"
- "version" : [1, 0]
- "leader" : "…leader string … "
- "fields" : [array, of, fields]
- }
- # A field is an array of either 2 or 4 elements
- [tag, value] # a control field
- [tag, ind1, ind2, [array, of subfields]]
- # A subfield is an array of two elements
- [code, value]
So, a short example:
- "type" : "marc-hash",
- "version" : [1, 0],
- "leader" : "leader string"
- "fields" : [
- ["001", "001 value"]
- ["002", "002 value"]
- ["010", " ", " ",
- [
- ["a", "68009499"]
- ]
- ],
- ["035", " ", " ",
- [
- ["a", "(RLIN)MIUG0000733-B"]
- ],
- ],
- ["035", " ", " ",
- [
- ["a", "(CaOTULAS)159818014"]
- ],
- ],
- ["245", "1", "0",
- [
- ["a", "Capitalism, primitive and modern;"],
- ["b", "some aspects of Tolai economic growth" ],
- ["c", "[by] T. Scarlett Epstein."]
- ]
- ]
- ]
- }
How's the speed?
I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.
Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance.
The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line.
The benchmark code looks like this:
# Unmarshal
x.report("MARC Binary") do
reader = MARC::Reader.new('test.mrc')
reader.each do |r|
title = r['245']['a']
end
end
# Marshal
x.report("MARC Binary") do
reader = MARC::Reader.new('test.mrc')
writer = MARC::Writer.new('benchout.mrc')
reader.each do |r|
writer.write(r)
end
writer.close
end
Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.
The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.
Marshalling Speed (read from binary marc, dump to given format)
Times are in seconds on my Macbook laptop, using ruby-marc.
| Format | Ruby 1.87 | Ruby 1.9 | JRuby 1.4 | Jruby 1.4 --1.9 |
|---|---|---|---|---|
| XML | 393 | 443 | 188 | 356 |
| MARC Binary | 36 | 23 | 23 | 25 |
| JSON/ NDJ | 31 | 19 | 25 | ERROR |
Unmarshalling speed (from pre-created file)
Again, times are in seconds
| Format | Ruby 1.87 | Ruby 1.9 | JRuby 1.4 | Jruby 1.4 --1.9 |
|---|---|---|---|---|
| XML | 113 | 89 | 75 | 89 |
| MARC Binary | 29 | 16 | 16 | 19 |
| JSON/ NDJ | 17 | 9 | 13 | 16 |
And so...
I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.
If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?
What’s up with the ERROR on marshalling to json under jruby? Nothing there that shouldn’t work under jruby, I wouldn’t think? I’m confident that the actual performance metrics aren’t going to be different enough under jruby to effect the “win” conclusion, but we would of course want to make sure that ruby could do the serialization under jruby!
[...] fact, I know of a couple people who had this idea of marc-json, but Bill Dueber did a little proto- mini- spec for a standard way to do marc in json, so different people writing tools can do it can be [...]
Adding on to the proto-mini-spec, we should be clear that a ‘blank’ indicator is represented as a ascii space, yes?
And likewise for the MARC “fill” character in fixed fields — which in marc8 is just ascii 7C, the “|” char, so I guess should still be represented as a |, it just probably deserves it’s own mention since it’s so weird.
by “hash” do you mean “array”? Because order matters in Marc, but ruby hashes do not guarantee order, right?
Heh. Yeah, the major element in the hash is a big ol’ array of fields, composed of arrays of subfields. My original take on MARC-Hash was mostly as a hash, and resulted in my first real, painful schooling in how batshit-insane MARC is.
Bill, I’d be interested in hearing if anyone has put up a central site for this in terms of schema definition and/or collaboration. This is an area I think could gain some traction here, but going off on our own without a public schema doesn’t seem productive.
[...] Should support output in Marc21, MarcXML, or Bill Dueber’s Marc in Json proto-spec. http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [...]
http://www.loc.gov/pictures/item/2008660390/marc?fo=json&at=marc
File_MARC 0.6.0 – now offering two tasty flavours of MARC-as-JSON output…
I’ve just released the PHP PEAR library File_MARC 0.6.0. This release brings two JSON serialization output methods for MARC to the table: toJSONHash() returns JSON that adheres to Bill Dueber’s proposal for the array-oriented MARC-HASH JSON format at…
[...] far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have [...]
[...] have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton [...]