Ross Singer recently updated ruby-marc to include a
#to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data.
I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21.
Binary MARC-21 is “broken” in that a lot of us have records that are so long (more than 99,999 bytes) it’s impossible to create a valid marc binary record. The standard alternative, MARC-XML, has huge filesizes (roughly 3 times as large) and runs a lot more slowly in every benchmark I’ve ever run. For ruby-marc, the penalty for using XML is further exaggerated because the serializer is based on REXML and is super-slow.
There have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton bigger in terms of file size, and (b) much easier to query from a NoSQL database using something like JSONPath or JSONQuery.
What I’m testing
For this test, I used:
- marc21 binary This is the stock serialization / deserialization provided by ruby-marc.
- YAJL for JSON YAJL is a very fast C-based JSON library. Here we’re using the Ruby bindings and calling
Yajl::Encoder.encode(r.to_hash)to serialize and
- Msgpack The Msgpack project is explicitly designed to be “binary JSON” – smaller, faster, etc – at the expense of human readability/editabilty . Again, this used the ruby bindings.
The benchmark and its results
I’m interested in how long it takes to serialize and deserialize a single record. My primary use-case is sticking a single record into Solr, and then pulling the string representation of that record out and turning it back into MARC.
It’s entirely possible that trying to deal with a whole set of MARC records – as a JSON array of marc-in-json objects, or as a set of newline-delimited JSON (or perhaps LDJSON or Msgpack objects – would yield different results. The former is especially interesting, since to parse a large JSON array one needs to use a streaming parser, which will almost certainly have a different profile in both processing and memory use.
The ambitious can see the full source code of the benchmark.
Note that the following represent only the performance of ruby-marc and the particular serializers used. Other platforms or other libraries will certainly give different results!
Total of 18880 records run 20 times (377,600 serialize/deserialize cycles per method) on my Mac OSX desktop; comparisons are to MARC21-Binary.
SERIALIZING MARC Binary 357.02 s (100%) YAJL 312.65 s ( 88%) Msgpack 266.26 s ( 75%)
DESERIALIZING MARC Binary 648.91 s (100%) YAJL 507.64 s ( 78%) Msgpack 459.73 s ( 71%)
SERIALIZE + DESERIALIZE MARC Binary 1005.93 s (100%) YAJL 820.29 s ( 82%) Msgpack 725.99 s ( 72%)
SIZE MARC Binary 31.15 MBytes (100%) Msgpack 42.00 MBytes (135%) JSON 55.99 MBytes (180%) XML 93.42 MBytes (300%)
Analysis, such as it is
Obviously, there are size/speed tradeoffs. Nothing is as small as binary MARC21, but both YAJL and Msgpack are faster – significantly so for deserialization, which happens to be where I want the speed for my uses.
At 80% larger, the JSON serialization is quite a big bigger, but it’s a hell of a lot smaller than MARC-XML and suffers none of the limitations of binary MARC.
For a closed system (i.e., you’re not worried about anyone else being able to read your data) such as a Blacklight installation, I’d be tempted to move to using JSON sooner rather than later.