A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them.
The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild.
I also wondered what would happen if I tried ruby’s Marshal
serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record
object from a hash?
File sizes
File size isn’t as important as it once was, but still matters to some of us working with ginormous amounts of data:
This is the file size to hold the 18,881 records used for the benchmark.
Serialization | Size on disk (MB) | Size vs. marc21 | Gzipped size on disk (MB) | Gzipped size vs marc21 |
---|---|---|---|---|
marc21 | 31 | 100% | 9.0 | 100% |
msgpack | 42 | 135% | 8.2 | 91% |
json (ndj) | 56 | 180% | 8.1 | 90% |
marshal | 69 | 223% | 9.4 | 104% |
marc-xml | 93 | 300% | 9.2 | 102% |
It’s interesting, if not super-useful, to note that the file sizes differ by a factor of three uncompressed, but hardly at all when compressed. I was surprised at how well the binary formats (msgpack, marshal, and marc21) compressed.
Serialization / Deserialization time
I took a file of about 19k MARC records and tested the serialization/deserialization time, as follows:
- marc21 uses MARC::Reader and MARC::Writer from the ruby marc distribution
- json Uses
MARC::Record#to_hash
to produce a marc-in-json hash, serializes with the stock JSON library, and the writes to a file with one record per line (sometimes known as newline-delimited JSON, or NDJ). Deserialization reverses the process - json (oj) is the same, except using the Oj json library under MRI.
- msgpack uses the msgpack or msgpack-jruby gems to serialize/deserialize msgpack objects to/from a file stream.
- marshal uses the core ruby Marshal class to serialize/deserialize to a file stream.
In all cases, deserialization means to pull each record in turn from a file on disk and turn it into a MARC::Record
object; serialization means to take a set of pre-created MARC::Record
objects, serialize them, and push them into a file.
All times are in “real time” seconds as reported by Benchmark, averaged across two runs on my desktop machine:
- MRI ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0
- JRuby jruby 1.7.15 (1.9.3p392) 2014-09-03 82b5cc3 on Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132 +indy +jit [darwin-x86_64]
The benchmark code is up on a gist if you’d like to look at or modify it.
MRI Ruby
Method | Deserialize (s) | Serialize (s) | Total round trip (s) | Total time vs. marc21 |
---|---|---|---|---|
marc21 | 11.5 | 6.7 | 18.2 | 100% |
json | 9.5 | 5.9 | 15.4 | 85% |
json (oj) | 6.4 | 3.5 | 9.9 | 54% |
msgpack | 5.8 | 3.4 | 9.2 | 51% |
marshal | 7.0 | 8.7 | 15.7 | 86% |
JRuby
Method | Deserialize (s) | Serialize (s) | Total round trip (s) | Total time vs. marc21 |
---|---|---|---|---|
marc21 | 6.9 | 4.2 | 11.1 | 100% |
json | 4.1 | 2.5 | 6.6 | 59% |
json (oj) | N/A | N/A | N/A | N/A |
msgpack | 4.0 | 3.9 | 7.9 | 71% |
marshal | 5.3 | 8.1 | 13.4 | 121% |
Conclusions
- JRuby is faster than MRI on this these tasks, at least once it’s warmed up.
- JSON, with a decent library, is either the fastest (JRuby) or really close (MRI)
- Marshal is slow and big.
I come away with this thinking the same thing I did last time. I’m going to use compressed ndj in files, and (possibly compressed) JSON over the wire. The speed is great, tool support is outstanding, and having something human-readable is a big bonus.