Still another look at MARC parsing in ruby and jruby

I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.

Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.
Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby instead of java would be nice.
The Big Question: Is the MARC parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?
The Answer (see below): Yes, if I use marc4j.

On our ridiculously-awesome hardware, right now we’re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.

I’ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn’t installed on the server) and on the server where we usually run these things.

The machines are my desktop OSX machine and the beefy linux server where we usually do this stuff
The platforms are jruby 1.4 –server and MRI ruby 1.87
The libraries are marc4j and ruby-marc 0.3.3
The parsers are
- The standard binary parsers all around
- A home-grown AlephSequential format reader for the ‘seq’ type. AlephSequential is a MARC representation that uses one line for each field. We use it because it doesn’t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to MARC-XML.
- Whatever marc4j uses internally for MARC-XML
- ruby-marc’s ‘jstax’ xml parser under jruby (which I wrote and apparently needs some love, see below)
- ruby-marc’s ‘libxml’ xml parser under MRI ruby
Seconds is the average of two rounds, with measurements taken after a warmup run in each case.

The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.

 MACHINE  PLATFORM LIBRARY     PARSER    SECONDS    REC/SECOND desktop  jruby    marc4j      binary      4.06       4650 desktop  jruby    marc4j      xml         5.55       3401 desktop  jruby    ruby-marc   binary     17.35       1088 desktop  jruby    ruby-marc   jstax      80.11        236  desktop  ruby     ruby-marc   binary     33.54        562 desktop  ruby     ruby-marc   libxml     46.87        402  server   jruby    marc4j      binary      2.29       8245 server   jruby    marc4j      xml         3.36       5619 server   jruby    marc4j      AlephSeq    3.68       5130 server   jruby    ruby-marc   binary      9.93       1901 server   jruby    ruby-marc   jstax      44.56        424

The quick takeaways, with all the obvious caveats:

jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI
marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.
The server is fast.

We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can’t use marc-binary format because our records are too big.

If I’m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I’m gonna need to use marc4j and just wrap it up in some nicer syntax.