Benchmarking MARC record parsing in Ruby

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow]

[UPADTE It turns out they don't use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ]

[UP-UPDATE Well, no, they do use MARC-XML. I'm not afraid to constantly change my story. This is why I'm the best investigative reporter in the business]

The other day on the blacklight mailing list, Bess Sadler wrote

Yes, we do still include the full marc record, but the rule of thumb we’re currently using is that anything that needs to display in the index view (the search results) needs to be broken out into a separate display field, because retrieving and parsing marc records for every item in a list of search results is too much of a performance hit.

This surprised me a fair bit, because in our implementation of VuFind (which uses PHP, versus Ruby for Blacklight) I do just that — grab the MARC out of Solr, parse it, and pull stuff like full titles and such out of it.

As it turns out, I’d been screwing around with calling marc4j from jruby, anyway, so I threw that into the mix, and here’s what I found.

What the benchmark tries to measure

The focus is on measuring time to parse MARC records as returned in a field from Solr in MARC-binary.

I got 40 sets of 50 records each (2000 records) from our Solr instance in ruby format and extracted the binary MARC strings. This resulted in an array of 40 sets of 50 strings, each of which is a valid MARC record.

Fifty records seems largish to me — we only display 20 at a time — but thought I’d swing for the fences.

I’m testing along three(ish) dimensions:

  • jruby vs mri
  • marc4j vs ruby-marc (only on jruby, obviously)
  • parsing each string individually, or globbing them all together and treating it as if it’s a multi-record file

[Note that MRI is using Net::HTTP to get the data; I presume Curl would be faster still. It's already faster than jruby]

The following data show the average time to parse out each set of 50 records and extract the first 245 (title) field from each one, along with the totals for doing all 2000 records.

Method                           User       Total      Real      

jruby Get/Eval data              0.134750   0.134750 (  0.134850)
jruby Get/Eval data (2000)       5.390000   5.390000 (  5.394000)

MRI Get/Eval data                0.008500   0.012750 (  0.115942)
MRI Get/Eval data (2000)         0.340000   0.510000 (  4.637677)    

jruby-marc4j-oneAtATime          0.056075   0.056075  (0.056125)
jruby-marc4j-multistring         0.027925   0.027925  (0.028000)

jruby-marc-oneAtATime            0.066625   0.066625  (0.066650)
jruby-marc-multistring           0.034300   0.034300  (0.034325)

mri-marc-oneAtATime              0.084500   0.085250  (0.086597)
mri-marc-multistring             0.085000   0.085750  (0.086026)

jruby-marc4j-oneAtATime (2000)   2.243000   2.243000  (2.244999)
jruby-marc-oneAtATime (2000)     2.665001   2.665001  (2.666000)
mri-marc-oneAtATime (2000)       3.380000   3.410000  (3.463888)


jruby-marc4j-multistring (2000)  1.117001   1.117001  (1.120001)
jruby-marc-multistring (2000)    1.371999   1.371999  (1.372999)
mri-marc-multistring (2000)      3.400000   3.430000  (3.441052)

So…the worst-case scenario is taking an average 0.085 second to get the first title field out of each one of 50 binary MARC records once we’ve got them.

Now, I’m sure all my records came out of the cache, so my query time wasn’t very long. But we still end up with a maximum of roughly 0.2 seconds plus the time to actually do the query to end up with a set of 50 marc records.

We can see from looking at the totals that it looks like MRI’s bottleneck is the actual parsing, whereas constructing the input streams is expensive under jruby (at least the way I’m doing it), resulting in a benefit of concatenating them all together into one longish string before parsing.

Marc4j is faster (20%ish), but not enough faster to be worth the effort in my mind. Keep in mind that I have no idea how fast Marc4j is when running under pure java, without all the jruby overhead.

Bottom line, though: that seems fast enough to me.

I’ll try to benchmark with XML later on today or tomorrow.

2 Comments

  1. Posted September 17, 2009 at 4:30 pm | Permalink

    Thanks Bill, incredibly helpful!

    Can you explain what the “Get/Eval Data” benchmarks are? I don’t really understand, but they seem quite a bit longer, trying to figure out what implications they have. (Cause part of what some of us are trying to figure out is not just what tool is fastest at parsing MARC, but whether, in a certain context, we should avoid parsing MARC altogether).

    Processing one at a time vs putting them all together in a multi-record string doesn’t seem to make a difference in MRI but does in jruby, interesting.

  2. Posted September 18, 2009 at 9:29 am | Permalink

    My C++ MPI code could parse about the scriblio set (~7M records) in about 27 seconds on two dual-core macs (gzipped binary marc, 1.85GHz imac & 2.0GHz macbook pro, both first models). This test was generating bigram tag/subcode counts.

    Performance scaled fairly linearly; with different IO->CPU ratios, it gets faster not to read compressed.

    Marc4j has a huge performance leak in the character set conversion routines. This would probably have been fixed by now.

    BTW, has anyone ever written a marc8 to unicode converter without screwing up that one character that maps to nothing? I think it’s there just to make sure they didn’t end up with some part of the spec being accidentally consistent :)

    The MARC leader is essentially worthless- for the LC data there’s less than a bytes worth of entropy in it.

    MARC-XML is a performance killer. Use it to exchange data, or for archiving, but not for real time work.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*