Skip to content

Tag: benchmark

Ruby MARC serialization/deserialization revisited

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…

Comments closed

Even better, even simpler multithreading with JRuby

[Yes, another post about ruby code; I’ll get back to library stuff soon.] Quite a while ago, I released a little gem called threach (for “threaded #each”). It allows you to easily process a block with multiple threads. # Process a CSV file with three threads FIle.open(‘data.csv’).threach(3, :each_line) {|line| send_to_db(line)} Nice, right? The problem is that I could never figure out a way to deal with a break or an Exception raised inside the block. The core problem is that once a thread trying to push/pop from a ruby SizedQueue is blocking, there’s no way (I could find) to tell…

Comments closed

A short ruby diversion: cost of flow control under Ruby

A couple days ago I decided to finally get back to working on threach to try to deal with problems it had — essentially, it didn’t deal well with non-local exits due to calls to break or even something simple like a NoMethodError. [BTW, I think I managed it. As near as I can tell, threach version 0.4 won’t deadlock anymore] Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an…

Comments closed

Size/speed of various MARC serializations using ruby-marc

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data. I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21. Why bother? Binary MARC-21 is “broken” in…

Comments closed

Pushing MARC to Solr; processing times and threading and such

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.] What’s the question? The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage. I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the…

Comments closed