Archives: March 2010

[Edit 2011-July-1: I've written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you're running jruby]

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

  1.   enumerable_object.threach(number_of_threads, :which_iterator) do |i|    
  2.     do_something_threadsafe(i)
  3.   end

Some examples

  1.   # You like #each? You'll love…err..probably like #threach
  2.   load 'threach.rb'
  3.  
  4.   # Process with 2 threads. It assumes you want 'each'
  5.   # as your iterator.
  6.   (1..10).threach(2) {|i| puts i.to_s}  
  7.  
  8.   # You can also specify the iterator
  9.   File.open('mybigfile') do |f|
  10.     f.threach(2, :each_line) do |line|
  11.       processLine(line)
  12.     end
  13.   end
  14.  
  15.   # threach does not care what the arity of your block is
  16.   # as long as it matches the iterator you ask for
  17.  
  18.   ('A'..'Z').threach(3, :each_with_index) do |letter, index|
  19.     puts "#{index}: #{letter}"
  20.   end
  21.  
  22.   # Or with a hash
  23.   h = {'a' => 1, 'b'=>2, 'c'=>3}
  24.   h.threach(2) do |letter, i|
  25.     puts "#{i}: #{letter}"
  26.   end

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

  1.   require 'thread'
  2.   module Enumerable
  3.  
  4.     def threach(threads=0, iterator=:each, &blk)
  5.       if threads == 0
  6.         # Just call the iterator itself
  7.         self.send(iterator, &blk)
  8.       else
  9.         bq = SizedQueue.new(threads * 4)
  10.         consumers = []
  11.         threads.times do |i|
  12.           consumers << Thread.new do
  13.             until (a = bq.pop) === :end_of_data
  14.               blk.call(*a)
  15.             end
  16.           end          
  17.         end
  18.  
  19.         # The producer
  20.         count = 0
  21.         self.send(iterator) do |*x|
  22.           bq.push x
  23.           count += 1
  24.         end
  25.         # Now end it
  26.         threads.times do
  27.           bq << :end_of_data
  28.         end
  29.         # Do the join
  30.         consumers.each {|t| t.join}
  31.       end
  32.     end
  33.   end

That’s it. If threads=0, just use the iterator itself. If not:

  • Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
  • Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
  • Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
  • Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

  1.   if defined? JRUBY_VERSION
  2.     numthreads = 3
  3.   else
  4.     numthreads = 0
  5.   end
  6.  
  7.   my_enumerable.threach(numthreads) {|i|}

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.

7 Responses to “Why bother with threading in jruby? Because it’s easy.”

  1. Nice. You wrote that one? Ruby’s pretty sweet, huh?

  2. What’s the purpose of using a SizedQueue instead of an ordinary Queue? What if the producer produces so much faster than the consumers consume, that the threads*4 size is exhausted, what happens? Does the producer just block waiting for there to be room to enqueue?

  3. Bill says:

    The assumption is that the producer is faster than the consumer (otherwise, why bother to have multiple consumers). A regular Queue (not sized) would grow without bound based on the speed difference between consumption and production. We don’t, for example, want 10K lines in memory while we’re waiting for consumers to turn them into MARC objects or whatnot.

    A SizedQueue will block on both enqueue (if it’s full) and dequeue (if there’s nothing in it), so it’s exactly what we need for this kind of thing.

  4. David says:

    Nice.

    Just call the iterator itself

    self.send(iterator) do |*args| blk.call *args end

    could be

    self.send(iterator, &blk)

  5. Bill says:

    Thanks — I’m obviously still translating from Perl in my head :-). Changed.

  6. For what it’s worth, there’s a gem called “peach” (for “parallel each”) that basically does this same thing. It actually shook loose a few bugs in our Enumerable logic, where the iteration structures were not thread-safe (that’s long since been fixed).

    Nice example either way. JRuby + threads can really kick some ass :)

  7. Bill says:

    Yeah, I saw peach (which, among other things, left me scrambling for a different name), but was dissatisfied with its monkey-patching of only Array. My use cases mostly involve pulling stuff out of a file, so I wanted to hit up Enumerable directly.

    Is there a list somewhere of what’s thread-safe in JRuby?

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

  • Read the records into marc4j objects and do nothing. This is a baseline of sorts.
  • The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
  • Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
  • The big “allfields” field is text from tags 100 through 900.
  • The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

  • 18,881 records in marc-binary format
  • Times are in seconds, run on my desktop
  • Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.
Total Seconds Description
19 Just read the records with marc4j and do nothing.
85 Read and do 35 “normal” fields (no custom)
104 Read, 35 normal, 15 custom fields
110 Read, normal, custom, allfields
129 Read, normal, custom, allfields, to_xml
136 Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142 Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124 Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds Description
19 read the records and do nothing
66 process the 35 normal fields
19 process the 15 custom fields
6 generate the “allfields” field
19 generate the XML (yowza!)
7 send to solr with two threads
13 send to solr with one thread

Or like this:

Seconds Description
129 do all the reading and processing
13 send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

  • SUSS is just faster than whatever solrmarc does.
  • My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
  • The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time SUSS Threads Processing Threads
70 1 1 (was 142 seconds on the desktop)
47 1 2
39 1 3
35 1 4
68 2 1
48 2 2
38 2 3
34 2 4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

4 Responses to “Pushing MARC to Solr; processing times and threading and such”

  1. What’s HLB?

    Both ruby-marc and marc4j will generate marc-xml, but do you mean optimizing speed of it? (Don’t forget marc-json possibilities! heh).

    Not sure if you’re still happy with marc4j or might prefer ruby-marc, I realized one thing missing from the ruby stack (if you didn’t want to use marc4j) (as far as I know) is the marc8-utf8 conversion stuff, and heuristic guess detection of marc records that aren’t really the encoding they claim to be.

  2. Oh, I see, performance with toXML.

    What i wonder/worry about, is if the added time for toXML isn’t actually the serialization to xml, but simply that if you’re pushing a larger stored field to solr, that’s going to slow things down.

    We still need to store our marc either way, of course. The UWisconsin approach of storing marc in an rdbms instead of a solr stored field may or may not speed up indexing, since it’s still gonna take time to store it.

  3. Hey, I should read more carefully before I post, but instead I’ll just multi-post.

    I see the serialization to XML itself is non-trivial too.

    json!

  4. Bruce says:

    What’s HLB?

    Both ruby-marc and marc4j will generate marc-xml, but do you mean optimizing speed of it? (Don’t forget marc-json possibilities! heh).

    Not sure if you’re still happy with marc4j or might prefer ruby-marc, I realized one thing missing from the ruby stack (if you didn’t want to use marc4j) (as far as I know) is the marc8-utf8 conversion stuff, and heuristic guess detection of marc records that aren’t really the encoding they claim to be.

ruby-marc with pluggable readers

March 2, 2010 at 1:55 pmCategory:Uncategorized

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

  1.   require 'marc'
  2.   require 'my_marc_stuff'
  3.  
  4.   mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader
  5.   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto
  6.  
  7.   MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)
  8.   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now
  9.  
  10.   xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)
  11.  
  12.   # …and maybe further on down the road
  13.  
  14.   asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)
  15.   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

3 Responses to “ruby-marc with pluggable readers”

  1. adam says:

    Bill, How is the performance as compared to other languages?

    • adam
  2. Bill says:

    Adam — not sure what you’re asking. Ruby vs. Perl? MARC-HASH-JSON vs. MARC-HASH-YAML?

  3. adam says:

    I was thinking ruby vs. perl vs. java