Home > Uncategorized > Why bother with threading in jruby? Because it’s easy.

Why bother with threading in jruby? Because it’s easy.

March 11, 2010 9 Comments »

[Edit 2011-July-1: I've written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you're running jruby]

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

  1.   enumerable_object.threach(number_of_threads, :which_iterator) do |i|    
  2.     do_something_threadsafe(i)
  3.   end

Some examples

  1.   # You like #each? You'll love…err..probably like #threach
  2.   load 'threach.rb'
  3.  
  4.   # Process with 2 threads. It assumes you want 'each'
  5.   # as your iterator.
  6.   (1..10).threach(2) {|i| puts i.to_s}  
  7.  
  8.   # You can also specify the iterator
  9.   File.open('mybigfile') do |f|
  10.     f.threach(2, :each_line) do |line|
  11.       processLine(line)
  12.     end
  13.   end
  14.  
  15.   # threach does not care what the arity of your block is
  16.   # as long as it matches the iterator you ask for
  17.  
  18.   ('A'..'Z').threach(3, :each_with_index) do |letter, index|
  19.     puts "#{index}: #{letter}"
  20.   end
  21.  
  22.   # Or with a hash
  23.   h = {'a' => 1, 'b'=>2, 'c'=>3}
  24.   h.threach(2) do |letter, i|
  25.     puts "#{i}: #{letter}"
  26.   end

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

  1.   require 'thread'
  2.   module Enumerable
  3.  
  4.     def threach(threads=0, iterator=:each, &blk)
  5.       if threads == 0
  6.         # Just call the iterator itself
  7.         self.send(iterator, &blk)
  8.       else
  9.         bq = SizedQueue.new(threads * 4)
  10.         consumers = []
  11.         threads.times do |i|
  12.           consumers << Thread.new do
  13.             until (a = bq.pop) === :end_of_data
  14.               blk.call(*a)
  15.             end
  16.           end          
  17.         end
  18.  
  19.         # The producer
  20.         count = 0
  21.         self.send(iterator) do |*x|
  22.           bq.push x
  23.           count += 1
  24.         end
  25.         # Now end it
  26.         threads.times do
  27.           bq << :end_of_data
  28.         end
  29.         # Do the join
  30.         consumers.each {|t| t.join}
  31.       end
  32.     end
  33.   end

That’s it. If threads=0, just use the iterator itself. If not:

  • Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
  • Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
  • Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
  • Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

  1.   if defined? JRUBY_VERSION
  2.     numthreads = 3
  3.   else
  4.     numthreads = 0
  5.   end
  6.  
  7.   my_enumerable.threach(numthreads) {|i|}

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.

Tags:

Comments:9

Leave my own
  1. Jonathan Rochkind
    March 11, 2010 at 11:35 pm

    Nice. You wrote that one? Ruby’s pretty sweet, huh?

  2. Jonathan Rochkind
    March 11, 2010 at 11:41 pm

    What’s the purpose of using a SizedQueue instead of an ordinary Queue? What if the producer produces so much faster than the consumers consume, that the threads*4 size is exhausted, what happens? Does the producer just block waiting for there to be room to enqueue?

  3. Bill
    March 12, 2010 at 12:04 am

    The assumption is that the producer is faster than the consumer (otherwise, why bother to have multiple consumers). A regular Queue (not sized) would grow without bound based on the speed difference between consumption and production. We don’t, for example, want 10K lines in memory while we’re waiting for consumers to turn them into MARC objects or whatnot.

    A SizedQueue will block on both enqueue (if it’s full) and dequeue (if there’s nothing in it), so it’s exactly what we need for this kind of thing.

  4. David
    March 18, 2010 at 4:48 pm

    Nice.

    Just call the iterator itself

    self.send(iterator) do |*args| blk.call *args end

    could be

    self.send(iterator, &blk)

  5. Bill
    March 18, 2010 at 11:36 pm

    Thanks — I’m obviously still translating from Perl in my head :-). Changed.

  6. Charles Nutter
    April 28, 2010 at 7:34 pm

    For what it’s worth, there’s a gem called “peach” (for “parallel each”) that basically does this same thing. It actually shook loose a few bugs in our Enumerable logic, where the iteration structures were not thread-safe (that’s long since been fixed).

    Nice example either way. JRuby + threads can really kick some ass :)

  7. Bill
    April 28, 2010 at 9:03 pm

    Yeah, I saw peach (which, among other things, left me scrambling for a different name), but was dissatisfied with its monkey-patching of only Array. My use cases mostly involve pulling stuff out of a file, so I wanted to hit up Enumerable directly.

    Is there a list somewhere of what’s thread-safe in JRuby?

  8. Ara T. Howard (@drawohara)
    May 29, 2012 at 9:48 am

    gem install threadify

    pull requests for jruby welcome.

  9. Bill
    May 29, 2012 at 11:10 am

    I remember seeing threadify at one point. My biggest issue with it is that it uses a non-blocking queue that can grow without bound. My use case involves a producer that’s (a) a lot faster than the consumers, and (b) pulls roughly ten million objects during a run. Anything not based on a size-limited queue doesn’t work for me, and that adds extra complexity that threach tries (mostly unsuccessfully) tries to address.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>