Skip to content

Why bother with threading in jruby? Because it’s easy.

[Edit 2011-July-1: I’ve written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you’re running jruby]

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

   enumerable_object.threach(number_of_threads, :which_iterator) do |i|     do_something_threadsafe(i)   end 

Some examples

   # You like #each? You'll love...err..probably like #threach   load 'threach.rb'    # Process with 2 threads. It assumes you want 'each'   # as your iterator.   (1..10).threach(2) {|i| puts i.to_s}      # You can also specify the iterator'mybigfile') do |f|     f.threach(2, :each_line) do |line|       processLine(line)     end   end    # threach does not care what the arity of your block is   # as long as it matches the iterator you ask for    ('A'..'Z').threach(3, :each_with_index) do |letter, index|     puts "#{index}: #{letter}"   end    # Or with a hash   h = {'a' => 1, 'b'=>2, 'c'=>3}   h.threach(2) do |letter, i|     puts "#{i}: #{letter}"   end 

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

   require 'thread'   module Enumerable      def threach(threads=0, iterator=:each, &blk)       if threads == 0         # Just call the iterator itself         self.send(iterator, &blk)       else         bq = * 4)         consumers = []         threads.times do |i|           consumers << do             until (a = bq.pop) === :end_of_data     *a)             end           end         end          # The producer         count = 0         self.send(iterator) do |*x|           bq.push x           count += 1         end         # Now end it         threads.times do           bq << :end_of_data         end         # Do the join         consumers.each {|t| t.join}       end     end   end  

That’s it. If threads=0, just use the iterator itself. If not:

  • Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
  • Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
  • Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
  • Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

   if defined? JRUBY_VERSION     numthreads = 3   else     numthreads = 0   end    my_enumerable.threach(numthreads) {|i| ...} 

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.