Why bother with threading in jruby? Because it’s easy.
[Edit 2011-July-1: I've written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you're running jruby]
Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.
Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.
- enumerable_object.threach(number_of_threads, :which_iterator) do |i|
- do_something_threadsafe(i)
- end
Some examples
- # You like #each? You'll love…err..probably like #threach
- load 'threach.rb'
- # Process with 2 threads. It assumes you want 'each'
- # as your iterator.
- (1..10).threach(2) {|i| puts i.to_s}
- # You can also specify the iterator
- File.open('mybigfile') do |f|
- f.threach(2, :each_line) do |line|
- processLine(line)
- end
- end
- # threach does not care what the arity of your block is
- # as long as it matches the iterator you ask for
- ('A'..'Z').threach(3, :each_with_index) do |letter, index|
- puts "#{index}: #{letter}"
- end
- # Or with a hash
- h = {'a' => 1, 'b'=>2, 'c'=>3}
- h.threach(2) do |letter, i|
- puts "#{i}: #{letter}"
- end
threach.rb adds to the Enumerable module to provide a threaded
version of whatever enumerator you throw at it (each by default).
How does it work?
How about I just put the source here. It’s short.
- require 'thread'
- module Enumerable
- def threach(threads=0, iterator=:each, &blk)
- if threads == 0
- # Just call the iterator itself
- self.send(iterator, &blk)
- else
- bq = SizedQueue.new(threads * 4)
- consumers = []
- threads.times do |i|
- consumers << Thread.new do
- until (a = bq.pop) === :end_of_data
- blk.call(*a)
- end
- end
- end
- # The producer
- count = 0
- self.send(iterator) do |*x|
- bq.push x
- count += 1
- end
- # Now end it
- threads.times do
- bq << :end_of_data
- end
- # Do the join
- consumers.each {|t| t.join}
- end
- end
- end
That’s it. If threads=0, just use the iterator itself. If not:
- Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
- Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the
end_of_datatoken. These consumer threads all immediately block because there’s nothing in the SizedQueue yet. - Populate the SizedQueue. When you run out of stuff to add, push on an
end_of_datatoken for each consumer thread. - Call
joinon the threads to keep the main program around when one of them exits.
Why use it?
Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.
You can always do something like:
- if defined? JRUBY_VERSION
- numthreads = 3
- else
- numthreads = 0
- end
- my_enumerable.threach(numthreads) {|i| …}
Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.
Tags:
Comments:9
Leave my own