Category: Uncategorized

[Edit 2011-July-1: I've written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you're running jruby]

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

  1.   enumerable_object.threach(number_of_threads, :which_iterator) do |i|    
  2.     do_something_threadsafe(i)
  3.   end

Some examples

  1.   # You like #each? You'll love…err..probably like #threach
  2.   load 'threach.rb'
  3.  
  4.   # Process with 2 threads. It assumes you want 'each'
  5.   # as your iterator.
  6.   (1..10).threach(2) {|i| puts i.to_s}  
  7.  
  8.   # You can also specify the iterator
  9.   File.open('mybigfile') do |f|
  10.     f.threach(2, :each_line) do |line|
  11.       processLine(line)
  12.     end
  13.   end
  14.  
  15.   # threach does not care what the arity of your block is
  16.   # as long as it matches the iterator you ask for
  17.  
  18.   ('A'..'Z').threach(3, :each_with_index) do |letter, index|
  19.     puts "#{index}: #{letter}"
  20.   end
  21.  
  22.   # Or with a hash
  23.   h = {'a' => 1, 'b'=>2, 'c'=>3}
  24.   h.threach(2) do |letter, i|
  25.     puts "#{i}: #{letter}"
  26.   end

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

  1.   require 'thread'
  2.   module Enumerable
  3.  
  4.     def threach(threads=0, iterator=:each, &blk)
  5.       if threads == 0
  6.         # Just call the iterator itself
  7.         self.send(iterator, &blk)
  8.       else
  9.         bq = SizedQueue.new(threads * 4)
  10.         consumers = []
  11.         threads.times do |i|
  12.           consumers << Thread.new do
  13.             until (a = bq.pop) === :end_of_data
  14.               blk.call(*a)
  15.             end
  16.           end          
  17.         end
  18.  
  19.         # The producer
  20.         count = 0
  21.         self.send(iterator) do |*x|
  22.           bq.push x
  23.           count += 1
  24.         end
  25.         # Now end it
  26.         threads.times do
  27.           bq << :end_of_data
  28.         end
  29.         # Do the join
  30.         consumers.each {|t| t.join}
  31.       end
  32.     end
  33.   end

That’s it. If threads=0, just use the iterator itself. If not:

  • Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
  • Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
  • Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
  • Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

  1.   if defined? JRUBY_VERSION
  2.     numthreads = 3
  3.   else
  4.     numthreads = 0
  5.   end
  6.  
  7.   my_enumerable.threach(numthreads) {|i|}

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.

7 Responses to “Why bother with threading in jruby? Because it’s easy.”

  1. Nice. You wrote that one? Ruby’s pretty sweet, huh?

  2. What’s the purpose of using a SizedQueue instead of an ordinary Queue? What if the producer produces so much faster than the consumers consume, that the threads*4 size is exhausted, what happens? Does the producer just block waiting for there to be room to enqueue?

  3. Bill says:

    The assumption is that the producer is faster than the consumer (otherwise, why bother to have multiple consumers). A regular Queue (not sized) would grow without bound based on the speed difference between consumption and production. We don’t, for example, want 10K lines in memory while we’re waiting for consumers to turn them into MARC objects or whatnot.

    A SizedQueue will block on both enqueue (if it’s full) and dequeue (if there’s nothing in it), so it’s exactly what we need for this kind of thing.

  4. David says:

    Nice.

    Just call the iterator itself

    self.send(iterator) do |*args| blk.call *args end

    could be

    self.send(iterator, &blk)

  5. Bill says:

    Thanks — I’m obviously still translating from Perl in my head :-). Changed.

  6. For what it’s worth, there’s a gem called “peach” (for “parallel each”) that basically does this same thing. It actually shook loose a few bugs in our Enumerable logic, where the iteration structures were not thread-safe (that’s long since been fixed).

    Nice example either way. JRuby + threads can really kick some ass :)

  7. Bill says:

    Yeah, I saw peach (which, among other things, left me scrambling for a different name), but was dissatisfied with its monkey-patching of only Array. My use cases mostly involve pulling stuff out of a file, so I wanted to hit up Enumerable directly.

    Is there a list somewhere of what’s thread-safe in JRuby?

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

  • Read the records into marc4j objects and do nothing. This is a baseline of sorts.
  • The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
  • Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
  • The big “allfields” field is text from tags 100 through 900.
  • The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

  • 18,881 records in marc-binary format
  • Times are in seconds, run on my desktop
  • Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.
Total Seconds Description
19 Just read the records with marc4j and do nothing.
85 Read and do 35 “normal” fields (no custom)
104 Read, 35 normal, 15 custom fields
110 Read, normal, custom, allfields
129 Read, normal, custom, allfields, to_xml
136 Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142 Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124 Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds Description
19 read the records and do nothing
66 process the 35 normal fields
19 process the 15 custom fields
6 generate the “allfields” field
19 generate the XML (yowza!)
7 send to solr with two threads
13 send to solr with one thread

Or like this:

Seconds Description
129 do all the reading and processing
13 send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

  • SUSS is just faster than whatever solrmarc does.
  • My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
  • The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time SUSS Threads Processing Threads
70 1 1 (was 142 seconds on the desktop)
47 1 2
39 1 3
35 1 4
68 2 1
48 2 2
38 2 3
34 2 4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

4 Responses to “Pushing MARC to Solr; processing times and threading and such”

  1. What’s HLB?

    Both ruby-marc and marc4j will generate marc-xml, but do you mean optimizing speed of it? (Don’t forget marc-json possibilities! heh).

    Not sure if you’re still happy with marc4j or might prefer ruby-marc, I realized one thing missing from the ruby stack (if you didn’t want to use marc4j) (as far as I know) is the marc8-utf8 conversion stuff, and heuristic guess detection of marc records that aren’t really the encoding they claim to be.

  2. Oh, I see, performance with toXML.

    What i wonder/worry about, is if the added time for toXML isn’t actually the serialization to xml, but simply that if you’re pushing a larger stored field to solr, that’s going to slow things down.

    We still need to store our marc either way, of course. The UWisconsin approach of storing marc in an rdbms instead of a solr stored field may or may not speed up indexing, since it’s still gonna take time to store it.

  3. Hey, I should read more carefully before I post, but instead I’ll just multi-post.

    I see the serialization to XML itself is non-trivial too.

    json!

  4. Bruce says:

    What’s HLB?

    Both ruby-marc and marc4j will generate marc-xml, but do you mean optimizing speed of it? (Don’t forget marc-json possibilities! heh).

    Not sure if you’re still happy with marc4j or might prefer ruby-marc, I realized one thing missing from the ruby stack (if you didn’t want to use marc4j) (as far as I know) is the marc8-utf8 conversion stuff, and heuristic guess detection of marc records that aren’t really the encoding they claim to be.

ruby-marc with pluggable readers

March 2, 2010 at 1:55 pmCategory:Uncategorized

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

  1.   require 'marc'
  2.   require 'my_marc_stuff'
  3.  
  4.   mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader
  5.   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto
  6.  
  7.   MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)
  8.   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now
  9.  
  10.   xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)
  11.  
  12.   # …and maybe further on down the road
  13.  
  14.   asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)
  15.   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

3 Responses to “ruby-marc with pluggable readers”

  1. adam says:

    Bill, How is the performance as compared to other languages?

    • adam
  2. Bill says:

    Adam — not sure what you’re asking. Ruby vs. Perl? MARC-HASH-JSON vs. MARC-HASH-YAML?

  3. adam says:

    I was thinking ruby vs. perl vs. java

New interest in MARC-HASH / JSON

February 26, 2010 at 12:29 amCategory:Uncategorized

EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  1.   # A record is a four-pair hash, as follows. UTF-8 is mandatory.
  2.   {
  3.     "type" : "marc-hash"
  4.     "version" : [1, 0]
  5.     "leader" : "…leader string … "
  6.     "fields" : [array, of, fields]
  7.   }
  8.  
  9.   # A field is an array of either 2 or 4 elements
  10.   [tag, value] # a control field
  11.   [tag, ind1, ind2, [array, of subfields]]
  12.  
  13.   # A subfield is an array of two elements
  14.  
  15.   [code, value]

So, a short example:

{
  1.     "type" : "marc-hash",
  2.     "version" : [1, 0],
  3.  
  4.     "leader" : "leader string"
  5.     "fields" : [
  6.        ["001", "001 value"]
  7.        ["002", "002 value"]
  8.        ["010", " ", " ",
  9.         [
  10.           ["a", "68009499"]
  11.         ]
  12.       ],
  13.       ["035", " ", " ",
  14.         [
  15.           ["a", "(RLIN)MIUG0000733-B"]
  16.         ],
  17.       ],
  18.       ["035", " ", " ",
  19.         [
  20.           ["a", "(CaOTULAS)159818014"]
  21.         ],
  22.       ],
  23.       ["245", "1", "0",
  24.         [
  25.           ["a", "Capitalism, primitive and modern;"],
  26.           ["b", "some aspects of Tolai economic growth" ],
  27.           ["c", "[by] T. Scarlett Epstein."]
  28.         ]
  29.       ]
  30.     ]
  31.   }

How's the speed?

I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end

# Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?

11 Responses to “New interest in MARC-HASH / JSON”

  1. What’s up with the ERROR on marshalling to json under jruby? Nothing there that shouldn’t work under jruby, I wouldn’t think? I’m confident that the actual performance metrics aren’t going to be different enough under jruby to effect the “win” conclusion, but we would of course want to make sure that ruby could do the serialization under jruby!

  2. [...] fact, I know of a couple people who had this idea of marc-json, but Bill Dueber did a little proto- mini- spec for a standard way to do marc in json, so different people writing tools can do it can be [...]

  3. Adding on to the proto-mini-spec, we should be clear that a ‘blank’ indicator is represented as a ascii space, yes?

    And likewise for the MARC “fill” character in fixed fields — which in marc8 is just ascii 7C, the “|” char, so I guess should still be represented as a |, it just probably deserves it’s own mention since it’s so weird.

  4. Naomi Dushay says:

    by “hash” do you mean “array”? Because order matters in Marc, but ruby hashes do not guarantee order, right?

  5. Bill says:

    Heh. Yeah, the major element in the hash is a big ol’ array of fields, composed of arrays of subfields. My original take on MARC-Hash was mostly as a hash, and resulted in my first real, painful schooling in how batshit-insane MARC is.

  6. GregPendlebury says:

    Bill, I’d be interested in hearing if anyone has put up a central site for this in terms of schema definition and/or collaboration. This is an area I think could gain some traction here, but going off on our own without a public schema doesn’t seem productive.

  7. [...] Should support output in Marc21, MarcXML, or Bill Dueber’s Marc in Json proto-spec. http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/ [...]

  8. File_MARC 0.6.0 – now offering two tasty flavours of MARC-as-JSON output…

    I’ve just released the PHP PEAR library File_MARC 0.6.0. This release brings two JSON serialization output methods for MARC to the table: toJSONHash() returns JSON that adheres to Bill Dueber’s proposal for the array-oriented MARC-HASH JSON format at…

  9. [...] far we have two well-publicized suggestions: one by Bill Dueber, at the University of Michigan; and one by Andrew Houghton, who works at OCLC Research. They are quite different and each have [...]

  10. [...] have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton [...]

NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So…good news all around, and huge kudos to Xiaoming Liu for his quick response!
**NOTE** It strikes me that I haven’t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you’ll get back either good data or nothing (and the “nothing” might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.

A long time ago, Jonathan Rochkind noted that the OCLC doesn’t correctly normalize their LCCNs.

Well, it’s not fixed.

I could really, really use the xlccn service right about now — a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you’re interested in.

Except they “normalize” their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.

The xLCCN service will silently provide no data or incorrect data for many LCCN requests!

An example:

  • (F) Full LCCN is “sn 83011407″
  • (D) First set of digits is “83011407″. This is what I think the OCLC is indexing.
  • (N) Correct normalization is “sn83011407″

The problem, of course, is that (D) “83011407″ is itself a valid LCCN.

  • (F) is associated with OCLC# 47212967
  • (D) is associated with OCLC# 12505148. That’s not the same record.

So, how do the OCLC services respond?

  • (F) Worldcat search finds correct (probably just doing a string match); xid finds nothing
  • (D) Worldcat finds both correct and incorrect records. The xLCCN service finds only the incorrect record, OCLC# 12505148.
  • (N) Neither worldcat nor xid finds anything for the correctly normalized version.

So, what am I supposed to do? Only use the service on LCCNs where the original and normalized versions are the same and include only digits? Frustrating.

One Response to “OCLC still not (NO! They are!) normalizing their LCCNs”

  1. Alice Sneary says:

    Thanks for sharing your frustration, Bill, and I’m glad you’ve called attention to it. I also see your posting to the OCLC Developer Network listserv, and we’re looking into things right now to get back with you.

[Note: in this post I'm just going to focus on the "get stuff into Solr" part. My normal focus -- MARC data -- will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love me the java.

So…taking Erik Hatcher’s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with Solr.

Getting the code

I’ve added the gems to gemcutter, if you want to play along at home:

  • jruby_producer_consumer (github, rdoc.info) Ruby syntax for threaded operations under jruby
  • jruby_streaming_update_solr_server (github, rdoc.info) Ruby syntax on top of the Java class of the same name
  • marc4j4r (github, rdoc.info) Ruby syntax on top of the marc4j java library.

WARNING: None of these gems have a 1.0 version tag on them, and that means that the API may change a titch in the future. Also, the fact that they’re released as gems means that it’s easy to release gems, not that I’m not an idiot.

The basics: Using SolrInputDocument and StreamingUpdateSolrServer

OK, with the disclaimer out of the way, let’s look at some code.

  1.   require 'rubygems'
  2.   require 'jruby_streaming_update_solr_server'
  3.  
  4.   solrurl = 'http://your.solr.server:port/solr'
  5.   sussqueuesize = 24 # how many items to buffer on their way to solr
  6.   sussthreads = 1   # how many threads to use to send stuff to solr
  7.  
  8.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  9.  
  10.   # Let's add a simple document via a hash: A title, three authors, and a year
  11.  
  12.   h = {
  13.     :title => "Never been deader",
  14.     :author => ['Bill', 'Mike', 'Molly'],
  15.     :year => 2003
  16.   }
  17.   suss << h
  18.   suss.commit
  19.  
  20.   # YEA! You just added a document to solr and committed it.
  21.   # Have a cookie!
  22.  
  23.   # We can also use a document object to do the same thing
  24.  
  25.   doc = SolrInputDocument.new
  26.   # Add the title
  27.   doc << ['title', 'Never been deader']
  28.  
  29.   # Add the first author
  30.   doc << [:author, 'Bill']
  31.  
  32.   # Add more. Re-used keys mean you're adding additional values
  33.   # Note values can be scalars or arrays
  34.  
  35.   doc << [:author, ['Mike', 'Molly']]
  36.  
  37.   # Add the wrong year using [] syntax
  38.   doc[:year] = 2001
  39.  
  40.   # Oops! fix it. []= overwrites existing value(s)
  41.  
  42.   doc[:year] = 2003
  43.  
  44.   # Finally, we can merge a hash (or anything else that responds to
  45.   # 'each_pair' with key-value pairs) into an existing doc
  46.  
  47.   doc.merge! {'author' => 'Ringo Starrre', 'publisher'=>'Vainity Books'}
  48.  
  49.   # Add it
  50.  
  51.   suss << doc
  52.  
  53.   # Commit and optimize if you'd like
  54.  
  55.   suss.commit
  56.   suss.optimize # if you want

Nothing really fancy in there — just a few things worth noting:

  • An suss object will take a hash (again, anything that responds to #each_pair) or a SolrInputDoc
  • You can use either strings or symbols to represent Solr field names
  • Values can be either a single value, or an array of multiple values

And there are three ways to get data into a doc:

  • Via << [field, value(s)] (additive)
  • Via doc.merge! hash (additive)
  • Via doc[field] = value (replaces)

Adding Threads

I also went down the garden path of threading things. There are an awful lot of operations that are not threadsafe (e.g., reading a line from a file) but once you’ve got a bunch of records to worth with, turning them into Solr documents is usually thread-safe.

My model is that there’s a producer (usually the method #each) from an underlying data object. A thread takes whatever that method yields and sticks the values into a java BlockingQueue awaiting consumption. You then use ProdcuerConsumer#threaded_each (or ProducerConsumer#threaded_each_with_index) to pull items out of the queue and do something useful with them.

I extracted stuff into a library (jruby_producer_consumer) for your viewing pleasure.

CONFUSION ALERT: It’s perhaps unfortunate that the object you send to ProducerConsumer.new(obj) must implement #each and that the ProducerConsumer method #threaded_each calls that underlying #each…well there’s a lot of #each‘s floating around. Keep them straight.

So…let’s look at some code to work with consumer threads.

  1.   # Start off the same as before
  2.   require 'rubygems'
  3.   require 'jruby_streaming_update_solr_server'
  4.   require 'jruby_producer_consumer'
  5.   require 'marc4j4r'
  6.  
  7.   solrurl = 'http://your.solr.server:port/solr'
  8.   sussqueuesize = 24 # how many items to buffer on their way to solr
  9.   sussthreads = 2   # how many threads to use to send stuff to solr
  10.  
  11.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  12.  
  13.   # I'll go ahead and use a MARC file as my example, but won't talk about the
  14.   # MARC parts of it. All you need to know is that the reader object
  15.   # implements #each
  16.  
  17.   reader = MARC4J4R.reader('test.xml', :marcxml)
  18.  
  19.   # Get a producer/consumer object with the reader at its base, using
  20.   # the default method #each to get stuff out of it, and with the assumption
  21.   # that we only need to keep the default 5 items in memory at a time to
  22.   # keep up with consumption
  23.  
  24.   pc = ProducerConsumer.new(reader)
  25.  
  26.   # Get three threads to actually consume the things, turn them into solr
  27.   # documents, and send them to solr (potentially out of order)
  28.  
  29.   numconsumerthreads = 3
  30.   pc.threaded_each(numconsumerthreads).each do |r|
  31.     suss << turn_marc_record_into_a_hash_or_solrdoc(r)
  32.   end
  33.   suss.commit

Again, not a lot happening here.

  • The “producer” is always one thread, because so little is thread-safe at the ‘each’ level. In this case, there’s a single thread pulling data out of the file and turning it into MARC records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don’t starve. I presume that producing items is cheaper than consuming them, or else this library won’t help you much.
  • ProducerConsumer#threaded_each calls the #each method of the underlying object. You can substitute anything that yields, though, as in this example where I call #each_line instead of the default #each
  1.   queuesize = 5
  2.   pc = ProducerConsumer.new(File.new('myfile.txt'), queuesize, :each_line)
  • Keep track of your threads. In this last example, there is one thread getting MARC records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the suss object, and another two pulling stuff out of the suss object and sending things to Sorl. And, of course, there’s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.
  • See the docs for how to mess with what goes into a ProducerConsumer object. It’s entirely possible to use, say, #each_slice. There’s also a convenience method #threaded_each_with_index, but it does not call the underlying #each_with_index, it produces its own index as things are read.

Feedback not only welcome but necessary!

I’ve done a lot of messing around with Ruby in the last 10 days or so, but I’m still basically converting from Perl in my head. Any comments, bugs reports, or whatnot are definitely welcome!

Comments are closed.

Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. And I didn't really understand it. So I dug in today and wrote this.]

I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer.

It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t a huge win under regular Ruby.

There’s no testing there because I’m not sure how to test threaded stuff :-(

It is, I hope, easy to use:

  1.    require 'rubygems'
  2.    require 'jruby_producer_consumer'
  3.  
  4.    # Create a ProducerConsumer. Arguments are anything that implements #each
  5.    # and the size for the underlying queue. For the former, I'll just use a Range object.
  6.  
  7.    eachable = 1..10
  8.    queuesize = 3
  9.  
  10.    pc = ProducerConsumer.new(eachable, queuesize)
  11.  
  12.    # Just a method to show what happens
  13.    def sample (consumerid, x)
  14.      puts "Consumer #{consumerid}: consuming #{x}"
  15.      sleep 1 # otherwise this'll finsish before I can create multiple consumers
  16.    end
  17.  
  18.    # Create three consumers. You can pass any number of args to
  19.    # #consumer, and must pass a block whose arguments are the
  20.    # object returned by eachable#each and those args back.
  21.  
  22.    ['A', 'B', 'C'].each do |consumerid|
  23.      pc.consumer(consumerid) do |x, consumerid|
  24.        sample(consumerid, x)
  25.      end
  26.    end
  27.  
  28.    # OUTPUT
  29.    # Consumer A: consuming 1
  30.    # Consumer B: consuming 2
  31.    # Consumer C: consuming 3
  32.    # Consumer A: consuming 4
  33.    # Consumer B: consuming 5
  34.    # Consumer C: consuming 6
  35.    # Consumer B: consuming 7
  36.    # Consumer A: consuming 8
  37.    # Consumer C: consuming 9
  38.    # Consumer B: consuming 10

Comments are closed.

I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.

Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.

Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby instead of java would be nice.

The Big Question: Is the MARC parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?

The Answer (see below): Yes, if I use marc4j.

On our ridiculously-awesome hardware, right now we’re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.

I’ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn’t installed on the server) and on the server where we usually run these things.

  • The machines are my desktop OSX machine and the beefy linux server where we usually do this stuff
  • The platforms are jruby 1.4 –server and MRI ruby 1.87
  • The libraries are marc4j and ruby-marc 0.3.3
  • The parsers are
    • The standard binary parsers all around
    • A home-grown AlephSequential format reader for the ‘seq’ type. AlephSequential is a MARC representation that uses one line for each field. We use it because it doesn’t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to MARC-XML.
    • Whatever marc4j uses internally for MARC-XML
    • ruby-marc’s ‘jstax’ xml parser under jruby (which I wrote and apparently needs some love, see below)
    • ruby-marc’s ‘libxml’ xml parser under MRI ruby
  • Seconds is the average of two rounds, with measurements taken after a warmup run in each case.

The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.

MACHINE PLATFORM LIBRARY PARSER SECONDS REC/SECOND
desktop jruby marc4j binary 4.06 4650 desktop jruby marc4j xml 5.55 3401 desktop jruby ruby-marc binary 17.35 1088 desktop jruby ruby-marc jstax 80.11 236

desktop ruby ruby-marc binary 33.54 562 desktop ruby ruby-marc libxml 46.87 402

server jruby marc4j binary 2.29 8245 server jruby marc4j xml 3.36 5619 server jruby marc4j AlephSeq 3.68 5130 server jruby ruby-marc binary 9.93 1901 server jruby ruby-marc jstax 44.56 424

The quick takeaways, with all the obvious caveats:

  • jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI
  • marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.
  • The server is fast.

We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can’t use marc-binary format because our records are too big.

If I’m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I’m gonna need to use marc4j and just wrap it up in some nicer syntax.

5 Responses to “Still another look at MARC parsing in ruby and jruby”

  1. These kinds of numbers always make my head swim a bit.

    The original question you had, if I understand right, was if you could parse MARC at 250/300 records per second. But your findings aren’t expressed in records per second. Is it possible to say how they add up in records per second, so we know which methods are fast enough to meet your original criteria and which aren’t?

    (You must have a really fast machine to get 250/300 records per second from SolrMarc. I only get 100 records per second on my machine.)

  2. Oh wait, I see it now! Nevermind.

  3. So one thing about marc-ruby, not in parsing but in analyzing/processing, is every time you ask for a tag (say, ’245′), it’s got to iterate through every field in the record, and match each one to see if it’s a 245. If you’re doing a lot of ‘mapping’ to a lot of records…. I’ve wondered for a while if there’s a bottleneck there, and how much difference it would make make to build a hash ‘index’ of tags. Almost all ‘mapping’ operations begin by looking up tag numbers, and the majority pretty much end there too.

    But I’ve never gotten around to trying to profile it.

    Curious how marc4j is implemented in terms of access to marc record fields by tag.

    Since your tests reveal that even ruby-marc is faster under jruby than mri, there’s something just about the environment that is speeding things up. But I wonder if there are optmizations (perhaps simple ones) that could be made to ruby-marc to make it catch up with marc4j.

  4. The XML performance in ruby-marc is almost certainly due to whatever XML parsing library it uses. Do you know what library that might be?

  5. Bill says:

    The XML parsing code for ruby-marc in jruby is a stax-based implementation I wrote from a position of intense ignorance. Nabbing the xml-parser out of the marc4j code is probably a next step when I get time. For the moment I’ve just re-opened the various marc4j classes to add syntactic sugar where necessary.

    BTW, can I tell you how much jruby rocks????

MAJOR CHANGE

So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character.

Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.

So, we’re back to using pipe (|) characters to separate multiple calls — the examples below have been updated to reflect this.

Introduction

I’ve put up a beta version of the HathiTrust Volumes API previously discussed on this blog and via email.

Currently, I’ve only got json output, although there is space in there for other output formats as necessary.

What exactly is this?

Given: an identifier or set of identifiers, this API will Return: a set of matched records and a sorted list of the items available in the HathiTrust.

Useful, for example, if you want to display HathiTrust holdings alongside your own in your OPAC.

Simple, single-value call

Given the URL:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json

You’ll get the following back:

  1.   {
  2.       "records":
  3.       {
  4.           "000791709":
  5.           {
  6.               "recordURL":"http://catalog.hathitrust.org/Record/000791709",
  7.               "titles":
  8.               [
  9.                   "\"Zhong gong dang shi\" fu dao /",
  10.                   "\u300a\u4e2d\u5171\u515a\u53f2\u300b\u8f85\u5bfc /"
  11.               ],
  12.               "isbns": [],
  13.               "issns": [],
  14.               "oclcs": ["15420548"],
  15.               "lccns": []
  16.           }
  17.       },
  18.       "items":
  19.       [
  20.           {
  21.               "orig":"University of Michigan",
  22.               "fromRecord":"000791709",
  23.               "htid":"mdp.39015058510069",
  24.               "itemURL":"http://hdl.handle.net/2027/mdp.39015058510069",
  25.               "rightsCode":"ic",
  26.               "lastUpdate":"00000000",
  27.               "enumcron":false
  28.           }
  29.       ]
  30.   }

Note that the ‘records’ are keyed on the local umid, also available in the ‘fromRecord’ field of each item.

The generic short form is:

http://catalog.hathitrust.org/api/volumes/(idtype)/id.(outputtype)

Right now the valid idtypes are:

  • issn (will be normalized to just digits, no leading zeros)
  • isbn (will be normalized to an ISBN-13)
  • oclc (will be normalized to all digits, no leading zeros)
  • lccn (will be normalized as recommended)
  • htid (HathiTrust item id, seen above as “mdp.39015058510069″)
  • umid (the University of Michigan record ID, seen above in the “fromRecord” field of an item)

Currently the only valid outputtype is ‘json’.

More complex, multi-valued call

The full API URL looks like this:

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613

This is a request for data on two separate items, identified on the calling end as simply ’1′ and ’2′ (id:1 and id:2). The first item is searched for using both an oclc number and an lccn; the second supplies only an isbn.

Note that

  • The output format (json) has moved to appear right after the ‘/volumes/’
  • There’s an arbitrary ‘id’ field. This will be used to index the return values, so use something meaningful on your end.
  • keys and values are separated by colons. Key-Value pairs are separated by semi-colons.
  • Separate requests are separated by ‘/’ in the URL, allowing you to request data for an arbitrary number of items with a single call.
  • Return values are
  • Matches follow the “#3″ option on the old post, the “Must match if present” option — basically, if you supply an identifier and a record has one of those identifiers, they must match.

So, in the example, the first request has both an oclc number and an lccn. Matches are as follows:

  • If a record has an oclc number but no lccn, its oclc number must match the passed oclc number.
  • If a record has an lccn but no oclc number, its lccn must match the passed lccn value.
  • If a record has both an lccn and an oclc number, both its identifiers must match the passed values.

The returned structure is keyed on the arbitrary id passed in the search string (if not present, the whole search string will be used instead):

  1.   {
  2.       "1":
  3.       {
  4.           "records":
  5.           {
  6.               "001474331":
  7.               {
  8.                   "recordURL":"http://catalog.hathitrust.org/Record/001474331",
  9.                   "titles":
  10.                   ["Some aspects of seventeenth-century medicine &amp; science; papers read at a Clark Library seminar, October 12, 1968"],
  11.                   "isbns": [],
  12.                   "issns": [],
  13.                   "oclcs": ["00045678"],
  14.                   "lccns": ["70628581 //r86"]
  15.               }
  16.           },
  17.           "items":
  18.           [{
  19.                   "orig":"University of Michigan",
  20.                   "fromRecord":"001474331",
  21.                   "htid":"mdp.39015004074095",
  22.                   "itemURL":"http://hdl.handle.net/2027/mdp.39015004074095",
  23.                   "rightsCode":"ic",
  24.                   "lastUpdate":"20090713",
  25.                   "enumcron":false
  26.               }]
  27.       },
  28.       "2":
  29.       {
  30.           "records":
  31.           {
  32.               "004370624":
  33.               {
  34.                   "recordURL":"http://catalog.hathitrust.org/Record/004370624",
  35.                   "titles":
  36.                   ["ARBA in-depth. Philosophy and religion /"],
  37.                   "isbns":
  38.                   ["1591581613"],
  39.                   "issns": [],
  40.                   "oclcs": ["53462174"],
  41.                   "lccns": ["2003065945"]
  42.               }
  43.           },
  44.           "items":
  45.           [{
  46.                   "orig":"University of Michigan",
  47.                   "fromRecord":"004370624",
  48.                   "htid":"mdp.39015058261911",
  49.                   "itemURL":"http://hdl.handle.net/2027/mdp.39015058261911",
  50.                   "rightsCode":"ic",
  51.                   "lastUpdate":"20090907",
  52.                   "enumcron":false
  53.            }]
  54.       }
  55.   }

Enumeration / Chronology

An effort is made to return items in “enumcron order” — hopefully, with earlier volumes showing up before later volumes. The full enumcron is listed in the items if you need to try something different.

JSONP Support

JSONP output is supported — just throw a ‘&callback=blahblahblah’ on the end of the URL you call and you’ll get a function definition back.

Some examples:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json&callback=myfunc

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581/id:2;isbn:1591581613&callback=myfunc

2 Responses to “Beta version of the HathiTrust Volumes API available”

  1. Looks good!

    Can you document how we determine the access level for an item? (Full text, search only, or… is that it, or do you have some with no access?). That’s the “rightscode”? Can you docuemnt the vocabularly used there?

  2. Stephanie Collett says:

    Hi Jonathan, you can find that information in the attributes subsection of the database layout section of the rights database document.

    http://www.hathitrust.org/rights_database#DatabaseLayout

Running Blacklight under JRuby

November 17, 2009 at 11:35 pmCategory:Uncategorized

I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there.

There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing.

This is the procedure I eventually arrived at — if there are places where I made trouble for myself, please let me know!

[And does anyone know how to get jruby's nokogiri to link to a different libxml and stop with the crappy libxml2-version error message every time I run it under OSX???]

Download jruby

Go to jruby.org and download a binary distribution. Extract the tar.gz (or zip or whatever)

I’ll put mine in ~/jruby. Or, at least that’s what I’ll tell you.

tar xzf jruby-1.4.tar.gz

To avoid confusion, let’s make jrake an alias for rake and add the jruby bin directory to the path

cd ~/jruby/bin
ln -s rake jrake
export PATH=`pwd`:$PATH

Download Blacklight

git clone git://github.com/projectblacklight/blacklight.git

Again, well say that I put this in ~/blacklight/

Muck with Blacklight dependencies

Edit the file init.rb to comment out references to libxml and ruby-xslt, as well as nokogiri. My understanding is that the first two are used, at this point, only for the EAD stuff. Both rely on libxml2 which is a C-extension and hence unavailable to JRuby.

Nokogiri gets pulled in during other installs and for some reason jrake will complain later on that it’s got a wrong version or something. So, we’ll just work without that particular net for now.

#### File ~/blacklight/init.rb
# config.gem 'libxml-ruby', :lib=>'libxml', :version=>'1.1.3'
# config.gem 'ruby-xslt', :lib=>'xml/xslt', :version=>'0.9.6'
# config.gem 'nokogiri', :version=>'1.3.3'

Do some initial installs

jgem install -v=2.3.4 rails 
jgem install activerecord-jdbc-adapter jdbc-sqlite3 
             activerecord-jdbcsqlite3-adapter ActiveRecord-JDBC 
jgem install rcov -s http://gemcutter.org --no-rdoc --no-ri
jrake
jrake gems:install

Edit the config/database.yml file

…to change the adapter to jdbcsqlite3 for development and testing.

Edit the databases.rake file

This one was harder to track down. The default rake task has hard-coded database names in the .rake file — jdbcsqlite3 isn’t included. I keep seeing things saying, “Oh, yeah, that’s been fixed…” but, well, it wasn’t for me. I had to do it by hand.

edit ~/jruby/lib/ruby/gems/1.8/gems/rails-2.3.4/lib/tasks/databases.rake

You need to find everywhere there’s a

when "sqlite", "sqlite3" # or when /^sqlite/ in one case

…and change it to

when "sqlite", "sqlite3", "jdbcsqlite3"

Repeat for other databases you want to use (e.g., mysql). For the moment, since I’m only worried about running jrake spec, that’s all I’m gonna do.

Try again

jrake
  Missing these required gems:
   mislav-hanna  = 0.1.11

OK. Not sure why that didn’t come in before. Go head and add it.

jgem install  mislav-hanna

Migrate the databases

jrake

The databases should migrate, and then it’ll poop out because Solr didn’t start.

Fire up solr

Since we’re running jruby, accessing the shell doesn’t work. You’ll have to fire up your test solr instance by hand.

cd ~/blacklight/jetty
java -Djetty.port=8888 -jar start.jar 2>log.jetty

Try it again!

cd ~/blacklight
jrake spec

   ................................................................
   ................................................................
   ....F............................................................
   1)
   'ApplicationHelper Export EndNote should render the correct 
   EndNote text file' FAILED
   expected: "%0 Format\n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%T Music for horn \n%I Harmonia Mundi USA, \n%C [United States] : \n%D p2001. \n",
  got: "%0 Format\n%C [United States] : \n%D p2001. \n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%I Harmonia Mundi USA, \n%T Music for horn \n" (using ##)
./spec/helpers/application_helper_spec.rb:128:

Finished in 15.519 seconds
193 examples, 1 failure

I can live with that for the moment. Anyone know why that spec fails?

Great! How about the features?

jrake features
  (much output)

  59 scenarios (59 passed)
  434 steps (434 passed)
  0m51.186s

And so…

…it appears that, at least on the surface, jruby is a viable platform for Blacklight so long as I don’t actually need any of the libxml stuff. In the next couple days I’ll try and actually get it all up and running and see if I can break it.

One Response to “Running Blacklight under JRuby”

  1. Mark Thomas says:

    Did you ever go any further with Blacklight? Are there other resources for Blacklight other than the API docs?