Pushing MARC to Solr; processing times and threading and such

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

  • Read the records into marc4j objects and do nothing. This is a baseline of sorts.
  • The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
  • Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
  • The big “allfields” field is text from tags 100 through 900.
  • The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

  • 18,881 records in marc-binary format
  • Times are in seconds, run on my desktop
  • Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.
Total Seconds Description
19 Just read the records with marc4j and do nothing.
85 Read and do 35 “normal” fields (no custom)
104 Read, 35 normal, 15 custom fields
110 Read, normal, custom, allfields
129 Read, normal, custom, allfields, to_xml
136 Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142 Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124 Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds Description
19 read the records and do nothing
66 process the 35 normal fields
19 process the 15 custom fields
6 generate the “allfields” field
19 generate the XML (yowza!)
7 send to solr with two threads
13 send to solr with one thread

Or like this:

Seconds Description
129 do all the reading and processing
13 send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

  • SUSS is just faster than whatever solrmarc does.
  • My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
  • The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time SUSS Threads Processing Threads
70 1 1 (was 142 seconds on the desktop)
47 1 2
39 1 3
35 1 4
68 2 1
48 2 2
38 2 3
34 2 4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

ruby-marc with pluggable readers

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

  1.   require 'marc'
  2.   require 'my_marc_stuff'
  3.  
  4.   mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader
  5.   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto
  6.  
  7.   MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)
  8.   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now
  9.  
  10.   xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)
  11.  
  12.   # …and maybe further on down the road
  13.  
  14.   asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)
  15.   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

New interest in MARC-HASH / JSON

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  1.  
  2.   # A record is a four-pair hash, as follows. UTF-8 is mandatory.
  3.   {
  4.     "type" : "marc-hash"
  5.     "version" : [1, 0]
  6.     "leader" : "…leader string … "
  7.     "fields" : [array, of, fields]
  8.   }
  9.  
  10.   # A field is an array of either 2 or 4 elements
  11.   [tag, value] # a control field
  12.   [tag, ind1, ind2, [array, of subfields]]
  13.  
  14.   # A subfield is an array of two elements
  15.  
  16.   [code, value]

So, a short example:

  1.   {
  2.     "type" : "marc-hash",
  3.     "version" : [1, 0],
  4.  
  5.     "leader" : "leader string"
  6.     "fields" : [
  7.        ["001", "001 value"]
  8.        ["002", "002 value"]
  9.        ["010", " ", " ",
  10.         [
  11.           ["a", "68009499"]
  12.         ]
  13.       ],
  14.       ["035", " ", " ",
  15.         [
  16.           ["a", "(RLIN)MIUG0000733-B"]
  17.         ],
  18.       ],
  19.       ["035", " ", " ",
  20.         [
  21.           ["a", "(CaOTULAS)159818014"]
  22.         ],
  23.       ],
  24.       ["245", "1", "0",
  25.         [
  26.           ["a", "Capitalism, primitive and modern;"],
  27.           ["b", "some aspects of Tolai economic growth" ],
  28.           ["c", "[by] T. Scarlett Epstein."]
  29.         ]
  30.       ]
  31.     ]
  32.   }

How's the speed?

I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end

# Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?

OCLC still not (NO! They are!) normalizing their LCCNs

NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So…good news all around, and huge kudos to Xiaoming Liu for his quick response!
**NOTE** It strikes me that I haven’t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you’ll get back either good data or nothing (and the “nothing” might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.

A long time ago, Jonathan Rochkind noted that the OCLC doesn’t correctly normalize their LCCNs.

Well, it’s not fixed.

I could really, really use the xlccn service right about now — a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you’re interested in.

Except they “normalize” their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.

The xLCCN service will silently provide no data or incorrect data for many LCCN requests!

An example:

  • (F) Full LCCN is “sn 83011407″
  • (D) First set of digits is “83011407″. This is what I think the OCLC is indexing.
  • (N) Correct normalization is “sn83011407″

The problem, of course, is that (D) “83011407″ is itself a valid LCCN.

  • (F) is associated with OCLC# 47212967
  • (D) is associated with OCLC# 12505148. That’s not the same record.

So, how do the OCLC services respond?

  • (F) Worldcat search finds correct (probably just doing a string match); xid finds nothing
  • (D) Worldcat finds both correct and incorrect records. The xLCCN service finds only the incorrect record, OCLC# 12505148.
  • (N) Neither worldcat nor xid finds anything for the correctly normalized version.

So, what am I supposed to do? Only use the service on LCCNs where the original and normalized versions are the same and include only digits? Frustrating.

Indexing data into Solr via JRuby (with threads!)

[Note: in this post I'm just going to focus on the "get stuff into Solr" part. My normal focus -- MARC data -- will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love me the java.

So…taking Erik Hatcher’s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with Solr.

Getting the code

I’ve added the gems to gemcutter, if you want to play along at home:

  • jruby_producer_consumer (github, rdoc.info) Ruby syntax for threaded operations under jruby
  • jruby_streaming_update_solr_server (github, rdoc.info) Ruby syntax on top of the Java class of the same name
  • marc4j4r (github, rdoc.info) Ruby syntax on top of the marc4j java library.

WARNING: None of these gems have a 1.0 version tag on them, and that means that the API may change a titch in the future. Also, the fact that they’re released as gems means that it’s easy to release gems, not that I’m not an idiot.

The basics: Using SolrInputDocument and StreamingUpdateSolrServer

OK, with the disclaimer out of the way, let’s look at some code.

  1.   require 'rubygems'
  2.   require 'jruby_streaming_update_solr_server'
  3.  
  4.   solrurl = 'http://your.solr.server:port/solr'
  5.   sussqueuesize = 24 # how many items to buffer on their way to solr
  6.   sussthreads = 1   # how many threads to use to send stuff to solr
  7.  
  8.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  9.  
  10.   # Let's add a simple document via a hash: A title, three authors, and a year
  11.  
  12.   h = {
  13.     :title => "Never been deader",
  14.     :author => ['Bill', 'Mike', 'Molly'],
  15.     :year => 2003
  16.   }
  17.   suss << h
  18.   suss.commit
  19.  
  20.   # YEA! You just added a document to solr and committed it.
  21.   # Have a cookie!
  22.  
  23.   # We can also use a document object to do the same thing
  24.  
  25.   doc = SolrInputDocument.new
  26.   # Add the title
  27.   doc << ['title', 'Never been deader']
  28.  
  29.   # Add the first author
  30.   doc << [:author, 'Bill']
  31.  
  32.   # Add more. Re-used keys mean you're adding additional values
  33.   # Note values can be scalars or arrays
  34.  
  35.   doc << [:author, ['Mike', 'Molly']]
  36.  
  37.   # Add the wrong year using [] syntax
  38.   doc[:year] = 2001
  39.  
  40.   # Oops! fix it. []= overwrites existing value(s)
  41.  
  42.   doc[:year] = 2003
  43.  
  44.   # Finally, we can merge a hash (or anything else that responds to
  45.   # 'each_pair' with key-value pairs) into an existing doc
  46.  
  47.   doc.merge! {'author' => 'Ringo Starrre', 'publisher'=>'Vainity Books'}
  48.  
  49.   # Add it
  50.  
  51.   suss << doc
  52.  
  53.   # Commit and optimize if you'd like
  54.  
  55.   suss.commit
  56.   suss.optimize # if you want

Nothing really fancy in there — just a few things worth noting:

  • An suss object will take a hash (again, anything that responds to #each_pair) or a SolrInputDoc
  • You can use either strings or symbols to represent Solr field names
  • Values can be either a single value, or an array of multiple values

And there are three ways to get data into a doc:

  • Via << [field, value(s)] (additive)
  • Via doc.merge! hash (additive)
  • Via doc[field] = value (replaces)

Adding Threads

I also went down the garden path of threading things. There are an awful lot of operations that are not threadsafe (e.g., reading a line from a file) but once you’ve got a bunch of records to worth with, turning them into Solr documents is usually thread-safe.

My model is that there’s a producer (usually the method #each) from an underlying data object. A thread takes whatever that method yields and sticks the values into a java BlockingQueue awaiting consumption. You then use ProdcuerConsumer#threaded_each (or ProducerConsumer#threaded_each_with_index) to pull items out of the queue and do something useful with them.

I extracted stuff into a library (jruby_producer_consumer) for your viewing pleasure.

CONFUSION ALERT: It’s perhaps unfortunate that the object you send to ProducerConsumer.new(obj) must implement #each and that the ProducerConsumer method #threaded_each calls that underlying #each…well there’s a lot of #each’s floating around. Keep them straight.

So…let’s look at some code to work with consumer threads.

  1.   # Start off the same as before
  2.   require 'rubygems'
  3.   require 'jruby_streaming_update_solr_server'
  4.   require 'jruby_producer_consumer'
  5.   require 'marc4j4r'
  6.  
  7.   solrurl = 'http://your.solr.server:port/solr'
  8.   sussqueuesize = 24 # how many items to buffer on their way to solr
  9.   sussthreads = 2   # how many threads to use to send stuff to solr
  10.  
  11.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  12.  
  13.   # I'll go ahead and use a MARC file as my example, but won't talk about the
  14.   # MARC parts of it. All you need to know is that the reader object
  15.   # implements #each
  16.  
  17.   reader = MARC4J4R.reader('test.xml', :marcxml)
  18.  
  19.   # Get a producer/consumer object with the reader at its base, using
  20.   # the default method #each to get stuff out of it, and with the assumption
  21.   # that we only need to keep the default 5 items in memory at a time to
  22.   # keep up with consumption
  23.  
  24.   pc = ProducerConsumer.new(reader)
  25.  
  26.   # Get three threads to actually consume the things, turn them into solr
  27.   # documents, and send them to solr (potentially out of order)
  28.  
  29.   numconsumerthreads = 3
  30.   pc.threaded_each(numconsumerthreads).each do |r|
  31.     suss << turn_marc_record_into_a_hash_or_solrdoc(r)
  32.   end
  33.   suss.commit

Again, not a lot happening here.

  • The “producer” is always one thread, because so little is thread-safe at the ‘each’ level. In this case, there’s a single thread pulling data out of the file and turning it into MARC records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don’t starve. I presume that producing items is cheaper than consuming them, or else this library won’t help you much.
  • ProducerConsumer#threaded_each calls the #each method of the underlying object. You can substitute anything that yields, though, as in this example where I call #each_line instead of the default #each
  1.   queuesize = 5
  2.   pc = ProducerConsumer.new(File.new('myfile.txt'), queuesize, :each_line)
  • Keep track of your threads. In this last example, there is one thread getting MARC records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the suss object, and another two pulling stuff out of the suss object and sending things to Sorl. And, of course, there’s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.
  • See the docs for how to mess with what goes into a ProducerConsumer object. It’s entirely possible to use, say, #each_slice. There’s also a convenience method #threaded_each_with_index, but it does not call the underlying #each_with_index, it produces its own index as things are read.

Feedback not only welcome but necessary!

I’ve done a lot of messing around with Ruby in the last 10 days or so, but I’m still basically converting from Perl in my head. Any comments, bugs reports, or whatnot are definitely welcome!

jruby_producer_consumer dead-simple producer/consumer for JRuby

Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. And I didn't really understand it. So I dug in today and wrote this.]

I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer.

It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t a huge win under regular Ruby.

There’s no testing there because I’m not sure how to test threaded stuff :-(

It is, I hope, easy to use:

  1.    require 'rubygems'
  2.    require 'jruby_producer_consumer'
  3.  
  4.    # Create a ProducerConsumer. Arguments are anything that implements #each
  5.    # and the size for the underlying queue. For the former, I'll just use a Range object.
  6.  
  7.    eachable = 1..10
  8.    queuesize = 3
  9.  
  10.    pc = ProducerConsumer.new(eachable, queuesize)
  11.  
  12.    # Just a method to show what happens
  13.    def sample (consumerid, x)
  14.      puts "Consumer #{consumerid}: consuming #{x}"
  15.      sleep 1 # otherwise this'll finsish before I can create multiple consumers
  16.    end
  17.  
  18.    # Create three consumers. You can pass any number of args to
  19.    # #consumer, and must pass a block whose arguments are the
  20.    # object returned by eachable#each and those args back.
  21.  
  22.    ['A', 'B', 'C'].each do |consumerid|
  23.      pc.consumer(consumerid) do |x, consumerid|
  24.        sample(consumerid, x)
  25.      end
  26.    end
  27.  
  28.    # OUTPUT
  29.    # Consumer A: consuming 1
  30.    # Consumer B: consuming 2
  31.    # Consumer C: consuming 3
  32.    # Consumer A: consuming 4
  33.    # Consumer B: consuming 5
  34.    # Consumer C: consuming 6
  35.    # Consumer B: consuming 7
  36.    # Consumer A: consuming 8
  37.    # Consumer C: consuming 9
  38.    # Consumer B: consuming 10

Still another look at MARC parsing in ruby and jruby

I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.

Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.

Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby instead of java would be nice.

The Big Question: Is the MARC parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?

The Answer (see below): Yes, if I use marc4j.

On our ridiculously-awesome hardware, right now we’re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.

I’ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn’t installed on the server) and on the server where we usually run these things.

  • The machines are my desktop OSX machine and the beefy linux server where we usually do this stuff
  • The platforms are jruby 1.4 –server and MRI ruby 1.87
  • The libraries are marc4j and ruby-marc 0.3.3
  • The parsers are
    • The standard binary parsers all around
    • A home-grown AlephSequential format reader for the ’seq’ type. AlephSequential is a MARC representation that uses one line for each field. We use it because it doesn’t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to MARC-XML.
    • Whatever marc4j uses internally for MARC-XML
    • ruby-marc’s ‘jstax’ xml parser under jruby (which I wrote and apparently needs some love, see below)
    • ruby-marc’s ‘libxml’ xml parser under MRI ruby
  • Seconds is the average of two rounds, with measurements taken after a warmup run in each case.

The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.

MACHINE PLATFORM LIBRARY PARSER SECONDS REC/SECOND
desktop jruby marc4j binary 4.06 4650 desktop jruby marc4j xml 5.55 3401 desktop jruby ruby-marc binary 17.35 1088 desktop jruby ruby-marc jstax 80.11 236

desktop ruby ruby-marc binary 33.54 562 desktop ruby ruby-marc libxml 46.87 402

server jruby marc4j binary 2.29 8245 server jruby marc4j xml 3.36 5619 server jruby marc4j AlephSeq 3.68 5130 server jruby ruby-marc binary 9.93 1901 server jruby ruby-marc jstax 44.56 424

The quick takeaways, with all the obvious caveats:

  • jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI
  • marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.
  • The server is fast.

We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can’t use marc-binary format because our records are too big.

If I’m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I’m gonna need to use marc4j and just wrap it up in some nicer syntax.

Beta version of the HathiTrust Volumes API available

MAJOR CHANGE

So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character.

Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.

So, we’re back to using pipe (|) characters to separate multiple calls — the examples below have been updated to reflect this.

Introduction

I’ve put up a beta version of the HathiTrust Volumes API previously discussed on this blog and via email.

Currently, I’ve only got json output, although there is space in there for other output formats as necessary.

What exactly is this?

Given: an identifier or set of identifiers, this API will Return: a set of matched records and a sorted list of the items available in the HathiTrust.

Useful, for example, if you want to display HathiTrust holdings alongside your own in your OPAC.

Simple, single-value call

Given the URL:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json

You’ll get the following back:

  1.   {
  2.       "records":
  3.       {
  4.           "000791709":
  5.           {
  6.               "recordURL":"http://catalog.hathitrust.org/Record/000791709",
  7.               "titles":
  8.               [
  9.                   "\"Zhong gong dang shi\" fu dao /",
  10.                   "\u300a\u4e2d\u5171\u515a\u53f2\u300b\u8f85\u5bfc /"
  11.               ],
  12.               "isbns": [],
  13.               "issns": [],
  14.               "oclcs": ["15420548"],
  15.               "lccns": []
  16.           }
  17.       },
  18.       "items":
  19.       [
  20.           {
  21.               "orig":"University of Michigan",
  22.               "fromRecord":"000791709",
  23.               "htid":"mdp.39015058510069",
  24.               "itemURL":"http://hdl.handle.net/2027/mdp.39015058510069",
  25.               "rightsCode":"ic",
  26.               "lastUpdate":"00000000",
  27.               "enumcron":false
  28.           }
  29.       ]
  30.   }

Note that the ‘records’ are keyed on the local umid, also available in the ‘fromRecord’ field of each item.

The generic short form is:

http://catalog.hathitrust.org/api/volumes/(idtype)/id.(outputtype)

Right now the valid idtypes are:

  • issn (will be normalized to just digits, no leading zeros)
  • isbn (will be normalized to an ISBN-13)
  • oclc (will be normalized to all digits, no leading zeros)
  • lccn (will be normalized as recommended)
  • htid (HathiTrust item id, seen above as “mdp.39015058510069″)
  • umid (the University of Michigan record ID, seen above in the “fromRecord” field of an item)

Currently the only valid outputtype is ‘json’.

More complex, multi-valued call

The full API URL looks like this:

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613

This is a request for data on two separate items, identified on the calling end as simply ‘1′ and ‘2′ (id:1 and id:2). The first item is searched for using both an oclc number and an lccn; the second supplies only an isbn.

Note that

  • The output format (json) has moved to appear right after the ‘/volumes/’
  • There’s an arbitrary ‘id’ field. This will be used to index the return values, so use something meaningful on your end.
  • keys and values are separated by colons. Key-Value pairs are separated by semi-colons.
  • Separate requests are separated by ‘/’ in the URL, allowing you to request data for an arbitrary number of items with a single call.
  • Return values are
  • Matches follow the “#3″ option on the old post, the “Must match if present” option — basically, if you supply an identifier and a record has one of those identifiers, they must match.

So, in the example, the first request has both an oclc number and an lccn. Matches are as follows:

  • If a record has an oclc number but no lccn, its oclc number must match the passed oclc number.
  • If a record has an lccn but no oclc number, its lccn must match the passed lccn value.
  • If a record has both an lccn and an oclc number, both its identifiers must match the passed values.

The returned structure is keyed on the arbitrary id passed in the search string (if not present, the whole search string will be used instead):

  1.   {
  2.       "1":
  3.       {
  4.           "records":
  5.           {
  6.               "001474331":
  7.               {
  8.                   "recordURL":"http://catalog.hathitrust.org/Record/001474331",
  9.                   "titles":
  10.                   ["Some aspects of seventeenth-century medicine &amp; science; papers read at a Clark Library seminar, October 12, 1968"],
  11.                   "isbns": [],
  12.                   "issns": [],
  13.                   "oclcs": ["00045678"],
  14.                   "lccns": ["70628581 //r86"]
  15.               }
  16.           },
  17.           "items":
  18.           [{
  19.                   "orig":"University of Michigan",
  20.                   "fromRecord":"001474331",
  21.                   "htid":"mdp.39015004074095",
  22.                   "itemURL":"http://hdl.handle.net/2027/mdp.39015004074095",
  23.                   "rightsCode":"ic",
  24.                   "lastUpdate":"20090713",
  25.                   "enumcron":false
  26.               }]
  27.       },
  28.       "2":
  29.       {
  30.           "records":
  31.           {
  32.               "004370624":
  33.               {
  34.                   "recordURL":"http://catalog.hathitrust.org/Record/004370624",
  35.                   "titles":
  36.                   ["ARBA in-depth. Philosophy and religion /"],
  37.                   "isbns":
  38.                   ["1591581613"],
  39.                   "issns": [],
  40.                   "oclcs": ["53462174"],
  41.                   "lccns": ["2003065945"]
  42.               }
  43.           },
  44.           "items":
  45.           [{
  46.                   "orig":"University of Michigan",
  47.                   "fromRecord":"004370624",
  48.                   "htid":"mdp.39015058261911",
  49.                   "itemURL":"http://hdl.handle.net/2027/mdp.39015058261911",
  50.                   "rightsCode":"ic",
  51.                   "lastUpdate":"20090907",
  52.                   "enumcron":false
  53.            }]
  54.       }
  55.   }

Enumeration / Chronology

An effort is made to return items in “enumcron order” — hopefully, with earlier volumes showing up before later volumes. The full enumcron is listed in the items if you need to try something different.

JSONP Support

JSONP output is supported — just throw a ‘&callback=blahblahblah’ on the end of the URL you call and you’ll get a function definition back.

Some examples:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json&callback=myfunc

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581/id:2;isbn:1591581613&callback=myfunc

Running Blacklight under JRuby

I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there.

There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing.

This is the procedure I eventually arrived at — if there are places where I made trouble for myself, please let me know!

[And does anyone know how to get jruby's nokogiri to link to a different libxml and stop with the crappy libxml2-version error message every time I run it under OSX???]

Download jruby

Go to jruby.org and download a binary distribution. Extract the tar.gz (or zip or whatever)

I’ll put mine in ~/jruby. Or, at least that’s what I’ll tell you.

tar xzf jruby-1.4.tar.gz

To avoid confusion, let’s make jrake an alias for rake and add the jruby bin directory to the path

cd ~/jruby/bin
ln -s rake jrake
export PATH=`pwd`:$PATH

Download Blacklight

git clone git://github.com/projectblacklight/blacklight.git

Again, well say that I put this in ~/blacklight/

Muck with Blacklight dependencies

Edit the file init.rb to comment out references to libxml and ruby-xslt, as well as nokogiri. My understanding is that the first two are used, at this point, only for the EAD stuff. Both rely on libxml2 which is a C-extension and hence unavailable to JRuby.

Nokogiri gets pulled in during other installs and for some reason jrake will complain later on that it’s got a wrong version or something. So, we’ll just work without that particular net for now.

#### File ~/blacklight/init.rb
# config.gem 'libxml-ruby', :lib=>'libxml', :version=>'1.1.3'
# config.gem 'ruby-xslt', :lib=>'xml/xslt', :version=>'0.9.6'
# config.gem 'nokogiri', :version=>'1.3.3'

Do some initial installs

jgem install -v=2.3.4 rails 
jgem install activerecord-jdbc-adapter jdbc-sqlite3 
             activerecord-jdbcsqlite3-adapter ActiveRecord-JDBC 
jgem install rcov -s http://gemcutter.org --no-rdoc --no-ri
jrake
jrake gems:install

Edit the config/database.yml file

…to change the adapter to jdbcsqlite3 for development and testing.

Edit the databases.rake file

This one was harder to track down. The default rake task has hard-coded database names in the .rake file — jdbcsqlite3 isn’t included. I keep seeing things saying, “Oh, yeah, that’s been fixed…” but, well, it wasn’t for me. I had to do it by hand.

edit ~/jruby/lib/ruby/gems/1.8/gems/rails-2.3.4/lib/tasks/databases.rake

You need to find everywhere there’s a

when "sqlite", "sqlite3" # or when /^sqlite/ in one case

…and change it to

when "sqlite", "sqlite3", "jdbcsqlite3"

Repeat for other databases you want to use (e.g., mysql). For the moment, since I’m only worried about running jrake spec, that’s all I’m gonna do.

Try again

jrake
  Missing these required gems:
   mislav-hanna  = 0.1.11

OK. Not sure why that didn’t come in before. Go head and add it.

jgem install  mislav-hanna

Migrate the databases

jrake

The databases should migrate, and then it’ll poop out because Solr didn’t start.

Fire up solr

Since we’re running jruby, accessing the shell doesn’t work. You’ll have to fire up your test solr instance by hand.

cd ~/blacklight/jetty
java -Djetty.port=8888 -jar start.jar 2>log.jetty

Try it again!

cd ~/blacklight
jrake spec

   ................................................................
   ................................................................
   ....F............................................................
   1)
   'ApplicationHelper Export EndNote should render the correct 
   EndNote text file' FAILED
   expected: "%0 Format\n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%T Music for horn \n%I Harmonia Mundi USA, \n%C [United States] : \n%D p2001. \n",
  got: "%0 Format\n%C [United States] : \n%D p2001. \n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%I Harmonia Mundi USA, \n%T Music for horn \n" (using ##)
./spec/helpers/application_helper_spec.rb:128:

Finished in 15.519 seconds
193 examples, 1 failure

I can live with that for the moment. Anyone know why that spec fails?

Great! How about the features?

jrake features
  (much output)

  59 scenarios (59 passed)
  434 steps (434 passed)
  0m51.186s

And so…

…it appears that, at least on the surface, jruby is a viable platform for Blacklight so long as I don’t actually need any of the libxml stuff. In the next couple days I’ll try and actually get it all up and running and see if I can break it.

Setting up your OPAC for Zotero support using unAPI

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let’s get them to play nice with each other!

How’s it all work?

  1. Zotero looks for a well-constructed <link> tag in the head of the page
  2. It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks.
  3. Zotero then looks for IDs in the body of the page
  4. If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records.

What you’ll need

  1. An OPAC whose output you can futz with
  2. Access to an individual record’s ID in that output
  3. A URL based on the ID that gives an RIS representation of the records
  4. A screwdriver. Made with decent — but not too expensive — vodka and fresh orange juice.

Yes. I’m cheating.

I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your OPAC’s URL scheme, or write an RIS export tool by hand, well, this will take you a bit longer.

The process

1. Build an upAPI target script

You need a script that’ll do three things:

  1. With no arguments, return a list of available formats in general
  2. With one argument, id=<ID>, return a list of formats available for that item. This will likely be exactly the same as #1.
  3. With two arguments, id=<ID> & format=<FORMAT>, return the record identified by <ID> in format <FORMAT>

Mine looks like this:

  1.  
  2.   // id is of the form urn:bibnum:000000000
  3.  
  4.   $id = isset($_REQUEST['id'])? $_REQUEST['id'] : false;
  5.  
  6.   // Format, at this point, had better be 'ris'
  7.   $format = isset($_REQUEST['format'])? $_REQUEST['format'] : false;
  8.  
  9.   // Got neither? Return the general list
  10.   if (!($id || $format)) {
  11.     header('Content-type: application/xml');
  12.     echo '<?xml version="1.0" encoding="UTF-8"?>
  13.    <formats>
  14.      <format name="ris"
  15.              type="application/x-Research-Info-Systems"
  16.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  17.    </formats>
  18.    ';
  19.   exit;  
  20.   }
  21.  
  22.  
  23.   // Got just the id? Return formats for that ID
  24.   if ($id && !$format) {
  25.     header('Content-type: application/xml');
  26.     echo '<?xml version="1.0" encoding="UTF-8"?>
  27.    <formats id="' . $id . '">
  28.      <format name="ris"
  29.              type="application/x-Research-Info-Systems"
  30.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  31.    </formats>
  32.    ';  
  33.   exit;  
  34.   }
  35.  
  36.  
  37.   // Otherwise…
  38.  
  39.   // Parse out the actual numeric part of the id from the urn:<typeOfNumber> prefix
  40.   preg_match('/^urn:bibnum:(.*)$/', $id, $match);
  41.   $actualID = $match[1];
  42.  
  43.   // Again: format had better be 'ris' because that's all I'm supporting at this point.
  44.   header("Location: /Search/SearchExport?id=$actualID&method=$format", true, 302);

You can see that a <format> is a just a name, a mime-type, and an optional reference to documentation on the type.

I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in — I’m hard-coding ‘bibnum’ for the moment, but could allow, say, “oclc” or “isbn” or whatnot, too.

2. Tell your OPAC where the script lives

You’ll need a line in the <head> section of all your pages that might have an ID on them:

<link rel="unapi-server" type="application/xml" title="unAPI" href="/unapi">

Everything should be left alone except for the actual href.

3. Add your IDs to the HTML

In the HTML of your page, you can add one or more tags of the form:

<abbr class="unapi-id" title="urn:bibnum:000000002"></abbr>

(where the title of the <abbr> conforms to what you’re expecting in your script).

You can put stuff inside the <abbr> but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.

4. Final step

Drink your screwdriver.

Where can I see it?

Well…here’s the thing.

You can take a look at my test instance, http://dueberb.vufind.lib.umich.edu/ and play there. You can not see it in production, because there’s a little problem.

Our old OPAC — now dubbed mirlyn-classic — had a custom translator written for it. And it worked fine, and that was great.

But now we’ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.

I’ve asked around about getting it fixed (changing the target URL for the old translator to point at mirlyn-classic) but it’s Friday, and no one is around. Hopefully soon.