Archives: November 2009

Running Blacklight under JRuby

November 17, 2009 at 11:35 pmCategory:Uncategorized

I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there.

There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing.

This is the procedure I eventually arrived at — if there are places where I made trouble for myself, please let me know!

[And does anyone know how to get jruby's nokogiri to link to a different libxml and stop with the crappy libxml2-version error message every time I run it under OSX???]

Download jruby

Go to jruby.org and download a binary distribution. Extract the tar.gz (or zip or whatever)

I’ll put mine in ~/jruby. Or, at least that’s what I’ll tell you.

tar xzf jruby-1.4.tar.gz

To avoid confusion, let’s make jrake an alias for rake and add the jruby bin directory to the path

cd ~/jruby/bin
ln -s rake jrake
export PATH=`pwd`:$PATH

Download Blacklight

git clone git://github.com/projectblacklight/blacklight.git

Again, well say that I put this in ~/blacklight/

Muck with Blacklight dependencies

Edit the file init.rb to comment out references to libxml and ruby-xslt, as well as nokogiri. My understanding is that the first two are used, at this point, only for the EAD stuff. Both rely on libxml2 which is a C-extension and hence unavailable to JRuby.

Nokogiri gets pulled in during other installs and for some reason jrake will complain later on that it’s got a wrong version or something. So, we’ll just work without that particular net for now.

#### File ~/blacklight/init.rb
# config.gem 'libxml-ruby', :lib=>'libxml', :version=>'1.1.3'
# config.gem 'ruby-xslt', :lib=>'xml/xslt', :version=>'0.9.6'
# config.gem 'nokogiri', :version=>'1.3.3'

Do some initial installs

jgem install -v=2.3.4 rails 
jgem install activerecord-jdbc-adapter jdbc-sqlite3 
             activerecord-jdbcsqlite3-adapter ActiveRecord-JDBC 
jgem install rcov -s http://gemcutter.org --no-rdoc --no-ri
jrake
jrake gems:install

Edit the config/database.yml file

…to change the adapter to jdbcsqlite3 for development and testing.

Edit the databases.rake file

This one was harder to track down. The default rake task has hard-coded database names in the .rake file — jdbcsqlite3 isn’t included. I keep seeing things saying, “Oh, yeah, that’s been fixed…” but, well, it wasn’t for me. I had to do it by hand.

edit ~/jruby/lib/ruby/gems/1.8/gems/rails-2.3.4/lib/tasks/databases.rake

You need to find everywhere there’s a

when "sqlite", "sqlite3" # or when /^sqlite/ in one case

…and change it to

when "sqlite", "sqlite3", "jdbcsqlite3"

Repeat for other databases you want to use (e.g., mysql). For the moment, since I’m only worried about running jrake spec, that’s all I’m gonna do.

Try again

jrake
  Missing these required gems:
   mislav-hanna  = 0.1.11

OK. Not sure why that didn’t come in before. Go head and add it.

jgem install  mislav-hanna

Migrate the databases

jrake

The databases should migrate, and then it’ll poop out because Solr didn’t start.

Fire up solr

Since we’re running jruby, accessing the shell doesn’t work. You’ll have to fire up your test solr instance by hand.

cd ~/blacklight/jetty
java -Djetty.port=8888 -jar start.jar 2>log.jetty

Try it again!

cd ~/blacklight
jrake spec

   ................................................................
   ................................................................
   ....F............................................................
   1)
   'ApplicationHelper Export EndNote should render the correct 
   EndNote text file' FAILED
   expected: "%0 Format\n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%T Music for horn \n%I Harmonia Mundi USA, \n%C [United States] : \n%D p2001. \n",
  got: "%0 Format\n%C [United States] : \n%D p2001. \n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%I Harmonia Mundi USA, \n%T Music for horn \n" (using ##)
./spec/helpers/application_helper_spec.rb:128:

Finished in 15.519 seconds
193 examples, 1 failure

I can live with that for the moment. Anyone know why that spec fails?

Great! How about the features?

jrake features
  (much output)

  59 scenarios (59 passed)
  434 steps (434 passed)
  0m51.186s

And so…

…it appears that, at least on the surface, jruby is a viable platform for Blacklight so long as I don’t actually need any of the libxml stuff. In the next couple days I’ll try and actually get it all up and running and see if I can break it.

One Response to “Running Blacklight under JRuby”

  1. Mark Thomas says:

    Did you ever go any further with Blacklight? Are there other resources for Blacklight other than the API docs?

Leave a Reply

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let’s get them to play nice with each other!

How’s it all work?

  1. Zotero looks for a well-constructed <link> tag in the head of the page
  2. It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks.
  3. Zotero then looks for IDs in the body of the page
  4. If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records.

What you’ll need

  1. An OPAC whose output you can futz with
  2. Access to an individual record’s ID in that output
  3. A URL based on the ID that gives an RIS representation of the records
  4. A screwdriver. Made with decent — but not too expensive — vodka and fresh orange juice.

Yes. I’m cheating.

I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your OPAC’s URL scheme, or write an RIS export tool by hand, well, this will take you a bit longer.

The process

1. Build an upAPI target script

You need a script that’ll do three things:

  1. With no arguments, return a list of available formats in general
  2. With one argument, id=<ID>, return a list of formats available for that item. This will likely be exactly the same as #1.
  3. With two arguments, id=<ID> & format=<FORMAT>, return the record identified by <ID> in format <FORMAT>

Mine looks like this:

  1.  
  2.   // id is of the form urn:bibnum:000000000
  3.  
  4.   $id = isset($_REQUEST['id'])? $_REQUEST['id'] : false;
  5.  
  6.   // Format, at this point, had better be 'ris'
  7.   $format = isset($_REQUEST['format'])? $_REQUEST['format'] : false;
  8.  
  9.   // Got neither? Return the general list
  10.   if (!($id || $format)) {
  11.     header('Content-type: application/xml');
  12.     echo '<?xml version="1.0" encoding="UTF-8"?>
  13.    <formats>
  14.      <format name="ris"
  15.              type="application/x-Research-Info-Systems"
  16.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  17.    </formats>
  18.    ';
  19.   exit;  
  20.   }
  21.  
  22.  
  23.   // Got just the id? Return formats for that ID
  24.   if ($id && !$format) {
  25.     header('Content-type: application/xml');
  26.     echo '<?xml version="1.0" encoding="UTF-8"?>
  27.    <formats id="' . $id . '">
  28.      <format name="ris"
  29.              type="application/x-Research-Info-Systems"
  30.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  31.    </formats>
  32.    ';  
  33.   exit;  
  34.   }
  35.  
  36.  
  37.   // Otherwise…
  38.  
  39.   // Parse out the actual numeric part of the id from the urn:<typeOfNumber> prefix
  40.   preg_match('/^urn:bibnum:(.*)$/', $id, $match);
  41.   $actualID = $match[1];
  42.  
  43.   // Again: format had better be 'ris' because that's all I'm supporting at this point.
  44.   header("Location: /Search/SearchExport?id=$actualID&method=$format", true, 302);

You can see that a <format> is a just a name, a mime-type, and an optional reference to documentation on the type.

I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in — I’m hard-coding ‘bibnum’ for the moment, but could allow, say, “oclc” or “isbn” or whatnot, too.

2. Tell your OPAC where the script lives

You’ll need a line in the <head> section of all your pages that might have an ID on them:

<link rel="unapi-server" type="application/xml" title="unAPI" href="/unapi">

Everything should be left alone except for the actual href.

3. Add your IDs to the HTML

In the HTML of your page, you can add one or more tags of the form:

<abbr class="unapi-id" title="urn:bibnum:000000002"></abbr>

(where the title of the <abbr> conforms to what you’re expecting in your script).

You can put stuff inside the <abbr> but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.

4. Final step

Drink your screwdriver.

Where can I see it?

Well…here’s the thing.

You can take a look at my test instance, http://dueberb.vufind.lib.umich.edu/ and play there. You can not see it in production, because there’s a little problem.

Our old OPAC — now dubbed mirlyn-classic — had a custom translator written for it. And it worked fine, and that was great.

But now we’ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.

I’ve asked around about getting it fixed (changing the target URL for the old translator to point at mirlyn-classic) but it’s Friday, and no one is around. Hopefully soon.

2 Responses to “Setting up your OPAC for Zotero support using unAPI”

  1. David says:

    This looks pretty cool – I’d like to try something similar with my Voyager system, but I’m currently let down on step 3.

    Is your RIS production from Aleph or VUFind? And is it off-the-shelf, or something you wrote?

  2. Bill says:

    My RIS output is a home-grown vufind extension, built using data from the solr index. So…not a lotta help there, although I can certainly post my config files and accompanying code if folks are interested in a quick-and-dirty way to do it.

Leave a Reply

EDITS:

  • Added “recordURL” per Tod’s request
  • Made a record’s title field an array and call it titles, to allow for vernacular entries
  • Changed item’s ingest to lastUpdate to accurately note what the actual date reflects. This gets updated every time either the item or the record to which it’s attached gets changed.
  • Fixed a couple typos, including one where I substituted an ampersand for a pipe in the multi-get example (thanks again, Tod).
  • Added a better explanation of option #4

Introduction and History

Ages ago, I wrote a simple(ish) little cgi program to get basic item-level data out of what is now Mirlyn Classic, our OPAC. Soon enough, I was asked to modify it so people could get HathiTrust data from the underlying Aleph system, check viewability of the associated items, etc.

It works…kinda…but reflects what I now look back on as blissful ignorance. It doesn’t deal at all with serials, and doesn’t deal correctly with duplicated records or cases where multiple records have the same (supposedly-unique) identifiers,

We need something better. And I’m hoping comments on this post will result in something better.

Scope

Given standard identifiers for a known item, return basic item-level metadata for volumes deposited in the HathiTrust.

I want to keep this simple. There will likely be other APIs for other, more complex (or specialized) tasks, linked data for folks who dig that sort of thing, and so on. The goal here is to make something that’s fast and can help people inline data about HT into their own OPAC or similar system.

I’d also like to get this thing in place, at least the basics, in the next two weeks. Anything longer is self-indulgent.

Data returned

At the moment, I’m only planning on offering JSON out, unless someone really, really needs something else. Speak up if you’re an edge case.

Proposed basic return structure

…complete with JSON-illegal comments embedded

  1.   {
  2.     "records":
  3.       {
  4.         "003384758": // The HathiTrust record id of a matched record
  5.           {
  6.             "recordURL" : "http://catalog.hathitrust.org/Record/003384758"
  7.             "titles":  ["Full, space-joined 245s"],
  8.             "isbns" : ["123456789X"], // any/all ISBNs on this record
  9.             "issns" : [], // any/all ISSNs on this record
  10.             "oclcs" : [], // any/all OCLC numbers
  11.             "lccns"  : ["68001537"], // any/all LCCNs
  12.           },
  13.           … // any more records that were matched
  14.       },
  15.     'items' :
  16.       [
  17.         {
  18.           "fromRecord" : "003384758",
  19.           "htid": "mdp.39015054407062",
  20.           "itemURL": "http://hdl.handle.net/2027/mdp.39015054407062",
  21.           "rights": "ic",
  22.           "orig": "University of Michigan", // supplying institution
  23.           "lastUpdate" : "20090807" // date of ingest into HathiTrust or last change
  24.           "enumcron" : "An enumeration/chronology, if available" // OPTIONAL
  25.         }
  26.       ]
  27.   }

A quick walk through the proposed return structure

Obviously, there are two sets of items: a list of records that matched the query, and a list representing the union of all items on those matched records.

records

For most purposes, people won’t care so much about the record-level data unless you’re trying to do your own error-checking (possible) or want to link to the catalog record-level page (more likely).

[I'm actually very open to just plain leaving it out.]

The format is a hash keyed on the HathiTrust record ID, which can currently be turned into a URL such as http://catalog.hathitrust.org/Record/003384758. Elements are:

  • recordURL: The URL to the human-readable record view in the catalog
  • titles: an array of all the full 245, space-separated subfields. Always present, usually with one item, sometimes more than one (vernacular entries), almost never with zero.
  • isbns: An array of all the ISBNs associated with the record. Always present; an empty array if none.
  • issns, oclcs, lccns: Same as ISBNs, but for the appropriate data.

Note that at this time, LCCNs are taken from the 010, so the LCCN array will either be empty or have one item. I left it as an array just for consistency.

items

This is an array of items, taken from all the matched records and ordered (as best I can) based on their enumcron. If no enumcron is present, order is undefined.

  • fromRecord: The HathiTrust record ID, as used as a key in the hash of records (explained above).
  • htid: The HathiTrust ID for the item.
  • itemURL: The URL to the page-turner (or search box, for search-only items) for this item. It’s currently just appended to the prefix “http://hdl.handle.net/2027/”, but I thought I’d include it in case the preferred URL algorithm changes at some point.
  • rights: The rights code for this item, as explained at http://www.hathitrust.org/hathifiles_metadata.
  • orig: The institution that supplied the item for digitization.
  • lastUpdate: The date of the last time this item or its containing record was touched, either because of ingest by the HathiTrust system or later editing, as YYYYMMDD. May be 00000000 if unknown.
  • enumcron: (OPTIONAL) The enumeration/cronology (e.g., “v. 3 1997″ or somesuch). Again — optional. Leave out the key? Provide an empty string? Provide a false?

A word about enumcron

The enumcron string is fickle, and very local. The algorithm I’m using to sort them basically consists of taking all the numbers in the enumcron strings and zero-padding them to 8 digits, then sorting. It works pretty well, but isn’t perfect. I’m incredibly resistant to trying to do anything fancier, simply because I want it to be fast and because trying to deal with all possible enumcron formats is a sisyphean task.

An actual record

Here’s the simplest possible case: a single matched record with a single item

  1.   {
  2.     "records":
  3.       {
  4.         "000366004":
  5.           {
  6.             "recordURL" : "http://catalog.hathitrust.org/Record/000366004",
  7.             "titles": ["The Sneetches, and other stories. Written and illustrated by Dr. Seuss."],
  8.             "isbns": [],
  9.             "issns": [],
  10.             "oclcs": ["00470409"],
  11.             "lccns": ["68001537"]
  12.           }
  13.       }
  14.     "items": [
  15.       {
  16.         "fromRecord": "000366004",
  17.         "htid": "mdp.39015079651611",
  18.         "itemURL": "http://hdl.handle.net/2027/mdp.39015079651611",
  19.         "rightscode": "ic",
  20.         "lastUpdate": "20091004",
  21.         "orig": "University of Michigan",
  22.         "enumcron": false
  23.       }
  24.     ],
  25.   }

We can see that (despite my expectation) we don’t happen to have an ISBN for this item. The item originally came from Michigan, either ingested or last updated on October 10th, 2009. The HathiTrust catalog page for this item is http://catalog.hathitrust.org/Record/000366004 (derived from the record ID) and it is In Copyright (ic), so the itemURL goes to a page that allows only search.

Making the request

I’ll take care of normalizing data on the way in (mostly done by the Solr backend): strip leading zeros off the OCLC number, normalize the LCCN as per this page at the Library of Congress, strip anything funny-looking from the ISBN and ISSN, and (probably) convert all ISBNs into ISBN13s.

I’m anticipating three formats for a request (note: they don’t work yet. There’s no code):

Single-identifier option

http://catalog.hathitrust.org/api/volumes/oclc/00470409.json

http://catalog.hathitrust.org/api/volumes/lccn/68001537.json


http://catalog.hathitrust.org/api/volumes/issn/1051290x.json


http://catalog.hathitrust.org/api/volumes/isbn/0835221792.json

Simple and unambiguous; returns the proposed return structure as described above (and presumable amended before actual implementation). Again, any normalization that needs to be done will be done on my end, so “00470409″ and “470409″ are considered the same OCLC number.

Multiple-identifier, multi-request option

http://catalog.hathitrust.org/api/volumes?yourID1=oclc:00470409|lccn:68001537&amp;yourID2=oclc:67890987|isbn:987652348X

In this format, you can see that (a) you can provide multiple pieces of metadata for a record, separated by pipe characters (|), and (b) you can provide metadata sets for multiple records at once, keyed on whatever arbitrary ID you want to use.

The return format would look like this:

  1.   {
  2.     "yourID1" : <proposed return structure>,
  3.     "yourID2" : <proposed return structure>,
  4.       …
  5.   }

What to do when the provided metadata don’t agree?

It’s entirely possible to provide an OCLC number and an LCCN that, in fact, refer to two different records. It’s also possible that we have two records in the system that should be merged, but haven’t been.

Some possible algorithms:

  1. Require that all sent numbers match: If you send an OCLC, and ISBN, and an LCCN, any returned record must have all three, and all must match. That seems too strict.
  2. Return any records that match any sent numbers: I could do a boolean-OR, so any record that matches any of the numbers you send gets returned. The risk of returning too much data seems too great.
  3. Return any records that don’t mismatch any sent numbers: The same as the first option, but null matches anything. So, if you sent an LCCN, and if the record has an LCCN, they must match. If you sent an OCLC number and if the record has an OCLC number, it, too, must match, etc.. Basically, every piece of metadata, if provided, must match.
  4. Order the number types and only match the best available. We provide an ordered list of type: OCLC, LCCN, ISBN, and finally ISSN. If you provide an OCLC number and there is a record with that OCLC number, return it and ignore everything else. If you didn’t provide an OCLC number (or if you did but we didn’t get any matches), move on to the LCCN and try again, as shown below.

    // The algorithm for #4
    foreach type in (OCLC, LCCN, ISBN, ISSN) {
      next unless (providedSearch[type]); ## move on unless a number was provided
      records = recordsThatMatch(type, providedSearch[type]);
      if records.size > 0 { # If we found some, return
        return records;
      }
      ## else, we move to the next type.
    }
    

So, for #4, if you provide an OCLC number and we find a match or matches, stop looking and return them. If we don’t find an OCLC match but you also provided an LCCN, look for records that match the LCCN, and if found return them. Repeat with ISBN and ISSN.

Understanding #3 vs. #4

Suppose the following are true:

  • You provide an OCLC number O and an LCCN L
  • I have a record r1 with OCLC number O and no LCCN at all
  • I have a record r2 with LCCN L and no OCLC number at all.

Under option #3, both records would be returned. They both fulfill the criteria that they match all the supplied identifiers in all fields for which they have values. In other words, r1 has a positive match on OCLC (O == O) and a null-matches-everything match on LCCN (L == no data).

Under option #4, only r1 is returned. We first look for all records that match on the OCLC number provided, find exactly one, and return it. We never even bother to look for records that match on LCCN only.

Let’s pick one and see how it works in the real world

I’m leaning toward #4, but I’m open to #3 as well, or any other variant that can be computed quickly and easily on this end. We’re talking about some pretty weird edge cases when we start going down this road, and I don’t want to sacrifice ease of use and ease of computation any more than we have to.

Please comment!

You can comment here, or send email directly to me. I’ll follow up this post periodically with more thoughts and synopses of what I’ve heard.

8 Responses to “Thinking through a simple API for HathiTrust item metadata”

  1. For your last question, I would pick #2.

    Also, despite the fact that I asked for it, you’re right that that records stuff is confusing. I’m confused about the difference between http://hdl.handle.net/2027/mdp.39015079651611 and http://catalog.hathitrust.org/Record/003384758

    I guess in part because HT is still working out. There doesn’t seem to be any reason to ever send the user to that /Record url, since it doesn’t even point to the item and provide access to searching or full text if available!

    I still tend to err on the side of including extra info, cause someone might need it, and when they do you’re unlikely to have time to go back and add it. On the other hand, maybe this is too confusing.

  2. Oh, I mean #3, not #2. Null should not prevent a mis-match.

    But if I’m sending LCCN=a&ISBN=b, probably because I think those both refer to the same ‘thing’, and you have multiple records that refer to that ‘thing’ — I want to see them all. I don’t want you to just pick an arbitrary one and hide all the rest.

    I mean, if I just sent an ISBN, and you had two records with that ISBN (quite possible), you’d give me both of them, right, not just pick one you think is ‘best’?

  3. Bill says:

    Jonathan — http://hdl.handle.net/2027/mdp.39015079651611 goes to the page-turner for that particular bound volume. http://catalog.hathitrust.org/Record/003384758 show the metadata for the record onto which that item hangs. A single serial record will have many items.

  4. Tod Olson says:

    Regarding the “records” data, keep it in. It feels like the sort of data element you don’t miss until it’s not there. At the very least, that data make it very easy for the JSON consumer to tell whether there were multiple records returned, without having to grovel all of the items and sift out the record IDs.

    Perhaps the “records” data should also provide a “recordURL” It would be analogous to “itemURL” in that the API would be responsible for formatting the URL to a record. Then every consumer of this information would not have to know the “how to link to a record” convention, just as they do not need to know how to construct the handle URL for an item.

    One bit of clarification: in the text under “Making the request,” I read that to mean that these two URLs return identical informatioon: http://catalog.hathitrust.org/api/volumes/oclc/00470409.json http://catalog.hathitrust.org/api/volumes/oclc/470409.json

    In the multivolume request example, did you mean “…yourID2=oclc:67890987|isbn:987652348X”, with a “|” rather than a “&”?

    On the final question about when the records don’t match, I’m leaning toward #4, the best matches. I’m a little concerned about cases where some important number changes, like when OCLC records merge. (or an ISBN is corrected or whatever.) So if I send OCLC, LCCN, and ISBN and there’s no matching OCLC number, would the service then fall back to LCCN? Or would a miss on the OCLC mean the whole request fails? In any case, experience with the new API will tell us whether the matching needs to be tweaked.

  5. Definitely prefer #3 to #4.

    A middle ground is that you can rank them internally, and put the one you think is ‘best’ first. So the client can easily just take the first one and ignore the others. But I as a client am going to sometimes want to see all matches, not have the ones your algorithm considered ‘not as good’ hidden from me.

  6. Ah, but wait, maybe I am misunderstanding things. The pipe/ampersand confusion made me realize I don’t understand that.

    What’s the difference between asking for?

    ?yourID1=oclc:00470409|lccn:68001537&yourID2=oclc:67890987|isbn:987652348X

    Or asking for:

    ?yourId1=oclc:67890987&yourId2=lccn:68001537&yourId3=oclc:67890987&yourId4=isbn:987652348X

    What does the pipe grouping do for you? Maybe this is related to the solution #3 vs #4 thing, cause maybe I can get what I want by constructing my request the right way even if you do #4. But the pipe grouping thing kinda seems like unneccesary complexity to me.

  7. Stephanie Collett says:

    I prefer #3 to #4 as well. Putting the items in best-match order would make short work for simple clients. However, we also plan to also use the API to spot check the metadata for our submissions. Multiple matches would alert us to metadata issues (on either end) that would fall silent in algorithm #4.

    I’d also like to propose a new feature if it would be simple to implement. I’d like to be able to look up records by Hathi Trust ID.

    I’m building a simple web client for looking up detailed item information. The target audience is internal staff working on the bibliographic issues for our Hathi Trust submissions. I’d like users to be able to query by Hathi ID along with the other identifiers. If they query by Hathi ID, it would be nice to at least show the title, and possible tie the item to other information by using the returned identifiers in the record like the OCLC number.

  8. [...] put up a beta version of the HathiTrust Volumes API previously discussed on this blog and via [...]

Leave a Reply