Setting up your OPAC for Zotero support using unAPI

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let’s get them to play nice with each other!

How’s it all work?

  1. Zotero looks for a well-constructed <link> tag in the head of the page
  2. It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks.
  3. Zotero then looks for IDs in the body of the page
  4. If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records.

What you’ll need

  1. An OPAC whose output you can futz with
  2. Access to an individual record’s ID in that output
  3. A URL based on the ID that gives an RIS representation of the records
  4. A screwdriver. Made with decent — but not too expensive — vodka and fresh orange juice.

Yes. I’m cheating.

I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your OPAC’s URL scheme, or write an RIS export tool by hand, well, this will take you a bit longer.

The process

1. Build an upAPI target script

You need a script that’ll do three things:

  1. With no arguments, return a list of available formats in general
  2. With one argument, id=<ID>, return a list of formats available for that item. This will likely be exactly the same as #1.
  3. With two arguments, id=<ID> & format=<FORMAT>, return the record identified by <ID> in format <FORMAT>

Mine looks like this:

  1.  
  2.   // id is of the form urn:bibnum:000000000
  3.  
  4.   $id = isset($_REQUEST['id'])? $_REQUEST['id'] : false;
  5.  
  6.   // Format, at this point, had better be 'ris'
  7.   $format = isset($_REQUEST['format'])? $_REQUEST['format'] : false;
  8.  
  9.   // Got neither? Return the general list
  10.   if (!($id || $format)) {
  11.     header('Content-type: application/xml');
  12.     echo '<?xml version="1.0" encoding="UTF-8"?>
  13.    <formats>
  14.      <format name="ris"
  15.              type="application/x-Research-Info-Systems"
  16.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  17.    </formats>
  18.    ';
  19.   exit;  
  20.   }
  21.  
  22.  
  23.   // Got just the id? Return formats for that ID
  24.   if ($id && !$format) {
  25.     header('Content-type: application/xml');
  26.     echo '<?xml version="1.0" encoding="UTF-8"?>
  27.    <formats id="' . $id . '">
  28.      <format name="ris"
  29.              type="application/x-Research-Info-Systems"
  30.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  31.    </formats>
  32.    ';  
  33.   exit;  
  34.   }
  35.  
  36.  
  37.   // Otherwise…
  38.  
  39.   // Parse out the actual numeric part of the id from the urn:<typeOfNumber> prefix
  40.   preg_match('/^urn:bibnum:(.*)$/', $id, $match);
  41.   $actualID = $match[1];
  42.  
  43.   // Again: format had better be 'ris' because that's all I'm supporting at this point.
  44.   header("Location: /Search/SearchExport?id=$actualID&method=$format", true, 302);

You can see that a <format> is a just a name, a mime-type, and an optional reference to documentation on the type.

I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in — I’m hard-coding ‘bibnum’ for the moment, but could allow, say, “oclc” or “isbn” or whatnot, too.

2. Tell your OPAC where the script lives

You’ll need a line in the <head> section of all your pages that might have an ID on them:

<link rel="unapi-server" type="application/xml" title="unAPI" href="/unapi">

Everything should be left alone except for the actual href.

3. Add your IDs to the HTML

In the HTML of your page, you can add one or more tags of the form:

<abbr class="unapi-id" title="urn:bibnum:000000002"></abbr>

(where the title of the <abbr> conforms to what you’re expecting in your script).

You can put stuff inside the <abbr> but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.

4. Final step

Drink your screwdriver.

Where can I see it?

Well…here’s the thing.

You can take a look at my test instance, http://dueberb.vufind.lib.umich.edu/ and play there. You can not see it in production, because there’s a little problem.

Our old OPAC — now dubbed mirlyn-classic — had a custom translator written for it. And it worked fine, and that was great.

But now we’ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.

I’ve asked around about getting it fixed (changing the target URL for the old translator to point at mirlyn-classic) but it’s Friday, and no one is around. Hopefully soon.

Thinking through a simple API for HathiTrust item metadata

EDITS:

  • Added “recordURL” per Tod’s request
  • Made a record’s title field an array and call it titles, to allow for vernacular entries
  • Changed item’s ingest to lastUpdate to accurately note what the actual date reflects. This gets updated every time either the item or the record to which it’s attached gets changed.
  • Fixed a couple typos, including one where I substituted an ampersand for a pipe in the multi-get example (thanks again, Tod).
  • Added a better explanation of option #4

Introduction and History

Ages ago, I wrote a simple(ish) little cgi program to get basic item-level data out of what is now Mirlyn Classic, our OPAC. Soon enough, I was asked to modify it so people could get HathiTrust data from the underlying Aleph system, check viewability of the associated items, etc.

It works…kinda…but reflects what I now look back on as blissful ignorance. It doesn’t deal at all with serials, and doesn’t deal correctly with duplicated records or cases where multiple records have the same (supposedly-unique) identifiers,

We need something better. And I’m hoping comments on this post will result in something better.

Scope

Given standard identifiers for a known item, return basic item-level metadata for volumes deposited in the HathiTrust.

I want to keep this simple. There will likely be other APIs for other, more complex (or specialized) tasks, linked data for folks who dig that sort of thing, and so on. The goal here is to make something that’s fast and can help people inline data about HT into their own OPAC or similar system.

I’d also like to get this thing in place, at least the basics, in the next two weeks. Anything longer is self-indulgent.

Data returned

At the moment, I’m only planning on offering JSON out, unless someone really, really needs something else. Speak up if you’re an edge case.

Proposed basic return structure

…complete with JSON-illegal comments embedded

  1.   {
  2.     "records":
  3.       {
  4.         "003384758": // The HathiTrust record id of a matched record
  5.           {
  6.             "recordURL" : "http://catalog.hathitrust.org/Record/003384758"
  7.             "titles":  ["Full, space-joined 245s"],
  8.             "isbns" : ["123456789X"], // any/all ISBNs on this record
  9.             "issns" : [], // any/all ISSNs on this record
  10.             "oclcs" : [], // any/all OCLC numbers
  11.             "lccns"  : ["68001537"], // any/all LCCNs
  12.           },
  13.           … // any more records that were matched
  14.       },
  15.     'items' :
  16.       [
  17.         {
  18.           "fromRecord" : "003384758",
  19.           "htid": "mdp.39015054407062",
  20.           "itemURL": "http://hdl.handle.net/2027/mdp.39015054407062",
  21.           "rights": "ic",
  22.           "orig": "University of Michigan", // supplying institution
  23.           "lastUpdate" : "20090807" // date of ingest into HathiTrust or last change
  24.           "enumcron" : "An enumeration/chronology, if available" // OPTIONAL
  25.         }
  26.       ]
  27.   }

A quick walk through the proposed return structure

Obviously, there are two sets of items: a list of records that matched the query, and a list representing the union of all items on those matched records.

records

For most purposes, people won’t care so much about the record-level data unless you’re trying to do your own error-checking (possible) or want to link to the catalog record-level page (more likely).

[I'm actually very open to just plain leaving it out.]

The format is a hash keyed on the HathiTrust record ID, which can currently be turned into a URL such as http://catalog.hathitrust.org/Record/003384758. Elements are:

  • recordURL: The URL to the human-readable record view in the catalog
  • titles: an array of all the full 245, space-separated subfields. Always present, usually with one item, sometimes more than one (vernacular entries), almost never with zero.
  • isbns: An array of all the ISBNs associated with the record. Always present; an empty array if none.
  • issns, oclcs, lccns: Same as ISBNs, but for the appropriate data.

Note that at this time, LCCNs are taken from the 010, so the LCCN array will either be empty or have one item. I left it as an array just for consistency.

items

This is an array of items, taken from all the matched records and ordered (as best I can) based on their enumcron. If no enumcron is present, order is undefined.

  • fromRecord: The HathiTrust record ID, as used as a key in the hash of records (explained above).
  • htid: The HathiTrust ID for the item.
  • itemURL: The URL to the page-turner (or search box, for search-only items) for this item. It’s currently just appended to the prefix “http://hdl.handle.net/2027/”, but I thought I’d include it in case the preferred URL algorithm changes at some point.
  • rights: The rights code for this item, as explained at http://www.hathitrust.org/hathifiles_metadata.
  • orig: The institution that supplied the item for digitization.
  • lastUpdate: The date of the last time this item or its containing record was touched, either because of ingest by the HathiTrust system or later editing, as YYYYMMDD. May be 00000000 if unknown.
  • enumcron: (OPTIONAL) The enumeration/cronology (e.g., “v. 3 1997″ or somesuch). Again — optional. Leave out the key? Provide an empty string? Provide a false?

A word about enumcron

The enumcron string is fickle, and very local. The algorithm I’m using to sort them basically consists of taking all the numbers in the enumcron strings and zero-padding them to 8 digits, then sorting. It works pretty well, but isn’t perfect. I’m incredibly resistant to trying to do anything fancier, simply because I want it to be fast and because trying to deal with all possible enumcron formats is a sisyphean task.

An actual record

Here’s the simplest possible case: a single matched record with a single item

  1.   {
  2.     "records":
  3.       {
  4.         "000366004":
  5.           {
  6.             "recordURL" : "http://catalog.hathitrust.org/Record/000366004",
  7.             "titles": ["The Sneetches, and other stories. Written and illustrated by Dr. Seuss."],
  8.             "isbns": [],
  9.             "issns": [],
  10.             "oclcs": ["00470409"],
  11.             "lccns": ["68001537"]
  12.           }
  13.       }
  14.     "items": [
  15.       {
  16.         "fromRecord": "000366004",
  17.         "htid": "mdp.39015079651611",
  18.         "itemURL": "http://hdl.handle.net/2027/mdp.39015079651611",
  19.         "rightscode": "ic",
  20.         "lastUpdate": "20091004",
  21.         "orig": "University of Michigan",
  22.         "enumcron": false
  23.       }
  24.     ],
  25.   }

We can see that (despite my expectation) we don’t happen to have an ISBN for this item. The item originally came from Michigan, either ingested or last updated on October 10th, 2009. The HathiTrust catalog page for this item is http://catalog.hathitrust.org/Record/000366004 (derived from the record ID) and it is In Copyright (ic), so the itemURL goes to a page that allows only search.

Making the request

I’ll take care of normalizing data on the way in (mostly done by the Solr backend): strip leading zeros off the OCLC number, normalize the LCCN as per this page at the Library of Congress, strip anything funny-looking from the ISBN and ISSN, and (probably) convert all ISBNs into ISBN13s.

I’m anticipating three formats for a request (note: they don’t work yet. There’s no code):

Single-identifier option

http://catalog.hathitrust.org/api/volumes/oclc/00470409.json

http://catalog.hathitrust.org/api/volumes/lccn/68001537.json


http://catalog.hathitrust.org/api/volumes/issn/1051290x.json


http://catalog.hathitrust.org/api/volumes/isbn/0835221792.json

Simple and unambiguous; returns the proposed return structure as described above (and presumable amended before actual implementation). Again, any normalization that needs to be done will be done on my end, so “00470409″ and “470409″ are considered the same OCLC number.

Multiple-identifier, multi-request option

http://catalog.hathitrust.org/api/volumes?yourID1=oclc:00470409|lccn:68001537&amp;yourID2=oclc:67890987|isbn:987652348X

In this format, you can see that (a) you can provide multiple pieces of metadata for a record, separated by pipe characters (|), and (b) you can provide metadata sets for multiple records at once, keyed on whatever arbitrary ID you want to use.

The return format would look like this:

  1.   {
  2.     "yourID1" : <proposed return structure>,
  3.     "yourID2" : <proposed return structure>,
  4.       …
  5.   }

What to do when the provided metadata don’t agree?

It’s entirely possible to provide an OCLC number and an LCCN that, in fact, refer to two different records. It’s also possible that we have two records in the system that should be merged, but haven’t been.

Some possible algorithms:

  1. Require that all sent numbers match: If you send an OCLC, and ISBN, and an LCCN, any returned record must have all three, and all must match. That seems too strict.
  2. Return any records that match any sent numbers: I could do a boolean-OR, so any record that matches any of the numbers you send gets returned. The risk of returning too much data seems too great.
  3. Return any records that don’t mismatch any sent numbers: The same as the first option, but null matches anything. So, if you sent an LCCN, and if the record has an LCCN, they must match. If you sent an OCLC number and if the record has an OCLC number, it, too, must match, etc.. Basically, every piece of metadata, if provided, must match.
  4. Order the number types and only match the best available. We provide an ordered list of type: OCLC, LCCN, ISBN, and finally ISSN. If you provide an OCLC number and there is a record with that OCLC number, return it and ignore everything else. If you didn’t provide an OCLC number (or if you did but we didn’t get any matches), move on to the LCCN and try again, as shown below.

    // The algorithm for #4
    foreach type in (OCLC, LCCN, ISBN, ISSN) {
      next unless (providedSearch[type]); ## move on unless a number was provided
      records = recordsThatMatch(type, providedSearch[type]);
      if records.size > 0 { # If we found some, return
        return records;
      }
      ## else, we move to the next type.
    }
    

So, for #4, if you provide an OCLC number and we find a match or matches, stop looking and return them. If we don’t find an OCLC match but you also provided an LCCN, look for records that match the LCCN, and if found return them. Repeat with ISBN and ISSN.

Understanding #3 vs. #4

Suppose the following are true:

  • You provide an OCLC number O and an LCCN L
  • I have a record r1 with OCLC number O and no LCCN at all
  • I have a record r2 with LCCN L and no OCLC number at all.

Under option #3, both records would be returned. They both fulfill the criteria that they match all the supplied identifiers in all fields for which they have values. In other words, r1 has a positive match on OCLC (O == O) and a null-matches-everything match on LCCN (L == no data).

Under option #4, only r1 is returned. We first look for all records that match on the OCLC number provided, find exactly one, and return it. We never even bother to look for records that match on LCCN only.

Let’s pick one and see how it works in the real world

I’m leaning toward #4, but I’m open to #3 as well, or any other variant that can be computed quickly and easily on this end. We’re talking about some pretty weird edge cases when we start going down this road, and I don’t want to sacrifice ease of use and ease of computation any more than we have to.

Please comment!

You can comment here, or send email directly to me. I’ll follow up this post periodically with more thoughts and synopses of what I’ve heard.

Adding LibXML and Java STAX support to ruby-marc with pluggable XML parsers

JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there.

Since I’ve been messing around with MARC-XML parsing of late, and since Ross Singer added pluggable xml-parser awesomeness to the ruby-marc project, I thought I’d see what I could do with native Java methods when parsing MARC-XML.

And just for kicks, I threw in the old code that I wrote before that uses LibXML.

Why do this at all?

Because…er…there’s an obvious work-situation where I need to squeeze every last drop of speed out of…ruby…which we don’t use…er…

Because. Because I wanted to screw around with the technologies. Because I wanted to learn about calling java native stuff. Because I already wrote the libxml stuff. Because it feels silly to run on the JVM and not use JVM-native code to deal with XML, given that standard java projects make it seem like Java is a giant XML processor with a language wrapped around it.

What exactly did I do?

For the LibXML stuff, I copied my own code. For the java stax (javax.xml.stream.XMLInputFactory.StreamReader) parser, I stole just about everything from Ross’s nokogiri code and put it into its own module, and then slimmed down the nokogiri module and the stax module to only include their differences.

The patch is at the ruby-marc rubyforge site if you want to play along at home.

Other than using the stax or libxml parser, everything else is the same — MARC::Record objects and their components are created exactly as they are with the other parsers. It might be “fun” (for some twisted definition of “fun”) to wrap the MARC::Record interface around marc4j at some point, but right now all that’s changed is the parsing.

Do they work?

Yes. Thanks for asking. At least all the tests pass when I type ‘rake’.

How fast is it?

As always, the numbers are iffy. These were done on my desktop, with other stuff going on. I didn’t bother to benchmark rexml because we know how slow that is.

The test file is a nightly dump intended to go into our VuFind install. It was born as binary marc, and changed to marc-xml using yaz-marcdump, which is so fast that I thought maybe something had gone wrong. Holy cow, is yaz-marcdump fast.

The resulting XML is 219MB and contains 46,242 records.

The test was to open it up, loop through the records, and pull the 245 out of each. Each segment looks something like this:

  1.   reader = MARC::XMLReader.new(filename, :parser=>'jstax')
  2.   reader.each do |record|
  3.     title = record[245]
  4.   end

Times are in seconds. I ran each one five times, with the exception of jrexml, during which I got bored. And the perl code, for which I just wanted to get a ballpark to compare.

MRI 1.8.7 libxml 104 (103, 103, 106, 104, 103) nokogiri 301 (304, 300, 301, 301, 300)

JRuby trunk jrexml 547 (539, 554) jstax 203 (201, 208, 201, 201, 204 )

Perl 5.10 w/MARC::File:XML perl 340 (340)

So…faster, right?

Pretty much, yeah.

Under (MRI) ruby, Ross found that nokogiri was 3.5x faster than rexml, and my noodling-around at home showed the same speedup. Using that as a baseline, we get the following speed comparison table using the libxml time normalized to 1.00.

In case that wasn’t clear: lower numbers are better.

libxml:   1.00
jstax:    1.95
nokogiri: 2.89
jrexml:   5.16
rexml:    10.11 (estimated; 3.5x nokogiri's speed

What does it all mean?

It means that adding pluggable parsers was freakin’ brilliant.

It means that a guy like me — with no real expertise in any of the applicable technologies — can do a passable job at integrating a java library into JRuby.

And it means that if I (a) can get folks around here to use Ruby, and (b) can get them to use MARC-XML instead of binary MARC (which we can’t use anyway because of the record-length limitations), I can be sure that any bottlenecks aren’t going to be the result of those choices.

Adding LibXML and Java STAX support to ruby-marc with pluggable XML parsers

JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there.

Since I’ve been messing around with MARC-XML parsing of late, and since Ross Singer added pluggable xml-parser awesomeness to the ruby-marc project, I thought I’d see what I could do with native Java methods when parsing MARC-XML.

And just for kicks, I threw in the old code that I wrote before that uses LibXML.

Why do this at all?

Because…er…there’s an obvious work-situation where I need to squeeze every last drop of speed out of…ruby…which we don’t use…er…

Because. Because I wanted to screw around with the technologies. Because I wanted to learn about calling java native stuff. Because I already wrote the libxml stuff. Because it feels silly to run on the JVM and not use JVM-native code to deal with XML, given that standard java projects make it seem like Java is a giant XML processor with a language wrapped around it.

What exactly did I do?

For the LibXML stuff, I copied my own code. For the java stax (javax.xml.stream.XMLInputFactory.StreamReader) parser, I stole just about everything from Ross’s nokogiri code and put it into its own module, and then slimmed down the nokogiri module and the stax module to only include their differences.

The patch is at the ruby-marc rubyforge site if you want to play along at home.

Other than using the stax or libxml parser, everything else is the same — MARC::Record objects and their components are created exactly as they are with the other parsers. It might be “fun” (for some twisted definition of “fun”) to wrap the MARC::Record interface around marc4j at some point, but right now all that’s changed is the parsing.

Do they work?

Yes. Thanks for asking. At least all the tests pass when I type ‘rake’.

How fast is it?

As always, the numbers are iffy. These were done on my desktop, with other stuff going on. I didn’t bother to benchmark rexml because we know how slow that is.

The test file is a nightly dump intended to go into our VuFind install. It was born as binary marc, and changed to marc-xml using yaz-marcdump, which is so fast that I thought maybe something had gone wrong. Holy cow, is yaz-marcdump fast.

The resulting XML is 219MB and contains 46,242 records.

The test was to open it up, loop through the records, and pull the 245 out of each. Each segment looks something like this:

  1.   reader = MARC::XMLReader.new(filename, :parser=>'jstax')
  2.   reader.each do |record|
  3.     title = record[245]
  4.   end

Times are in seconds. I ran each one five times, with the exception of jrexml, during which I got bored. And the perl code, for which I just wanted to get a ballpark to compare.

MRI 1.8.7
    libxml     104    (103, 103, 106, 104, 103)
    nokogiri   301    (304, 300, 301, 301, 300)

JRuby trunk jrexml 547 (539, 554) jstax 203 (201, 208, 201, 201, 204 )

Perl 5.10 w/MARC::File:XML perl 340 (340)

So…faster, right?

Pretty much, yeah.

Under (MRI) ruby, Ross found that nokogiri was 3.5x faster than rexml, and my noodling-around at home showed the same speedup. Using that as a baseline, we get the following speed comparison table using the libxml time normalized to 1.00.

In case that wasn’t clear: lower numbers are better.

libxml:   1.00
jstax:    1.95
nokogiri: 2.89
jrexml:   5.16
rexml:    10.11 (estimated; 3.5x nokogiri's speed)

What does it all mean?

It means that adding pluggable parsers was freakin’ brilliant.

It means that a guy like me — with no real expertise in any of the applicable technologies — can do a passable job at integrating a java library into JRuby.

And it means that if I (a) can get folks around here to use Ruby, and (b) can get them to use MARC-XML instead of binary MARC (which we can’t use anyway because of the record-length limitations), I can be sure that any bottlenecks aren’t going to be the result of those choices.

An exercise in Solr and DataImportHandler: HathiTrust data

Many of the folks who read this blog (hi, both of you! Mom, say hello to Dad!) are aware, at least tangentially, of the HathiTrust. Currently hosted by us at the University of Michigan, the most public interface to its data is a VuFind installation you can access at catalog.hathitrust.org (or, for you smart-phone types, at m.catalog.hathitrust.org). Once you do a metadata search, you get links into the actual page images or a chance to search the fulltext of the selected item (depending on its copyright status).

It’s awesome. Seriously. Even in the absence of fulltext, being able to search within an item can be incredibly useful. Give it a shot if you haven’t.

You don’t always need an OPAC

But there are plenty of folks who don’t want or need a full-flown interface into all the metadata. They’ve already got one of those. What they’re interested in, mostly, is figuring out how to easily put links in their own OPAC (or whatnot or whoseits) to the HathiTrust if page images or searching are available. See, for example, a typical record from Tod Olson’s stuff at U-Chicago — he sniffs for HathiTrust and Google Books availability via embedded javascript.

To this end, the HathiTrust folks provide a set of simple, tab-delimited files — a full extract on the first of every month, and nightly updates every …er…night.

You can see from the description of the file that it’s very simple. Tab-delimited fields of the HathiTrust ID, right information, and all the golden-oldie standard identifiers — some of which (ISSNs, ISBNs, etc.) are further comma-delimited in cases where multiple values are available and a field repeats. And a title and enumcron (description of an individual volume, e.g., “Sept 2007, vol. 33, issue 4″), so you have something useful to display if you need to, and that’s 98% of what most folks want.

The smart way to do it: RDBMS

If you want to query this data quickly and easily, the obvious thing to do is to dump it into a database. One main table for the non-repeated values, and either a few key=>value tables (or, if you’re lazy, a single key => type/value) for the repeated ISBNs/ISSNs/whatnot. A quick mod-perl script to set up some data normalization going in and out and persist the prepared SQL queries and you’re set.

It’s hard to make an argument against using a database for these data. I mean, c’mon. We’ve got a well-defined structure. An obvious foreign-key. No full-text searching needed. This is practically designed for a good old-fashioned RDBMS. Plus, I’ve done this approximately a zillion times before, so I’m good and fast at it. Case closed.

How I’m gonna do it

Screw that. What I really wanted to do was start messing around with the DataImportHandler(DIH) in Solr.

I can make a weak argument for including the data in a Solr instance. To wit, it’ll certainly be fast enough for anything I’m gonna throw at it, and (more important to me) it’s easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about manipulating the input much, if at all.

The list of simple DIH examples is…well, I never really found any good ones, although I’m sure they’re out there. The documentation isn’t bad, but it’s not full of complete examples, and almost all of them have to do with the potential complexities of sucking data out of a database, which is what most people want to do. Not me, I’ve got flat files to work with.

Luckily, you can fire up an “interactive” DIH session where, at the very least, you can try to import a few rows of data and see if things are puking. I didn’t find the error reports particularly helpful all the time, but it’s about a zillion times better than nothing, I can tell you that much.

The game plan

We’ll start with the assumption that I’ve already managed to load a full dump from some date (run with me here; I’ll explain how to do it later). Then what we want to do is the following:

  1. Every night, download the nightly additions/changes file and gunzip it.
  2. Hit the DIH handle to import all files that (a) have a filename of the right format, and (b) have a created date after the last time the DIH handle was run.

And that’s it. Get the new stuff, have DIH figure out what’s new, and import it.

The first part is easy enough to do with perl/python/ruby/whatever. I’ll leave it as an exercise for all you diligent students.

Setting up solrconfig.xml

This is the easy part. Set up the handler, give it a semi-meaningful name, and call out to a config file.

  1.   <requestHandler name="/hathiimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  2.       <lst name="defaults">
  3.         <str name="config">hathi-data-config.xml</str>
  4.       </lst>
  5.   </requestHandler>

Define some useful data types in schema.xml

I left pretty much all of the boilerplate in schema.xml and just added a few types to deal with identifiers.

  • lowercase: return a single token that’s been lowercased. Don’t muck with it otherwise.
  • genericID: trim it, lowercase it, ditch everything that’s not a number or a letter, and return as a single token.
  • numeric: Ditch everything but the first string of digits, and then ditch any leading zeros. Useful when you know it’s gotta be an integer.
  • stdnum Find the first set of digits (optionally followed by an ‘X’ and potentially interspersed with dashes or dots), strip off the leading zeros, and return it. Good to extract an ISBN from a string like “(alt) 123-45-678X electronic only”.
  • lccnnormalizer: Custom code to normalize an LCCN as per this page at the LoC.
  1. <types>
  2.   <!– lowercases the entire field value, keeping it as a single token.  –>
  3. <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
  4.   <analyzer>
  5.     <tokenizer class="solr.KeywordTokenizerFactory"/>
  6.     <filter class="solr.LowerCaseFilterFactory" />
  7.   </analyzer>
  8. </fieldType>
  9.  
  10. <!– Full string, stripped of \W and lowercased –>
  11.  <fieldType name="genericID" class="solr.TextField" sortMissingLast="true"  omitNorms="true">
  12.    <analyzer>
  13.      <tokenizer class="solr.KeywordTokenizerFactory"/>
  14.      <filter class="solr.LowerCaseFilterFactory"/>
  15.      <filter class="solr.TrimFilterFactory"/>
  16.      <filter class="solr.PatternReplaceFilterFactory"
  17.           pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
  18.      />
  19.    </analyzer>
  20. </fieldType>
  21.  
  22.   <!– standard number normalizer – extract sequence of digits, strip leading zeroes –>
  23. <fieldType name="numeric" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  24.  <analyzer>
  25.    <tokenizer class="solr.KeywordTokenizerFactory"/>
  26.    <filter class="solr.LowerCaseFilterFactory"/>
  27.    <filter class="solr.TrimFilterFactory"/>
  28.    <filter class="solr.PatternReplaceFilterFactory"
  29.         pattern="[^0-9]*([0-9]+)[^0-9]*" replacement="$1"
  30.    />
  31.    <filter class="solr.PatternReplaceFilterFactory"
  32.         pattern="^0*(.*)" replacement="$1"
  33.    />
  34.  </analyzer>
  35. </fieldType>
  36.  
  37.  
  38.   <!– Simple type to normalize isbn/issn. Just get first string of digits followed by an optional 'x' –>
  39. <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  40.  <analyzer>
  41.    <tokenizer class="solr.KeywordTokenizerFactory"/>
  42.    <filter class="solr.LowerCaseFilterFactory"/>
  43.    <filter class="solr.TrimFilterFactory"/>
  44.     <filter class="solr.PatternReplaceFilterFactory"
  45.         pattern="^[\s0\-\.]*([\d\.\-]+x?).*$" replacement="$1"
  46.    />
  47.    <filter class="solr.PatternReplaceFilterFactory"
  48.         pattern="[\-\.]" replacement=""  replace="all"
  49.    />
  50.  </analyzer>
  51. </fieldType>
  52.  
  53. <!– LCCN normalization on both index and query –>
  54. <fieldType name="lccnnormalizer" class="solr.TextField"  omitNorms="true">
  55.   <analyzer>
  56.     <tokenizer class="solr.KeywordTokenizerFactory"/>
  57.     <filter class="solr.LowerCaseFilterFactory"/>
  58.     <filter class="solr.TrimFilterFactory"/>
  59.     <filter class="edu.umich.lib.solr.analysis.LCCNNormalizerFilterFactory"/>
  60.   </analyzer>
  61. </fieldType>
  62.  
  63. <!– since fields of this type are by default not stored or indexed,
  64.      any data added to them will be ignored outright.  –>
  65. <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
  66.    
  67. </types>

Add field definitions to schema.xml

This is pretty straight-forward: just set it up.

  1. <field name="htid"        type="genericID"          indexed="true"  stored="true"  multiValued="true"/>
  2. <field name="bibnum"       type="genericID"         indexed="true"  stored="true"/>
  3.  
  4. <field name="access"       type="lowercase"         indexed="true"  stored="true"/>
  5.  
  6. <field name="rights"       type="lowercase"         indexed="true"  stored="true"/>
  7.  
  8. <field name="source"       type="lowercase"         indexed="true"  stored="true"/>
  9. <field name="sourceid"     type="genericID"         indexed="true"  stored="true"/>
  10.  
  11. <field name="lccn"         type="lccnnormalizer" indexed="true"  stored="true"  multiValued="true"/>
  12. <field name="oclc"         type="numeric"        indexed="true"  stored="true"  multiValued="true"/>
  13. <field name="isbn"         type="stdnum"         indexed="true"  stored="true"  multiValued="true"/>
  14. <field name="issn"         type="stdnum"         indexed="true"  stored="true"  multiValued="true"/>
  15.  
  16. <field name="title"        type="text"         indexed="true" stored="true"/>
  17. <field name="imprint"      type="text"         indexed="true" stored="true"/>
  18. <field name="enumcron"     type="text"         indexed="true" stored="true"/>
  19.  
  20.   <!– Ignore the multivalued, comma-delimieted source strings –>
  21.  
  22.   <field name="rawLine"  type="ignored" indexed="false" stored="false"/>
  23.   <field name="issns"  type="ignored" indexed="false" stored="false"/>
  24.   <field name="isbns"  type="ignored" indexed="false" stored="false"/>
  25.   <field name="oclcs"  type="ignored" indexed="false" stored="false"/>
  26.   <field name="lccns"  type="ignored" indexed="false" stored="false"/>

hathi-data-config.xml — define how DIH is going to work.

This, of course, is the meat of the heart of the center of the matter.

I’m going to make use of four DIH technologies:

  • FileDataSource: In DIH, you declare a data source from which you’ll be sucking the raw data for manipulation and massaging. I’m just using a file, so this is for me. You can, as you might expect, pull in from a URL or (as mentioned) a database via JDBC.
  • FileListEntityProcessor: Given a directory and a set of criteria for a file, this will return a list of filenames that match those criteria. The criteria we’ll be using are (a) a regexp the filename must match, and (b) a creation date after the last time we ran the process.
  • LineEntityProcessor: Once you’ve got a data source, you need to stream it in somehow. There are Processors for XML and other formats, but this one just pulls in lines one at a time. The documentation all talks about LineEntityProcessor basically only being useful for pulling in, say, a list of filenames, but since my data is all line-by-line, this is what I’m using as my primary record-fetcher. It populates a single field called rawLine for later processing.
  • RegexTransformer: Allows you to take a field pulled from the datasource (or already derived from previous processing) and do regexp substitutions, group extraction, or splitting.

SO…I’m going to:

  1. Set up a FileDataSource to read from files
  2. Use FileListEntityProcessor to get a list of files that match my criteria
  3. Run each through LineEntityProcessor to generate a bunch of rawLines.
  4. Use the RegexTransformer multiple times to extract the data from the line.

[If you never went to look at it, this might be a good time to check out the description of the tab-delimited metadata files.]

  1.   <dataConfig>
  2.     <dataSource name="fds" encoding="UTF-8"  type="FileDataSource" />
  3.     <document>
  4.       <!– Get a list of files from the last time the handler ran –>
  5.       <entity name="hathifile"
  6.               processor="FileListEntityProcessor"
  7.               newerThan="${dataimporter.last_index_time}"
  8.               fileName="^hathi_upd_.*\.txt$"
  9.               rootEntity="false"
  10.               baseDir="/Users/dueberb/Documents/devel/hathi"
  11.       >
  12.  
  13.         <entity name="hathiline"
  14.                 processor="LineEntityProcessor"
  15.                 url="${hathifile.fileAbsolutePath}"
  16.                 rootEntity="true"
  17.                 dataSource="fds"
  18.                 transformer="RegexTransformer"
  19.         >
  20.  
  21. <!– Big ugly regexp to get all the tab-delimited fields –>
  22.           <field column="rawLine"
  23.                  regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
  24.                  groupNames="htid,access,rights,bibnum,enumcron,source,sourceid,oclcs,isbns,issns,lccns,title,imprint"
  25.           />
  26.  
  27. <!– Split the multi-values on comma –>
  28.  
  29.           <field column="oclc" splitBy="," sourceColName="oclcs" />
  30.           <field column="issn" splitBy="," sourceColName="issns" />
  31.           <field column="isbn" splitBy="," sourceColName="isbns" />
  32.           <field column="lccn" splitBy="," sourceColName="lccns" />
  33.         </entity> <!– end of hathiline –>
  34.  
  35.       </entity> <!– end of hathifile –>
  36.     </document>
  37.   </dataConfig>

And…it doesn’t work.

It almost works. The problem is that my attempt to use the variable ${dataimporter.last_index_time} is busted. There’s a ticket to fix it and a patch already provided, so it’s only a matter of time before it’s not an issue.

For the moment, though, we’ll change that line to:

  1.   <entity name="hathifile"
  2.           processor="FileListEntityProcessor"
  3.           newerThan="'NOW/DAY'"
  4.           fileName="^hathi_upd_.*\.txt$"
  5.           rootEntity="false"
  6.           baseDir="/Users/dueberb/Documents/devel/hathi"
  7.   >

That says to basically take everything created since midnight and use it. If you have cron scripts set up to run this every day, you’ll have no problems.

Dealing with a full extract

You’ll only have to do this once, of course, but it has to be done. Basically, reproduce the DIH handler with a different name, pulling in the data from a full extract (you could, e.g., just change the filename parameter to accept /^hathi_full_.*\.txt$/). Maybe call it hathifullimport instead of hathiimport.

Fire her up!

Once you’re ready to go, just hit the right URL:

http:://solrmachine:port/solr/hathifullimport?command=full-import&clean=true

http:://solrmachine:port/solr/hathiimport?command=full-import&clean=false

The first one will get the initial big, full file; the second will pull in all the nightlies you’ve downloaded, gunzipped, and put in the right place (provided, of course, they’re dated after the last midnight, or they’ve fixed DIH to allow the last_index_time syntax).

Next steps?

Beer or wine. Take your pick.

After that, though, it’d be a matter of actually writing the download scripts and setting up cron jobs. And, of course, putting a front-end on it if you want, or massaging the data as they come out to return a nice JSON format for your consumers. That sort of thing.

So, wait…is this really worth doing?

Maybe. Probably not. It was worth it to me to start thinking about DIH and how I can use it. And it might be worth it to you, if you want to play around with these data in the ways that solr makes easy.

But, like to many things, it’s less worth doing that it was worth writing up. I learned a lot.

Dead-easy (but extreme) AJAX logging in our VuFind install

One of the advantages of having complete control over the OPAC is that I change things pretty easily. The downside of that is that we need to know what to change.

Many of you that work in libraries may have noticed that data are not necessarily the primary tool in decision-making. Or, say, even a part of the process. Or even thought about hard. Or even considered.

For many decisions I see going on in the library world, the primary motivator is the anecdote. In fact, to be honest, the primary driver is the faculty anecdote. Those cliched three curmudgeonly old faculty members invariably have huge influence over systems and interfaces that will be used by 40K undergraduates. The tiny percentage of weirdos that actually talk to reference librarians end up wielding enormous power compared to the untold masses that don’t.

Enter the dragon…er…log.

So…I’m logging everything. EVERYTHING. Everything I can think of, anyway, and that doesn’t slow things down too far.

I’ve got a simple database table set up with the following columns:

  • incrementing integer (solely for innodb’s efficiency needs)
  • sessionid
  • action
  • data1, data2, and data3 (all of these are action-dependent)
  • logweekday, logdate, logtime (instead of a single timestamp for easy and efficient queries)

And that’s it. I’ve had it running (initially with only a few actions) for two weeks and have on the order of 300,000 rows in it at this point. Obviously, at some point I’ll have a better idea of which data I actually care about and things will get slimmed down a little bit. But for now, it’s fun having that all around.

Common log events include an action, and usually at least one other piece of data. Stuff like:

  • start-a-new-session with IP Address
  • simple-search with search-index, searchstring
  • choose-a-facet with facet-index, facet-value, position-on-list
  • view-a-full-record with recordID, search-result-number
  • click on an electronic resource link to proquest/google/hathitrust/whatever

…etc. I track adding and removing things from the selected items set and the user’s favorites, exporting to email or refworks or whatnot, logging in and out, clicking on the author’s name or a LCSH subject in the full record view, picking a “similar item” from the eponymous list, clicking on the spelling suggestion and the prev/next buttons, etc. Currently I’m logging 78 events.

[Note: by "search result number" I mean the enumeration of that record in that specific search set. So, the top result is #1. The first result on page two is #21]

What do I think I’m gonna learn?

I’m not exactly sure. Of course, I can get all the basics — how much traffic and what people are searching for — but there’s the possibility of other stuff. Things like:

  • Do people actually use the [prev/next, facets, facets-below-the-fold, items past the third page, etc]
  • Seriously, is anyone using the boolean searches and wildcards on the basic search page? Are any of them using IP addresses from outside the staff subnet? If not, can I please please please start using DisMax???
  • What facets are most popular? Do people hit the little “more” button to expand the list of facet values from 6 to 30?
  • What’s the average search result number of a record chosen for a full-record view for each search index (perhaps an indicator of how well the relevancy-ranking is working?)
  • Looking at all the full-record displays, what are the patterns for those records (e.g., break down by callnumber prefix, or by our “Academic Discipline” subject)

I’ve got a lot to learn about stats, and user tracking, and clickpath analysis, but dammit I’ll have data and I’m not afraid to use it!

[Er...them. Not "it". Data are a "them." Always feels weird to me to refer to data in the plural, but I'm forcing myself to do so these days.]

What’s the server implementation?

I already mentioned the database. I’ve got a little module called ActivityLog that does three and a half things:

  1. Get the session id from the session
  2. Get logging information from the GET/POST or passed in as parameters
  3. Modify the parameters if need be (e.g., pull domain name out of an external URL). This is the half a thing.
  4. Stuff it into the database with appropriate timestamps.

And that’s it.

What’s the client need to do?

I start off with the following rules:

  1. I want to be able to log damn near everything
  2. I can’t degrade the user experience in a meaningful way just for logging
  3. I want to log outgoing links, too.
  4. I must must must have pretty, bookmarkable URLs.

Truth be told, some of the “client” stuff can be (and is) done on the server. When someone is, say, sending a record or set of records to RefWorks, the server knows everything it needs to know and I can just take care of logging as part of the regular request fulfillment.

But some stuff — like the search result number, say — are best taken care of from the browser. Easy enough, for the most part, esp. with form submissions and such.

The potentially-non-obvious part comes in with rule #3 — I want pretty URLs. That means that the full display of record 123456789 is always going to be at /Record/123456789 no matter what the user clicked on. Ditto with adding/removing facets and such — the URL contains the resulting search, not the resulting search plus which facet was removed or added.

But — see #1 — I want to log damn near everything.

My solution — and I know lots of people are doing this; this isn’t rocket science — is to fire off an AJAX post for the click events that I’m interested in, sending log data off to my server and then not waiting for a return. Just send the data and follow the link as if nothing had intervened. It degrades gracefully (although the rest of my VuFind doesn’t, so that doesn’t matter much) and it dead-easy to implement.

The actual javascript implementation

I long ago switched our VuFind stuff over to use jQueryuery, just because I like it and know it.

First thing is to use the templates to modify the links to have a particular class (logit) and a well-structured ref (pipe-delimited values).

So, a link from the title of a work on the search-results page to the individual record will look like this:

<a ref="srrecview|{$record.id}||{$recordCounter}" href="/Record/{$record.id}" class="title logit">{$title}</a>

The ref attribute tells us that we’re going to log the type of event (record view from the search results), the ID of the record, a null in the data2 column, and the search result number.

Then there’s javascript to make all the magic happen:

  1.  
  2.  
  3.   function logit(a, args) {
  4.     a = jQuery(a);
  5.    
  6.     // Allow the caller to pass in args, or get them from the ref attribute
  7.     if (!args) {
  8.       args = a.attr('ref').split('|');
  9.     }
  10.    
  11.     jQuery.post(
  12.       url_to_the_logging_method,
  13.       {
  14.         'lc' : args[0],
  15.         'lv1': args[1] || '',
  16.         'lv2': args[2] || '',
  17.         'lv3': args[3] || ''
  18.       }
  19.     );
  20.   }
  21.  
  22.   jQuery(document).ready(function() {
  23.     jQuery('a.logit').live('click', function(e) {
  24.       logit(this);
  25.     });
  26.   });

The logit function just does a brain-dead post of the data in the ref attribute. We then bind that function to all anchors with the appropriate class, and we’re done.

[Note the use of the jQuery live event -- this makes sure the event will be bound to stuff that comes in via AJAX after page load. Our links to Google Books, for example, come in like this.]

Since I’m not returning false from the logit function, the default action (actually follow the link) will fire — without even waiting for the AJAX call to come back. Delay to the user is, hopefully, unnoticeable.

Final words

This isn’t all that smart. I should be doing more data-integrity stuff than I am, and of course someone could spoof my numbers if they wanted. But someone could spoof my stats just by hitting my normal catalog pages programatically, too, so there’s no more risk involved, and I do log IPs.

And, of course, I get my pretty URLs, and most users (i.e., those not running firebug) will never notice anything.

I don’t know that this would work for everyone, but so far it’s working pretty well for us. I’ll let you know if that continues in a post in a few weeks.

The sad truths about journal bundle prices

[Notes taken during a talk today, Ted Bergstrom: "Some Economics of Saying Nix To Big Deals and the Terrible Fix". My own thoughts are interspersed throughout; please don't automatically ascribe everything to Dr. Bergstrom.

Check out his stuff at Ted Bergstrom's home page.]

Journals are a weird market — libraries buy as agents of professors, using someone else’s money, in deals of enormous complexity and uncertain value from companies that basically have a monopoly.

Similar to a few other situations: doctors prescribe drugs for patients using insurance money. Professors assign textbooks to students whose parents (in general) buy them. In all these cases, the supplier is (or is nearly) a monopoly operation.

Median price per article of for-profit journals is 3-4 times the median prices for non-profit journals. When you look at price per citation it gets even worse (because the “best” — or at least most cited — journals tend to be non-profit).

Marginal cost of supplying print journals is about a penny a page. The marginal cost of supplying electronic access is nearly zero, of course and shelf space and multiple-copies become a thing of the past.

SO…enter the Big Deal.

Academic press and then Elsevier figured out how to price-discriminate: calculate each library’s current expenditure on paper journals, multiply by 1+x (for x about 0.15) and provide access to all their journals electronically, plus whatever paper you used to buy. Elsevier had a 5-year contract, during which they promise not to increase the price more than 7% per year.

This is a great deal for Elsevier, because they know what a library is already paying and the marginal cost of providing electronic access is essentially zero to them. Huge success — lots of libraries jumped on board, and then so did the other publishers.

Bundling deters entry into the market

Libraries who bought the first Big Deal had their payments increase 7%/year (note that about half of UC’s serials budget goes to Elsevier), but their own budget increases about 3.5%/year. So, they’re in constant cancellation mode, but you can’t cancel only a portion of the electronic access, so it’s exempt. Small-time journal publishers are the only thing left to cut.

We have to learn to say “no”

Plus…it’s an incredibly popular product. Faculty love online access, as do students. So negotiating a new contract is tough to do, because libraries (as always) are unwilling to walk away.

The theory of bargaining suggests the the library needs to know what will happen if the Big Deal bargain breaks down — what happens if we walk?

Problem: we (the libraries) don’t know what the deal is worth to us OR what it’s worth to the publisher. Valuation of the big bundles is a ridiculously complex problem.

What happens if we cut if off?

  • Library owns access rights to back issues of journals previously subscribd to.
  • Pay-per-view access required only for recent volumes

So…we need to calculate number of pay-per-view access, which will obviously increase as time goes on (and more stuff falls under this model) and would go down if we were to change consumers (faculty and students) some percentage of the cost.

Big problem: we can’t make that calculation. We have no good way of knowing what percentage of article downloads are for current journals, and the publishers don’t release that data.

Even if we had the data, though, the likely outcome after a certain amount of time, as more stuff falls into the pay-per-view window, is a new Big Deal.

What about the Big Deals themselves?

Hard to know — because there are NDA sprinkled around like snow in Minnesota. But it turns out that these deals are FOIA-able! Yea! Elsevier actually tried to sue Ted’s group in the state of Washington, but Paul Courant and others helped to win the day, and they haven’t been sued since.

Publishes want to give the impression that the renewals of the Big Deals are basically formulaic, but in real life there are significant differences from institution to institution.

What’s the “Economic Solution”

Llet users pay for what they use. If users paid their own money (about $35/article at the for-profit institutions), users will modify their own behavior AND authors will stop submitting to the expensive-to-access journals because they want their stuff to be read.

How much money are they making?

Reported profits of Elsevier and Springer are about 30% of sales. That’s a huge margin in the regular world, but their costs are tiny? Where does it all go?

Basically, it goes to lobbyists, lawers, and executive salaries.

The Optical Society of America (physics, not eyesight) is a non-profit organization that publishes journals at about 1/3 the cost per page of Elsevier, but makes 40% profit on sales. They, of course, plow it back into physics journals, being a non-profit and all.

What’s the economic model of a journal?

  • Publishers have fixed costs (editing, harassing referees, typesetting, technology, etc.). I (Bill) think of this as the “First reader” cost — the cost to get one reader to be able to read it.
  • The marginal cost of adding more users is essentially zero.
  • The “efficient” option is to either allow user access at zero cost, with various institutions subsidizing the fixed costs, or just don’t publish the journal.

What can a single library do?

Not much. Faculty will scream, and one library acting alone will have essentially no effect on anything.

An interim strategy

Drop Big Deals to overpriced journals. Maintain subscription and free access to journals priced near the average cost, and subsidize (at less than 100%) pay-per-view access to the overpriced journals

How big are the differences in what people pay?

Just as an aside, almost, he tells us that while UMich and Illinois pay Elsevier about $2.25M for the “Freedom Collection”, Wisconsin pays about $1.2M for the exact same collection. Whoops!

He’s getting contract via FOIA and analyzing the differences. I imagine there’ll be publications forthcoming.

More Ruby MARC Benchmarks: Adding in MARC-XML

It turns out that UVA’s reluctance to use the raw MARC data on the search results screen is driven more by processing time than parsing time. Even if they were to start with a fully-parsed MARC object, they’re doing enough screwing around with that data that the bottleneck on their end appears to be all the regex and string processing, not the parsing. Their specs for what gets displayed are complex enough that they want to do the work up-front.

But I remain interested, at least partially because of the reason UVA is using MARC-XML: they have MARC records too big for binary MARC format to handle. We do, too, and we’ve just been talking about what to do with them. So I’m thinking that

First, I spent some time dusting off my first attempt at ruby programming: modifying ruby-marc to use libxml if it’s available. It’s not super-well tested, but I’m pretty sure it works. And the speed increases are … well, see below.

Anyone who wants to mess with my attempt at libxml-enabled ruby-marc is welcome to do so. This is a very forgiving parser — it trusts that whatever ended up in the XML should, in fact, have been there. If you say ‘XXE’ is a control field, well, I’ll treat it as a control field.

But back to the data. A few points are obvious:

  • XML with REXML is dead-slow on both platforms (at least an order of magnitude slower )
  • XML with LibXML is competitive with binary MARC (within 20% or so)
  • Even with REXML, though, time to create MARC records out of the 50 input strings is less than a second, which might be ok depending on your application.

Full results

As with last time, the total numbers below show how long it took to process all 40 sets of 50 records. The unadorned numbers are the average time it took to process a set of 50 records.

Call up solr with a null search, get 2000 records back in batches of 50 with wt=ruby, eval it, and stick it into arrays

jruby-Get/Eval data              0.143550
mri-Get/Eval data                0.106550

jruby-Get/Eval data (total)      5.742000
mri-Get/Eval data (total)        4.262017

Turn raw strings into MARC::Record objects from MARC-Binary strings, joining all the returned MARC together first

jruby-marc4j-multistring         0.026575
jruby-marc-multistring           0.037175
mri-marc-multistring             0.073396

jruby-marc4j-multistring (total) 1.063000
jruby-marc-multistring (total)   1.487000
mri-marc-multistring (total)     2.935842

Turn raw strings into MARC::Record objects from MARC-XML

mri-marc-LibXML                  0.091332
jruby-marc-REXML                 0.799500
mri-marc-REXML                   0.948549

mri-marc-LibXML (total)          3.653276
jruby-marc-REXML (total)        31.980000
mri-marc-REXML (total)          37.941975

Conclusions

I’m not sure exactly where this leaves me, other than knowing that marc-xml is probably a viable alternative if you can use libxml. Getting a version of that code which uses native Java XML libraries when run under jruby might be a useful exercise.

Benchmarking MARC record parsing in Ruby

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow]

[UPADTE It turns out they don't use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ]

[UP-UPDATE Well, no, they do use MARC-XML. I'm not afraid to constantly change my story. This is why I'm the best investigative reporter in the business]

The other day on the blacklight mailing list, Bess Sadler wrote

Yes, we do still include the full marc record, but the rule of thumb we’re currently using is that anything that needs to display in the index view (the search results) needs to be broken out into a separate display field, because retrieving and parsing marc records for every item in a list of search results is too much of a performance hit.

This surprised me a fair bit, because in our implementation of VuFind (which uses PHP, versus Ruby for Blacklight) I do just that — grab the MARC out of Solr, parse it, and pull stuff like full titles and such out of it.

As it turns out, I’d been screwing around with calling marc4j from jruby, anyway, so I threw that into the mix, and here’s what I found.

What the benchmark tries to measure

The focus is on measuring time to parse MARC records as returned in a field from Solr in MARC-binary.

I got 40 sets of 50 records each (2000 records) from our Solr instance in ruby format and extracted the binary MARC strings. This resulted in an array of 40 sets of 50 strings, each of which is a valid MARC record.

Fifty records seems largish to me — we only display 20 at a time — but thought I’d swing for the fences.

I’m testing along three(ish) dimensions:

  • jruby vs mri
  • marc4j vs ruby-marc (only on jruby, obviously)
  • parsing each string individually, or globbing them all together and treating it as if it’s a multi-record file

[Note that MRI is using Net::HTTP to get the data; I presume Curl would be faster still. It's already faster than jruby]

The following data show the average time to parse out each set of 50 records and extract the first 245 (title) field from each one, along with the totals for doing all 2000 records.

Method                           User       Total      Real      

jruby Get/Eval data              0.134750   0.134750 (  0.134850)
jruby Get/Eval data (2000)       5.390000   5.390000 (  5.394000)

MRI Get/Eval data                0.008500   0.012750 (  0.115942)
MRI Get/Eval data (2000)         0.340000   0.510000 (  4.637677)    

jruby-marc4j-oneAtATime          0.056075   0.056075  (0.056125)
jruby-marc4j-multistring         0.027925   0.027925  (0.028000)

jruby-marc-oneAtATime            0.066625   0.066625  (0.066650)
jruby-marc-multistring           0.034300   0.034300  (0.034325)

mri-marc-oneAtATime              0.084500   0.085250  (0.086597)
mri-marc-multistring             0.085000   0.085750  (0.086026)

jruby-marc4j-oneAtATime (2000)   2.243000   2.243000  (2.244999)
jruby-marc-oneAtATime (2000)     2.665001   2.665001  (2.666000)
mri-marc-oneAtATime (2000)       3.380000   3.410000  (3.463888)


jruby-marc4j-multistring (2000)  1.117001   1.117001  (1.120001)
jruby-marc-multistring (2000)    1.371999   1.371999  (1.372999)
mri-marc-multistring (2000)      3.400000   3.430000  (3.441052)

So…the worst-case scenario is taking an average 0.085 second to get the first title field out of each one of 50 binary MARC records once we’ve got them.

Now, I’m sure all my records came out of the cache, so my query time wasn’t very long. But we still end up with a maximum of roughly 0.2 seconds plus the time to actually do the query to end up with a set of 50 marc records.

We can see from looking at the totals that it looks like MRI’s bottleneck is the actual parsing, whereas constructing the input streams is expensive under jruby (at least the way I’m doing it), resulting in a benefit of concatenating them all together into one longish string before parsing.

Marc4j is faster (20%ish), but not enough faster to be worth the effort in my mind. Keep in mind that I have no idea how fast Marc4j is when running under pure java, without all the jruby overhead.

Bottom line, though: that seems fast enough to me.

I’ll try to benchmark with XML later on today or tomorrow.

Building a solr text filter for normalizing data

[Kind of part of a continuing series on our VUFind implementation; more of a sidebar, really.]

In my last post I made the case that you should put as much data normalization into Solr as possible. The built-in text filters will get you a long, long way, but sometimes you want to have specialized code, and then you need to build your own filter.

Huge Disclaimer: I’m putting this up not because I’m the best person to do so, but because it doesn’t look as if anyone else has. I don’t know what I’m doing. I don’t know why the code I’m showing below is the way it is, and if anyone would like to make it better, that’d be great. This is basically just a lot of pattern-matching on my part.

[A second disclaimer: I haven't actually built this into Solr yet, although I've done some simple testing on the ISBN-13 checksum code. I'll remove this disclaimer when I get a chance to actually index some data with it.]

The Setup: An ISBN-10 to ISBN-13 converter

Last time, I said I didn’t know why I hadn’t put together an ISBN longifier yet. So let’s walk through it.

This is a lot easier than most things in that I’m assuming we’re going to be getting exactly one token to work with (via the KeywordTokenizer) and can just work on it with impunity.

If you’d like to follow along, get the solr source via svn on a machine with java and ant. And junit, I think.

Where to put stuff

Of all the black magic associated with doing this, figuring out how to actually make it build is the part that’s probably easiest for Java-heads and the most confusing to the rest of us. Anyone attempting this sort of thing should probably get a good grounding in how Solr is set up and how its build system works before doing anything else.

Me? I cheated.

I basically just copied the directory structure of another project in the config directory in the solr root (looks like maybe it was velocity), did some tiny modifications to the build.xml file to change the name of the project, renamed the ‘.pom’ file and edited it in the obvious ways, and followed the copied directory structure to figure out where to put my files.

And then it worked. And I didn’t ask any question, and metaphorically just backed away slowly with a nonchalant look on my face. Of course, if you know what you’re doing with java and ant, I’m sure there are better ways.

For the record, the directory in solr/config/umichnormalizers (where I put this stuff) would look something like this by the end of this project:

./target/ 
./build.xml

./src/main/java/edu/umich/lib/normalizers/ISBNLongifier.java ./src/test/java/edu/umich/lib/normalizers/ISBNLongifier.java

./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilter.java ./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilterFactory.java

You then just run ant in your config directory to generate a .jar file that can be put in solrmarc’s lib directory or (I think) jetty’s lib directory. You can also just run ant dist at the solr root level to get a .war file with your stuff embedded.

The converter

First, you just need some basic code to actually do the conversion. I’m sure this is hideously inefficient, but probably not as inefficient as the actual filter I’ll be producing in a minute.

We take in a string. If it looks like it might have a 10-digit ISBN in it (possibly with dashes or periods as delimiters), extract it, do the conversion to an ISBN-13, and return that as a 13-character string (e.g., no dashes or whatnot).

Note that I’m not working hard to determine if it’s an ISBN — this isn’t designed to try to pull an ISBN from random text. The hope is that by the time you get this far you’ve already got a pretty good idea that you’ve got an ISBN on your hands. I’m also not checking to see if the incoming ISBN is valid in any way; that’s left as an exercise for the dilligent reader.

  1. package edu.umich.lib.normalizers;
  2. import java.util.regex.*;
  3.  
  4. public class ISBNLongifier {
  5.  
  6.   // dashes and dots are acceptable delimiters. Should we add spaces??
  7.   private static String  ISBNDelimiiterPattern = "[\\-\\.]";
  8.  
  9.   // Look for a string of nine digits followed by another digit or an X
  10.   private static Pattern ISBNPattern = Pattern.compile("^.*?(\\d{9})[\\dXx].*$");
  11.  
  12.   public static Boolean matches(String isbn)  throws IllegalArgumentException {
  13.     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");
  14.     Matcher m = ISBNPattern.matcher(isbn);
  15.     return m.matches();
  16.   }
  17.  
  18.   public static String longify(String isbn) {
  19.     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");
  20.     Matcher m = ISBNPattern.matcher(isbn);
  21.     if (!m.matches()) {
  22.       throw new IllegalArgumentException(isbn + ": Not an ISBN");
  23.     }
  24.  
  25.     String longisbn = "978" + m.group(1);
  26.     int[] digits = new int[12];
  27.     for (int i=0;i<12;i++) {
  28.       digits[i] =  new Integer(longisbn.substring(i, i+1));
  29.     }
  30.  
  31.     Integer sum = 0;
  32.     for (int i = 0; i < 12; i++) {
  33.       sum = sum + digits[i] + (2 * digits[i] * (i % 2));
  34.     }
  35.  
  36.     // Get the smallest multiple of ten > sum
  37.     Integer top = sum + (10(sum % 10));
  38.     Integer check = top – sum;
  39.     if (check == 10) {
  40.       return longisbn + "0";
  41.     } else {
  42.       return longisbn + check.toString();
  43.     }
  44.   }
  45. }

The Factory Object

Next is a boilerplate factory object. The only change will be the package you put it in, and the last method’s name and return value.

  1. package edu.umich.lib.solr.analysis;
  2. import java.util.Map;
  3. import org.apache.solr.analysis.BaseTokenFilterFactory;
  4. import org.apache.lucene.analysis.TokenStream;
  5.  
  6. public class ISBNLongifierFilterFactory extends BaseTokenFilterFactory {
  7.   Map<String,String> args;
  8.  
  9.   public Map<String,String> getArgs()
  10.   {
  11.     return args;
  12.   }
  13.   public void init(Map<String,String> args)
  14.   {
  15.     this.args = args;
  16.   }
  17.   public ISBNLongifierFilter create(TokenStream input)
  18.   {
  19.     return new ISBNLongifierFilter(input);
  20.   }
  21. }

The actual filter

And, finally, the filter class. You’ll notice that I’m catching any illegal argument error and just returning the input unchanged. So anything that comes through that isn’t an ISBN just gets passed along.

  1. package edu.umich.lib.solr.analysis;
  2.  
  3. import edu.umich.lib.normalizers.ISBNLongifier;
  4. import org.apache.lucene.analysis.Token;
  5. import org.apache.lucene.analysis.TokenFilter;
  6. import org.apache.lucene.analysis.TokenStream;
  7. import java.util.regex.*;
  8. import java.io.IOException;
  9.  
  10. public final class ISBNLongifierFilter extends org.apache.lucene.analysis.TokenFilter {
  11.  
  12.   public ISBNLongifierFilter(TokenStream in) {
  13.     super(in);
  14.   }
  15.  
  16.   public Token next() throws IOException {
  17.     return normalize(this.input.next());
  18.   }
  19.  
  20.   public Token next(Token result) throws IOException {
  21.     return normalize(this.input.next());
  22.  
  23.   }
  24.  
  25.   public Token normalize(Token t) {
  26.     if (null == t || null == t.termBuffer() || t.termLength() == 0) {
  27.       return t;
  28.     }
  29.     String val = new String(t.termBuffer());
  30.     try {
  31.       t.setTermBuffer(ISBNLongifier.longify(val));
  32.       return t;
  33.     } catch (IllegalArgumentException e) {
  34.        // pass it through unchanged
  35.       return t;
  36.     }
  37.   }
  38. }

How to use it

Assuming you’ve managed to get it built into Solr and then deployed, just define it as a type in your schema.xml:

  1.   <fieldType name="isbnlongifier" class="solr.TextField"  omitNorms="true">
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="edu.umich.lib.solr.analysis.ISBNLongifierFilterFactory"/>
  5.     </analyzer>
  6.   </fieldType>
  7.  
  8.   # and later…
  9.  
  10.   <field name="isbn" type="isbnlongifier" indexed="true" stored="false" multiValued="true"/>

Conclusion

There it is. The rocket science is all hidden behind the import statements. My understanding is that casting the token value to/from Strings makes things horribly inefficient, but I’m pretty sure I’ve got bigger bottlenecks to tackle before worrying about this.