jruby_producer_consumer dead-simple producer/consumer for JRuby

Yea! My first gem ever released!

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. And I didn't really understand it. So I dug in today and wrote this.]

I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer.

It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t a huge win under regular Ruby.

There’s no testing there because I’m not sure how to test threaded stuff :-(

It is, I hope, easy to use:

  1.    require 'rubygems'
  2.    require 'jruby_producer_consumer'
  3.  
  4.    # Create a ProducerConsumer. Arguments are anything that implements #each
  5.    # and the size for the underlying queue. For the former, I'll just use a Range object.
  6.  
  7.    eachable = 1..10
  8.    queuesize = 3
  9.  
  10.    pc = ProducerConsumer.new(eachable, queuesize)
  11.  
  12.    # Just a method to show what happens
  13.    def sample (consumerid, x)
  14.      puts "Consumer #{consumerid}: consuming #{x}"
  15.      sleep 1 # otherwise this'll finsish before I can create multiple consumers
  16.    end
  17.  
  18.    # Create three consumers. You can pass any number of args to
  19.    # #consumer, and must pass a block whose arguments are the
  20.    # object returned by eachable#each and those args back.
  21.  
  22.    ['A', 'B', 'C'].each do |consumerid|
  23.      pc.consumer(consumerid) do |x, consumerid|
  24.        sample(consumerid, x)
  25.      end
  26.    end
  27.  
  28.    # OUTPUT
  29.    # Consumer A: consuming 1
  30.    # Consumer B: consuming 2
  31.    # Consumer C: consuming 3
  32.    # Consumer A: consuming 4
  33.    # Consumer B: consuming 5
  34.    # Consumer C: consuming 6
  35.    # Consumer B: consuming 7
  36.    # Consumer A: consuming 8
  37.    # Consumer C: consuming 9
  38.    # Consumer B: consuming 10

Still another look at MARC parsing in ruby and jruby

I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.

Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.

Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby instead of java would be nice.

The Big Question: Is the MARC parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?

The Answer (see below): Yes, if I use marc4j.

On our ridiculously-awesome hardware, right now we’re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.

I’ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn’t installed on the server) and on the server where we usually run these things.

  • The machines are my desktop OSX machine and the beefy linux server where we usually do this stuff
  • The platforms are jruby 1.4 –server and MRI ruby 1.87
  • The libraries are marc4j and ruby-marc 0.3.3
  • The parsers are
    • The standard binary parsers all around
    • A home-grown AlephSequential format reader for the ’seq’ type. AlephSequential is a MARC representation that uses one line for each field. We use it because it doesn’t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to MARC-XML.
    • Whatever marc4j uses internally for MARC-XML
    • ruby-marc’s ‘jstax’ xml parser under jruby (which I wrote and apparently needs some love, see below)
    • ruby-marc’s ‘libxml’ xml parser under MRI ruby
  • Seconds is the average of two rounds, with measurements taken after a warmup run in each case.

The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.

MACHINE PLATFORM LIBRARY PARSER SECONDS REC/SECOND
desktop jruby marc4j binary 4.06 4650 desktop jruby marc4j xml 5.55 3401 desktop jruby ruby-marc binary 17.35 1088 desktop jruby ruby-marc jstax 80.11 236

desktop ruby ruby-marc binary 33.54 562 desktop ruby ruby-marc libxml 46.87 402

server jruby marc4j binary 2.29 8245 server jruby marc4j xml 3.36 5619 server jruby marc4j AlephSeq 3.68 5130 server jruby ruby-marc binary 9.93 1901 server jruby ruby-marc jstax 44.56 424

The quick takeaways, with all the obvious caveats:

  • jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI
  • marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.
  • The server is fast.

We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can’t use marc-binary format because our records are too big.

If I’m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I’m gonna need to use marc4j and just wrap it up in some nicer syntax.

Beta version of the HathiTrust Volumes API available

MAJOR CHANGE

So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character.

Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.

So, we’re back to using pipe (|) characters to separate multiple calls — the examples below have been updated to reflect this.

Introduction

I’ve put up a beta version of the HathiTrust Volumes API previously discussed on this blog and via email.

Currently, I’ve only got json output, although there is space in there for other output formats as necessary.

What exactly is this?

Given: an identifier or set of identifiers, this API will Return: a set of matched records and a sorted list of the items available in the HathiTrust.

Useful, for example, if you want to display HathiTrust holdings alongside your own in your OPAC.

Simple, single-value call

Given the URL:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json

You’ll get the following back:

  1.   {
  2.       "records":
  3.       {
  4.           "000791709":
  5.           {
  6.               "recordURL":"http://catalog.hathitrust.org/Record/000791709",
  7.               "titles":
  8.               [
  9.                   "\"Zhong gong dang shi\" fu dao /",
  10.                   "\u300a\u4e2d\u5171\u515a\u53f2\u300b\u8f85\u5bfc /"
  11.               ],
  12.               "isbns": [],
  13.               "issns": [],
  14.               "oclcs": ["15420548"],
  15.               "lccns": []
  16.           }
  17.       },
  18.       "items":
  19.       [
  20.           {
  21.               "orig":"University of Michigan",
  22.               "fromRecord":"000791709",
  23.               "htid":"mdp.39015058510069",
  24.               "itemURL":"http://hdl.handle.net/2027/mdp.39015058510069",
  25.               "rightsCode":"ic",
  26.               "lastUpdate":"00000000",
  27.               "enumcron":false
  28.           }
  29.       ]
  30.   }

Note that the ‘records’ are keyed on the local umid, also available in the ‘fromRecord’ field of each item.

The generic short form is:

http://catalog.hathitrust.org/api/volumes/(idtype)/id.(outputtype)

Right now the valid idtypes are:

  • issn (will be normalized to just digits, no leading zeros)
  • isbn (will be normalized to an ISBN-13)
  • oclc (will be normalized to all digits, no leading zeros)
  • lccn (will be normalized as recommended)
  • htid (HathiTrust item id, seen above as “mdp.39015058510069″)
  • umid (the University of Michigan record ID, seen above in the “fromRecord” field of an item)

Currently the only valid outputtype is ‘json’.

More complex, multi-valued call

The full API URL looks like this:

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613

This is a request for data on two separate items, identified on the calling end as simply ‘1′ and ‘2′ (id:1 and id:2). The first item is searched for using both an oclc number and an lccn; the second supplies only an isbn.

Note that

  • The output format (json) has moved to appear right after the ‘/volumes/’
  • There’s an arbitrary ‘id’ field. This will be used to index the return values, so use something meaningful on your end.
  • keys and values are separated by colons. Key-Value pairs are separated by semi-colons.
  • Separate requests are separated by ‘/’ in the URL, allowing you to request data for an arbitrary number of items with a single call.
  • Return values are
  • Matches follow the “#3″ option on the old post, the “Must match if present” option — basically, if you supply an identifier and a record has one of those identifiers, they must match.

So, in the example, the first request has both an oclc number and an lccn. Matches are as follows:

  • If a record has an oclc number but no lccn, its oclc number must match the passed oclc number.
  • If a record has an lccn but no oclc number, its lccn must match the passed lccn value.
  • If a record has both an lccn and an oclc number, both its identifiers must match the passed values.

The returned structure is keyed on the arbitrary id passed in the search string (if not present, the whole search string will be used instead):

  1.   {
  2.       "1":
  3.       {
  4.           "records":
  5.           {
  6.               "001474331":
  7.               {
  8.                   "recordURL":"http://catalog.hathitrust.org/Record/001474331",
  9.                   "titles":
  10.                   ["Some aspects of seventeenth-century medicine & science; papers read at a Clark Library seminar, October 12, 1968"],
  11.                   "isbns": [],
  12.                   "issns": [],
  13.                   "oclcs": ["00045678"],
  14.                   "lccns": ["70628581 //r86"]
  15.               }
  16.           },
  17.           "items":
  18.           [{
  19.                   "orig":"University of Michigan",
  20.                   "fromRecord":"001474331",
  21.                   "htid":"mdp.39015004074095",
  22.                   "itemURL":"http://hdl.handle.net/2027/mdp.39015004074095",
  23.                   "rightsCode":"ic",
  24.                   "lastUpdate":"20090713",
  25.                   "enumcron":false
  26.               }]
  27.       },
  28.       "2":
  29.       {
  30.           "records":
  31.           {
  32.               "004370624":
  33.               {
  34.                   "recordURL":"http://catalog.hathitrust.org/Record/004370624",
  35.                   "titles":
  36.                   ["ARBA in-depth. Philosophy and religion /"],
  37.                   "isbns":
  38.                   ["1591581613"],
  39.                   "issns": [],
  40.                   "oclcs": ["53462174"],
  41.                   "lccns": ["2003065945"]
  42.               }
  43.           },
  44.           "items":
  45.           [{
  46.                   "orig":"University of Michigan",
  47.                   "fromRecord":"004370624",
  48.                   "htid":"mdp.39015058261911",
  49.                   "itemURL":"http://hdl.handle.net/2027/mdp.39015058261911",
  50.                   "rightsCode":"ic",
  51.                   "lastUpdate":"20090907",
  52.                   "enumcron":false
  53.            }]
  54.       }
  55.   }

Enumeration / Chronology

An effort is made to return items in “enumcron order” — hopefully, with earlier volumes showing up before later volumes. The full enumcron is listed in the items if you need to try something different.

JSONP Support

JSONP output is supported — just throw a ‘&callback=blahblahblah’ on the end of the URL you call and you’ll get a function definition back.

Some examples:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json&callback=myfunc

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581/id:2;isbn:1591581613&callback=myfunc

Running Blacklight under JRuby

I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there.

There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing.

This is the procedure I eventually arrived at — if there are places where I made trouble for myself, please let me know!

[And does anyone know how to get jruby's nokogiri to link to a different libxml and stop with the crappy libxml2-version error message every time I run it under OSX???]

Download jruby

Go to jruby.org and download a binary distribution. Extract the tar.gz (or zip or whatever)

I’ll put mine in ~/jruby. Or, at least that’s what I’ll tell you.

tar xzf jruby-1.4.tar.gz

To avoid confusion, let’s make jrake an alias for rake and add the jruby bin directory to the path

cd ~/jruby/bin
ln -s rake jrake
export PATH=`pwd`:$PATH

Download Blacklight

git clone git://github.com/projectblacklight/blacklight.git

Again, well say that I put this in ~/blacklight/

Muck with Blacklight dependencies

Edit the file init.rb to comment out references to libxml and ruby-xslt, as well as nokogiri. My understanding is that the first two are used, at this point, only for the EAD stuff. Both rely on libxml2 which is a C-extension and hence unavailable to JRuby.

Nokogiri gets pulled in during other installs and for some reason jrake will complain later on that it’s got a wrong version or something. So, we’ll just work without that particular net for now.

#### File ~/blacklight/init.rb
# config.gem 'libxml-ruby', :lib=>'libxml', :version=>'1.1.3'
# config.gem 'ruby-xslt', :lib=>'xml/xslt', :version=>'0.9.6'
# config.gem 'nokogiri', :version=>'1.3.3'

Do some initial installs

jgem install -v=2.3.4 rails 
jgem install activerecord-jdbc-adapter jdbc-sqlite3 
             activerecord-jdbcsqlite3-adapter ActiveRecord-JDBC 
jgem install rcov -s http://gemcutter.org --no-rdoc --no-ri
jrake
jrake gems:install

Edit the config/database.yml file

…to change the adapter to jdbcsqlite3 for development and testing.

Edit the databases.rake file

This one was harder to track down. The default rake task has hard-coded database names in the .rake file — jdbcsqlite3 isn’t included. I keep seeing things saying, “Oh, yeah, that’s been fixed…” but, well, it wasn’t for me. I had to do it by hand.

edit ~/jruby/lib/ruby/gems/1.8/gems/rails-2.3.4/lib/tasks/databases.rake

You need to find everywhere there’s a

when "sqlite", "sqlite3" # or when /^sqlite/ in one case

…and change it to

when "sqlite", "sqlite3", "jdbcsqlite3"

Repeat for other databases you want to use (e.g., mysql). For the moment, since I’m only worried about running jrake spec, that’s all I’m gonna do.

Try again

jrake
  Missing these required gems:
   mislav-hanna  = 0.1.11

OK. Not sure why that didn’t come in before. Go head and add it.

jgem install  mislav-hanna

Migrate the databases

jrake

The databases should migrate, and then it’ll poop out because Solr didn’t start.

Fire up solr

Since we’re running jruby, accessing the shell doesn’t work. You’ll have to fire up your test solr instance by hand.

cd ~/blacklight/jetty
java -Djetty.port=8888 -jar start.jar 2>log.jetty

Try it again!

cd ~/blacklight
jrake spec

   ................................................................
   ................................................................
   ....F............................................................
   1)
   'ApplicationHelper Export EndNote should render the correct 
   EndNote text file' FAILED
   expected: "%0 Format\n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%T Music for horn \n%I Harmonia Mundi USA, \n%C [United States] : \n%D p2001. \n",
  got: "%0 Format\n%C [United States] : \n%D p2001. \n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%I Harmonia Mundi USA, \n%T Music for horn \n" (using ##)
./spec/helpers/application_helper_spec.rb:128:

Finished in 15.519 seconds
193 examples, 1 failure

I can live with that for the moment. Anyone know why that spec fails?

Great! How about the features?

jrake features
  (much output)

  59 scenarios (59 passed)
  434 steps (434 passed)
  0m51.186s

And so…

…it appears that, at least on the surface, jruby is a viable platform for Blacklight so long as I don’t actually need any of the libxml stuff. In the next couple days I’ll try and actually get it all up and running and see if I can break it.

Setting up your OPAC for Zotero support using unAPI

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let’s get them to play nice with each other!

How’s it all work?

  1. Zotero looks for a well-constructed <link> tag in the head of the page
  2. It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks.
  3. Zotero then looks for IDs in the body of the page
  4. If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records.

What you’ll need

  1. An OPAC whose output you can futz with
  2. Access to an individual record’s ID in that output
  3. A URL based on the ID that gives an RIS representation of the records
  4. A screwdriver. Made with decent — but not too expensive — vodka and fresh orange juice.

Yes. I’m cheating.

I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your OPAC’s URL scheme, or write an RIS export tool by hand, well, this will take you a bit longer.

The process

1. Build an upAPI target script

You need a script that’ll do three things:

  1. With no arguments, return a list of available formats in general
  2. With one argument, id=<ID>, return a list of formats available for that item. This will likely be exactly the same as #1.
  3. With two arguments, id=<ID> & format=<FORMAT>, return the record identified by <ID> in format <FORMAT>

Mine looks like this:

  1.  
  2.   // id is of the form urn:bibnum:000000000
  3.  
  4.   $id = isset($_REQUEST['id'])? $_REQUEST['id'] : false;
  5.  
  6.   // Format, at this point, had better be 'ris'
  7.   $format = isset($_REQUEST['format'])? $_REQUEST['format'] : false;
  8.  
  9.   // Got neither? Return the general list
  10.   if (!($id || $format)) {
  11.     header('Content-type: application/xml');
  12.     echo '<?xml version="1.0" encoding="UTF-8"?>
  13.    <formats>
  14.      <format name="ris"
  15.              type="application/x-Research-Info-Systems"
  16.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  17.    </formats>
  18.    ';
  19.   exit;  
  20.   }
  21.  
  22.  
  23.   // Got just the id? Return formats for that ID
  24.   if ($id && !$format) {
  25.     header('Content-type: application/xml');
  26.     echo '<?xml version="1.0" encoding="UTF-8"?>
  27.    <formats id="' . $id . '">
  28.      <format name="ris"
  29.              type="application/x-Research-Info-Systems"
  30.              docs="http://www.refman.com/support/risformat_intro.asp"/>
  31.    </formats>
  32.    ';  
  33.   exit;  
  34.   }
  35.  
  36.  
  37.   // Otherwise…
  38.  
  39.   // Parse out the actual numeric part of the id from the urn:<typeOfNumber> prefix
  40.   preg_match('/^urn:bibnum:(.*)$/', $id, $match);
  41.   $actualID = $match[1];
  42.  
  43.   // Again: format had better be 'ris' because that's all I'm supporting at this point.
  44.   header("Location: /Search/SearchExport?id=$actualID&method=$format", true, 302);

You can see that a <format> is a just a name, a mime-type, and an optional reference to documentation on the type.

I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in — I’m hard-coding ‘bibnum’ for the moment, but could allow, say, “oclc” or “isbn” or whatnot, too.

2. Tell your OPAC where the script lives

You’ll need a line in the <head> section of all your pages that might have an ID on them:

<link rel="unapi-server" type="application/xml" title="unAPI" href="/unapi">

Everything should be left alone except for the actual href.

3. Add your IDs to the HTML

In the HTML of your page, you can add one or more tags of the form:

<abbr class="unapi-id" title="urn:bibnum:000000002"></abbr>

(where the title of the <abbr> conforms to what you’re expecting in your script).

You can put stuff inside the <abbr> but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.

4. Final step

Drink your screwdriver.

Where can I see it?

Well…here’s the thing.

You can take a look at my test instance, http://dueberb.vufind.lib.umich.edu/ and play there. You can not see it in production, because there’s a little problem.

Our old OPAC — now dubbed mirlyn-classic — had a custom translator written for it. And it worked fine, and that was great.

But now we’ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.

I’ve asked around about getting it fixed (changing the target URL for the old translator to point at mirlyn-classic) but it’s Friday, and no one is around. Hopefully soon.

Thinking through a simple API for HathiTrust item metadata

EDITS:

  • Added “recordURL” per Tod’s request
  • Made a record’s title field an array and call it titles, to allow for vernacular entries
  • Changed item’s ingest to lastUpdate to accurately note what the actual date reflects. This gets updated every time either the item or the record to which it’s attached gets changed.
  • Fixed a couple typos, including one where I substituted an ampersand for a pipe in the multi-get example (thanks again, Tod).
  • Added a better explanation of option #4

Introduction and History

Ages ago, I wrote a simple(ish) little cgi program to get basic item-level data out of what is now Mirlyn Classic, our OPAC. Soon enough, I was asked to modify it so people could get HathiTrust data from the underlying Aleph system, check viewability of the associated items, etc.

It works…kinda…but reflects what I now look back on as blissful ignorance. It doesn’t deal at all with serials, and doesn’t deal correctly with duplicated records or cases where multiple records have the same (supposedly-unique) identifiers,

We need something better. And I’m hoping comments on this post will result in something better.

Scope

Given standard identifiers for a known item, return basic item-level metadata for volumes deposited in the HathiTrust.

I want to keep this simple. There will likely be other APIs for other, more complex (or specialized) tasks, linked data for folks who dig that sort of thing, and so on. The goal here is to make something that’s fast and can help people inline data about HT into their own OPAC or similar system.

I’d also like to get this thing in place, at least the basics, in the next two weeks. Anything longer is self-indulgent.

Data returned

At the moment, I’m only planning on offering JSON out, unless someone really, really needs something else. Speak up if you’re an edge case.

Proposed basic return structure

…complete with JSON-illegal comments embedded

  1.   {
  2.     "records":
  3.       {
  4.         "003384758": // The HathiTrust record id of a matched record
  5.           {
  6.             "recordURL" : "http://catalog.hathitrust.org/Record/003384758"
  7.             "titles":  ["Full, space-joined 245s"],
  8.             "isbns" : ["123456789X"], // any/all ISBNs on this record
  9.             "issns" : [], // any/all ISSNs on this record
  10.             "oclcs" : [], // any/all OCLC numbers
  11.             "lccns"  : ["68001537"], // any/all LCCNs
  12.           },
  13.           … // any more records that were matched
  14.       },
  15.     'items' :
  16.       [
  17.         {
  18.           "fromRecord" : "003384758",
  19.           "htid": "mdp.39015054407062",
  20.           "itemURL": "http://hdl.handle.net/2027/mdp.39015054407062",
  21.           "rights": "ic",
  22.           "orig": "University of Michigan", // supplying institution
  23.           "lastUpdate" : "20090807" // date of ingest into HathiTrust or last change
  24.           "enumcron" : "An enumeration/chronology, if available" // OPTIONAL
  25.         }
  26.       ]
  27.   }

A quick walk through the proposed return structure

Obviously, there are two sets of items: a list of records that matched the query, and a list representing the union of all items on those matched records.

records

For most purposes, people won’t care so much about the record-level data unless you’re trying to do your own error-checking (possible) or want to link to the catalog record-level page (more likely).

[I'm actually very open to just plain leaving it out.]

The format is a hash keyed on the HathiTrust record ID, which can currently be turned into a URL such as http://catalog.hathitrust.org/Record/003384758. Elements are:

  • recordURL: The URL to the human-readable record view in the catalog
  • titles: an array of all the full 245, space-separated subfields. Always present, usually with one item, sometimes more than one (vernacular entries), almost never with zero.
  • isbns: An array of all the ISBNs associated with the record. Always present; an empty array if none.
  • issns, oclcs, lccns: Same as ISBNs, but for the appropriate data.

Note that at this time, LCCNs are taken from the 010, so the LCCN array will either be empty or have one item. I left it as an array just for consistency.

items

This is an array of items, taken from all the matched records and ordered (as best I can) based on their enumcron. If no enumcron is present, order is undefined.

  • fromRecord: The HathiTrust record ID, as used as a key in the hash of records (explained above).
  • htid: The HathiTrust ID for the item.
  • itemURL: The URL to the page-turner (or search box, for search-only items) for this item. It’s currently just appended to the prefix “http://hdl.handle.net/2027/”, but I thought I’d include it in case the preferred URL algorithm changes at some point.
  • rights: The rights code for this item, as explained at http://www.hathitrust.org/hathifiles_metadata.
  • orig: The institution that supplied the item for digitization.
  • lastUpdate: The date of the last time this item or its containing record was touched, either because of ingest by the HathiTrust system or later editing, as YYYYMMDD. May be 00000000 if unknown.
  • enumcron: (OPTIONAL) The enumeration/cronology (e.g., “v. 3 1997″ or somesuch). Again — optional. Leave out the key? Provide an empty string? Provide a false?

A word about enumcron

The enumcron string is fickle, and very local. The algorithm I’m using to sort them basically consists of taking all the numbers in the enumcron strings and zero-padding them to 8 digits, then sorting. It works pretty well, but isn’t perfect. I’m incredibly resistant to trying to do anything fancier, simply because I want it to be fast and because trying to deal with all possible enumcron formats is a sisyphean task.

An actual record

Here’s the simplest possible case: a single matched record with a single item

  1.   {
  2.     "records":
  3.       {
  4.         "000366004":
  5.           {
  6.             "recordURL" : "http://catalog.hathitrust.org/Record/000366004",
  7.             "titles": ["The Sneetches, and other stories. Written and illustrated by Dr. Seuss."],
  8.             "isbns": [],
  9.             "issns": [],
  10.             "oclcs": ["00470409"],
  11.             "lccns": ["68001537"]
  12.           }
  13.       }
  14.     "items": [
  15.       {
  16.         "fromRecord": "000366004",
  17.         "htid": "mdp.39015079651611",
  18.         "itemURL": "http://hdl.handle.net/2027/mdp.39015079651611",
  19.         "rightscode": "ic",
  20.         "lastUpdate": "20091004",
  21.         "orig": "University of Michigan",
  22.         "enumcron": false
  23.       }
  24.     ],
  25.   }

We can see that (despite my expectation) we don’t happen to have an ISBN for this item. The item originally came from Michigan, either ingested or last updated on October 10th, 2009. The HathiTrust catalog page for this item is http://catalog.hathitrust.org/Record/000366004 (derived from the record ID) and it is In Copyright (ic), so the itemURL goes to a page that allows only search.

Making the request

I’ll take care of normalizing data on the way in (mostly done by the Solr backend): strip leading zeros off the OCLC number, normalize the LCCN as per this page at the Library of Congress, strip anything funny-looking from the ISBN and ISSN, and (probably) convert all ISBNs into ISBN13s.

I’m anticipating three formats for a request (note: they don’t work yet. There’s no code):

Single-identifier option

http://catalog.hathitrust.org/api/volumes/oclc/00470409.json

http://catalog.hathitrust.org/api/volumes/lccn/68001537.json


http://catalog.hathitrust.org/api/volumes/issn/1051290x.json


http://catalog.hathitrust.org/api/volumes/isbn/0835221792.json

Simple and unambiguous; returns the proposed return structure as described above (and presumable amended before actual implementation). Again, any normalization that needs to be done will be done on my end, so “00470409″ and “470409″ are considered the same OCLC number.

Multiple-identifier, multi-request option

http://catalog.hathitrust.org/api/volumes?yourID1=oclc:00470409|lccn:68001537&amp;yourID2=oclc:67890987|isbn:987652348X

In this format, you can see that (a) you can provide multiple pieces of metadata for a record, separated by pipe characters (|), and (b) you can provide metadata sets for multiple records at once, keyed on whatever arbitrary ID you want to use.

The return format would look like this:

  1.   {
  2.     "yourID1" : <proposed return structure>,
  3.     "yourID2" : <proposed return structure>,
  4.       …
  5.   }

What to do when the provided metadata don’t agree?

It’s entirely possible to provide an OCLC number and an LCCN that, in fact, refer to two different records. It’s also possible that we have two records in the system that should be merged, but haven’t been.

Some possible algorithms:

  1. Require that all sent numbers match: If you send an OCLC, and ISBN, and an LCCN, any returned record must have all three, and all must match. That seems too strict.
  2. Return any records that match any sent numbers: I could do a boolean-OR, so any record that matches any of the numbers you send gets returned. The risk of returning too much data seems too great.
  3. Return any records that don’t mismatch any sent numbers: The same as the first option, but null matches anything. So, if you sent an LCCN, and if the record has an LCCN, they must match. If you sent an OCLC number and if the record has an OCLC number, it, too, must match, etc.. Basically, every piece of metadata, if provided, must match.
  4. Order the number types and only match the best available. We provide an ordered list of type: OCLC, LCCN, ISBN, and finally ISSN. If you provide an OCLC number and there is a record with that OCLC number, return it and ignore everything else. If you didn’t provide an OCLC number (or if you did but we didn’t get any matches), move on to the LCCN and try again, as shown below.

    // The algorithm for #4
    foreach type in (OCLC, LCCN, ISBN, ISSN) {
      next unless (providedSearch[type]); ## move on unless a number was provided
      records = recordsThatMatch(type, providedSearch[type]);
      if records.size > 0 { # If we found some, return
        return records;
      }
      ## else, we move to the next type.
    }
    

So, for #4, if you provide an OCLC number and we find a match or matches, stop looking and return them. If we don’t find an OCLC match but you also provided an LCCN, look for records that match the LCCN, and if found return them. Repeat with ISBN and ISSN.

Understanding #3 vs. #4

Suppose the following are true:

  • You provide an OCLC number O and an LCCN L
  • I have a record r1 with OCLC number O and no LCCN at all
  • I have a record r2 with LCCN L and no OCLC number at all.

Under option #3, both records would be returned. They both fulfill the criteria that they match all the supplied identifiers in all fields for which they have values. In other words, r1 has a positive match on OCLC (O == O) and a null-matches-everything match on LCCN (L == no data).

Under option #4, only r1 is returned. We first look for all records that match on the OCLC number provided, find exactly one, and return it. We never even bother to look for records that match on LCCN only.

Let’s pick one and see how it works in the real world

I’m leaning toward #4, but I’m open to #3 as well, or any other variant that can be computed quickly and easily on this end. We’re talking about some pretty weird edge cases when we start going down this road, and I don’t want to sacrifice ease of use and ease of computation any more than we have to.

Please comment!

You can comment here, or send email directly to me. I’ll follow up this post periodically with more thoughts and synopses of what I’ve heard.

Adding LibXML and Java STAX support to ruby-marc with pluggable XML parsers

JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there.

Since I’ve been messing around with MARC-XML parsing of late, and since Ross Singer added pluggable xml-parser awesomeness to the ruby-marc project, I thought I’d see what I could do with native Java methods when parsing MARC-XML.

And just for kicks, I threw in the old code that I wrote before that uses LibXML.

Why do this at all?

Because…er…there’s an obvious work-situation where I need to squeeze every last drop of speed out of…ruby…which we don’t use…er…

Because. Because I wanted to screw around with the technologies. Because I wanted to learn about calling java native stuff. Because I already wrote the libxml stuff. Because it feels silly to run on the JVM and not use JVM-native code to deal with XML, given that standard java projects make it seem like Java is a giant XML processor with a language wrapped around it.

What exactly did I do?

For the LibXML stuff, I copied my own code. For the java stax (javax.xml.stream.XMLInputFactory.StreamReader) parser, I stole just about everything from Ross’s nokogiri code and put it into its own module, and then slimmed down the nokogiri module and the stax module to only include their differences.

The patch is at the ruby-marc rubyforge site if you want to play along at home.

Other than using the stax or libxml parser, everything else is the same — MARC::Record objects and their components are created exactly as they are with the other parsers. It might be “fun” (for some twisted definition of “fun”) to wrap the MARC::Record interface around marc4j at some point, but right now all that’s changed is the parsing.

Do they work?

Yes. Thanks for asking. At least all the tests pass when I type ‘rake’.

How fast is it?

As always, the numbers are iffy. These were done on my desktop, with other stuff going on. I didn’t bother to benchmark rexml because we know how slow that is.

The test file is a nightly dump intended to go into our VuFind install. It was born as binary marc, and changed to marc-xml using yaz-marcdump, which is so fast that I thought maybe something had gone wrong. Holy cow, is yaz-marcdump fast.

The resulting XML is 219MB and contains 46,242 records.

The test was to open it up, loop through the records, and pull the 245 out of each. Each segment looks something like this:

  1.   reader = MARC::XMLReader.new(filename, :parser=>'jstax')
  2.   reader.each do |record|
  3.     title = record[245]
  4.   end

Times are in seconds. I ran each one five times, with the exception of jrexml, during which I got bored. And the perl code, for which I just wanted to get a ballpark to compare.

MRI 1.8.7 libxml 104 (103, 103, 106, 104, 103) nokogiri 301 (304, 300, 301, 301, 300)

JRuby trunk jrexml 547 (539, 554) jstax 203 (201, 208, 201, 201, 204 )

Perl 5.10 w/MARC::File:XML perl 340 (340)

So…faster, right?

Pretty much, yeah.

Under (MRI) ruby, Ross found that nokogiri was 3.5x faster than rexml, and my noodling-around at home showed the same speedup. Using that as a baseline, we get the following speed comparison table using the libxml time normalized to 1.00.

In case that wasn’t clear: lower numbers are better.

libxml:   1.00
jstax:    1.95
nokogiri: 2.89
jrexml:   5.16
rexml:    10.11 (estimated; 3.5x nokogiri's speed

What does it all mean?

It means that adding pluggable parsers was freakin’ brilliant.

It means that a guy like me — with no real expertise in any of the applicable technologies — can do a passable job at integrating a java library into JRuby.

And it means that if I (a) can get folks around here to use Ruby, and (b) can get them to use MARC-XML instead of binary MARC (which we can’t use anyway because of the record-length limitations), I can be sure that any bottlenecks aren’t going to be the result of those choices.

Adding LibXML and Java STAX support to ruby-marc with pluggable XML parsers

JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there.

Since I’ve been messing around with MARC-XML parsing of late, and since Ross Singer added pluggable xml-parser awesomeness to the ruby-marc project, I thought I’d see what I could do with native Java methods when parsing MARC-XML.

And just for kicks, I threw in the old code that I wrote before that uses LibXML.

Why do this at all?

Because…er…there’s an obvious work-situation where I need to squeeze every last drop of speed out of…ruby…which we don’t use…er…

Because. Because I wanted to screw around with the technologies. Because I wanted to learn about calling java native stuff. Because I already wrote the libxml stuff. Because it feels silly to run on the JVM and not use JVM-native code to deal with XML, given that standard java projects make it seem like Java is a giant XML processor with a language wrapped around it.

What exactly did I do?

For the LibXML stuff, I copied my own code. For the java stax (javax.xml.stream.XMLInputFactory.StreamReader) parser, I stole just about everything from Ross’s nokogiri code and put it into its own module, and then slimmed down the nokogiri module and the stax module to only include their differences.

The patch is at the ruby-marc rubyforge site if you want to play along at home.

Other than using the stax or libxml parser, everything else is the same — MARC::Record objects and their components are created exactly as they are with the other parsers. It might be “fun” (for some twisted definition of “fun”) to wrap the MARC::Record interface around marc4j at some point, but right now all that’s changed is the parsing.

Do they work?

Yes. Thanks for asking. At least all the tests pass when I type ‘rake’.

How fast is it?

As always, the numbers are iffy. These were done on my desktop, with other stuff going on. I didn’t bother to benchmark rexml because we know how slow that is.

The test file is a nightly dump intended to go into our VuFind install. It was born as binary marc, and changed to marc-xml using yaz-marcdump, which is so fast that I thought maybe something had gone wrong. Holy cow, is yaz-marcdump fast.

The resulting XML is 219MB and contains 46,242 records.

The test was to open it up, loop through the records, and pull the 245 out of each. Each segment looks something like this:

  1.   reader = MARC::XMLReader.new(filename, :parser=>'jstax')
  2.   reader.each do |record|
  3.     title = record[245]
  4.   end

Times are in seconds. I ran each one five times, with the exception of jrexml, during which I got bored. And the perl code, for which I just wanted to get a ballpark to compare.

MRI 1.8.7
    libxml     104    (103, 103, 106, 104, 103)
    nokogiri   301    (304, 300, 301, 301, 300)

JRuby trunk jrexml 547 (539, 554) jstax 203 (201, 208, 201, 201, 204 )

Perl 5.10 w/MARC::File:XML perl 340 (340)

So…faster, right?

Pretty much, yeah.

Under (MRI) ruby, Ross found that nokogiri was 3.5x faster than rexml, and my noodling-around at home showed the same speedup. Using that as a baseline, we get the following speed comparison table using the libxml time normalized to 1.00.

In case that wasn’t clear: lower numbers are better.

libxml:   1.00
jstax:    1.95
nokogiri: 2.89
jrexml:   5.16
rexml:    10.11 (estimated; 3.5x nokogiri's speed)

What does it all mean?

It means that adding pluggable parsers was freakin’ brilliant.

It means that a guy like me — with no real expertise in any of the applicable technologies — can do a passable job at integrating a java library into JRuby.

And it means that if I (a) can get folks around here to use Ruby, and (b) can get them to use MARC-XML instead of binary MARC (which we can’t use anyway because of the record-length limitations), I can be sure that any bottlenecks aren’t going to be the result of those choices.

An exercise in Solr and DataImportHandler: HathiTrust data

Many of the folks who read this blog (hi, both of you! Mom, say hello to Dad!) are aware, at least tangentially, of the HathiTrust. Currently hosted by us at the University of Michigan, the most public interface to its data is a VuFind installation you can access at catalog.hathitrust.org (or, for you smart-phone types, at m.catalog.hathitrust.org). Once you do a metadata search, you get links into the actual page images or a chance to search the fulltext of the selected item (depending on its copyright status).

It’s awesome. Seriously. Even in the absence of fulltext, being able to search within an item can be incredibly useful. Give it a shot if you haven’t.

You don’t always need an OPAC

But there are plenty of folks who don’t want or need a full-flown interface into all the metadata. They’ve already got one of those. What they’re interested in, mostly, is figuring out how to easily put links in their own OPAC (or whatnot or whoseits) to the HathiTrust if page images or searching are available. See, for example, a typical record from Tod Olson’s stuff at U-Chicago — he sniffs for HathiTrust and Google Books availability via embedded javascript.

To this end, the HathiTrust folks provide a set of simple, tab-delimited files — a full extract on the first of every month, and nightly updates every …er…night.

You can see from the description of the file that it’s very simple. Tab-delimited fields of the HathiTrust ID, right information, and all the golden-oldie standard identifiers — some of which (ISSNs, ISBNs, etc.) are further comma-delimited in cases where multiple values are available and a field repeats. And a title and enumcron (description of an individual volume, e.g., “Sept 2007, vol. 33, issue 4″), so you have something useful to display if you need to, and that’s 98% of what most folks want.

The smart way to do it: RDBMS

If you want to query this data quickly and easily, the obvious thing to do is to dump it into a database. One main table for the non-repeated values, and either a few key=>value tables (or, if you’re lazy, a single key => type/value) for the repeated ISBNs/ISSNs/whatnot. A quick mod-perl script to set up some data normalization going in and out and persist the prepared SQL queries and you’re set.

It’s hard to make an argument against using a database for these data. I mean, c’mon. We’ve got a well-defined structure. An obvious foreign-key. No full-text searching needed. This is practically designed for a good old-fashioned RDBMS. Plus, I’ve done this approximately a zillion times before, so I’m good and fast at it. Case closed.

How I’m gonna do it

Screw that. What I really wanted to do was start messing around with the DataImportHandler(DIH) in Solr.

I can make a weak argument for including the data in a Solr instance. To wit, it’ll certainly be fast enough for anything I’m gonna throw at it, and (more important to me) it’s easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about manipulating the input much, if at all.

The list of simple DIH examples is…well, I never really found any good ones, although I’m sure they’re out there. The documentation isn’t bad, but it’s not full of complete examples, and almost all of them have to do with the potential complexities of sucking data out of a database, which is what most people want to do. Not me, I’ve got flat files to work with.

Luckily, you can fire up an “interactive” DIH session where, at the very least, you can try to import a few rows of data and see if things are puking. I didn’t find the error reports particularly helpful all the time, but it’s about a zillion times better than nothing, I can tell you that much.

The game plan

We’ll start with the assumption that I’ve already managed to load a full dump from some date (run with me here; I’ll explain how to do it later). Then what we want to do is the following:

  1. Every night, download the nightly additions/changes file and gunzip it.
  2. Hit the DIH handle to import all files that (a) have a filename of the right format, and (b) have a created date after the last time the DIH handle was run.

And that’s it. Get the new stuff, have DIH figure out what’s new, and import it.

The first part is easy enough to do with perl/python/ruby/whatever. I’ll leave it as an exercise for all you diligent students.

Setting up solrconfig.xml

This is the easy part. Set up the handler, give it a semi-meaningful name, and call out to a config file.

  1.   <requestHandler name="/hathiimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  2.       <lst name="defaults">
  3.         <str name="config">hathi-data-config.xml</str>
  4.       </lst>
  5.   </requestHandler>

Define some useful data types in schema.xml

I left pretty much all of the boilerplate in schema.xml and just added a few types to deal with identifiers.

  • lowercase: return a single token that’s been lowercased. Don’t muck with it otherwise.
  • genericID: trim it, lowercase it, ditch everything that’s not a number or a letter, and return as a single token.
  • numeric: Ditch everything but the first string of digits, and then ditch any leading zeros. Useful when you know it’s gotta be an integer.
  • stdnum Find the first set of digits (optionally followed by an ‘X’ and potentially interspersed with dashes or dots), strip off the leading zeros, and return it. Good to extract an ISBN from a string like “(alt) 123-45-678X electronic only”.
  • lccnnormalizer: Custom code to normalize an LCCN as per this page at the LoC.
  1. <types>
  2.   <!– lowercases the entire field value, keeping it as a single token.  –>
  3. <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
  4.   <analyzer>
  5.     <tokenizer class="solr.KeywordTokenizerFactory"/>
  6.     <filter class="solr.LowerCaseFilterFactory" />
  7.   </analyzer>
  8. </fieldType>
  9.  
  10. <!– Full string, stripped of \W and lowercased –>
  11.  <fieldType name="genericID" class="solr.TextField" sortMissingLast="true"  omitNorms="true">
  12.    <analyzer>
  13.      <tokenizer class="solr.KeywordTokenizerFactory"/>
  14.      <filter class="solr.LowerCaseFilterFactory"/>
  15.      <filter class="solr.TrimFilterFactory"/>
  16.      <filter class="solr.PatternReplaceFilterFactory"
  17.           pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
  18.      />
  19.    </analyzer>
  20. </fieldType>
  21.  
  22.   <!– standard number normalizer – extract sequence of digits, strip leading zeroes –>
  23. <fieldType name="numeric" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  24.  <analyzer>
  25.    <tokenizer class="solr.KeywordTokenizerFactory"/>
  26.    <filter class="solr.LowerCaseFilterFactory"/>
  27.    <filter class="solr.TrimFilterFactory"/>
  28.    <filter class="solr.PatternReplaceFilterFactory"
  29.         pattern="[^0-9]*([0-9]+)[^0-9]*" replacement="$1"
  30.    />
  31.    <filter class="solr.PatternReplaceFilterFactory"
  32.         pattern="^0*(.*)" replacement="$1"
  33.    />
  34.  </analyzer>
  35. </fieldType>
  36.  
  37.  
  38.   <!– Simple type to normalize isbn/issn. Just get first string of digits followed by an optional 'x' –>
  39. <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  40.  <analyzer>
  41.    <tokenizer class="solr.KeywordTokenizerFactory"/>
  42.    <filter class="solr.LowerCaseFilterFactory"/>
  43.    <filter class="solr.TrimFilterFactory"/>
  44.     <filter class="solr.PatternReplaceFilterFactory"
  45.         pattern="^[\s0\-\.]*([\d\.\-]+x?).*$" replacement="$1"
  46.    />
  47.    <filter class="solr.PatternReplaceFilterFactory"
  48.         pattern="[\-\.]" replacement=""  replace="all"
  49.    />
  50.  </analyzer>
  51. </fieldType>
  52.  
  53. <!– LCCN normalization on both index and query –>
  54. <fieldType name="lccnnormalizer" class="solr.TextField"  omitNorms="true">
  55.   <analyzer>
  56.     <tokenizer class="solr.KeywordTokenizerFactory"/>
  57.     <filter class="solr.LowerCaseFilterFactory"/>
  58.     <filter class="solr.TrimFilterFactory"/>
  59.     <filter class="edu.umich.lib.solr.analysis.LCCNNormalizerFilterFactory"/>
  60.   </analyzer>
  61. </fieldType>
  62.  
  63. <!– since fields of this type are by default not stored or indexed,
  64.      any data added to them will be ignored outright.  –>
  65. <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
  66.    
  67. </types>

Add field definitions to schema.xml

This is pretty straight-forward: just set it up.

  1. <field name="htid"        type="genericID"          indexed="true"  stored="true"  multiValued="true"/>
  2. <field name="bibnum"       type="genericID"         indexed="true"  stored="true"/>
  3.  
  4. <field name="access"       type="lowercase"         indexed="true"  stored="true"/>
  5.  
  6. <field name="rights"       type="lowercase"         indexed="true"  stored="true"/>
  7.  
  8. <field name="source"       type="lowercase"         indexed="true"  stored="true"/>
  9. <field name="sourceid"     type="genericID"         indexed="true"  stored="true"/>
  10.  
  11. <field name="lccn"         type="lccnnormalizer" indexed="true"  stored="true"  multiValued="true"/>
  12. <field name="oclc"         type="numeric"        indexed="true"  stored="true"  multiValued="true"/>
  13. <field name="isbn"         type="stdnum"         indexed="true"  stored="true"  multiValued="true"/>
  14. <field name="issn"         type="stdnum"         indexed="true"  stored="true"  multiValued="true"/>
  15.  
  16. <field name="title"        type="text"         indexed="true" stored="true"/>
  17. <field name="imprint"      type="text"         indexed="true" stored="true"/>
  18. <field name="enumcron"     type="text"         indexed="true" stored="true"/>
  19.  
  20.   <!– Ignore the multivalued, comma-delimieted source strings –>
  21.  
  22.   <field name="rawLine"  type="ignored" indexed="false" stored="false"/>
  23.   <field name="issns"  type="ignored" indexed="false" stored="false"/>
  24.   <field name="isbns"  type="ignored" indexed="false" stored="false"/>
  25.   <field name="oclcs"  type="ignored" indexed="false" stored="false"/>
  26.   <field name="lccns"  type="ignored" indexed="false" stored="false"/>

hathi-data-config.xml — define how DIH is going to work.

This, of course, is the meat of the heart of the center of the matter.

I’m going to make use of four DIH technologies:

  • FileDataSource: In DIH, you declare a data source from which you’ll be sucking the raw data for manipulation and massaging. I’m just using a file, so this is for me. You can, as you might expect, pull in from a URL or (as mentioned) a database via JDBC.
  • FileListEntityProcessor: Given a directory and a set of criteria for a file, this will return a list of filenames that match those criteria. The criteria we’ll be using are (a) a regexp the filename must match, and (b) a creation date after the last time we ran the process.
  • LineEntityProcessor: Once you’ve got a data source, you need to stream it in somehow. There are Processors for XML and other formats, but this one just pulls in lines one at a time. The documentation all talks about LineEntityProcessor basically only being useful for pulling in, say, a list of filenames, but since my data is all line-by-line, this is what I’m using as my primary record-fetcher. It populates a single field called rawLine for later processing.
  • RegexTransformer: Allows you to take a field pulled from the datasource (or already derived from previous processing) and do regexp substitutions, group extraction, or splitting.

SO…I’m going to:

  1. Set up a FileDataSource to read from files
  2. Use FileListEntityProcessor to get a list of files that match my criteria
  3. Run each through LineEntityProcessor to generate a bunch of rawLines.
  4. Use the RegexTransformer multiple times to extract the data from the line.

[If you never went to look at it, this might be a good time to check out the description of the tab-delimited metadata files.]

  1.   <dataConfig>
  2.     <dataSource name="fds" encoding="UTF-8"  type="FileDataSource" />
  3.     <document>
  4.       <!– Get a list of files from the last time the handler ran –>
  5.       <entity name="hathifile"
  6.               processor="FileListEntityProcessor"
  7.               newerThan="${dataimporter.last_index_time}"
  8.               fileName="^hathi_upd_.*\.txt$"
  9.               rootEntity="false"
  10.               baseDir="/Users/dueberb/Documents/devel/hathi"
  11.       >
  12.  
  13.         <entity name="hathiline"
  14.                 processor="LineEntityProcessor"
  15.                 url="${hathifile.fileAbsolutePath}"
  16.                 rootEntity="true"
  17.                 dataSource="fds"
  18.                 transformer="RegexTransformer"
  19.         >
  20.  
  21. <!– Big ugly regexp to get all the tab-delimited fields –>
  22.           <field column="rawLine"
  23.                  regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
  24.                  groupNames="htid,access,rights,bibnum,enumcron,source,sourceid,oclcs,isbns,issns,lccns,title,imprint"
  25.           />
  26.  
  27. <!– Split the multi-values on comma –>
  28.  
  29.           <field column="oclc" splitBy="," sourceColName="oclcs" />
  30.           <field column="issn" splitBy="," sourceColName="issns" />
  31.           <field column="isbn" splitBy="," sourceColName="isbns" />
  32.           <field column="lccn" splitBy="," sourceColName="lccns" />
  33.         </entity> <!– end of hathiline –>
  34.  
  35.       </entity> <!– end of hathifile –>
  36.     </document>
  37.   </dataConfig>

And…it doesn’t work.

It almost works. The problem is that my attempt to use the variable ${dataimporter.last_index_time} is busted. There’s a ticket to fix it and a patch already provided, so it’s only a matter of time before it’s not an issue.

For the moment, though, we’ll change that line to:

  1.   <entity name="hathifile"
  2.           processor="FileListEntityProcessor"
  3.           newerThan="'NOW/DAY'"
  4.           fileName="^hathi_upd_.*\.txt$"
  5.           rootEntity="false"
  6.           baseDir="/Users/dueberb/Documents/devel/hathi"
  7.   >

That says to basically take everything created since midnight and use it. If you have cron scripts set up to run this every day, you’ll have no problems.

Dealing with a full extract

You’ll only have to do this once, of course, but it has to be done. Basically, reproduce the DIH handler with a different name, pulling in the data from a full extract (you could, e.g., just change the filename parameter to accept /^hathi_full_.*\.txt$/). Maybe call it hathifullimport instead of hathiimport.

Fire her up!

Once you’re ready to go, just hit the right URL:

http:://solrmachine:port/solr/hathifullimport?command=full-import&clean=true

http:://solrmachine:port/solr/hathiimport?command=full-import&clean=false

The first one will get the initial big, full file; the second will pull in all the nightlies you’ve downloaded, gunzipped, and put in the right place (provided, of course, they’re dated after the last midnight, or they’ve fixed DIH to allow the last_index_time syntax).

Next steps?

Beer or wine. Take your pick.

After that, though, it’d be a matter of actually writing the download scripts and setting up cron jobs. And, of course, putting a front-end on it if you want, or massaging the data as they come out to return a nice JSON format for your consumers. That sort of thing.

So, wait…is this really worth doing?

Maybe. Probably not. It was worth it to me to start thinking about DIH and how I can use it. And it might be worth it to you, if you want to play around with these data in the ways that solr makes easy.

But, like to many things, it’s less worth doing that it was worth writing up. I learned a lot.

Dead-easy (but extreme) AJAX logging in our VuFind install

One of the advantages of having complete control over the OPAC is that I change things pretty easily. The downside of that is that we need to know what to change.

Many of you that work in libraries may have noticed that data are not necessarily the primary tool in decision-making. Or, say, even a part of the process. Or even thought about hard. Or even considered.

For many decisions I see going on in the library world, the primary motivator is the anecdote. In fact, to be honest, the primary driver is the faculty anecdote. Those cliched three curmudgeonly old faculty members invariably have huge influence over systems and interfaces that will be used by 40K undergraduates. The tiny percentage of weirdos that actually talk to reference librarians end up wielding enormous power compared to the untold masses that don’t.

Enter the dragon…er…log.

So…I’m logging everything. EVERYTHING. Everything I can think of, anyway, and that doesn’t slow things down too far.

I’ve got a simple database table set up with the following columns:

  • incrementing integer (solely for innodb’s efficiency needs)
  • sessionid
  • action
  • data1, data2, and data3 (all of these are action-dependent)
  • logweekday, logdate, logtime (instead of a single timestamp for easy and efficient queries)

And that’s it. I’ve had it running (initially with only a few actions) for two weeks and have on the order of 300,000 rows in it at this point. Obviously, at some point I’ll have a better idea of which data I actually care about and things will get slimmed down a little bit. But for now, it’s fun having that all around.

Common log events include an action, and usually at least one other piece of data. Stuff like:

  • start-a-new-session with IP Address
  • simple-search with search-index, searchstring
  • choose-a-facet with facet-index, facet-value, position-on-list
  • view-a-full-record with recordID, search-result-number
  • click on an electronic resource link to proquest/google/hathitrust/whatever

…etc. I track adding and removing things from the selected items set and the user’s favorites, exporting to email or refworks or whatnot, logging in and out, clicking on the author’s name or a LCSH subject in the full record view, picking a “similar item” from the eponymous list, clicking on the spelling suggestion and the prev/next buttons, etc. Currently I’m logging 78 events.

[Note: by "search result number" I mean the enumeration of that record in that specific search set. So, the top result is #1. The first result on page two is #21]

What do I think I’m gonna learn?

I’m not exactly sure. Of course, I can get all the basics — how much traffic and what people are searching for — but there’s the possibility of other stuff. Things like:

  • Do people actually use the [prev/next, facets, facets-below-the-fold, items past the third page, etc]
  • Seriously, is anyone using the boolean searches and wildcards on the basic search page? Are any of them using IP addresses from outside the staff subnet? If not, can I please please please start using DisMax???
  • What facets are most popular? Do people hit the little “more” button to expand the list of facet values from 6 to 30?
  • What’s the average search result number of a record chosen for a full-record view for each search index (perhaps an indicator of how well the relevancy-ranking is working?)
  • Looking at all the full-record displays, what are the patterns for those records (e.g., break down by callnumber prefix, or by our “Academic Discipline” subject)

I’ve got a lot to learn about stats, and user tracking, and clickpath analysis, but dammit I’ll have data and I’m not afraid to use it!

[Er...them. Not "it". Data are a "them." Always feels weird to me to refer to data in the plural, but I'm forcing myself to do so these days.]

What’s the server implementation?

I already mentioned the database. I’ve got a little module called ActivityLog that does three and a half things:

  1. Get the session id from the session
  2. Get logging information from the GET/POST or passed in as parameters
  3. Modify the parameters if need be (e.g., pull domain name out of an external URL). This is the half a thing.
  4. Stuff it into the database with appropriate timestamps.

And that’s it.

What’s the client need to do?

I start off with the following rules:

  1. I want to be able to log damn near everything
  2. I can’t degrade the user experience in a meaningful way just for logging
  3. I want to log outgoing links, too.
  4. I must must must have pretty, bookmarkable URLs.

Truth be told, some of the “client” stuff can be (and is) done on the server. When someone is, say, sending a record or set of records to RefWorks, the server knows everything it needs to know and I can just take care of logging as part of the regular request fulfillment.

But some stuff — like the search result number, say — are best taken care of from the browser. Easy enough, for the most part, esp. with form submissions and such.

The potentially-non-obvious part comes in with rule #3 — I want pretty URLs. That means that the full display of record 123456789 is always going to be at /Record/123456789 no matter what the user clicked on. Ditto with adding/removing facets and such — the URL contains the resulting search, not the resulting search plus which facet was removed or added.

But — see #1 — I want to log damn near everything.

My solution — and I know lots of people are doing this; this isn’t rocket science — is to fire off an AJAX post for the click events that I’m interested in, sending log data off to my server and then not waiting for a return. Just send the data and follow the link as if nothing had intervened. It degrades gracefully (although the rest of my VuFind doesn’t, so that doesn’t matter much) and it dead-easy to implement.

The actual javascript implementation

I long ago switched our VuFind stuff over to use jQueryuery, just because I like it and know it.

First thing is to use the templates to modify the links to have a particular class (logit) and a well-structured ref (pipe-delimited values).

So, a link from the title of a work on the search-results page to the individual record will look like this:

<a ref="srrecview|{$record.id}||{$recordCounter}" href="/Record/{$record.id}" class="title logit">{$title}</a>

The ref attribute tells us that we’re going to log the type of event (record view from the search results), the ID of the record, a null in the data2 column, and the search result number.

Then there’s javascript to make all the magic happen:

  1.  
  2.  
  3.   function logit(a, args) {
  4.     a = jQuery(a);
  5.    
  6.     // Allow the caller to pass in args, or get them from the ref attribute
  7.     if (!args) {
  8.       args = a.attr('ref').split('|');
  9.     }
  10.    
  11.     jQuery.post(
  12.       url_to_the_logging_method,
  13.       {
  14.         'lc' : args[0],
  15.         'lv1': args[1] || '',
  16.         'lv2': args[2] || '',
  17.         'lv3': args[3] || ''
  18.       }
  19.     );
  20.   }
  21.  
  22.   jQuery(document).ready(function() {
  23.     jQuery('a.logit').live('click', function(e) {
  24.       logit(this);
  25.     });
  26.   });

The logit function just does a brain-dead post of the data in the ref attribute. We then bind that function to all anchors with the appropriate class, and we’re done.

[Note the use of the jQuery live event -- this makes sure the event will be bound to stuff that comes in via AJAX after page load. Our links to Google Books, for example, come in like this.]

Since I’m not returning false from the logit function, the default action (actually follow the link) will fire — without even waiting for the AJAX call to come back. Delay to the user is, hopefully, unnoticeable.

Final words

This isn’t all that smart. I should be doing more data-integrity stuff than I am, and of course someone could spoof my numbers if they wanted. But someone could spoof my stats just by hitting my normal catalog pages programatically, too, so there’s no more risk involved, and I do log IPs.

And, of course, I get my pretty URLs, and most users (i.e., those not running firebug) will never notice anything.

I don’t know that this would work for everyone, but so far it’s working pretty well for us. I’ll let you know if that continues in a post in a few weeks.