More Ruby MARC Benchmarks: Adding in MARC-XML

It turns out that UVA’s reluctance to use the raw MARC data on the search results screen is driven more by processing time than parsing time. Even if they were to start with a fully-parsed MARC object, they’re doing enough screwing around with that data that the bottleneck on their end appears to be all the regex and string processing, not the parsing. Their specs for what gets displayed are complex enough that they want to do the work up-front.

But I remain interested, at least partially because of the reason UVA is using MARC-XML: they have MARC records too big for binary MARC format to handle. We do, too, and we’ve just been talking about what to do with them. So I’m thinking that

First, I spent some time dusting off my first attempt at ruby programming: modifying ruby-marc to use libxml if it’s available. It’s not super-well tested, but I’m pretty sure it works. And the speed increases are … well, see below.

Anyone who wants to mess with my attempt at libxml-enabled ruby-marc is welcome to do so. This is a very forgiving parser — it trusts that whatever ended up in the XML should, in fact, have been there. If you say ‘XXE’ is a control field, well, I’ll treat it as a control field.

But back to the data. A few points are obvious:

  • XML with REXML is dead-slow on both platforms (at least an order of magnitude slower )
  • XML with LibXML is competitive with binary MARC (within 20% or so)
  • Even with REXML, though, time to create MARC records out of the 50 input strings is less than a second, which might be ok depending on your application.

Full results

As with last time, the total numbers below show how long it took to process all 40 sets of 50 records. The unadorned numbers are the average time it took to process a set of 50 records.

Call up solr with a null search, get 2000 records back in batches of 50 with wt=ruby, eval it, and stick it into arrays

jruby-Get/Eval data              0.143550
mri-Get/Eval data                0.106550

jruby-Get/Eval data (total)      5.742000
mri-Get/Eval data (total)        4.262017

Turn raw strings into MARC::Record objects from MARC-Binary strings, joining all the returned MARC together first

jruby-marc4j-multistring         0.026575
jruby-marc-multistring           0.037175
mri-marc-multistring             0.073396

jruby-marc4j-multistring (total) 1.063000
jruby-marc-multistring (total)   1.487000
mri-marc-multistring (total)     2.935842

Turn raw strings into MARC::Record objects from MARC-XML

mri-marc-LibXML                  0.091332
jruby-marc-REXML                 0.799500
mri-marc-REXML                   0.948549

mri-marc-LibXML (total)          3.653276
jruby-marc-REXML (total)        31.980000
mri-marc-REXML (total)          37.941975

Conclusions

I’m not sure exactly where this leaves me, other than knowing that marc-xml is probably a viable alternative if you can use libxml. Getting a version of that code which uses native Java XML libraries when run under jruby might be a useful exercise.

Benchmarking MARC record parsing in Ruby

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow]

[UPADTE It turns out they don't use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ]

[UP-UPDATE Well, no, they do use MARC-XML. I'm not afraid to constantly change my story. This is why I'm the best investigative reporter in the business]

The other day on the blacklight mailing list, Bess Sadler wrote

Yes, we do still include the full marc record, but the rule of thumb we’re currently using is that anything that needs to display in the index view (the search results) needs to be broken out into a separate display field, because retrieving and parsing marc records for every item in a list of search results is too much of a performance hit.

This surprised me a fair bit, because in our implementation of VuFind (which uses PHP, versus Ruby for Blacklight) I do just that — grab the MARC out of Solr, parse it, and pull stuff like full titles and such out of it.

As it turns out, I’d been screwing around with calling marc4j from jruby, anyway, so I threw that into the mix, and here’s what I found.

What the benchmark tries to measure

The focus is on measuring time to parse MARC records as returned in a field from Solr in MARC-binary.

I got 40 sets of 50 records each (2000 records) from our Solr instance in ruby format and extracted the binary MARC strings. This resulted in an array of 40 sets of 50 strings, each of which is a valid MARC record.

Fifty records seems largish to me — we only display 20 at a time — but thought I’d swing for the fences.

I’m testing along three(ish) dimensions:

  • jruby vs mri
  • marc4j vs ruby-marc (only on jruby, obviously)
  • parsing each string individually, or globbing them all together and treating it as if it’s a multi-record file

[Note that MRI is using Net::HTTP to get the data; I presume Curl would be faster still. It's already faster than jruby]

The following data show the average time to parse out each set of 50 records and extract the first 245 (title) field from each one, along with the totals for doing all 2000 records.

Method                           User       Total      Real      

jruby Get/Eval data              0.134750   0.134750 (  0.134850)
jruby Get/Eval data (2000)       5.390000   5.390000 (  5.394000)

MRI Get/Eval data                0.008500   0.012750 (  0.115942)
MRI Get/Eval data (2000)         0.340000   0.510000 (  4.637677)    

jruby-marc4j-oneAtATime          0.056075   0.056075  (0.056125)
jruby-marc4j-multistring         0.027925   0.027925  (0.028000)

jruby-marc-oneAtATime            0.066625   0.066625  (0.066650)
jruby-marc-multistring           0.034300   0.034300  (0.034325)

mri-marc-oneAtATime              0.084500   0.085250  (0.086597)
mri-marc-multistring             0.085000   0.085750  (0.086026)

jruby-marc4j-oneAtATime (2000)   2.243000   2.243000  (2.244999)
jruby-marc-oneAtATime (2000)     2.665001   2.665001  (2.666000)
mri-marc-oneAtATime (2000)       3.380000   3.410000  (3.463888)


jruby-marc4j-multistring (2000)  1.117001   1.117001  (1.120001)
jruby-marc-multistring (2000)    1.371999   1.371999  (1.372999)
mri-marc-multistring (2000)      3.400000   3.430000  (3.441052)

So…the worst-case scenario is taking an average 0.085 second to get the first title field out of each one of 50 binary MARC records once we’ve got them.

Now, I’m sure all my records came out of the cache, so my query time wasn’t very long. But we still end up with a maximum of roughly 0.2 seconds plus the time to actually do the query to end up with a set of 50 marc records.

We can see from looking at the totals that it looks like MRI’s bottleneck is the actual parsing, whereas constructing the input streams is expensive under jruby (at least the way I’m doing it), resulting in a benefit of concatenating them all together into one longish string before parsing.

Marc4j is faster (20%ish), but not enough faster to be worth the effort in my mind. Keep in mind that I have no idea how fast Marc4j is when running under pure java, without all the jruby overhead.

Bottom line, though: that seems fast enough to me.

I’ll try to benchmark with XML later on today or tomorrow.

Building a solr text filter for normalizing data

[Kind of part of a continuing series on our VUFind implementation; more of a sidebar, really.]

In my last post I made the case that you should put as much data normalization into Solr as possible. The built-in text filters will get you a long, long way, but sometimes you want to have specialized code, and then you need to build your own filter.

Huge Disclaimer: I’m putting this up not because I’m the best person to do so, but because it doesn’t look as if anyone else has. I don’t know what I’m doing. I don’t know why the code I’m showing below is the way it is, and if anyone would like to make it better, that’d be great. This is basically just a lot of pattern-matching on my part.

[A second disclaimer: I haven't actually built this into Solr yet, although I've done some simple testing on the ISBN-13 checksum code. I'll remove this disclaimer when I get a chance to actually index some data with it.]

The Setup: An ISBN-10 to ISBN-13 converter

Last time, I said I didn’t know why I hadn’t put together an ISBN longifier yet. So let’s walk through it.

This is a lot easier than most things in that I’m assuming we’re going to be getting exactly one token to work with (via the KeywordTokenizer) and can just work on it with impunity.

If you’d like to follow along, get the solr source via svn on a machine with java and ant. And junit, I think.

Where to put stuff

Of all the black magic associated with doing this, figuring out how to actually make it build is the part that’s probably easiest for Java-heads and the most confusing to the rest of us. Anyone attempting this sort of thing should probably get a good grounding in how Solr is set up and how its build system works before doing anything else.

Me? I cheated.

I basically just copied the directory structure of another project in the config directory in the solr root (looks like maybe it was velocity), did some tiny modifications to the build.xml file to change the name of the project, renamed the ‘.pom’ file and edited it in the obvious ways, and followed the copied directory structure to figure out where to put my files.

And then it worked. And I didn’t ask any question, and metaphorically just backed away slowly with a nonchalant look on my face. Of course, if you know what you’re doing with java and ant, I’m sure there are better ways.

For the record, the directory in solr/config/umichnormalizers (where I put this stuff) would look something like this by the end of this project:

./target/ 
./build.xml

./src/main/java/edu/umich/lib/normalizers/ISBNLongifier.java ./src/test/java/edu/umich/lib/normalizers/ISBNLongifier.java

./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilter.java ./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilterFactory.java

You then just run ant in your config directory to generate a .jar file that can be put in solrmarc’s lib directory or (I think) jetty’s lib directory. You can also just run ant dist at the solr root level to get a .war file with your stuff embedded.

The converter

First, you just need some basic code to actually do the conversion. I’m sure this is hideously inefficient, but probably not as inefficient as the actual filter I’ll be producing in a minute.

We take in a string. If it looks like it might have a 10-digit ISBN in it (possibly with dashes or periods as delimiters), extract it, do the conversion to an ISBN-13, and return that as a 13-character string (e.g., no dashes or whatnot).

Note that I’m not working hard to determine if it’s an ISBN — this isn’t designed to try to pull an ISBN from random text. The hope is that by the time you get this far you’ve already got a pretty good idea that you’ve got an ISBN on your hands. I’m also not checking to see if the incoming ISBN is valid in any way; that’s left as an exercise for the dilligent reader.

  1. package edu.umich.lib.normalizers;
  2. import java.util.regex.*;
  3.  
  4. public class ISBNLongifier {
  5.  
  6.   // dashes and dots are acceptable delimiters. Should we add spaces??
  7.   private static String  ISBNDelimiiterPattern = "[\\-\\.]";
  8.  
  9.   // Look for a string of nine digits followed by another digit or an X
  10.   private static Pattern ISBNPattern = Pattern.compile("^.*?(\\d{9})[\\dXx].*$");
  11.  
  12.   public static Boolean matches(String isbn)  throws IllegalArgumentException {
  13.     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");
  14.     Matcher m = ISBNPattern.matcher(isbn);
  15.     return m.matches();
  16.   }
  17.  
  18.   public static String longify(String isbn) {
  19.     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");
  20.     Matcher m = ISBNPattern.matcher(isbn);
  21.     if (!m.matches()) {
  22.       throw new IllegalArgumentException(isbn + ": Not an ISBN");
  23.     }
  24.  
  25.     String longisbn = "978" + m.group(1);
  26.     int[] digits = new int[12];
  27.     for (int i=0;i<12;i++) {
  28.       digits[i] =  new Integer(longisbn.substring(i, i+1));
  29.     }
  30.  
  31.     Integer sum = 0;
  32.     for (int i = 0; i < 12; i++) {
  33.       sum = sum + digits[i] + (2 * digits[i] * (i % 2));
  34.     }
  35.  
  36.     // Get the smallest multiple of ten > sum
  37.     Integer top = sum + (10(sum % 10));
  38.     Integer check = top – sum;
  39.     if (check == 10) {
  40.       return longisbn + "0";
  41.     } else {
  42.       return longisbn + check.toString();
  43.     }
  44.   }
  45. }

The Factory Object

Next is a boilerplate factory object. The only change will be the package you put it in, and the last method’s name and return value.

  1. package edu.umich.lib.solr.analysis;
  2. import java.util.Map;
  3. import org.apache.solr.analysis.BaseTokenFilterFactory;
  4. import org.apache.lucene.analysis.TokenStream;
  5.  
  6. public class ISBNLongifierFilterFactory extends BaseTokenFilterFactory {
  7.   Map<String,String> args;
  8.  
  9.   public Map<String,String> getArgs()
  10.   {
  11.     return args;
  12.   }
  13.   public void init(Map<String,String> args)
  14.   {
  15.     this.args = args;
  16.   }
  17.   public ISBNLongifierFilter create(TokenStream input)
  18.   {
  19.     return new ISBNLongifierFilter(input);
  20.   }
  21. }

The actual filter

And, finally, the filter class. You’ll notice that I’m catching any illegal argument error and just returning the input unchanged. So anything that comes through that isn’t an ISBN just gets passed along.

  1. package edu.umich.lib.solr.analysis;
  2.  
  3. import edu.umich.lib.normalizers.ISBNLongifier;
  4. import org.apache.lucene.analysis.Token;
  5. import org.apache.lucene.analysis.TokenFilter;
  6. import org.apache.lucene.analysis.TokenStream;
  7. import java.util.regex.*;
  8. import java.io.IOException;
  9.  
  10. public final class ISBNLongifierFilter extends org.apache.lucene.analysis.TokenFilter {
  11.  
  12.   public ISBNLongifierFilter(TokenStream in) {
  13.     super(in);
  14.   }
  15.  
  16.   public Token next() throws IOException {
  17.     return normalize(this.input.next());
  18.   }
  19.  
  20.   public Token next(Token result) throws IOException {
  21.     return normalize(this.input.next());
  22.  
  23.   }
  24.  
  25.   public Token normalize(Token t) {
  26.     if (null == t || null == t.termBuffer() || t.termLength() == 0) {
  27.       return t;
  28.     }
  29.     String val = new String(t.termBuffer());
  30.     try {
  31.       t.setTermBuffer(ISBNLongifier.longify(val));
  32.       return t;
  33.     } catch (IllegalArgumentException e) {
  34.        // pass it through unchanged
  35.       return t;
  36.     }
  37.   }
  38. }

How to use it

Assuming you’ve managed to get it built into Solr and then deployed, just define it as a type in your schema.xml:

  1.   <fieldType name="isbnlongifier" class="solr.TextField"  omitNorms="true">
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="edu.umich.lib.solr.analysis.ISBNLongifierFilterFactory"/>
  5.     </analyzer>
  6.   </fieldType>
  7.  
  8.   # and later…
  9.  
  10.   <field name="isbn" type="isbnlongifier" indexed="true" stored="false" multiValued="true"/>

Conclusion

There it is. The rocket science is all hidden behind the import statements. My understanding is that casting the token value to/from Strings makes things horribly inefficient, but I’m pretty sure I’ve got bigger bottlenecks to tackle before worrying about this.

Easy Solr types for library data

[Yet another bit in a series about our Vufind installation]

While I’m no longer shocked at the terrible state of our data every single day, I’m still shocked pretty often. We figured out pretty quickly that anything we could do to normalize data as it went into the Solr index (and, in fact, as queries were produced) would be a huge win.

There’s a continuum of attitudes about how much “business logic” belongs in the database layer of any application. Some folks — including super-high throughput sites, but mostly people who have never used anything by MySQL — tend to put no logic into the database. I’ve always edged over the middle to the other side of that debate, preferring to let the database do type-checking and conversions and track foreign keys and the like.

Solr, while not a traditional RDBMS, offers this type of functionality in its text filters. One can pipe data through a few standard filters, or write a custom one in Java if need be. The nice part is that it applies at index and query time. One obvious application, which I somehow haven’t bothered to write yet, is to convert all ISBNs to 13-characters new-style ISBNs upon both index and query. That way, you don’t care if your original records had the short or long form; all the data gets converted no matter how it comes in.

Our standard text field is similar to the default schema.xml, for example, running text through the following filters:

  • UnicodeNormalization to normalize unicode composition and (optionally) remove diacritics
  • StopFilter to ignore stopwords in a separate file
  • WordDelimiter to do intelligent word deliniation
  • LowerCase to…you know…lowercase everything
  • EnglishPorter to do stemming
  • RemoveDuplicates to do what it says

And because it happens on index and on query, everything works out.

We’re running Solr basically from trunk — whenever we need to change something, I pull down a fresh svn copy, put in our local changes to make sure it all works, and then deploy — so I have access to stuff slated for Solr 1.4, including most importantly Trie fields and the PatternReplaceFilterFactory.

The stdnum type

One of the first things we defined was a “stdnum” type, to deal with supposedly-unique identifiers, possibly with embedded dashes and dots and leading/trailing nonsense. Here’s a variant.

  1.   <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="solr.LowerCaseFilterFactory"/>
  5.       <filter class="solr.TrimFilterFactory"/>
  6.       <filter class="solr.PatternReplaceFilterFactory"
  7.            pattern="^[\D]*([\d\-\.]+x?).*$" replacement="$1"
  8.       />
  9.       <filter class="solr.PatternReplaceFilterFactory"
  10.            pattern="[^\dx]" replacement=""  replace="all"
  11.       />
  12.       <filter class="solr.PatternReplaceFilterFactory"
  13.            pattern="^0+" replacement=""  replace="all"
  14.       />
  15.     </analyzer>
  16.   </fieldType>

Let’s walk through it. It could probably be done in one go, but solr is not our bottleneck at this point…

  • We start by defining it as a TextField because it’s the only type that can take filters.
  • We then declare that instead of the standard tokenizer, we’re using the KeywordTokenizer. Confusingly, the KeywordTokenizer doesn’t tokenize in the traditional sense — it just returns the whole input as a single token.
  • Lowercase it.
  • Trim spaces off both ends
  • Skip any leading non-digts, find a string of numbers, dashes, and dots, with optional x at the end, and skip everything after it.
  • Remove anything left that isn’t a digit or an ‘x’.
  • Remove leading zeros, if you’ve got ‘em.

The net effect is a trimmed string that has only digits (with an optional trailing ‘x’) and removes any leading zeros.

We use this “stdnum” field for ISBNs and ISSNs (and I think OCLC numbers) and it should work for any messy numerics you might have lying around. If you wanted to, you could change the regexp to enforce a minimum string of digits so it doesn’t get confused by any leading nonsense, e.g, “ISSN2: 1234567X (online)”. But if your data are that bad, you may have bigger problems to worry about.

textProper type

We define a textProper type that is exactly the same as the default text type, but without the stemming and synonyms. In the presence of stemming, exact matches and stemmed matches count the same toward relevancy (e.g. row and rowing). We had plenty of examples where exact results were getting overridden by the stemmed results, and this is confusing.

So for most of our important fields, we index them as both text and textProper so we can apply different weights to searches against them.

By the way, don’t forget to make sure your authors are in a textProper type; you don’t want stemming on author names!

exactmatcher type

The name exactmatcher is a red herring, of course, It’s not an exact matcher. It just strips out all the delimiters so we can pretend it’s an exact match.

  1.   <fieldType name="exactmatcher" class="solr.TextField" omitNorms="true">
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
  5.       <filter class="solr.LowerCaseFilterFactory"/>
  6.       <filter class="solr.TrimFilterFactory"/>
  7.       <filter class="solr.PatternReplaceFilterFactory"
  8.            pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
  9.       />
  10.     </analyzer>
  11.   </fieldType>

That’s it. Lowercase it, normalize the unicode, and pull out everything that’s not a (unicode) letter or number.

Note that we’re still using KeywordTokenizerFactory — we’re getting exactly one token out of this thing. That means that the query input either matches (as one string) or it doesn’t.

Here’s how we use it:

  • Control numbers: Our controlnums (old ids, that sort of thing), report numbers, sdr numbers (related to HathiTrust), the HathiTrust ID
  • Callnumbers: I also try to normalize LC, but this helps people find everything else
  • Titles: in addition to a regular tokenized title, we index the 245a (as title_a) and the 245ab (as title_ab). If someone types in an exact match for either of them, we shoot the relevancy through the roof (more so for the title_ab than the title_a, obviously). This makes known item searching a little less painful.

wildcard searching

One downside of using all these filters is that Solr ignores filters when doing wildcard searches. There is a patch floating around that will using an analyzing query parser for wildcard searches, but I haven’t had time to fiddle around with it.

One thing you can do is to do the exact same normalization in your calling code and then throw a ‘*’ on the end of it. The data are in the index, after all — you just have to do the filtering yourself. For example, for a cheap and easy “Title starts with” search, you can do the same normalization in PHP or Ruby or whatever as we do in the Solr exactmatcher type, drop a ‘*’ on the end of it, and query against the exactmatcher version of your title. Voila.

Custom filters

Regular expressions can get you ridiculously far, but for a couple cases it’d be nice to have custom code running. I’ve already mentioned that we should by upcasting all our ISBNs to the 13-character variant. The other two areas where I do this are to normalize LCCNs and to badly normalize LC CallNumbers. I’ll talk about both soon.

Going with and “forking” VUFind

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff.

When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight.

I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around it (e.g., Stanford), and was based on a technology stack we had some experience with (PHP). Blacklight seemed to be just getting off to a fitfull start, and its Ruby stack was at that time an iffy proposition (this was before any sort of major adoption of Passenger or JRuby).

As I write this, things have flipped around a little. Andrew Nagy, the principle architect of VUFind, left Villanova for Serial Solutions and VUFind stopped being his primary focus. The Blacklight community decided to go with a major reorganization of the code to make it easier to deploy, which resulted in a flurry of refactoring and improvements and folks generally thinking things through really well. Stanford just flipped the switch from their VUFind to a Blacklight installation, and as I pointed out, the Ruby deployment options are more stable and less resource-hungry than they were back then. If the decision were being made today, it would be a much more complex analysis.

But anyway, the decision was made, and Tim Prettyman and I were tapped to do most of the hardcore nerd work to make it suitable for our environment.

Right away, I found things that would need some pretty major revision. The user model was based on a local database of logins (we use cosign), even moderately-long search strings would crash the thing, cookies were being used instead of sessions and hitting the 4K limit, search specification were hardcoded in the PHP, and lots of the UI elements didn’t actually have working code behind them (RSS feeds, endnote export, spellcheck, etc).

So, I dug in and started learning PHP and Smarty and refactoring/rewriting/rearchitecting the crap out of it. One of the first things I did was to extract the search specification — the mapping of, say, a ‘title’ search to a weighted search of six or seven actual Solr fields — into a yaml file so we could mess around with it more easily than modifying the giant case-statement in the PHP code. I built a patch against the then-current revision, filed it as a bug, and sent email to the list.

And nothing happened. That patch is still sitting there, in fact. Maybe I’m the only one that thinks it’s useful. But in any case, there was no discussion of it, no one rejected it. It just sat. Sits. Whatever.

I could have asked for write access to the repository, but I didn’t. I saw a few other patches get submitted and met with yawns all around, and started looking more closely at the list and saw pretty much no one doing anything with the then-current code base, and frankly kind of gave up. The folks that I knew were working actively on implementing VUFind — us, Australia and Alan Rykhus at MNPals — were all working from very different code bases, which made our ability to share code very limited. Any sort of official work on VUFind seemed to have slowed to a near standstill (based on svn checkins), and almost no one else seemed interested in submitting patches. After a while, we stopped, too.

So, we didn’t really fork VUFind. We just rewrote much of it and stopped trying to generate interest in our changes. The right thing to do would have been to either grab the bull by the horns, or do an actual fork of the project. But we didn’t feel as if we had time to shepherd a project of this size, and after many, many (many) discussions, decided to just do our thing. I assume that’s what everyone else has done, too, since I see plenty of differences in how things work at the different sites.

As it stands, the wiki shows a good handful of libraries live with VUFind, and a bunch more marked as being in “beta.” I don’t know if what we’re running Mirlyn on is still enough VUFind to be called VUFind. Probably. The basic structure is the same, the search syntax as exposed in the URL is the same. The plumbing underneath is changed in a lot of ways, and I like to think the flow of control makes a little more sense now.

In real life, of course, it doesn’t matter where you draw the line. Our code is far enough removed from the svn repository now that we’re essentially going it alone.

That doesn’t bother me.

The reality is that we’ve taken control of the UI and learned what we need to know about using Solr with our data. If I need to change the backend — to Blacklight, to a newer VUFind, to anything — my users need not ever know, other than to notice that things are a little bit better. If we end up moving to a release-quality version of VUFind, there’s almost nothing I can’t reuse if it makes sense.

We’ve also learned a lot. Solr, obviously, and how to write text filters for it and push it around just a little bit. Solrmarc, too. But we’ve also taken a hard look at data normalization in ways we haven’t before, and decided how we’re going to output to Refworks, and to email, what kinds of searches we want to offer, where we have collisions in ID namespaces (OCLC & ISSN, I’m looking at you).

We’ve discovered issues and problems with our data we’d have never seen otherwise, and started up whole sets of conversations about OPAC issues that used to languish for lack of a reification for reference. The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.

The system has tons of problems still, starting with underlying templates that will make you a little sick if you do a “view source” and going right through my call number search not working for some edge cases. But that stuff will get cleaned up as we get a little downtime from adding new features, and there are elements of the new backend code that could be useful to others once I clean them up and remove local dependencies.

I’m not sure when, if ever, we’ll start thinking of ourselves as part of the “VUFind community” again. The heavy intellectual lifting about how to organize what is essentially a front-end for Solr doesn’t seem to be happening on the VUFind list. And to be honest, I’m not sure it should be. Solr is the real engine. Solrmarc is, for us right now, an important piece. Data normalization, translation, workaround for crappy data, and the basic information theory of a faceted search system are all independent of the particular middleware you’re using to grab Solr results and throw them up on the screen.

So, what we have is good for us, for now, and we’re continuing to learn how to move forward. And I’ve been able to get bug reports and say, “Thanks, Fixed” fifteen minutes later and get warm fuzzy feelings that don’t usually accompany, “Thanks. I’ll put a request in at Ex Libris’ online ticket system”.

Next time: using and abusing Solr for data normalization.

Sending unicode email headers in PHP

I’m probably the last guy on earth to know this, but I’m recording it here just in case. I’m sending record titles in the subject line of emails, and of course they may be unicode. The body takes care of itself, but you need to explicitly encode a header like “Subject.”

  1.  
  2.     $headers['To'] = $to;
  3.     $headers['From'] = $from;
  4.     $headers['Content-Type'] = "text/plain; charset=utf-8";
  5.     $headers['Content-Transfer-Encoding'] = "8bit";
  6.     $b64subject = "=?UTF-8?B?" . base64_encode($subject) . "?=";
  7.     $headers['Subject'] = $b64subject;
  8.  
  9.     $mail =& Mail::factory('sendmail', array('host' => $host,
  10.                                              'port'=>$port));
  11.     $retval =  $mail->send($to, $headers, $body);

Rolling out UMich’s “VUFind”: Introduction and New Features

For the last few months, I’ve been working on rolling out a ridiculous-modified version of Vufind, which we just launched as our primary OPAC, Mirlyn, with a slightly-different version powering catalog.hathitrust.org, a temporary metadata search on the HathiTrust data until the OCLC takes it over at some undetermined date.

(Yeah, the HathiTrust site is a lot better looking.)

[Our Aleph-based catalog lives on at mirlyn-classic) -- I'll be interested to see how the traffic on the two differs as time goes on.]

I’m going to spend a few posts talking about how and why we essentially forked vufind, what sorts of modifications I made, and what technologies I hope to extract from our implementation that may be useful to the wider library community. And, I’m sure, a lot about why I hate Solr, why I love love love Solr, why I hate PHP, and why I love…er…no, I still hate PHP.

Credit where it’s due

And… a little credit where it’s due. I did a lot, but I didn’t do it all. I probably didn’t even do most of it. Half the effort, including all the heavy Aleph lifting — from getting the MARC out with all the filters and expansions we needed, to pulling holdings in real time, to grabbing a patron’s current checked-out items and holds, to fighting the inevitably-scarring battle with ILL — was done by Tim Prettyman. Suzanne Chapman lent her expertise to make it a lot less ugly and more usable than it once was (you can see her talents more strongly expressed at the HathiTrust catalog). And a whole horde of librarians were tapped by my boss, Jon Rothman, to try to figure out how to deal with the MARC data and facets and everything else that required a much deeper understand of our data than I possess.

Non-stock user-facing features

In the next post, I’ll start with a look at how and why we changed the backend and what I’d do differently if I were starting from scratch. But right now, a quick list of the user-facing stuff that you might find interesting.

  • Email and export searches and search results, as opposed to just individual records.
  • Working endnote and refworks export.
  • Multi-select on the advanced search (e.g., pick two languages to get English OR German).
  • Publication date-range searching (with date-added-to-catalog searching coming soon).
  • A “sticky” institution selection, so each campus can choose to default to searching just their own stuff. We sniff IPs to set a default, too.
  • A “call number starts with” search based on semantics for LC searches (e.g., searching on CA11 won’t find CA1105), with call number range searching in testing now.
  • Contracted holdings for long lists of serials (see, e.g., Nature).
  • [Coming soon] Selecting records to a temporary set, which can be manipulated en masse (sent to Refworks, etc.). I’ll be hooking this up to mTagger, our home-grown bookmarking and tagging tool, later on.

Of course, I also broke some things. I haven’t added back in Search History, but will do so when I’ve got a couple hours. “Search Within” will make a comeback soon, too, but there are usability issues to contend with. And …for the love of god, don’t do a “View Source.” It’s the ugliest HTML underpinnings I’ve been associated with since 1993 or so.

All in all, though, it’s not bad work, and I’m glad to be able to offer it to our patrons.

Sending MARC(ish) data to Refworks

Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I’d put down how I’m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested.

The basic procedure is:

  • Send your user to a specific refworks URL along with a callback URL that can enumerate the record(s) you want to import in a supported form
  • Your user logs in (if need be) gets to her RefWorks page
  • RefWorks calls up your system and requests the record(s)
  • The import happens, and your user does whatever she want to do with them

Of course, there are lots of issues with doing this well (quick! Is this MARC record for a book? An edited book? Is it a journal, or a serial of some other sort? Who’s the actual author/editor?), but doing it at all isn’t so bad.

The URL to send them to

This is the “Export this record” URL on my system:

http://www.refworks.com.proxy.lib.umich.edu/express/expressimport.asp?
vendor=[your system]&
filter=MARC+Format&
database=All+MARC+Formats&
encoding=65001
&url=[your callback URL]
Note that the vendor variable should be a unique string (made up by your) for your system, not a larger entity (like the whole library or the institution).

The “MARC Format” filter we’re using is not a filter for real MARC. It’s a MARC-like delimited format (see an example from my catalog).

Basically, you have three types of lines (but really, look at the example, ’cause it’ll make everything a lot clearer):

LEADER

  LEADER [one space] [leader text]

Control Field

  [three-digit control tag] [four spaces] [data text]

Data Field

  [three-digit data tag] [one space] [ind1] [ind2] [one space] [value of subfield a] [other subfield constructs]

…where [other subfield constructs] look like

  [pipe characeter][subfield code][subfield value]

Notice that (a) there’s no leading ‘|a’ before the subfield a value, and (b) there are no spaces between the pipe, the subfield code, and the subfield value for the non-code-a subfields.

Some easy PHP code to produce such a format is as follows. Note that I’m sending it as text (because it’s not MARC) and UTF-8. If you’re got MARC-8, you’ll have to convert it before sending.

  1.       $m = $this->marcRecord;
  2.       header('Content-type: text/plain; charset=UTF-8');
  3.  
  4.       echo 'LEADER ', $m->getLeader(), "\n";
  5.      
  6.       foreach ($m->getFields() as $tag => $val) {
  7.         echo $tag;
  8.         if ($val instanceof File_MARC_Control_FIELD) {
  9.           echo '    ', $val->getData(), "\n";
  10.         } else {
  11.           echo ' ', $val->getIndicator(1),  $val->getIndicator(2), ' ';
  12.           $subs = array();
  13.           foreach ($val->getSubFields() as $code=>$subdata) {
  14.             $line = '';
  15.             if ($code != 'a') {
  16.               $line = '|' . $code;
  17.             }
  18.             $subs[] = $line . $subdata->getData();
  19.           }
  20.           echo implode(' ', $subs), "\n";
  21.         }        
  22.       }

MARC-HASH: The saga continues (now with even less structure)

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "fields" : [
  7.      ["001", "001 value"]
  8.      ["002", "002 value"]
  9.      ["010", " ", " ",
  10.       [
  11.         ["a", "68009499"]
  12.       ]
  13.     ],
  14.     ["035", " ", " ",
  15.       [
  16.         ["a", "(RLIN)MIUG0000733-B"]
  17.       ],
  18.     ],
  19.     ["035", " ", " ",
  20.       [
  21.         ["a", "(CaOTULAS)159818014"]
  22.       ],
  23.     ],
  24.     ["245", "1", "0",
  25.       [
  26.         ["a", "Capitalism, primitive and modern;"],
  27.         ["b", "some aspects of Tolai economic growth" ],
  28.         ["c", "[by] T. Scarlett Epstein."]
  29.       ]
  30.     ]
  31.   ]
  32. }

MARC-HASH control field, now with less structure

Why do I ever, ever think that MARC might not rely on order? I don’t know.

In any case, control fields will now be just an array of duples:

  1. control: [
  2.   ['001', 'value of the 001'],
  3.   ['006', 'value of the 006']
  4.   ['006', 'another 006']
  5. }