Archives: August 2009

[Kind of part of a continuing series on our VUFind implementation; more of a sidebar, really.]

In my last post I made the case that you should put as much data normalization into Solr as possible. The built-in text filters will get you a long, long way, but sometimes you want to have specialized code, and then you need to build your own filter.

Huge Disclaimer: I’m putting this up not because I’m the best person to do so, but because it doesn’t look as if anyone else has. I don’t know what I’m doing. I don’t know why the code I’m showing below is the way it is, and if anyone would like to make it better, that’d be great. This is basically just a lot of pattern-matching on my part.

[A second disclaimer: I haven't actually built this into Solr yet, although I've done some simple testing on the ISBN-13 checksum code. I'll remove this disclaimer when I get a chance to actually index some data with it.]

The Setup: An ISBN-10 to ISBN-13 converter

Last time, I said I didn’t know why I hadn’t put together an ISBN longifier yet. So let’s walk through it.

This is a lot easier than most things in that I’m assuming we’re going to be getting exactly one token to work with (via the KeywordTokenizer) and can just work on it with impunity.

If you’d like to follow along, get the solr source via svn on a machine with java and ant. And junit, I think.

Where to put stuff

Of all the black magic associated with doing this, figuring out how to actually make it build is the part that’s probably easiest for Java-heads and the most confusing to the rest of us. Anyone attempting this sort of thing should probably get a good grounding in how Solr is set up and how its build system works before doing anything else.

Me? I cheated.

I basically just copied the directory structure of another project in the config directory in the solr root (looks like maybe it was velocity), did some tiny modifications to the build.xml file to change the name of the project, renamed the ‘.pom’ file and edited it in the obvious ways, and followed the copied directory structure to figure out where to put my files.

And then it worked. And I didn’t ask any question, and metaphorically just backed away slowly with a nonchalant look on my face. Of course, if you know what you’re doing with java and ant, I’m sure there are better ways.

For the record, the directory in solr/config/umichnormalizers (where I put this stuff) would look something like this by the end of this project:

./target/ 
./build.xml

./src/main/java/edu/umich/lib/normalizers/ISBNLongifier.java ./src/test/java/edu/umich/lib/normalizers/ISBNLongifier.java

./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilter.java ./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilterFactory.java

You then just run ant in your config directory to generate a .jar file that can be put in solrmarc’s lib directory or (I think) jetty’s lib directory. You can also just run ant dist at the solr root level to get a .war file with your stuff embedded.

The converter

First, you just need some basic code to actually do the conversion. I’m sure this is hideously inefficient, but probably not as inefficient as the actual filter I’ll be producing in a minute.

We take in a string. If it looks like it might have a 10-digit ISBN in it (possibly with dashes or periods as delimiters), extract it, do the conversion to an ISBN-13, and return that as a 13-character string (e.g., no dashes or whatnot).

Note that I’m not working hard to determine if it’s an ISBN — this isn’t designed to try to pull an ISBN from random text. The hope is that by the time you get this far you’ve already got a pretty good idea that you’ve got an ISBN on your hands. I’m also not checking to see if the incoming ISBN is valid in any way; that’s left as an exercise for the dilligent reader.

  1. package edu.umich.lib.normalizers;
  2. import java.util.regex.*;
  3.  
  4. public class ISBNLongifier {
  5.  
  6.   // dashes and dots are acceptable delimiters. Should we add spaces??
  7.   private static String  ISBNDelimiiterPattern = "[\\-\\.]";
  8.  
  9.   // Look for a string of nine digits followed by another digit or an X
  10.   private static Pattern ISBNPattern = Pattern.compile("^.*?(\\d{9})[\\dXx].*$");
  11.  
  12.   public static Boolean matches(String isbn)  throws IllegalArgumentException {
  13.     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");
  14.     Matcher m = ISBNPattern.matcher(isbn);
  15.     return m.matches();
  16.   }
  17.  
  18.   public static String longify(String isbn) {
  19.     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");
  20.     Matcher m = ISBNPattern.matcher(isbn);
  21.     if (!m.matches()) {
  22.       throw new IllegalArgumentException(isbn + ": Not an ISBN");
  23.     }
  24.  
  25.     String longisbn = "978" + m.group(1);
  26.     int[] digits = new int[12];
  27.     for (int i=0;i<12;i++) {
  28.       digits[i] =  new Integer(longisbn.substring(i, i+1));
  29.     }
  30.  
  31.     Integer sum = 0;
  32.     for (int i = 0; i < 12; i++) {
  33.       sum = sum + digits[i] + (2 * digits[i] * (i % 2));
  34.     }
  35.  
  36.     // Get the smallest multiple of ten > sum
  37.     Integer top = sum + (10(sum % 10));
  38.     Integer check = top – sum;
  39.     if (check == 10) {
  40.       return longisbn + "0";
  41.     } else {
  42.       return longisbn + check.toString();
  43.     }
  44.   }
  45. }

The Factory Object

Next is a boilerplate factory object. The only change will be the package you put it in, and the last method’s name and return value.

  1. package edu.umich.lib.solr.analysis;
  2. import java.util.Map;
  3. import org.apache.solr.analysis.BaseTokenFilterFactory;
  4. import org.apache.lucene.analysis.TokenStream;
  5.  
  6. public class ISBNLongifierFilterFactory extends BaseTokenFilterFactory {
  7.   Map<String,String> args;
  8.  
  9.   public Map<String,String> getArgs()
  10.   {
  11.     return args;
  12.   }
  13.   public void init(Map<String,String> args)
  14.   {
  15.     this.args = args;
  16.   }
  17.   public ISBNLongifierFilter create(TokenStream input)
  18.   {
  19.     return new ISBNLongifierFilter(input);
  20.   }
  21. }

The actual filter

And, finally, the filter class. You’ll notice that I’m catching any illegal argument error and just returning the input unchanged. So anything that comes through that isn’t an ISBN just gets passed along.

  1. package edu.umich.lib.solr.analysis;
  2.  
  3. import edu.umich.lib.normalizers.ISBNLongifier;
  4. import org.apache.lucene.analysis.Token;
  5. import org.apache.lucene.analysis.TokenFilter;
  6. import org.apache.lucene.analysis.TokenStream;
  7. import java.util.regex.*;
  8. import java.io.IOException;
  9.  
  10. public final class ISBNLongifierFilter extends org.apache.lucene.analysis.TokenFilter {
  11.  
  12.   public ISBNLongifierFilter(TokenStream in) {
  13.     super(in);
  14.   }
  15.  
  16.   public Token next() throws IOException {
  17.     return normalize(this.input.next());
  18.   }
  19.  
  20.   public Token next(Token result) throws IOException {
  21.     return normalize(this.input.next());
  22.  
  23.   }
  24.  
  25.   public Token normalize(Token t) {
  26.     if (null == t || null == t.termBuffer() || t.termLength() == 0) {
  27.       return t;
  28.     }
  29.     String val = new String(t.termBuffer());
  30.     try {
  31.       t.setTermBuffer(ISBNLongifier.longify(val));
  32.       return t;
  33.     } catch (IllegalArgumentException e) {
  34.        // pass it through unchanged
  35.       return t;
  36.     }
  37.   }
  38. }

How to use it

Assuming you’ve managed to get it built into Solr and then deployed, just define it as a type in your schema.xml:

  1.   <fieldType name="isbnlongifier" class="solr.TextField"  omitNorms="true">
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="edu.umich.lib.solr.analysis.ISBNLongifierFilterFactory"/>
  5.     </analyzer>
  6.   </fieldType>
  7.  
  8.   # and later…
  9.  
  10.   <field name="isbn" type="isbnlongifier" indexed="true" stored="false" multiValued="true"/>

Conclusion

There it is. The rocket science is all hidden behind the import statements. My understanding is that casting the token value to/from Strings makes things horribly inefficient, but I’m pretty sure I’ve got bigger bottlenecks to tackle before worrying about this.

3 Responses to “Building a solr text filter for normalizing data”

  1. [...] easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about manipulating the input [...]

  2. This is very helpful, thanks.

    Do you know, with the filter attached to the field as you show in your example, will it be used for both indexing and querrying? Ideally one would want to not only normalize input on indexing, but also normalize input in a query, so, in this example for instance, someone can enter a 10 digit ISBN and still match the equivalent 13 digit ISBN.

  3. Bill says:

    Any analyzer that isn’t specifically marked as query or index does both — so this will modify input on index and on query.

Leave a Reply

Easy Solr types for library data

August 19, 2009 at 4:19 pmCategory:Uncategorized

[Yet another bit in a series about our Vufind installation]

While I’m no longer shocked at the terrible state of our data every single day, I’m still shocked pretty often. We figured out pretty quickly that anything we could do to normalize data as it went into the Solr index (and, in fact, as queries were produced) would be a huge win.

There’s a continuum of attitudes about how much “business logic” belongs in the database layer of any application. Some folks — including super-high throughput sites, but mostly people who have never used anything by MySQL — tend to put no logic into the database. I’ve always edged over the middle to the other side of that debate, preferring to let the database do type-checking and conversions and track foreign keys and the like.

Solr, while not a traditional RDBMS, offers this type of functionality in its text filters. One can pipe data through a few standard filters, or write a custom one in Java if need be. The nice part is that it applies at index and query time. One obvious application, which I somehow haven’t bothered to write yet, is to convert all ISBNs to 13-characters new-style ISBNs upon both index and query. That way, you don’t care if your original records had the short or long form; all the data gets converted no matter how it comes in.

Our standard text field is similar to the default schema.xml, for example, running text through the following filters:

  • UnicodeNormalization to normalize unicode composition and (optionally) remove diacritics
  • StopFilter to ignore stopwords in a separate file
  • WordDelimiter to do intelligent word deliniation
  • LowerCase to…you know…lowercase everything
  • EnglishPorter to do stemming
  • RemoveDuplicates to do what it says

And because it happens on index and on query, everything works out.

We’re running Solr basically from trunk — whenever we need to change something, I pull down a fresh svn copy, put in our local changes to make sure it all works, and then deploy — so I have access to stuff slated for Solr 1.4, including most importantly Trie fields and the PatternReplaceFilterFactory.

The stdnum type

One of the first things we defined was a “stdnum” type, to deal with supposedly-unique identifiers, possibly with embedded dashes and dots and leading/trailing nonsense. Here’s a variant.

  1.   <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="solr.LowerCaseFilterFactory"/>
  5.       <filter class="solr.TrimFilterFactory"/>
  6.       <filter class="solr.PatternReplaceFilterFactory"
  7.            pattern="^[\D]*([\d\-\.]+x?).*$" replacement="$1"
  8.       />
  9.       <filter class="solr.PatternReplaceFilterFactory"
  10.            pattern="[^\dx]" replacement=""  replace="all"
  11.       />
  12.       <filter class="solr.PatternReplaceFilterFactory"
  13.            pattern="^0+" replacement=""  replace="all"
  14.       />
  15.     </analyzer>
  16.   </fieldType>

Let’s walk through it. It could probably be done in one go, but solr is not our bottleneck at this point…

  • We start by defining it as a TextField because it’s the only type that can take filters.
  • We then declare that instead of the standard tokenizer, we’re using the KeywordTokenizer. Confusingly, the KeywordTokenizer doesn’t tokenize in the traditional sense — it just returns the whole input as a single token.
  • Lowercase it.
  • Trim spaces off both ends
  • Skip any leading non-digts, find a string of numbers, dashes, and dots, with optional x at the end, and skip everything after it.
  • Remove anything left that isn’t a digit or an ‘x’.
  • Remove leading zeros, if you’ve got ‘em.

The net effect is a trimmed string that has only digits (with an optional trailing ‘x’) and removes any leading zeros.

We use this “stdnum” field for ISBNs and ISSNs (and I think OCLC numbers) and it should work for any messy numerics you might have lying around. If you wanted to, you could change the regexp to enforce a minimum string of digits so it doesn’t get confused by any leading nonsense, e.g, “ISSN2: 1234567X (online)”. But if your data are that bad, you may have bigger problems to worry about.

textProper type

We define a textProper type that is exactly the same as the default text type, but without the stemming and synonyms. In the presence of stemming, exact matches and stemmed matches count the same toward relevancy (e.g. row and rowing). We had plenty of examples where exact results were getting overridden by the stemmed results, and this is confusing.

So for most of our important fields, we index them as both text and textProper so we can apply different weights to searches against them.

By the way, don’t forget to make sure your authors are in a textProper type; you don’t want stemming on author names!

exactmatcher type

The name exactmatcher is a red herring, of course, It’s not an exact matcher. It just strips out all the delimiters so we can pretend it’s an exact match.

  1.   <fieldType name="exactmatcher" class="solr.TextField" omitNorms="true">
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
  5.       <filter class="solr.LowerCaseFilterFactory"/>
  6.       <filter class="solr.TrimFilterFactory"/>
  7.       <filter class="solr.PatternReplaceFilterFactory"
  8.            pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
  9.       />
  10.     </analyzer>
  11.   </fieldType>

That’s it. Lowercase it, normalize the unicode, and pull out everything that’s not a (unicode) letter or number.

Note that we’re still using KeywordTokenizerFactory — we’re getting exactly one token out of this thing. That means that the query input either matches (as one string) or it doesn’t.

Here’s how we use it:

  • Control numbers: Our controlnums (old ids, that sort of thing), report numbers, sdr numbers (related to HathiTrust), the HathiTrust ID
  • Callnumbers: I also try to normalize LC, but this helps people find everything else
  • Titles: in addition to a regular tokenized title, we index the 245a (as title_a) and the 245ab (as title_ab). If someone types in an exact match for either of them, we shoot the relevancy through the roof (more so for the title_ab than the title_a, obviously). This makes known item searching a little less painful.

wildcard searching

One downside of using all these filters is that Solr ignores filters when doing wildcard searches. There is a patch floating around that will using an analyzing query parser for wildcard searches, but I haven’t had time to fiddle around with it.

One thing you can do is to do the exact same normalization in your calling code and then throw a ‘*’ on the end of it. The data are in the index, after all — you just have to do the filtering yourself. For example, for a cheap and easy “Title starts with” search, you can do the same normalization in PHP or Ruby or whatever as we do in the Solr exactmatcher type, drop a ‘*’ on the end of it, and query against the exactmatcher version of your title. Voila.

Custom filters

Regular expressions can get you ridiculously far, but for a couple cases it’d be nice to have custom code running. I’ve already mentioned that we should by upcasting all our ISBNs to the 13-character variant. The other two areas where I do this are to normalize LCCNs and to badly normalize LC CallNumbers. I’ll talk about both soon.

5 Responses to “Easy Solr types for library data”

  1. Andy says:

    Bill, thanks for posting this whole series of articles. You’re pulling together useful, real-world examples which can be hard to find.

  2. robcaSSon says:

    indeed, very good stuff….we’re already doing a few of the same things with our solr indexing, but great to have someone post these.

  3. [...] (more important to me) it’s easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about [...]

  4. Vladimir says:

    Ballo Bill, Thank you for exiting tutorial. We have already a solution for a extended wildcard search. You can download the new parser here: http://markmail.org/message/6dsipdkir5vscb3o

    Sincerely, Vladimir

  5. I’m just getting into using Solr with (non-MARC) library data. Thanks for posting these field types! So useful, and much more sophisticated than my first clumsy attempts at custom field types. :)

Leave a Reply

Going with and “forking” VUFind

August 19, 2009 at 12:09 amCategory:Uncategorized

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff.

When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight.

I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around it (e.g., Stanford), and was based on a technology stack we had some experience with (PHP). Blacklight seemed to be just getting off to a fitfull start, and its Ruby stack was at that time an iffy proposition (this was before any sort of major adoption of Passenger or JRuby).

As I write this, things have flipped around a little. Andrew Nagy, the principle architect of VUFind, left Villanova for Serial Solutions and VUFind stopped being his primary focus. The Blacklight community decided to go with a major reorganization of the code to make it easier to deploy, which resulted in a flurry of refactoring and improvements and folks generally thinking things through really well. Stanford just flipped the switch from their VUFind to a Blacklight installation, and as I pointed out, the Ruby deployment options are more stable and less resource-hungry than they were back then. If the decision were being made today, it would be a much more complex analysis.

But anyway, the decision was made, and Tim Prettyman and I were tapped to do most of the hardcore nerd work to make it suitable for our environment.

Right away, I found things that would need some pretty major revision. The user model was based on a local database of logins (we use cosign), even moderately-long search strings would crash the thing, cookies were being used instead of sessions and hitting the 4K limit, search specification were hardcoded in the PHP, and lots of the UI elements didn’t actually have working code behind them (RSS feeds, endnote export, spellcheck, etc).

So, I dug in and started learning PHP and Smarty and refactoring/rewriting/rearchitecting the crap out of it. One of the first things I did was to extract the search specification — the mapping of, say, a ‘title’ search to a weighted search of six or seven actual Solr fields — into a yaml file so we could mess around with it more easily than modifying the giant case-statement in the PHP code. I built a patch against the then-current revision, filed it as a bug, and sent email to the list.

And nothing happened. That patch is still sitting there, in fact. Maybe I’m the only one that thinks it’s useful. But in any case, there was no discussion of it, no one rejected it. It just sat. Sits. Whatever.

I could have asked for write access to the repository, but I didn’t. I saw a few other patches get submitted and met with yawns all around, and started looking more closely at the list and saw pretty much no one doing anything with the then-current code base, and frankly kind of gave up. The folks that I knew were working actively on implementing VUFind — us, Australia and Alan Rykhus at MNPals — were all working from very different code bases, which made our ability to share code very limited. Any sort of official work on VUFind seemed to have slowed to a near standstill (based on svn checkins), and almost no one else seemed interested in submitting patches. After a while, we stopped, too.

So, we didn’t really fork VUFind. We just rewrote much of it and stopped trying to generate interest in our changes. The right thing to do would have been to either grab the bull by the horns, or do an actual fork of the project. But we didn’t feel as if we had time to shepherd a project of this size, and after many, many (many) discussions, decided to just do our thing. I assume that’s what everyone else has done, too, since I see plenty of differences in how things work at the different sites.

As it stands, the wiki shows a good handful of libraries live with VUFind, and a bunch more marked as being in “beta.” I don’t know if what we’re running Mirlyn on is still enough VUFind to be called VUFind. Probably. The basic structure is the same, the search syntax as exposed in the URL is the same. The plumbing underneath is changed in a lot of ways, and I like to think the flow of control makes a little more sense now.

In real life, of course, it doesn’t matter where you draw the line. Our code is far enough removed from the svn repository now that we’re essentially going it alone.

That doesn’t bother me.

The reality is that we’ve taken control of the UI and learned what we need to know about using Solr with our data. If I need to change the backend — to Blacklight, to a newer VUFind, to anything — my users need not ever know, other than to notice that things are a little bit better. If we end up moving to a release-quality version of VUFind, there’s almost nothing I can’t reuse if it makes sense.

We’ve also learned a lot. Solr, obviously, and how to write text filters for it and push it around just a little bit. Solrmarc, too. But we’ve also taken a hard look at data normalization in ways we haven’t before, and decided how we’re going to output to Refworks, and to email, what kinds of searches we want to offer, where we have collisions in ID namespaces (OCLC & ISSN, I’m looking at you).

We’ve discovered issues and problems with our data we’d have never seen otherwise, and started up whole sets of conversations about OPAC issues that used to languish for lack of a reification for reference. The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.

The system has tons of problems still, starting with underlying templates that will make you a little sick if you do a “view source” and going right through my call number search not working for some edge cases. But that stuff will get cleaned up as we get a little downtime from adding new features, and there are elements of the new backend code that could be useful to others once I clean them up and remove local dependencies.

I’m not sure when, if ever, we’ll start thinking of ourselves as part of the “VUFind community” again. The heavy intellectual lifting about how to organize what is essentially a front-end for Solr doesn’t seem to be happening on the VUFind list. And to be honest, I’m not sure it should be. Solr is the real engine. Solrmarc is, for us right now, an important piece. Data normalization, translation, workaround for crappy data, and the basic information theory of a faceted search system are all independent of the particular middleware you’re using to grab Solr results and throw them up on the screen.

So, what we have is good for us, for now, and we’re continuing to learn how to move forward. And I’ve been able to get bug reports and say, “Thanks, Fixed” fifteen minutes later and get warm fuzzy feelings that don’t usually accompany, “Thanks. I’ll put a request in at Ex Libris’ online ticket system”.

Next time: using and abusing Solr for data normalization.

3 Responses to “Going with and “forking” VUFind”

  1. till says:

    I think we have taken the same road with our “Suchkiste” project based on VuFind. VuFind was a convenient user interface for our Solr index that we could deploy quickly to have some kind of prototype interface to show our ideas (the Solr XML interface is not that sexy in public demonstrations :-). And I think we experienced similar disappointment with the state of the VuFind community and so decided to do our own thing as well. Today I think, that is wrong. It just doesn’t make sense, that we all redundantly fix the same issues in our VuFind based projects (and there is still a lot to fix). And I think, it is a good idea to commit new features and improvements back to the main trunk to ensure their sustainability. You are right, that did not work in the past. But on the other hand: We can’t complain about a missing community around VuFind. We are the ones that form (or not) that community. Of course you need to put some efforts into engagement in a community, but I think that pays off by what you may get back. And I feel, just at the moment there is a chance to make VuFind a real community project. But that depends on us individuals.

  2. I enjoy your updates on “vuFork” (lol). Kudos to you and your colleagues for proceeding on this!

    I really like your very quotable statement:

    “The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.”

    Sums up things nicely..

  3. [...] Open Source Software, VuFind has several forks and this seems to be the community choice, see Going With and Forking VuFind for more details .This is not the first fork of Koha – there is already Koha Plus at [...]

Leave a Reply

Sending unicode email headers in PHP

August 17, 2009 at 3:22 pmCategory:Uncategorized

I’m probably the last guy on earth to know this, but I’m recording it here just in case. I’m sending record titles in the subject line of emails, and of course they may be unicode. The body takes care of itself, but you need to explicitly encode a header like “Subject.”

  1.  
  2.     $headers['To'] = $to;
  3.     $headers['From'] = $from;
  4.     $headers['Content-Type'] = "text/plain; charset=utf-8";
  5.     $headers['Content-Transfer-Encoding'] = "8bit";
  6.     $b64subject = "=?UTF-8?B?" . base64_encode($subject) . "?=";
  7.     $headers['Subject'] = $b64subject;
  8.  
  9.     $mail =& Mail::factory('sendmail', array('host' => $host,
  10.                                              'port'=>$port));
  11.     $retval =  $mail->send($to, $headers, $body);

Leave a Reply

For the last few months, I’ve been working on rolling out a ridiculous-modified version of Vufind, which we just launched as our primary OPAC, Mirlyn, with a slightly-different version powering catalog.hathitrust.org, a temporary metadata search on the HathiTrust data until the OCLC takes it over at some undetermined date.

(Yeah, the HathiTrust site is a lot better looking.)

[Our Aleph-based catalog lives on at mirlyn-classic) -- I'll be interested to see how the traffic on the two differs as time goes on.]

I’m going to spend a few posts talking about how and why we essentially forked vufind, what sorts of modifications I made, and what technologies I hope to extract from our implementation that may be useful to the wider library community. And, I’m sure, a lot about why I hate Solr, why I love love love Solr, why I hate PHP, and why I love…er…no, I still hate PHP.

Credit where it’s due

And… a little credit where it’s due. I did a lot, but I didn’t do it all. I probably didn’t even do most of it. Half the effort, including all the heavy Aleph lifting — from getting the MARC out with all the filters and expansions we needed, to pulling holdings in real time, to grabbing a patron’s current checked-out items and holds, to fighting the inevitably-scarring battle with ILL — was done by Tim Prettyman. Suzanne Chapman lent her expertise to make it a lot less ugly and more usable than it once was (you can see her talents more strongly expressed at the HathiTrust catalog). And a whole horde of librarians were tapped by my boss, Jon Rothman, to try to figure out how to deal with the MARC data and facets and everything else that required a much deeper understand of our data than I possess.

Non-stock user-facing features

In the next post, I’ll start with a look at how and why we changed the backend and what I’d do differently if I were starting from scratch. But right now, a quick list of the user-facing stuff that you might find interesting.

  • Email and export searches and search results, as opposed to just individual records.
  • Working endnote and refworks export.
  • Multi-select on the advanced search (e.g., pick two languages to get English OR German).
  • Publication date-range searching (with date-added-to-catalog searching coming soon).
  • A “sticky” institution selection, so each campus can choose to default to searching just their own stuff. We sniff IPs to set a default, too.
  • A “call number starts with” search based on semantics for LC searches (e.g., searching on CA11 won’t find CA1105), with call number range searching in testing now.
  • Contracted holdings for long lists of serials (see, e.g., Nature).
  • [Coming soon] Selecting records to a temporary set, which can be manipulated en masse (sent to Refworks, etc.). I’ll be hooking this up to mTagger, our home-grown bookmarking and tagging tool, later on.

Of course, I also broke some things. I haven’t added back in Search History, but will do so when I’ve got a couple hours. “Search Within” will make a comeback soon, too, but there are usability issues to contend with. And …for the love of god, don’t do a “View Source.” It’s the ugliest HTML underpinnings I’ve been associated with since 1993 or so.

All in all, though, it’s not bad work, and I’m glad to be able to offer it to our patrons.

2 Responses to “Rolling out UMich’s “VUFind”: Introduction and New Features”

  1. Dean says:

    Hi Bill, Great to see this post. Would you mind elaborating in a future post how you made the exact title matching work? ie. Nature, Science, Cell?

    Thanks!

  2. Bill says:

    Sure thing. If anyone else has stuff they’d like to hear about sooner rather than later, drop a comment here or email me.

Leave a Reply