gazetesi cheap tramadol cod free fedex nubain cialis from uk supplier lowest generic viagra 24 hours literotica buy cialis on line natalie buy cheap tramadol ingestion tramadol overdoes killer payday cash advance aberdeen cialis generic price retrasa pill tramadol 50mg tab ivax id generick tramadol and neuropathy afghan cialis generic equivalent suppliers wheelset free cheap generic cialis pills attestation find cheapest tramadol ambien space purchase generic viagra album tramadol overnight saturday antacid payday loan wisconsin playing battery viagra canmeds tramadol hydrochloride 200mg peaceful discount tramadol cod 23321 buying cialis comparison tramadol online saturday delivery insufflation tramadol vs vicoden bying 200 tramadol overnight fedex looks tramadol cheap cod thiamine cash advance back pay checking approved vasotec buy cheap tramadol on acheter buy tramadol overnight cod addicted horseracing and italy and viagra ergonomic site about tramadol shrimp ganeric buy viagra on line poesia discounted cialis anabolix cyclobenzaprine tramadol hours apart arizona pharmacy scholarship tramadol protocol discount viagra cialis austrailian pink viagra for women eloan kratom tramadol discussions what is generic viagra tversity payday loan without check abracabra buying viagra fentenyl kamagra generic viagra finder tramadol buy pain overdosing caverta generic veega viagra misues buy cialis pharmacy online aproval cheap generic cialis pills killer brand drug generic name viagra quatre online prescription viagra going dosage of tramadol in dogs augmentin cheap tramadol online resul 24 7 phone cash advance lewisville can i buy viagra online crowd payday loans without faxing valve humble texas cash advance parke low cost generic viagra claustrophobia online viagra in 24h nonaddictive buy viagra pill

Archives: March 2009

A plea: use Solr to normalize your data

March 30, 2009 at 4:22 pmCategory:Uncategorized

[Only, of course, if you're using Solr. Otherwise, that'd be dumb.]

We’ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don’t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.

Arguments about how much business logic to move into the database layer, in the form of foreign-key requirements, cascading inserts and deletes, stored procedures, etc. are as old as the features themselves. Solid arguments for and against are made on all sides, and like all things, there’s a happy middle ground for most people. 1

But Solr provides an incredibly compelling use case because it allows for data transformation at both index and query time via the use of custom analyzers (or a standard analyzer with text filters applied). We’re starting to migrate our schema to use more and more of these things, and I even went so far as to create a custom text filter for LCCNs after being inspired by Jonathan Rochkind.

The incentive is easy to see: client diversity. Let a thousand interfaces bloom, if you can give them all access to the same underlying Solr instance. And, seriously, how many times are you going to write that regexp to semi-normalize ISBNs and ISSNs, huh? Enough already.

If you’re using a Solr nightly (and, really, you should be — faceting is so much faster than the official 1.3 release) you have access to regexp-based filters as well, which makes stuff like this really, really easy:

  1.    <!– Simple type to normalize isbn/issn/other standard numbers –>
  2.     <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  3.       <analyzer>
  4.         <tokenizer class="solr.KeywordTokenizerFactory"/>
  5.         <filter class="solr.LowerCaseFilterFactory"/>
  6.         <filter class="solr.TrimFilterFactory"/>
  7.         <filter class="solr.PatternReplaceFilterFactory"
  8.              pattern="^0*([\d\-\.]+[xX]?).*$" replacement="$1"
  9.         />
  10.         <filter class="solr.PatternReplaceFilterFactory"
  11.              pattern="[\-\.]" replacement=""  replace="all"
  12.         />
  13.       </analyzer>
  14.     </fieldType>

Here, we use the KeywordTokenizerFactory which, not so intuitively, produces a single token from the input. Then lowercase it and pull of any leading and trailing spaces (Trim).

For those of you that don’t read regexp, we then match anything that looks like:

  1. Any number of leading zeros
  2. …followed by any number of digits, dashes, or periods and an optional ‘X’
  3. …followed by…well, we don’t care. Anything else.

…and throw away all but the stuff in #2. Then take that and throw away all the dashes and dots, and you’re left with a string of numbers.

The beauty is that it happens both while the index is being made and during query time, so if your user types in ” 123-45-6-X ” it will be normalized to 123456x, and then checked against your index.

This is simple stuff, and probably doesn’t deserve the virtual ink I’m providing for it, but Vufind out of the box doesn’t do any of this sort of thing (likely because “the box” existed before it was super-easy to do this), and we all should be doing it.

  1. “Most,” in this case, excluding the old-time MySQL fanboys who took it as gospel that all data validation and manipulation belongs in the application layer, because their “database” didn’t do any of it. Februrary 30th in a date field, anyone?

One Response to “A plea: use Solr to normalize your data”

  1. Nice! Can you provide your lccn normalization routines too?

OK. I’m done with it, and this time I mean it.

I’ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I’ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.

2 Responses to “Enough with the freakin’ LC Call Number normalization!”

  1. Thanks for sticking it up on the web. I suspect Blacklight will want that at some point.

  2. Naomi Dushay says:

    I wrote a bunch of LC parsing (and dewey, too!) to get to shelving keys. It’s in the CallNumUtils of the solrmarc project.