Category: Uncategorized

Easy Solr types for library data

August 19, 2009 at 4:19 pmCategory:Uncategorized

[Yet another bit in a series about our Vufind installation]

While I’m no longer shocked at the terrible state of our data every single day, I’m still shocked pretty often. We figured out pretty quickly that anything we could do to normalize data as it went into the Solr index (and, in fact, as queries were produced) would be a huge win.

There’s a continuum of attitudes about how much “business logic” belongs in the database layer of any application. Some folks — including super-high throughput sites, but mostly people who have never used anything by MySQL — tend to put no logic into the database. I’ve always edged over the middle to the other side of that debate, preferring to let the database do type-checking and conversions and track foreign keys and the like.

Solr, while not a traditional RDBMS, offers this type of functionality in its text filters. One can pipe data through a few standard filters, or write a custom one in Java if need be. The nice part is that it applies at index and query time. One obvious application, which I somehow haven’t bothered to write yet, is to convert all ISBNs to 13-characters new-style ISBNs upon both index and query. That way, you don’t care if your original records had the short or long form; all the data gets converted no matter how it comes in.

Our standard text field is similar to the default schema.xml, for example, running text through the following filters:

  • UnicodeNormalization to normalize unicode composition and (optionally) remove diacritics
  • StopFilter to ignore stopwords in a separate file
  • WordDelimiter to do intelligent word deliniation
  • LowerCase to…you know…lowercase everything
  • EnglishPorter to do stemming
  • RemoveDuplicates to do what it says

And because it happens on index and on query, everything works out.

We’re running Solr basically from trunk — whenever we need to change something, I pull down a fresh svn copy, put in our local changes to make sure it all works, and then deploy — so I have access to stuff slated for Solr 1.4, including most importantly Trie fields and the PatternReplaceFilterFactory.

The stdnum type

One of the first things we defined was a “stdnum” type, to deal with supposedly-unique identifiers, possibly with embedded dashes and dots and leading/trailing nonsense. Here’s a variant.

  1.   <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="solr.LowerCaseFilterFactory"/>
  5.       <filter class="solr.TrimFilterFactory"/>
  6.       <filter class="solr.PatternReplaceFilterFactory"
  7.            pattern="^[\D]*([\d\-\.]+x?).*$" replacement="$1"
  8.       />
  9.       <filter class="solr.PatternReplaceFilterFactory"
  10.            pattern="[^\dx]" replacement=""  replace="all"
  11.       />
  12.       <filter class="solr.PatternReplaceFilterFactory"
  13.            pattern="^0+" replacement=""  replace="all"
  14.       />
  15.     </analyzer>
  16.   </fieldType>

Let’s walk through it. It could probably be done in one go, but solr is not our bottleneck at this point…

  • We start by defining it as a TextField because it’s the only type that can take filters.
  • We then declare that instead of the standard tokenizer, we’re using the KeywordTokenizer. Confusingly, the KeywordTokenizer doesn’t tokenize in the traditional sense — it just returns the whole input as a single token.
  • Lowercase it.
  • Trim spaces off both ends
  • Skip any leading non-digts, find a string of numbers, dashes, and dots, with optional x at the end, and skip everything after it.
  • Remove anything left that isn’t a digit or an ‘x’.
  • Remove leading zeros, if you’ve got ‘em.

The net effect is a trimmed string that has only digits (with an optional trailing ‘x’) and removes any leading zeros.

We use this “stdnum” field for ISBNs and ISSNs (and I think OCLC numbers) and it should work for any messy numerics you might have lying around. If you wanted to, you could change the regexp to enforce a minimum string of digits so it doesn’t get confused by any leading nonsense, e.g, “ISSN2: 1234567X (online)”. But if your data are that bad, you may have bigger problems to worry about.

textProper type

We define a textProper type that is exactly the same as the default text type, but without the stemming and synonyms. In the presence of stemming, exact matches and stemmed matches count the same toward relevancy (e.g. row and rowing). We had plenty of examples where exact results were getting overridden by the stemmed results, and this is confusing.

So for most of our important fields, we index them as both text and textProper so we can apply different weights to searches against them.

By the way, don’t forget to make sure your authors are in a textProper type; you don’t want stemming on author names!

exactmatcher type

The name exactmatcher is a red herring, of course, It’s not an exact matcher. It just strips out all the delimiters so we can pretend it’s an exact match.

  1.   <fieldType name="exactmatcher" class="solr.TextField" omitNorms="true">
  2.     <analyzer>
  3.       <tokenizer class="solr.KeywordTokenizerFactory"/>
  4.       <filter class="schema.UnicodeNormalizationFilterFactory" version="icu4j" composed="false" remove_diacritics="true" remove_modifiers="true" fold="true"/>
  5.       <filter class="solr.LowerCaseFilterFactory"/>
  6.       <filter class="solr.TrimFilterFactory"/>
  7.       <filter class="solr.PatternReplaceFilterFactory"
  8.            pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
  9.       />
  10.     </analyzer>
  11.   </fieldType>

That’s it. Lowercase it, normalize the unicode, and pull out everything that’s not a (unicode) letter or number.

Note that we’re still using KeywordTokenizerFactory — we’re getting exactly one token out of this thing. That means that the query input either matches (as one string) or it doesn’t.

Here’s how we use it:

  • Control numbers: Our controlnums (old ids, that sort of thing), report numbers, sdr numbers (related to HathiTrust), the HathiTrust ID
  • Callnumbers: I also try to normalize LC, but this helps people find everything else
  • Titles: in addition to a regular tokenized title, we index the 245a (as title_a) and the 245ab (as title_ab). If someone types in an exact match for either of them, we shoot the relevancy through the roof (more so for the title_ab than the title_a, obviously). This makes known item searching a little less painful.

wildcard searching

One downside of using all these filters is that Solr ignores filters when doing wildcard searches. There is a patch floating around that will using an analyzing query parser for wildcard searches, but I haven’t had time to fiddle around with it.

One thing you can do is to do the exact same normalization in your calling code and then throw a ‘*’ on the end of it. The data are in the index, after all — you just have to do the filtering yourself. For example, for a cheap and easy “Title starts with” search, you can do the same normalization in PHP or Ruby or whatever as we do in the Solr exactmatcher type, drop a ‘*’ on the end of it, and query against the exactmatcher version of your title. Voila.

Custom filters

Regular expressions can get you ridiculously far, but for a couple cases it’d be nice to have custom code running. I’ve already mentioned that we should by upcasting all our ISBNs to the 13-character variant. The other two areas where I do this are to normalize LCCNs and to badly normalize LC CallNumbers. I’ll talk about both soon.

5 Responses to “Easy Solr types for library data”

  1. Andy says:

    Bill, thanks for posting this whole series of articles. You’re pulling together useful, real-world examples which can be hard to find.

  2. robcaSSon says:

    indeed, very good stuff….we’re already doing a few of the same things with our solr indexing, but great to have someone post these.

  3. [...] (more important to me) it’s easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about [...]

  4. Vladimir says:

    Ballo Bill, Thank you for exiting tutorial. We have already a solution for a extended wildcard search. You can download the new parser here: http://markmail.org/message/6dsipdkir5vscb3o

    Sincerely, Vladimir

  5. I’m just getting into using Solr with (non-MARC) library data. Thanks for posting these field types! So useful, and much more sophisticated than my first clumsy attempts at custom field types. :)

Going with and “forking” VUFind

August 19, 2009 at 12:09 amCategory:Uncategorized

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff.

When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight.

I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around it (e.g., Stanford), and was based on a technology stack we had some experience with (PHP). Blacklight seemed to be just getting off to a fitfull start, and its Ruby stack was at that time an iffy proposition (this was before any sort of major adoption of Passenger or JRuby).

As I write this, things have flipped around a little. Andrew Nagy, the principle architect of VUFind, left Villanova for Serial Solutions and VUFind stopped being his primary focus. The Blacklight community decided to go with a major reorganization of the code to make it easier to deploy, which resulted in a flurry of refactoring and improvements and folks generally thinking things through really well. Stanford just flipped the switch from their VUFind to a Blacklight installation, and as I pointed out, the Ruby deployment options are more stable and less resource-hungry than they were back then. If the decision were being made today, it would be a much more complex analysis.

But anyway, the decision was made, and Tim Prettyman and I were tapped to do most of the hardcore nerd work to make it suitable for our environment.

Right away, I found things that would need some pretty major revision. The user model was based on a local database of logins (we use cosign), even moderately-long search strings would crash the thing, cookies were being used instead of sessions and hitting the 4K limit, search specification were hardcoded in the PHP, and lots of the UI elements didn’t actually have working code behind them (RSS feeds, endnote export, spellcheck, etc).

So, I dug in and started learning PHP and Smarty and refactoring/rewriting/rearchitecting the crap out of it. One of the first things I did was to extract the search specification — the mapping of, say, a ‘title’ search to a weighted search of six or seven actual Solr fields — into a yaml file so we could mess around with it more easily than modifying the giant case-statement in the PHP code. I built a patch against the then-current revision, filed it as a bug, and sent email to the list.

And nothing happened. That patch is still sitting there, in fact. Maybe I’m the only one that thinks it’s useful. But in any case, there was no discussion of it, no one rejected it. It just sat. Sits. Whatever.

I could have asked for write access to the repository, but I didn’t. I saw a few other patches get submitted and met with yawns all around, and started looking more closely at the list and saw pretty much no one doing anything with the then-current code base, and frankly kind of gave up. The folks that I knew were working actively on implementing VUFind — us, Australia and Alan Rykhus at MNPals — were all working from very different code bases, which made our ability to share code very limited. Any sort of official work on VUFind seemed to have slowed to a near standstill (based on svn checkins), and almost no one else seemed interested in submitting patches. After a while, we stopped, too.

So, we didn’t really fork VUFind. We just rewrote much of it and stopped trying to generate interest in our changes. The right thing to do would have been to either grab the bull by the horns, or do an actual fork of the project. But we didn’t feel as if we had time to shepherd a project of this size, and after many, many (many) discussions, decided to just do our thing. I assume that’s what everyone else has done, too, since I see plenty of differences in how things work at the different sites.

As it stands, the wiki shows a good handful of libraries live with VUFind, and a bunch more marked as being in “beta.” I don’t know if what we’re running Mirlyn on is still enough VUFind to be called VUFind. Probably. The basic structure is the same, the search syntax as exposed in the URL is the same. The plumbing underneath is changed in a lot of ways, and I like to think the flow of control makes a little more sense now.

In real life, of course, it doesn’t matter where you draw the line. Our code is far enough removed from the svn repository now that we’re essentially going it alone.

That doesn’t bother me.

The reality is that we’ve taken control of the UI and learned what we need to know about using Solr with our data. If I need to change the backend — to Blacklight, to a newer VUFind, to anything — my users need not ever know, other than to notice that things are a little bit better. If we end up moving to a release-quality version of VUFind, there’s almost nothing I can’t reuse if it makes sense.

We’ve also learned a lot. Solr, obviously, and how to write text filters for it and push it around just a little bit. Solrmarc, too. But we’ve also taken a hard look at data normalization in ways we haven’t before, and decided how we’re going to output to Refworks, and to email, what kinds of searches we want to offer, where we have collisions in ID namespaces (OCLC & ISSN, I’m looking at you).

We’ve discovered issues and problems with our data we’d have never seen otherwise, and started up whole sets of conversations about OPAC issues that used to languish for lack of a reification for reference. The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.

The system has tons of problems still, starting with underlying templates that will make you a little sick if you do a “view source” and going right through my call number search not working for some edge cases. But that stuff will get cleaned up as we get a little downtime from adding new features, and there are elements of the new backend code that could be useful to others once I clean them up and remove local dependencies.

I’m not sure when, if ever, we’ll start thinking of ourselves as part of the “VUFind community” again. The heavy intellectual lifting about how to organize what is essentially a front-end for Solr doesn’t seem to be happening on the VUFind list. And to be honest, I’m not sure it should be. Solr is the real engine. Solrmarc is, for us right now, an important piece. Data normalization, translation, workaround for crappy data, and the basic information theory of a faceted search system are all independent of the particular middleware you’re using to grab Solr results and throw them up on the screen.

So, what we have is good for us, for now, and we’re continuing to learn how to move forward. And I’ve been able to get bug reports and say, “Thanks, Fixed” fifteen minutes later and get warm fuzzy feelings that don’t usually accompany, “Thanks. I’ll put a request in at Ex Libris’ online ticket system”.

Next time: using and abusing Solr for data normalization.

3 Responses to “Going with and “forking” VUFind”

  1. till says:

    I think we have taken the same road with our “Suchkiste” project based on VuFind. VuFind was a convenient user interface for our Solr index that we could deploy quickly to have some kind of prototype interface to show our ideas (the Solr XML interface is not that sexy in public demonstrations :-). And I think we experienced similar disappointment with the state of the VuFind community and so decided to do our own thing as well. Today I think, that is wrong. It just doesn’t make sense, that we all redundantly fix the same issues in our VuFind based projects (and there is still a lot to fix). And I think, it is a good idea to commit new features and improvements back to the main trunk to ensure their sustainability. You are right, that did not work in the past. But on the other hand: We can’t complain about a missing community around VuFind. We are the ones that form (or not) that community. Of course you need to put some efforts into engagement in a community, but I think that pays off by what you may get back. And I feel, just at the moment there is a chance to make VuFind a real community project. But that depends on us individuals.

  2. I enjoy your updates on “vuFork” (lol). Kudos to you and your colleagues for proceeding on this!

    I really like your very quotable statement:

    “The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.”

    Sums up things nicely..

  3. [...] Open Source Software, VuFind has several forks and this seems to be the community choice, see Going With and Forking VuFind for more details .This is not the first fork of Koha – there is already Koha Plus at [...]

Sending unicode email headers in PHP

August 17, 2009 at 3:22 pmCategory:Uncategorized

I’m probably the last guy on earth to know this, but I’m recording it here just in case. I’m sending record titles in the subject line of emails, and of course they may be unicode. The body takes care of itself, but you need to explicitly encode a header like “Subject.”

  1.  
  2.     $headers['To'] = $to;
  3.     $headers['From'] = $from;
  4.     $headers['Content-Type'] = "text/plain; charset=utf-8";
  5.     $headers['Content-Transfer-Encoding'] = "8bit";
  6.     $b64subject = "=?UTF-8?B?" . base64_encode($subject) . "?=";
  7.     $headers['Subject'] = $b64subject;
  8.  
  9.     $mail =& Mail::factory('sendmail', array('host' => $host,
  10.                                              'port'=>$port));
  11.     $retval =  $mail->send($to, $headers, $body);

Comments are closed.

For the last few months, I’ve been working on rolling out a ridiculous-modified version of Vufind, which we just launched as our primary OPAC, Mirlyn, with a slightly-different version powering catalog.hathitrust.org, a temporary metadata search on the HathiTrust data until the OCLC takes it over at some undetermined date.

(Yeah, the HathiTrust site is a lot better looking.)

[Our Aleph-based catalog lives on at mirlyn-classic) -- I'll be interested to see how the traffic on the two differs as time goes on.]

I’m going to spend a few posts talking about how and why we essentially forked vufind, what sorts of modifications I made, and what technologies I hope to extract from our implementation that may be useful to the wider library community. And, I’m sure, a lot about why I hate Solr, why I love love love Solr, why I hate PHP, and why I love…er…no, I still hate PHP.

Credit where it’s due

And… a little credit where it’s due. I did a lot, but I didn’t do it all. I probably didn’t even do most of it. Half the effort, including all the heavy Aleph lifting — from getting the MARC out with all the filters and expansions we needed, to pulling holdings in real time, to grabbing a patron’s current checked-out items and holds, to fighting the inevitably-scarring battle with ILL — was done by Tim Prettyman. Suzanne Chapman lent her expertise to make it a lot less ugly and more usable than it once was (you can see her talents more strongly expressed at the HathiTrust catalog). And a whole horde of librarians were tapped by my boss, Jon Rothman, to try to figure out how to deal with the MARC data and facets and everything else that required a much deeper understand of our data than I possess.

Non-stock user-facing features

In the next post, I’ll start with a look at how and why we changed the backend and what I’d do differently if I were starting from scratch. But right now, a quick list of the user-facing stuff that you might find interesting.

  • Email and export searches and search results, as opposed to just individual records.
  • Working endnote and refworks export.
  • Multi-select on the advanced search (e.g., pick two languages to get English OR German).
  • Publication date-range searching (with date-added-to-catalog searching coming soon).
  • A “sticky” institution selection, so each campus can choose to default to searching just their own stuff. We sniff IPs to set a default, too.
  • A “call number starts with” search based on semantics for LC searches (e.g., searching on CA11 won’t find CA1105), with call number range searching in testing now.
  • Contracted holdings for long lists of serials (see, e.g., Nature).
  • [Coming soon] Selecting records to a temporary set, which can be manipulated en masse (sent to Refworks, etc.). I’ll be hooking this up to mTagger, our home-grown bookmarking and tagging tool, later on.

Of course, I also broke some things. I haven’t added back in Search History, but will do so when I’ve got a couple hours. “Search Within” will make a comeback soon, too, but there are usability issues to contend with. And …for the love of god, don’t do a “View Source.” It’s the ugliest HTML underpinnings I’ve been associated with since 1993 or so.

All in all, though, it’s not bad work, and I’m glad to be able to offer it to our patrons.

2 Responses to “Rolling out UMich’s “VUFind”: Introduction and New Features”

  1. Dean says:

    Hi Bill, Great to see this post. Would you mind elaborating in a future post how you made the exact title matching work? ie. Nature, Science, Cell?

    Thanks!

  2. Bill says:

    Sure thing. If anyone else has stuff they’d like to hear about sooner rather than later, drop a comment here or email me.

Sending MARC(ish) data to Refworks

May 11, 2009 at 10:48 amCategory:Uncategorized

Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I’d put down how I’m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested.

The basic procedure is:

  • Send your user to a specific refworks URL along with a callback URL that can enumerate the record(s) you want to import in a supported form
  • Your user logs in (if need be) gets to her RefWorks page
  • RefWorks calls up your system and requests the record(s)
  • The import happens, and your user does whatever she want to do with them

Of course, there are lots of issues with doing this well (quick! Is this MARC record for a book? An edited book? Is it a journal, or a serial of some other sort? Who’s the actual author/editor?), but doing it at all isn’t so bad.

The URL to send them to

This is the “Export this record” URL on my system:

http://www.refworks.com.proxy.lib.umich.edu/express/expressimport.asp?
vendor=[your system]&
filter=MARC+Format&
database=All+MARC+Formats&
encoding=65001
&url=[your callback URL]
Note that the vendor variable should be a unique string (made up by your) for your system, not a larger entity (like the whole library or the institution).

The “MARC Format” filter we’re using is not a filter for real MARC. It’s a MARC-like delimited format (see an example from my catalog).

Basically, you have three types of lines (but really, look at the example, ’cause it’ll make everything a lot clearer):

LEADER

  LEADER [one space] [leader text]

Control Field

  [three-digit control tag] [four spaces] [data text]

Data Field

  [three-digit data tag] [one space] [ind1] [ind2] [one space] [value of subfield a] [other subfield constructs]

…where [other subfield constructs] look like

  [pipe characeter][subfield code][subfield value]

Notice that (a) there’s no leading ‘|a’ before the subfield a value, and (b) there are no spaces between the pipe, the subfield code, and the subfield value for the non-code-a subfields.

Some easy PHP code to produce such a format is as follows. Note that I’m sending it as text (because it’s not MARC) and UTF-8. If you’re got MARC-8, you’ll have to convert it before sending.

  1.       $m = $this->marcRecord;
  2.       header('Content-type: text/plain; charset=UTF-8');
  3.  
  4.       echo 'LEADER ', $m->getLeader(), "\n";
  5.      
  6.       foreach ($m->getFields() as $tag => $val) {
  7.         echo $tag;
  8.         if ($val instanceof File_MARC_Control_FIELD) {
  9.           echo '    ', $val->getData(), "\n";
  10.         } else {
  11.           echo ' ', $val->getIndicator(1),  $val->getIndicator(2), ' ';
  12.           $subs = array();
  13.           foreach ($val->getSubFields() as $code=>$subdata) {
  14.             $line = '';
  15.             if ($code != 'a') {
  16.               $line = '|' . $code;
  17.             }
  18.             $subs[] = $line . $subdata->getData();
  19.           }
  20.           echo implode(' ', $subs), "\n";
  21.         }        
  22.       }

3 Responses to “Sending MARC(ish) data to Refworks”

  1. Dan Scott says:

    Bill – thanks so much for this! A working example makes all the difference for other people following in your wake. (Once the waves finish washing over my head, I hope to implement RefWorks export in Evergreen…)

  2. [...] Sending citations to RefWorks can be done with a callback. Essentially, you add a link to RefWorks’ import function page and send it your credentials, as well as a callback URL that RefWorks uses to grab the record from your ILS…in a RefWorks-supported format. The problem is that RefWorks doesn’t accept MODS, MARC, or even MARCXML. They say they accept MARC, but it’s actually what I call “MARC text” (it is described very well by Bill Dueber). [...]

  3. Ali says:

    Hi Bill,

    Good stuff. I will see if I can do similar for YorkU Vufind Instance.

    Cheers,

    Ali

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "fields" : [
  7.      ["001", "001 value"]
  8.      ["002", "002 value"]
  9.      ["010", " ", " ",
  10.       [
  11.         ["a", "68009499"]
  12.       ]
  13.     ],
  14.     ["035", " ", " ",
  15.       [
  16.         ["a", "(RLIN)MIUG0000733-B"]
  17.       ],
  18.     ],
  19.     ["035", " ", " ",
  20.       [
  21.         ["a", "(CaOTULAS)159818014"]
  22.       ],
  23.     ],
  24.     ["245", "1", "0",
  25.       [
  26.         ["a", "Capitalism, primitive and modern;"],
  27.         ["b", "some aspects of Tolai economic growth" ],
  28.         ["c", "[by] T. Scarlett Epstein."]
  29.       ]
  30.     ]
  31.   ]
  32. }

One Response to “MARC-HASH: The saga continues (now with even less structure)”

  1. [...] I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, [...]

Why do I ever, ever think that MARC might not rely on order? I don’t know.

In any case, control fields will now be just an array of duples:

  1. control: [
  2.   ['001', 'value of the 001'],
  3.   ['006', 'value of the 006']
  4.   ['006', 'another 006']
  5. }

Comments are closed.

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.

So…I’m suggesting we try something a little simpler. Something so brain-dead, in fact, that I’m loathe to put it down because it’s pretty much the obvious way to do it. To wit:

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "control" : [
  7.      ["001", ["all", "001", "values"]],
  8.      ["002", ["all", "002", "values"]],
  9.   ],
  10.   "data" : [
  11.     ["010", " ", " ",
  12.       [
  13.         ["a", "68009499"]
  14.       ]
  15.     ],
  16.     ["035", " ", " ",
  17.       [
  18.         ["a", "(RLIN)MIUG0000733-B"]
  19.       ],
  20.     ]
  21.     ["035", " ", " ",
  22.       [
  23.         ["a", "(CaOTULAS)159818014"]
  24.       ],
  25.     ]
  26.     ["245", "1", "0",
  27.       [
  28.         ["a", "Capitalism, primitive and modern;"],
  29.         ["b", "some aspects of Tolai economic growth" ],
  30.         ["c", "[by] T. Scarlett Epstein."]
  31.       ]
  32.     ]
  33.   ]
  34. }

Stupid MARC allows all the stupid fields to stupid repeat and be out of stupid order and such, so it’s just a lot of arrays. Easily round-tripable.

Why bother? Excellent question, and one that’s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it’s still a lot easier than working with raw MARC (or, I would claim, MARC-XML), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.

A few things worth noting about the assumptions in my mind:

  • By definition, it’s always UTF-8. The leader should be changed to note this on the sending end, but it’s not required.
  • We include both a type “marc-hash”, and a version with major/minor numbers.
  • Everything is a string.
  • Alpha characters in indicators/tags are all lowercased.
  • A control field is a duple: tag and array of values.
  • A data field has four values:
    1. The tag
    2. Indicator one
    3. Indicator two
    4. An array of duples: subfield and its value

A simple transformation to make it a little more queryable

Let’s say you don’t give a damn about tags that appear out of order, because that’s just a crime against nature, anyway. And you really don’t care what order the subtags appear in most of the time, ’cause really, who does?

A simple run-through (psuedocode ahead):

  my marchash = getTheMarcHash();
  my kindamarc;
  kindamarc{leader} = marchash{leader};

# Map the control fields by tag => array-of-values foreach cfield (marchash{control}) { kindamarc{control}{cfield[0] ||= []}; kindamarc{control}{cfield[0]}.push(cfield[1]); }

foreach d (marchash{data}) { (tag, ind1, ind1) = (d[0], d[1], d[2]);

# build up a hash based on subfields for this tag
newd = {};
foreach subfield (d[3]) {
  (stag, sval) = subfield;
  newd{stag} = sval;
}

# Store the subfield hash in a few places so it's easy to find.
foreach i1 ('*', ind1) {
  foreach i2 ('*', ind2) {
    kindamarc{data}{tag}{i1}{i2} ||= [];
    kindamarc{data}{tag}{i1}{i2}.push(newd);
  }
}

}

Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator ‘*’.

Basically, this will allow things like this:

  1.   $leader = $kindamarc{leader};
  2.   $first001 = $kindamarc{control}{"001"}[0];
  3.  
  4.   # Find 856s where indicator 2 is '1'
  5.  
  6.   @mystuff = $kindamarc{data}{856}{'*'}{1};

It’s easy to see how we could store the index from the original array to make it easy to find the original order, too.

For many, I’m sure, the prospect of dealing with something like this is more daunting than just learning to use MARC-XML or using existing libraries to deal with straight MARC. But there seems to be a set of folks out there for whom this might be useful, so I’m throwing it out there.

2 Responses to “MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records”

  1. This is basically just MARC-XML translated to JSON, yes?

    Do you really find it easier to work with than MARC-XML? I guess if you’re doing your work in javascript, maybe.

  2. Bill says:

    Meh. I agree — and hence the disclaimers — that it’s nothing special. But “nothing special that everyone agrees on” is better than no agreement at all, and for some people XML processing is a barrier they don’t want to have to deal with. For all its seeming ubiquity, lots of folks never have to deal with XML in their programming.

    I guess my thing is, “IF we’re going to serialize MARC as JSON/YAML/Whatnot, let’s all do it the same way.”

A plea: use Solr to normalize your data

March 30, 2009 at 4:22 pmCategory:Uncategorized

[Only, of course, if you're using Solr. Otherwise, that'd be dumb.]

We’ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don’t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.

Arguments about how much business logic to move into the database layer, in the form of foreign-key requirements, cascading inserts and deletes, stored procedures, etc. are as old as the features themselves. Solid arguments for and against are made on all sides, and like all things, there’s a happy middle ground for most people. 1

But Solr provides an incredibly compelling use case because it allows for data transformation at both index and query time via the use of custom analyzers (or a standard analyzer with text filters applied). We’re starting to migrate our schema to use more and more of these things, and I even went so far as to create a custom text filter for LCCNs after being inspired by Jonathan Rochkind.

The incentive is easy to see: client diversity. Let a thousand interfaces bloom, if you can give them all access to the same underlying Solr instance. And, seriously, how many times are you going to write that regexp to semi-normalize ISBNs and ISSNs, huh? Enough already.

If you’re using a Solr nightly (and, really, you should be — faceting is so much faster than the official 1.3 release) you have access to regexp-based filters as well, which makes stuff like this really, really easy:

  1.    <!– Simple type to normalize isbn/issn/other standard numbers –>
  2.     <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  3.       <analyzer>
  4.         <tokenizer class="solr.KeywordTokenizerFactory"/>
  5.         <filter class="solr.LowerCaseFilterFactory"/>
  6.         <filter class="solr.TrimFilterFactory"/>
  7.         <filter class="solr.PatternReplaceFilterFactory"
  8.              pattern="^0*([\d\-\.]+[xX]?).*$" replacement="$1"
  9.         />
  10.         <filter class="solr.PatternReplaceFilterFactory"
  11.              pattern="[\-\.]" replacement=""  replace="all"
  12.         />
  13.       </analyzer>
  14.     </fieldType>

Here, we use the KeywordTokenizerFactory which, not so intuitively, produces a single token from the input. Then lowercase it and pull of any leading and trailing spaces (Trim).

For those of you that don’t read regexp, we then match anything that looks like:

  1. Any number of leading zeros
  2. …followed by any number of digits, dashes, or periods and an optional ‘X’
  3. …followed by…well, we don’t care. Anything else.

…and throw away all but the stuff in #2. Then take that and throw away all the dashes and dots, and you’re left with a string of numbers.

The beauty is that it happens both while the index is being made and during query time, so if your user types in ” 123-45-6-X ” it will be normalized to 123456x, and then checked against your index.

This is simple stuff, and probably doesn’t deserve the virtual ink I’m providing for it, but Vufind out of the box doesn’t do any of this sort of thing (likely because “the box” existed before it was super-easy to do this), and we all should be doing it.

  1. “Most,” in this case, excluding the old-time MySQL fanboys who took it as gospel that all data validation and manipulation belongs in the application layer, because their “database” didn’t do any of it. Februrary 30th in a date field, anyone?

One Response to “A plea: use Solr to normalize your data”

  1. Nice! Can you provide your lccn normalization routines too?

OK. I’m done with it, and this time I mean it.

I’ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I’ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.

2 Responses to “Enough with the freakin’ LC Call Number normalization!”

  1. Thanks for sticking it up on the web. I suspect Blacklight will want that at some point.

  2. Naomi Dushay says:

    I wrote a bunch of LC parsing (and dewey, too!) to get to shelving keys. It’s in the CallNumUtils of the solrmarc project.