Robot Librarian

Reintroducing Traject: Traject 2.0

Bill Dueber — Thu, 19 Feb 2015 00:00:00 +0000

Traject 2.0.0 released! Now runs under MRI/RBX!

traject is an ETL (extract/transform/load) system written in ruby with a special
view towards extracting fields from MARC data and writing it out into Solr. [Jonathan Rochkind] (http://bibwild.wordpress.com) and I wrote this primarily out of
frustration using other tools in this space (e.g., Solrmarc, or my own precursor to traject , marc2solr

Note: Catmandu is another, perl-based system I don’t have any direct experience with.

traject had its first release almost a year and a half ago (at least based on the date of my
post introducting it), and I’ve used it literally every day since then indexing data for the
Univeristy of Michigan and HathiTrust library catalogs.

How does it work?

traject is packaged as a gem, and ships with a command-line program (traject) that reads in configuration files or switches and the name of the file to operate on and transforms the incoming records as specified.

The "configuration" is actually just ruby code, with some macros included to make it simple to do the common operations (e.g., get the ISBN) and possible to do …well, anything you can do with ruby.

require 'traject/macros/marc21_semantics'
extend  Traject::Macros::Marc21Semantics
require 'library_stdnums' # just a regular ruby gem

to_field "id", extract_marc("001", :first => true)
to_field 'marcxml_record', serialized_marc(:format=>:xml)
to_field "allfields", extract_all_marc_values(:from=>'100', :to=>'999')

to_field 'oclc', oclcnum('035a:035z')
to_field 'isbn', extract_marc('020a') do |rec, acc|
  acc.map!{|x| StdNum::ISBN.normalize(x)}
end

to_field 'title', extract_marc_filing_version('245abdefghknp', :include_original => true)
to_field 'vtitle', extract_marc('245abdefghknp', :alternate_script=>:only, :trim_punctuation => true, :first=>true)
to_field "publisher", extract_marc('260b:264|*1|:533c')
to_field "edition", extract_marc('250a')

Questions about `traject`

How do I get started?
The best way is likely to look at the heavily-documented sample project we provide, followed by checking out the traject documentation. And, of course, just ask me for help.

Do you need to use JRuby?
Not anymore. As of version 2.0.0, traject runs under MRI ("regular") ruby, although without
all the speed-enhancing true threading that JRuby offers.

How fast is it?
Apple-to-apples comparison is difficult. The stock Blacklight indexing scheme is reportedly about as fast as Solarmarc when just using single-threaded MRI (JRuby would presumably speed things up). I run a hideously complex indexing scheme using JRuby and a few threads and can average over 900 records/second during a longish run (e.g., I can index all
eleven-million bib records before lunch). For me, it’s fast enough.

What kinds of data can I throw at it?
In theory, anything you want — it’s pretty easy to write a Traject reader. Out of the box or
with existing gems, though, we support a few kinds of MARC:

MARC (binary), via either ruby-marc or marc4j (the latter requiring JRuby)
MARC-XML (again, via either ruby-marc or marc4j)
marc-in-json in the form of newline-delimited json (a text file with one MARC-in-JSON record per line)
Alephsequential, a human-readable serialization put out by the Aleph ILS from Ex Libris.
Direct import from the Horizon ILS

How does it make it easier to deal with MARC?
MARC is the bread-and-butter of what traject is currently used for. traject ships with macros for reading through MARC records and transforming the often-weird data within them. Some of these can:

extract data from fields based on tag, indicators, and subfield values
trim punctuation from extracted data
translate MARC codes into human-readable languages, countries, etc.
correctly deal with "filing characters" (e.g., leading articles like "a" or "the")
find field data repeated in other languages ("vernacular" data, usually in 880 fields)
find OCLC numbers, with their myriad of prefixes
…and many others. And, if you know a little ruby, it’s not hard to write your own.

What are the records transformed into?
Given its history focused on indexing data into Solr, the basic result of a
traject transformation of a record is a hash (map) of arrays (e.g., key1=>[val1, val2,...] — each key/fieldname is mapped to one or more values). This is easily transformed into
something that you can send to Solr or write to a file. If you need to produce more complex hierarchical data, traject may not be the right tool for you.

What kind of output can it produce?
Obviously, the resulting documents can be sent to Solr, via Traject::SolrJsonWriter. Additionally, we ship Traject
with writers that produce other formats.

Traject::DebugWriter produces a human-readable file with one field and its values per line.
Traject::JsonWriter produces newline-delimited JSON, one valid JSON record per line.
Traject::YamlWriter writes a yaml file that contains multiple documents, good for both further processing and human inspection.
Traject::DelimitedWriter, by default, writes a tab-delimited file suitable for further processing or import into Excel.
Traject::CSVWriter produces comma-separated value files, as you’d expect.

A 2.0 release?

So, what’s changed enough to warrant a 2.0 release?

No longer requires JRuby
The first release of traject only ran under JRuby, based on its need to use the
solrj java library to efficiently indexing things into Solr. More modern
versions of Solr (since version 3.2) allow indexing documents via HTTP with JSON;
doing so not only works under any ruby implementation, but in my tests the JSON indexer goes about 20% faster than the old solrj-based indexer.

(Tab-)Delimited and CSV Writers:
How often are systems librarians asked to do things like "find all the records with publisher string XXX, and give me a list of them with the title, isbn, author, and date of publication"? For me, the answer is "often" and traject now makes it easy to output something that your user can inspect as text or import into Excel for further processing.

Cross-platform threading
For most applications of traject to date, the bottleneck is the transformation process of turning a MARC record into a Solr document. Under JRuby, you can throw as many cores as you have available at that transformation to speed up the indexing process. Even under MRI, which can’t run multiple threads on ruby code at the same time, we can use a second thread to talk to Solr so indexing on the server doesn’t slow down processing of MARC records.

So…give it a whirl!

You can find traject and its related gems on Github. Besides traject itself and the associated reader/writers, there’s a heavily-documented sample project to get you started.

I’m heavily invested in traject, and am more than willing to assist folks as they start using it, so don’t be afraid to contact me (via email or twitter) if you want a little advice or a helping hand.

How good/bad is MARC data? The case of place-of-publication

Bill Dueber — Mon, 10 Nov 2014 11:09:00 +0000

I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else.

One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself.

[Note: there will be some complaining at the end. I promise.]

One of my recent charges was to try to put in place a better place-of-publication filter in the catalog. Place of Publication is most formally dictated by the (poorly-named, since it includes states/provinces) Country of Publication code in the 008 fixed field. This one-, two- or three-letter code that is then translated into a place name via a mapping provided by the LoC. Like most important pieces of data, the place of publication can appear in a few different places in a valid MARC record — because the searching is half the fun! — but we decided to just stick with the 008 for the catalog search.

Of course, the name of a place of publication may have changed since the actual publication. Historically speaking, borders have been remakably consistent over the last half of a century or so, but there are still changes (fall of the Soviet Union), splits (the former Czechoslovakia) and merges (Germany).

Focus on validity

So, there are roughly a bazillion ways one could try to slice and dice the data to figure out what the most accurate textual representation of a place name should be for a given record. More cut-and-dry is a simple question: how many of the 008s have a valid (current or obsolete) place-of-publication code in them?

I ran an analysis of all the 008s in all the records in the University of Michigan catalog, which helpfully includes all the HathiTrust holdings as well, so we’re getting a nice cross-section of institutional records.

Here’s what I found, in round numbers

	Total	Pct. of Total
All records	12M	100%
Invalid 008	1900	0.15%
Valid code	11.6M	96.6%
Unknown place-of-pub	381k	3.1%
Invalid code	27k	0.2%

[“No place-of-pub” includes both records with no data in the 008 and those with the code ‘ x’ which explicitly indicates “Unknown”]

Results: pretty good!

Given much of the data I’ve worked with over the years, this strikes me a stunningly good. Of course, in the case of a place as big as UMich, that means we’ve still got about 408k items about which we have no good place-of-publication information, but as a percentage, it’s small enough that I’m happy to live with it.

I was, admittedly, a little put out by the fact that we have records in which the 008 fixed field — which is pretty important, as these things go — was just plain invalid (including 176 just plain missing). You’d think that the ILS software would reject things like that, but, as in almost all cases when you think the ILS should do something smart, you’d be wrong.

And now, the complaints

Of course, all we know is that the codes are (or were) valid — not whether or not they’re accurate.

There are two obvious problems:

Some rocket scientists at some point decided that the code ‘ai’, which had been used to represent Anguilla, should now be used to represent the Republic of Armenia. As if that weren’t enough to make you slam your head into a brick wall, the change is based on the date of cataloging, not the date of publication, so there’s no way for me to know which country is supposed to be indicated. It looks like this was to try to keep the first two letters of codes from the old Soviet Union the same one it fell apart, but c’mon, people! (Note that Anguilla is now ‘am’, because of the …ummmm….”m” in it’s…er…nevermind.) We don’t have many records with that code, but this is the sort of blatent disregard for simple data integrity that drives me crazy.
A presumably different set of rocket scientists (once NASA downsized, those folks were everywhere) at various points in time and at various locations decided that the place of publication on a reproduction (say, a microfilm) should be the place the reproduction was created. So, a microfilm of The New York Times that happened to be created in Ann Arbor, MI is coded as ‘miu’, for Michigan.

The latter, of course, is designed to serve those people studying where microfilms were created at the expense of people who want to, you know, find things actually published in a particular location. I’m sure all three of the people in the country who want to know the former are forever grateful.

Ruby MARC serialization/deserialization revisited

Bill Dueber — Thu, 09 Oct 2014 12:58:00 +0000

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them.

The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild.

I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from a hash?

File sizes

File size isn’t as important as it once was, but still matters to some of us working with ginormous amounts of data:

This is the file size to hold the 18,881 records used for the benchmark.

Serialization	Size on disk (MB)	Size vs. marc21	Gzipped size on disk (MB)	Gzipped size vs marc21
marc21	31	100%	9.0	100%
msgpack	42	135%	8.2	91%
json (ndj)	56	180%	8.1	90%
marshal	69	223%	9.4	104%
marc-xml	93	300%	9.2	102%

It’s interesting, if not super-useful, to note that the file sizes differ by a factor of three uncompressed, but hardly at all when compressed. I was surprised at how well the binary formats (msgpack, marshal, and marc21) compressed.

Serialization / Deserialization time

I took a file of about 19k MARC records and tested the serialization/deserialization time, as follows:

marc21 uses MARC::Reader and MARC::Writer from the ruby marc distribution
json Uses MARC::Record#to_hash to produce a marc-in-json hash, serializes with the stock JSON library, and the writes to a file with one record per line (sometimes known as newline-delimited JSON, or NDJ). Deserialization reverses the process
json (oj) is the same, except using the Oj json library under MRI.
msgpack uses the msgpack or msgpack-jruby gems to serialize/deserialize msgpack objects to/from a file stream.
marshal uses the core ruby Marshal class to serialize/deserialize to a file stream.

In all cases, deserialization means to pull each record in turn from a file on disk and turn it into a MARC::Record object; serialization means to take a set of pre-created MARC::Record objects, serialize them, and push them into a file.

All times are in “real time” seconds as reported by Benchmark, averaged across two runs on my desktop machine:

MRI ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0
JRuby jruby 1.7.15 (1.9.3p392) 2014-09-03 82b5cc3 on Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132 +indy +jit [darwin-x86_64]

The benchmark code is up on a gist if you’d like to look at or modify it.

MRI Ruby

Method	Deserialize (s)	Serialize (s)	Total round trip (s)	Total time vs. marc21
marc21	11.5	6.7	18.2	100%
json	9.5	5.9	15.4	85%
json (oj)	6.4	3.5	9.9	54%
msgpack	5.8	3.4	9.2	51%
marshal	7.0	8.7	15.7	86%

JRuby

Method	Deserialize (s)	Serialize (s)	Total round trip (s)	Total time vs. marc21
marc21	6.9	4.2	11.1	100%
json	4.1	2.5	6.6	59%
json (oj)	N/A	N/A	N/A	N/A
msgpack	4.0	3.9	7.9	71%
marshal	5.3	8.1	13.4	121%

Conclusions

JRuby is faster than MRI on this these tasks, at least once it’s warmed up.
JSON, with a decent library, is either the fastest (JRuby) or really close (MRI)
Marshal is slow and big.

I come away with this thinking the same thing I did last time. I’m going to use compressed ndj in files, and (possibly compressed) JSON over the wire. The speed is great, tool support is outstanding, and having something human-readable is a big bonus.

Schemaless” solr with dynamicField and copyField

Bill Dueber — Thu, 02 Oct 2014 00:00:00 +0000

[Holy Kamoly, it’s been a long time since I blogged!]

Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields.

I find this intruguing, but it’s not what I’m after right now.

The problem I’m in the first stages of addressing is that my schema.xml is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on reorganization.

The way people tend to address this is with strict naming conventions (possibly using dynamicField ) and judicious use of copyField directives. The Project Hydra folks have a nice, straightforward system for how they set up dynamic fields.

Indexed XOR Stored?

The more I thought about it, the more I wondered whether it might be useful to have a strict separation of stored and indexed fields. Indexed fields would be named with an appropriate suffix, so you know how they’ve been analyzed. And stored fields would have pleasant, human-readable names to make them easy to deal with for consuming applications.

What I think I’d like is a system where:

All stored fields have ‘bare’ names (e.g., ‘title’, not ‘title_t’ or ‘title_s’)
All indexed fields are typed according to their name (so I know ‘title_t’ is an indexed field of type "text")
Separation of stored and indexed fields — a field is either stored or indexed, but not both.
A "schemaless" setup, where I don’t need to define all (any of?) my fields in my schema and reboot solr when I make a change.

To be clear: I’m not sure this is a great way to go as of yet. But I figured out what I think is a good way to do it, should it turn out to be worthwhile.

Part 1: Dynamic Fields

Solr allows one to define dynamic fields — a field whose type is determined by a glob-match on its name. Instead of explicitly naming your field in your schema, you can do something like:

…to indicate that any unrecognized field whose name ends in _is will treated as an indexed, stored integer.

Dynamic Field definitions are processed in order of declaration; first one wins. That allows you to define a “default” as the very last dynamicField that matches anything (e.g., *). The schema.xml that ships with Solr suggests that you can use this functionality to just ignore unrecognized fields.

But that gives me an idea.

Part 2: Copy Fields

The copyField directive allows you to index the same text into multiple fields (presumably with different analysis chains). Index data into one field, it automatically gets copied into another.

In this case, even though I only send a title, the indexed field title_l will automatically be created and available for me to search against. Nice.

Part 3: Copy Field with globs

But it gets better. You can have globs (*) in your copyField source or destination attributes.

So that’s nice. But what if you have globs in both the source and the destination? The docs say:

The copyField command can use a wildcard (*) character in the dest parameter only if the source parameter contains one as well. copyField uses the matching glob from the source field for the dest field name into which the source content is copied.

Hmmmmmm….

Part 4: Putting it all together

Once I read that, I thought, “Huh. I’m hungry.”

But after lunch, I thought, “Maybe I can do something with this.”

Here’s what I came up with.

Let’s walk through that.

First, there are two dynamicField definitions. The first is a no-op: unstored, unindexed. We use it only for copying. The second is a standard indexed (but not stored) text field.

Then come the copyFields, where we match on the suffixes of the field types. Finally, we have our default: a stored, unindexed string. (Note that when Solr stores a value, it stores whatever you put into it, not the value after analysis — same as a String does anyway).

Suppose I index an undeclared field called title_t_s:

title_t_s matches the first dynamicField declaration. This specific field is ignored (no indexing, no storing), but the text sent to it remains available for further processing by the copyFields.
The first copyField matches, and copies the text into newly-generated field formed by what matched the * in the source field, followed by _t. That’s title, so we get title_t.
The newly-minted title_t field is also unrecognized, but it matches the second dynamicField and is thus assigned to be an indexed text field.
Meanwhile, the second copyField also matches our original title_t_s. It uses what matched against the * in the source (title, again) to create a new field just called title.
Now we have a new field called title not matching any declared field, so it runs down the list of dynamicField definitions until it hits our stopgap at the end: a stored, nonindexed string.

Yeah, like that wasn’t confusing.

The result is what’s important, though. What we end up with field-wise is:

title_t_s disappearing into the ether. It’s just gone.
title_t, an indexed text field
title, a stored string.

Now I can run searches against title_t, but my document will have a nice stored string in it just called title.

Why this is probably a bad idea.

Depending on how crazy you want to get options-wise (multi-valued or not, termVectors or not, etc.) you can get a combinatorial explosion on the number of dynamicField/copyField sets you need to generate. But that’s not the real problem.

The real problem is that you don’t have any intrinsic documentation of what your index looks like. None. You can’t even look at the indexing code, because it’ll look like you’re sending a document with a field called title_t_s and that field is nowhere to be found.

So, like I said: interesting, but by no means the obvious way to go. Still, I’m sure I’ll have some variant of this in my schema when it comes time for me to reboot the library catalog.

Help me test yet another LC Callnumber parser

Bill Dueber — Thu, 30 Jan 2014 00:00:00 +0000

Those who have followed this blog and my code for a while know that I have a long, slightly sad, and borderline abusive relationship with Library of Congress call numbers.

They’re a freakin’ nightmare. They just are.

But, based on the premise that Sisyphus was a quitter, I took another stab at it, this time writing a real (PEG-) parser instead of trying to futz with extended regular expressions.

The results, so far, aren’t too bad.

The gem is called lc_callnumber, but more importantly, I’ve put together a little heroku app to let you play with it, and then correct any incorrect parses (or tell me that it worked correctly) to build up a test suite.

So…Please try to break my LC Callnumber parser!

[Code for the app itself is on github; pull requests for both the app and the gem joyously received]

New blog front- and back-end

Bill Dueber — Tue, 17 Dec 2013 00:00:00 +0000

A while back, Dreamhost had some problems and my blog and assorted other websites I help keep track of went down.

For more than two weeks.

Now, I understand that crap happens. And I understand that sometimes lots of things happen at once. But fundamentally, their infrastructure is such that they could lose everything on a machine and be unable to get it back for more than two weeks. I’m not a mathematician, but that’s not “five-nine” service.

So, I decided to start hunting around for another provider. And then I got distracted by the idea that maybe having my blog in WordPress was more trouble than it was worth. There’s something to be said for simplicity, especially since all I really wanted to do is throw up posts written in markdown with code samples.

I got a few pointers toward using middleman, a pre-processor that takes in almost anything and produces regular css/html. Between that and Disqus for the comments, well, this just seems easier. And now that I’ve put in the effort, it’ll be easier to actually get blog posts up, most importantly, move it over when I find a new hosting provider.

Feel free to tell me how ugly it is and suggest improvements. I have the design skills of a one-eyed poodle.

Announcing “traject” indexing software

Bill Dueber — Mon, 14 Oct 2013 00:00:00 +0000

[Over the next few days I’ll be writing a series of posts that highlight a new indexing solution by Jonathan Rochkind and myself called traject that we’re using to index MARC data into Solr. This is the introduction.]

Wow. Six months since I posted here. What have I been doing?

Well, mostly parenting, but in the last few weeks I was lucky enough to get on board with a project started by Jonathan Rochkind for a new JRuby-based tool optimized for indexing MARC data into solr. You know, kinda like solrmarc, but JRuby.

What’s it look like?

I encourage you to take a look at a little sample setup I put together for instructional purposes. It’s based on the HathiTrust catalog indexing scheme and shows off about 85% of what traject can do. Clone it and go through the README and the two indexing files to get a taste of how things are put together.

Real quickly, though, here’s a sample configuration file to pull out the ID, title, and authors (if any) out of a file of MARC records and send them to a file as JSON object, one record per line (i.e., newline-delimited JSON)

 # we'll pretend this file is called 'sample.rb' require 'traject' require 'traject/marc_reader' require 'traject/json_writer'   # It's just ruby, so I can have comments! # Here we set up which reader/writer to use and so on settings do   provide "reader_class_name", "Traject::MarcReader"   provide "writer_class_name", "Traject::JsonWriter"   provide "output_file", "basics.ndj"   provide 'processing_thread_pool', 3 end   # It's *still* just ruby, so I can declare a variable! idfield = '001'  # ...and then use it to find the ID to_field "id", extract_marc(idfield, :first => true)  # Now the other data to_field "title", extract_marc('245') to_field "author", extract_marc('100abcd:110abcd:111abc')   # You'd run this as: #    traject -c sample.rb myfile.mrc

That’s simplistic, of course, but it should drive home the point that we strove to make sure traject makes the easy stuff easy. For a more complex example, look at the heavily-annotated index.rb file in the sample project.

Why use (or move to) traject?

First off, you can and should look at the annoucement and/or the README for a longer answer, but I’ll tell you why I use traject in one word:

Flexibility.

After a year or so of struggling with solrmarc (often due to my lack of Java-fu), and then even more years after that using my own, home-grown marc2solr, the things I most wanted were the ability to decouple the various components from each other, rely on code instead of configuration, and basically just know that I can up the complexity of my code without paying an enormous price.

I’m fast wtih Ruby. And the architecture of traject allows me to easily build and test my transformations in isolation, with tools I’m good with, with debugging output that’s easy to read or process by machine or inspection.

What does it have out of the box?

One advantage traject has that my previous system didn’t is, well, years of struggling with my previous system. I’ve learned a lot about what I need, what needs to be easy, and how I want to think about indexing.

The nature of traject is that “a reader” sends “a record” to “an indexer” which produces a key=>value hash and sends that to “a writer.” Obviously, this is a pretty abstract setup; it’s not hard to see how it could be used for all sorts of transformations (e.g., I’m already thinking about a simple gem that would provide macros to index CSV or tab-delmited files into Solr. Or maybe going to/from a database).

But Jonathan and I are, mostly, stuck dealing with MARC data and Solr. So here’s what we get:

Readers: MARC readers for MARC21 binary and MARC-XML based on both ruby-marc and marc4j (the latter allowing you to deal with encoding transformations and the like). An NDJ reader (for one marc-in-json structure per line in a file — that’s what we use in for the HathiTrust). And we’ve already got a couple gems for people with other needs: traject_alephsequential_reader for those that need to deal with AlephSequential, and Jonathan’s new horizon reader for efficiently pulling records right out of your Horizon ILS, if you happen to run one.

Transforming Macros: A traject indexing step is just a well-formed ruby block (or lambda), which makes writing macros ridiculously easy. Traject ships with most of what you’d commonly need to deal with MARC: extracting data based on tag/subfield/indicators (or substring of a fixed field), dealing with non-filing characters, automatically dealing with 880 linked fields. Mucking with publication dates. Dealing with languages, formats, etc. And, of course, doing it all with multiple threads, because who wants to see all those lovely cores go to waste?

Writers: Of course, you can write to solr, using the excellent solrj java library. And you can do it in multiple threads, to keep things fast. But there’s also the DebugWriter to spit stuff out in a human-readable format, and the JsonWriter mentioned above to spit stuff out in a machine-readable format. And building your own writer is literally just a couple methods.

How do I get a taste?

Like I said, clone and play with the sample project. And ask me questions, either here or via email. After years of being the only person running my indexing software, I’m anxious to try to build up a community around traject.

Come work at the University of Michigan

Bill Dueber — Thu, 18 Apr 2013 00:00:00 +0000

The Library has three UX positions available right now — interface designer, interface developer, and a web content strategist.

Come join me at what is easily the best place I’ve ever worked! Full details are over at Suz’s blog.

Please: don’t return your books

Bill Dueber — Tue, 12 Feb 2013 00:00:00 +0000

So, I’m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part.

Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question:

How much more shelving would we need if everyone returned their books?

Assuming we could get them all checked in and such, well, where would we put them?

I’m looking at this in the simplest, most conservative way possible:

Assume they’re all paperbacks, so we don’t worry about how thick a cover is (cover width = 0)
Assume items for which we don’t have page count information are “average”

Starting data

What’s my current situation at Michigan?

Total bibs: about 10M (but that includes a bunch of HathiTrust items and other electronic-only items that could never be checked out)
Total items checked out right now: 162,080

The first problem I run into is that I don’t know how many pages are in a given book. Well, in theory I can look in MARC field 300$a, and it will tell me.

Finding the number of pages in a book

I went through a recent dump of all our records and pulled out page counts from the 300 (those that matched the regular expression $$a\d+\s+[pP].).

Problem solved, right? Well, kind of

3,085,433 total bibs with page count data (about 30%)
40,872 checked out items with page count data (about 25%)

OK, so I don’t have data for everything. Plus, some of those are multi-volume works that list the total page count, even though only a single volume may be checked out.

We’ll have to drop down into statistics:

Average number of pages in a checked-out item: 270
Median number of pages in a checked-out item: 244

The median is lower, so we’ll go with that. Being conservative, remember?

Bringing it all together

Obviously we need to make a lot of assumptions.

All paperbacks (== no space allowance for covers)
244 pages per item (the median of checked out items for which we have data)
Pages = 244 * 162,080 = 39,547,520 pages

So…what’s the damage?

But how to do the calculation?

It turns out that simply googling book spine width calculator a few come up.

I picked one and input 39,547,520 pages and assumed 50lb paper (the lightest paper in the tool).

Total width: 77,241.25 inches, or 6437 feet, or 1.22 miles

1.22 miles???

Well, we had a lot of assumptions,but most of them were pretty conservative. And I have no idea if the book spine calculator is at all accurate.

But…it’s gonna be a big number no matter what. Add in that many of them are hardcover, and this seems like a pretty good guess at a lower end.

What is this good for again?

Oh, nothing at all. Just a little fun while I’m at code4lib.

Next steps

Well, the best next step would be to walk away. This is a huge waste of time.

But…we could look in the 020s for a hint of whether it’s hardcover or paperback (which is really hard. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and take that into account.

But really: this is enough for me. Whether Michael wants to pursue it further on his own, well, that’s up to him.

Boosting on Exactish (anchored) phrase matching in Solr: (SST #4)

Bill Dueber — Mon, 19 Mar 2012 00:00:00 +0000

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Exact matching in Solr is easy. Use the default string type: all it does is, essentially, exact phrase matching. string is a great type for faceted values, where the only way we expect to search the index is via text pulled from the index itself. Query the index to get a value: use that value to re-query the index. Simple and self-contained.

But much of the time, we don’t want exact matching. We want exactish matching. You know, where things are exactly the same except. Except for case, or punctuation, or how much whitespace is between tokens. Maybe do some unicode folding, or stemming.

Essentially, we want to reward users (via high relevancy) for getting really close. If someone types in a full title, but misses a colon, well, let’s go ahead and assume they want that particular item.

Exactish matching vs phrase matching

Phrase matching in Solr does a great job, but fails those of us generating super-complex queries where we want to provide awesome service for those users doing known-item queries. If someone puts in the exact(ish) title, or the exact(ish) subject, well, those items should float to the top.

Solr’s default phrase matching (via, say, the pf param in dismax or just putting your query in quotes) doesn’t differentiate between a phrase that matches the whole target string and only part of that target string. For this, we’ll need a decent text fieldtype and a way to “anchor” the search to both ends of the target string.

Our goals

We’re shooting for:

A useful text type that we can use all over the place
A phrase match against that field that will match any portion of the target text. Solr already does this — that’s a normal Solr phrase search.
A “fully anchored” text type that will only phrase match if the query string exactishly-matches the whole field. We’ll phrase-search on this field and boost it way up.
And, what the heck, a left-anchored version that will exactish match a phrase only at the start of a field. We’ll boost this one up a bit less.

Follow along at home

Go ahead and clone the github repo I’ve been using if you haven’t already and let’s dig in.

 cd solr_stupid_tricks git pull origin master git fetch --all git checkout SST4 java -jar start.jar &

There are some additions to the schema.xml file; let’s take a look!

Step 1: get a decent text type

The recent-nighty of Solr 3.x we’re using has a great tokenizer in ICUTokenizerFactory, which does “the right thing” across a whole host of languages.

Let’s take it bit by bit:

Obviously, start with the ICUTokenizer with a large positionIncrementGap so we can do some of the tricks we talked about last time
Next, we get one-stop shopping with the ICUFoldingFilterFactory. It provides all of the following:
- NFKC normalization (precomosing),
- Unicode case folding (i.e., lowercasing)
- search term folding (removing accents, etc).
Push in synonyms if you have any
Uncomment the WordDelimiterFilterFactory if you want to. I’m going to try to avoid it, since it messes with the number of tokens midstream and I worry about the effect on dismax and its mm parameter as explained so excellently by Jonathan Rochkind
Dealing with CJK (Chinese, Japanese, Korean) is hard. The CJK filters process those languages and provide overlapping bigrams so searching isn’t (I’m told) quite as painful. (I really, really recommend the above link for a great overview by Tom Burton-West).

Step 2: Set up parallel text types that anchor phrase matches to one or both ends

We’re going to use something new: a charFilter. This differs from a normal filter in that it affects the input string before tokenization.

Here’s the trick. We’re going to add anchoring text (I chose just ‘AAAA’ at the front and ‘ZZZZ’ at the end) to the normal text type, just by adding a simple charfilter.

Note that this charFilter actually adds two new tokens (‘AAAA’ and ‘ZZZZ’) to your token stream on both index and query. How does this help us?

Let’s look at indexing Mister Blue Sky in a normal text field. A normal solr phrase query q="Blue Sky" will match on that value, because the query phrase is fully contained in the indexed phrase.

But what happens if we index into a text_lr field?

Indexing Mister Blue Sky becomes aaaa mister blue sky zzzz
Search terms blue sky becomes aaaa blue sky zzzz
Phrase searching will then compare the two transformed values using normal Solr rules, find the the latter is not fully contained in the former as a phrase, and give up.

Be careful, though. That ‘aaaa’ and ‘zzzz’ are there just as if you’d typed them in. Thus every indexed value has the tokens ‘aaaa’ and ‘zzzz’, and every query will, in effect, include a query for ‘aaaa’ or ‘zzzz’ (depending on your mm settings).

That means that any non-phrase query will match every field that uses this fieldtype, and it will also mess with token counts with respect to your mm parameter. For those reasons, only ever use anchored fieldtypes for phrase queries when you want exactish matches.

By adding only one of ‘AAAA’ or ‘ZZZZ’, we can have left-anchored and right-anchored searches as well. See the schema.xml for these definitions.

Try it out!

Let’s take a small set of new documents:

 [   {     "id": "1",     "title": "The Monkees: Pleasant Valley Never"   },   {     "id": "2",     "title": "The Monkees"   },   {     "id": "3",     "title": "Meet the Monkees"   },   {     "id": "4",     "title": "Corportate boy bands through the ages"   } ]

We have copyFields set up to copy the title field to both a fully-anchored field (text_exact) and a left-anchored field (text_l).

If you’re following at home, clear out your solr and index them:

 cd exampledocs  ./reset_and_index_json.sh exactish.json

We’ll now run three dismax queries, all of which use the search terms the monkees. Watch what happens to the score as we change things.

First, qf=title, pf=title^2. This will match the three Monkees documents, and then boost all of them because they all contain the phrase “the monkees” in the title.
Second, qf=title, pf=title_exact^10 title^2. These will match the Monkees documents, and then give a huge boost to the one with the exact match.
Finally, qf=title, pf=title_exact^10 title_l^5 title^2. There you’ll see the score for the exact title match go way up (relatively speaking, of course), and document 1 go up quite a bit (because it begins with the phrase “The Monkees”).

You can run all three queries as:

 cd ruby ruby browse.rb exactish_query.rb # or ruby browse.rb exactish_query.rb json|xml|csv to get different output type

[BTW, browse.rb will now take an array of queries to run in a single file.]

Tah Dah! You’ve successfully boosted the exatish match, and the left-anchored exactish match. Your known-item-searchers will thank you.

You may want to take a look at exactish_query.rb to see what’s going on.

To sum up

Your schema.xml now contains a decent text type and three variants for anchoring phrase searches left, right, and full (exactish)
The anchored text fields should NOT NOT NOT be searched against by anything other than a single phrase (which means they’re very useful in the pf param of a dismax search). A non-phrase search will trivially match every single document, so, you know, avoid that.
You now have a set of tools (field types, copyField directives, phrase search) that can be used to provide higher boosts to exactish matches and left-anchored exactish phrase matches.

Requiring/Preferring searches that don’t span multiple values (SST #3)

Bill Dueber — Fri, 09 Mar 2012 00:00:00 +0000

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Solr and multiValued fields

Here’s another thing you need to understand about Solr: it doesn’t really have fields that can take multiple values.

But Bill, you’re saying, sure it does. I mean, hell, it even has a ‘multiValued’ parameter.

First off: watch your language.

Second off: are you sure?

Let’s do a quick test. Look at the following documents

// exampledocs/names.json 
[
  {
    "id":1,
    "title":"The Monkees",
    "name_text":[
      "Peter Tork",
      "Mike Nesmith",
      "Micky Dolenz",
      "Davy Thomas Jones"
    ]
  },
  {
    "id":2,
    "title":"Heros of the Wild West",
    "name_text":[
      "Buck Jones",
      "Davy Crockett"
    ]
  }

Question: what do you get when you run this query against those two documents?

# ruby/names_query.rb  
{   
  'fl' => 'score, *',   'defType' => 'dismax',   'wt' => 'csv',   
  'qf' => 'name_text',  
  'q' => 'davy jones'   # Poor guy just died. So young. So short.
}

See how I threw the wt=csv in there? Check out all the query response formats if you’re interested, but really all you’ll use is standard (XML), json, or csv unless you’re rolling your own in some way.

I’ve updated ruby/browse.rb to allow a second argument of the type of output you want. You can now do ruby browse.rb jsonfile [json|csv|standard|xml]

Following along at home?

If so, let’s go ahead and index these document and run the query.

java -jar start.jar
cd exampledocs 
./reset_and_index_json.sh names.json  
cd ../ruby  
ruby browse.rb names_query.rb

Here’s the scores that I get:

score	id	title	name_text
0.42039964	2	Heros of the Wild West	[Buck Jones, Davy Crockett]
0.26274976	1	The Monkees	[Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones]

Check out that last column. The query was davy jones. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.

The relevance ranking seems…wrong

While it looks like we added four separate names to the name_text field in our first document, Solr doesn’t see it that way. Solr treats those four poor Monkees as if they had one long name.

Then it finds all the documents that match the query (both of our documents match) and figures out which is a better match by assigning a score.

In this case, while both document have both query terms, the field in the second document is shorter. Which means that, essentially, a higher percentage of the terms in the field value match the given query terms. In Solr’s mind, that makes it a better match, and the shorter document shows up first.

Solr doesn’t automatically give more weight to the recently-dead Monkee because internally it doesn’t care that you’re thinking of those values as four separate names. It just concatenates them together and indexes them.

This is not, for most people, expected behavior.

Phrase slop

Part of what’s going on here is that we haven’t told Solr that it should care how close together the terms are.

One way to do that is to use a phrase query by throwing quotes around the terms

# Put double-quotes around it to make it a phrase query   
q => '"Davy Jones"'

…but that won’t find anything, because Davy and Jones aren’t right next to each other in our document.

Solr does allow a phrase query to be sloppy, though — basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.

For that, we’ll tell solr to search against certain fields (pf) treating the query as a phrase, and allow a little slop (ps) as well.

#ruby/names_sloppy_query.rb   
{    
    'fl' => 'score, *',    
    'defType' => 'dismax',    
    'wt' => 'csv',     
    'q' => 'davy jones',     
    'qf' => 'name_text',    
    'pf' => 'name_text^10', # search this field as a phrase     
    'ps' => '4' # allow 'phrase' to mean 'within 4 tokens of each other'   
    }

That gets us something more expected.

   id,title,name_text,score   
   1,The Monkees,Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones,0.2806283  
   2,Heros of the Wild West,Buck Jones,Davy Crockett,0.029652705

Enter `positionIncrementGap`

OK. Now that we have the concept of slop, one of those mystery fieldtype parameters makes sense: positionIncrementGap. Basically, a positionIncrementGap of 1000 means When computing slop, pretend there are 1000 tokens between the entries in a multValued field.

A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your positionIncrementGap.

All you have to do is use the pf and ps parameters and you’re set.

Note that this should be telling you two things:

Always use the same positionIncrementGap for your multiValued fields
Make it a number much larger than the maximum number of tokens you expect to ever have in a field.

Note that a large positionIncrementGap doesn’t actually put 1000 tokens in there — a large value doesn’t affect processing time or your index size or anything.

But I’m already using the `pf` parameter!

Slop is great when you want it. But I don’t always want to use slop. Slop of 4 makes the phrase Sex in the City be treated exactly the same as In the Sex City. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.

[Forshadowing: We’ll talk about exact-ish matches in a few days.]

OK, so we can’t just appropriate the pf/ps parameters and and push the slop value up all the time — that cripples our ability to create the query boost structure we want.

Query slop

So, dismax (and its cousin edismax) have an analogous parameter that affects only phrases within the normal query: qs.

qs is a dismax param that affects query slop — how much slop to allow in phrases within the query, much like the ps param.

The query

# A three-token query   
'q' => 'Bill "The Weasel" Dueber'

…has three tokens, the second of which (The Weasel) is a phrase. It’s that phrase token that is affected by query slop.

OK. So it affects only the phrases in the normal query. But…suppose we just force the entire query to be one big phrase? That’ll get us somewhere!

We just need to do the following:

Create a boost query that uses the same fields as the regular query
…but treats all the query terms as one big phrase
…and give it a query slop of one less that the positionIncrementGap in our field type definition (in my case, 999)

Package it up

OK, so here’s what we’re going to do. You can just take this basic idea and build it into your own queries in your application code. Try it. You might like it. Play around with what fields are affected, how much weight to give it, etc.

But heck, we’ve gone this far. Let’s encode it into the Solr configuration file solrconfig.xml itself as a custom request handler.

We’re going to extend our edismaxplus requestHandler from last time, but we’ll add an extra boost query that reflects this new prefer documents where all the tokens appear in the same ‘line’ of a multiValued query attitude.


   
  
    10
    *,score
    explicit
    
      _query_:{!edismax qf=$fields mm=$mymm v=$qwords bq=$boostForAll}
    0%
    JunkThatWillNEverShowUpInAMillionFreakinYears
    
      _query_:{!edismax qf=$fields mm='100%' v=$qwords }^5 OR
      _query_:{!dismax  qf=$fields mm='100%' v=$qwordsphrase qs='999'}^5

We now do a few new things:

(Line 15) Add a second clause to the boost query that use the same fields provided for the regular query (note the boolean OR between the two localparams queries that comprise this boost query)
(Line 17) Ask for another user-provided value: qwordsphrase which your application-level stuff should set to the list of all the regular query terms, but as a single phrase. Basically, strip out all the double-quotes, then put the whole thing in double quotes. In ruby: qwordsphrase = '' + qwords.gsub(//, '') + ''
(Line 10) Provide a default value for the new qwordsphrase that won’t ever show up in a real query (empty string won’t work; I tried it and it throws an error). So, if the application doesn’t provide qwordsphrase, no harm is done — the search regresses to what we had last time.
(Line 18) Use a qs (query slop) of 999 in the new boost clause acting against qwordsphrase. That value is one less than the positionIncrementGap of 1000, making sure that we don’t cross multiValue boundaries.

Note: If you wanted to, you could make this a filter query (fq) instead of a boost query to only allow documents that meet this criterion.

Let’s try it out!

Once again, if you did a git pull origin master you’ve got this up and running already — the updated requestHandler source is already in solr/conf/solrconfig.xml.

We first construct the query just like we did last week, without the qwordsphrase argument:

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text

You’ll see Davy Crockett and friend appear as the first item.

But when you add the phraseified query, you’ll see the boost we’ve been talking about this whole post and get something more expected.

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text&qwordsphrase=Davy Jones

The Monkees are again on top! Party like it’s 1967!

Where it breaks down

If you actually have a phrase as one of your query terms, it will no longer be treated as a phrase during the boost because we’re getting rid of all the double-quotes.

And, of course, if you’ve got gobs of full-text and include your fulltext field, setting query slop to 999 isn’t just a cute trick, it’s a cute trick that will melt your servers to slag and still not do what you want it to do.

What have we learned?

…ish?

Solr doesn’t really separate multiple values from each other in a multiValued field
Phrase slop (ps) and query slop (qs) can be used to allow phrase to mean a bunch of tokens within X spots of each other
I’m A Believer is the best damn song Neil Diamond ever wrote.

Using localparams in Solr (or, how to boost records that contain all terms) (SST #2)

Bill Dueber — Tue, 06 Mar 2012 00:00:00 +0000

[Note: this isn’t so much a Stupid Solr Trick as a Thing You Should Probably Know; consider it required reading for the next SST. If you’re just joining us, check out the introduction to the Stupid Solr Tricks series]

What the heck is a localparams query?

A garden-variety Solr query URL looks something like this:

  http://localhost:8983/solr/select?     defType=dismax     &qf=name^2 place^1     &q=Dueber

Which is fine, as far as it goes. But it’s easy to run into the limits of the standard query plugins (e.g., Dismax).

Say, for example, you want something like this:

  title:Constructivism AND author:Dueber

And furthermore, you have multiple underlying fields (title1, title2, title3, author1, author2).

The naÃ¯ve approach would be to just do this:

  defType=dismax   &qf=title1 title2 title3 author1 author2   &q=Constructivism Dueber

But you can’t construct a dismax query with the boolean AND. You can with edismax, but even then you’ve got no way of telling (e)dismax that Constructivism must be found in the title fields, and Dueber must be found in the author fields. Dismax doesn’t do that.

Solution: Build a query of queries

The solution is to build a query made up of fully-encapsulated sub-queries. A localparams query has two forms (note that, of course, you’d need to URL-Escape the values):

  _query_:"{!dismax qf='field^2 otherfield^4'}my search terms" or   _query_:"{!dismax qf='field^2 otherfield^4' v=$q1}" & q1=my search terms

I far prefer the second form (which uses a second URL parameter q1 instead of sticking the search right in there), because I don’t have to worry about escaping double-quotes in the query terms (as you would if there’s a phrase as part of the query).

Once you’ve got these things, you can combine them with booleans.

    q=_query_:"{!dismax qf='title1 title2 title3' v=$q1}" AND       _query_:"{!dismax qf='author1 author2' v=$q2}"   &q1=Constructivism   &q2=Dueber

[Note: be careful with solr booleans!!!]

You can add any local parameters you need (for dismax, stuff like mm, qs, pf, and ps) and you can use any query parser you want by changing what comes after the bang (e.g., {!lucene ...} or {!edismax...}).

In this way, you can build up arbitrarily complex queries using any available query parsers in combination with each other. Very powerful.

An example: boost records that contain all terms

Just about everything in a localparams query can be pulled out in the way I pulled out the search terms above. Here’s a fairly-complex example (which, let’s be honest, would be a lot more complex if you were trying to inline and escape everything).

Scenario: We want to do a logical-OR search (mm=0%), but want to make sure we boost documents that contain all the search terms. This is necessary because sometimes a very long document with all the terms will have a lower score than a very short document with most of the terms.

Having short document with a few keywords show up before long documents with all the keywords will drive your librarians CraZy!!! So it’s tempting to just leave it alone. But let’s fix it anyway.

The gist of it is as follows:

Query against title and author
Use an mm of 0% (logical OR) for the main query
Use a pf to boost on a phrase in those same two fields (just common sense)
Set up a boost query (bq) to boost the score if all the search terms are present

To accomplish this, we’re going to have two localparams queries: one to be the main query, and another that we’re going to use as the boost query. This works in much the same way as our previous “AND-together two localparams queries” did.

[Presenting the URL parameters as a ruby hash to make it easier to read]

  {   'q'=>'_query_:"{!dismax qf=$f1 mm=$mm1 pf=$f1 bq=$bq1 v=$q1}"',   'mm1'=>'0%',   'f1'=>'author^3 title^1',   'q1'=>'Dueber Constructivism',   'bq1'=>'_query_:"{!dismax qf=$f1 mm=\'100%\' v=$q1 }"^5',   'fl' =>; 'score,*'   }

What’s nice about this is that I’m reusing the search terms (for the main query and the boost query) and field list (for the query field and the phrase fields) so I don’t have to repeat them.

Try along at home

First off, if you don’t have a browser that does nice XML and JSON formatting, well, get one. I use Chrome with JSONView and XMLTree, but I’m sure there are equivalents for Firefox. They’ll make your life easier.

By now you know the drill:

  cd solr_stupid_tricks   git pull origin master   git fetch origin master   git checkout SST2 # I've started tagging the repo for these posts   # ignore warning about "detached HEAD"   java -jar start.jar &

We’ll want to empty out the index and put in some documents to work with. I’m presuming you have curl installed. If not…well, you’re on your own.

  cd exampledocs   ./reset_and_index_json localparams.json

You might want to take a look at the localparams.json file, which contains a set of documents in the new JSON update structure. The full Solr JSON Update structure allows repeated keys. Apparently, so does the JSON RFC:

2.2. Objects

An object structure is represented as a pair of curly brackets surrounding zero or more name/value pairs (or members). A name is a string. A single colon comes after each name, separating the name from the value. A single comma separates a value from a following name. The names within an object SHOULD be unique. (emphasis mine)

“SHOULD”. Not “MUST”. I don’t care if it’s legal. It still weirds me out.

Once you’ve got solr running in the background, you can go ahead and try our query!

If you’re really lazy, just click the link
If you’re slightly less lazy, and you’ve got ruby installed, take a look in the new ruby directory. You can run ruby browse.rb localparams_query.rb to run the query and have it automatically open up in your browser.
If you’re ambitious, you might want to actually mess with the localparams_query.rb file so you can try things out.

As a longish side note, we’ll probably use browse.rb in the future of this series as well, so you might want to go ahead and get ruby installed if you don’t already. RVM is the easiest route if you’re on linux/OSX. You can also just install JRuby, seeing as how you’re running java anway (just make sure to use 1.9 mode by calling stuff as jruby --1.9 myscript.rb or setting the environment variable export JRUBY_OPTS=--1.9).

Special Stupid Solr Trick: Make a special query handler for a complex query

OK, so I said I wouldn’t have a real SST in this episode, but it’s so damn long at this point I figure I’ve lost everyone except Rochkind (Hey, Jonathan!), so let’s throw one in.

The Solr configuration file solrconfig.xml is where you can configure custom search handlers. In such a custom handler, you can specify defaults (which, by default, can be overridden by passed-in parameters, although you can control that, too) — this is commonly used to, say, put in a q.alt or a filter query that will always be applied.

But we can use it to put in our special query defaults that boosts when a document contains all the terms:

        10       *,score       explicit          _query_:"{!edismax qf=$fields                            mm=$mymm                            v=$qwords                            bq=$boostForAll}"        0%          _query_:"{!edismax qf=$fields                            mm='100%'                            v=$qwords }"^5

If you look closely, you’ll see that everything you need is defined in this requestHandler in the solrconfig.xml file, except for $fields and $qwords. You could also override mymm by passing in an argument with that name, if the default ‘0%’ isn’t to your liking.

If you’ve been following along at home, this requestHandler is already in the solrconfig.xml file that you’re running right now. Go ahead and try it! Let’s search for the terms ‘dueberb’ and ‘penn’ and see if the correct record floats to the top.

http://localhost:8983/solr/edismaxplus/?qwords=dueber penn&fields=author title

Nifty, huh?

Next time we’ll use a local params query to get around something about dismax that drives me crazy: preventing (or penalizing) matches that go across a field’s multiple values.100% What’s nice about this is that I’m reusing the search terms (for the main query and the boost query) and field list (for the query field and the phrase fields) so I don’t have to repeat them.

Try along at home

By now you know the drill:

  cd solr_stupid_tricks   git pull origin master   git fetch origin master   git checkout SST2 # I've started tagging the repo for these posts   # ignore warning about "detached HEAD"   java -jar start.jar &

We’ll want to empty out the index and put in some documents to work with. I’m presuming you have curl installed. If not…well, you’re on your own.

  cd exampledocs   ./reset_and_index_json localparams.json

> 2.2. Objects > An object structure is represented as a pair of curly brackets > surrounding zero or more name/value pairs (or members). A name is a > string. A single colon comes after each name, separating the name > from the value. A single comma separates a value from a following > name. The names within an object SHOULD be unique. (emphasis mine)

“SHOULD”. Not “MUST”. I don’t care if it’s legal. It still weirds me out.

Once you’ve got solr running in the background, you can go ahead and try our query!

If you’re really lazy, just click the link
If you’re slightly less lazy, and you’ve got ruby installed, take a look in the new ruby directory. You can run ruby browse.rb localparams_query.rb to run the query and have it automatically open up in your browser.
If you’re ambitious, you might want to actually mess with the localparams_query.rb file so you can try things out.

Special Stupid Solr Trick: Make a special query handler for a complex query

OK, so I said I wouldn’t have a real SST in this episode, but it’s so damn long at this point I figure I’ve lost everyone except Rochkind (Hey, Jonathan!), so let’s throw one in.

But we can use it to put in our special query defaults that boosts when a document contains all the terms:

        10       *,score       explicit          _query_:"{!edismax qf=$fields                            mm=$mymm                            v=$qwords                            bq=$boostForAll}"        0%          _query_:"{!edismax qf=$fields                            mm='100%'                            v=$qwords }"^5

http://localhost:8983/solr/edismaxplus/?qwords=dueber penn&fields=author title

Nifty, huh?

Next time we’ll use a local params query to get around something about dismax that drives me crazy: preventing (or penalizing) matches that go across a field’s multiple values. v=$q1 }”^5′, ‘fl’ => ‘score,*’ }

What's nice about this is that I'm reusing the search terms (for the main query and the boost query) and field list (for the query field and the phrase fields) so I don't have to repeat them.  ### Try along at home  First off, if you don't have a browser that does nice XML and JSON formatting, well, get one. I use Chrome with [JSONView](https://chrome.google.com/webstore/detail/chklaanhfefbnpoihckbnefhakgolnmc) and [XMLTree](https://chrome.google.com/webstore/detail/gbammbheopgpmaagmckhpjbfgdfkpadb), but I'm sure there are equivalents for Firefox. They'll make your life easier.  By now you know the drill:  ~~~bash   cd solr_stupid_tricks   git pull origin master   git fetch origin master   git checkout SST2 # I've started tagging the repo for these posts   # ignore warning about "detached HEAD"   java -jar start.jar &

We’ll want to empty out the index and put in some documents to work with. I’m presuming you have curl installed. If not…well, you’re on your own.

  cd exampledocs   ./reset_and_index_json localparams.json

“SHOULD”. Not “MUST”. I don’t care if it’s legal. It still weirds me out.

Once you’ve got solr running in the background, you can go ahead and try our query!

If you’re really lazy, just click the link
If you’re slightly less lazy, and you’ve got ruby installed, take a look in the new ruby directory. You can run ruby browse.rb localparams_query.rb to run the query and have it automatically open up in your browser.
If you’re ambitious, you might want to actually mess with the localparams_query.rb file so you can try things out.

Special Stupid Solr Trick: Make a special query handler for a complex query

OK, so I said I wouldn’t have a real SST in this episode, but it’s so damn long at this point I figure I’ve lost everyone except Rochkind (Hey, Jonathan!), so let’s throw one in.

But we can use it to put in our special query defaults that boosts when a document contains all the terms:

        10       *,score       explicit          _query_:"{!edismax qf=$fields                            mm=$mymm                            v=$qwords                            bq=$boostForAll}"        0%          _query_:"{!edismax qf=$fields                            mm='100%'                            v=$qwords }"^5

http://localhost:8983/solr/edismaxplus/?qwords=dueber penn&fields=author title

Nifty, huh?

Next time we’ll use a local params query to get around something about dismax that drives me crazy: preventing (or penalizing) matches that go across a field’s multiple values.

Solr Field Type for numeric(ish) IDs (SST #1)

Bill Dueber — Thu, 01 Mar 2012 00:00:00 +0000

[For the introduction to this series, take a quick gander at the introduction]

Like everyone else in the library world, I’ve got a bunch of well-defined, well-controlled standard identifiers I need to keep track of and allow searching on.

You know, well-vetted stuff like this:

1234-5678
123-4567-890
12-34-567-X
0012-0045
ISBN13: 1234567890123
ISSN: 1234567X (1998-99)
ISSN (1998-99): 1234567X
1234567890 (hdk. 22 pgs)
9
Behind the 3rd floor desk
Henry VIII

[Note: some of these may be a titch exaggerated]

How does your system deal with these on index? How about on query?

Here’s an idea of how to use a custom solr fieldtype to do the heavy lifting.

What we’re shooting for

I’d like to be able to send in a text string as follows:

The input can contain other text besides the id
The ID starts with a digit and consists solely of digits and (optional) dashes, then ends with a digits and possibly a trailing ‘X’ or ‘x’ so we can deal with ISBN/ISSN
The ID has to be at least N characters long (for this example, I’m using N=8); this helps us avoid other text that might trivially look like an ID but isn’t.
Only the ID itself is indexed
If no valid ID is identified, nothing is indexed

The numericID field, suitable for ISBN/ISSN/OCLC/etc.

Let’s take a look at the end product and then walk through it.

Things we’ll be learning about today

NOTE: I really, really recommend taking a look at Scaling Lucene and Solr by the good folks over at Lucid Imagination for great, short explanations of omitNorms, term frequencies, etc.

Since this is the first post, I’ll go over some stuff that’s probably a little too basic for any audience that’s likely to show up here, but what the heck.

KeywordTokenizer
PatternReplaceFilterFactory
LowerCaseFilterFactory
LengthFilterFactory

Step 1: “Tokenize” to a single token

The job of a tokenizer is to decide how to split your input into individual tokens (often “words”), which are then munged by any filters you’re applying.

For the case of an ID, we don’t want to tokenize. At least at this juncture, I’m not trying to extract multiple valid IDs out of a single string; I’m just trying to determine if there’s a valid ID in there somewhere and throwing everything else away.

In other words, I’m going to treat the input as a single token, and then munge the bejeebers out of it in order to get what I want.

In the Solr world, that leads us to the confusingly-named KeywordTokenizer.

What we have now: exactly what we started with

Step 2: Find the first thing that looks like an ID and mark it

I primarily work in Ruby and Perl, which means the dramatic abuse of regular expressions is just part of my daily life.

Line 5 is our first use of a regexp in the filter chain via PatternReplaceFilterFactory.

The idea is to:

Find something that looks like a match
If found, get rid of everything else, and throw a ‘***’ onto the beginning so later on I can tell if I matched or not.

The second step is a little…odd…but necessary because I need a way to know if I found a candidate ID or not. If I did, well, there will be three asterisks on the front of the string from here on out. If not, there won’t.

This is a little confusing as these things go, so I’ll break it down.

Line 6: the match:

Skip any amount of stuff we don’t care about (.*?)
Match a number (\p{N}) (that’s unicode regexp syntax, if you haven’t seen it)
Match a string of at least 6 numbers and dashes
Close with an optional X or x [Xx]?
…and any trailing bits until the end of the string.

So…[number][six numbers/dashes][optional X]

At minimum, that’s seven digits/dashes.

Line 7: replacement

Replace the whole string (note how I anchored the match with ^ and $?) with whatever was matched inside the parentheses (represented here by $1) after prepending a set of three asterisks ‘***’

What we have now: If we found a candidate ID, we have that string prepended by ‘***’. Otherwise, we have exactly what we started with.

Step 3: If we didn’t find a match, throw it all away

Line 9 shows an attempt to match on any string that start with an asterisk (which we’re pretty sure we won’t see because that’s illegal lucene wildcard syntax). If we have a string that doesn’t start with an asterisk, then throw the whole damn thing away because we don’t have a candidate ID anyway.

[There’s a strong argument to be made that using an asterisk as the tagging character is a bad choice. Anyone have suggestions?]

What we have now: Either a candidate ID string preceded by ‘***’ or the empty string.

Step 4: Ditch the ‘***’ used to mark a candidate ID

Lines 10-11

Find the ‘***’ and throw it away.

What we have now: The raw candidate ID string or the empty string.

Step 5: Lowercase it

Line 12.

By ‘it’ I mean “any X that might be trailing the ID”; we should have thrown everything else away by now. (Note: could have done this with a PattenReplace as well, obviously; not sure why’d I’d choose one over the other).

What we have now: The raw candidate ID string with its optional trailing ‘X’ lowercased, or the empty string

Step 6: Get rid of everything that’s not a number or an ‘x’

Lines 13-15

Ditch any dashes that are remaining. I’m doing it like this instead of just ditching the dashes because I’ll likely modify this at some point to allow, e.g., periods between numbers, or maybe spaces. This is safer.

Note the extra parameter (replace=”all”), indicating that I want to replace all occurrences. This hasn’t been an issue until now because I’ve been careful to match the entire string by anchoring the pattern at the beginning (‘^’) and end (‘$’).

What we have now: A string of numbers possibly followed by an ‘x’, or the empty string.

Step 7: Make sure what we have is a reasonable length

Line 16

Now that we’ve gotten rid of the dashes, we need to make sure we have enough digits left to make a valid identifier.

If we didn’t match originally, it quickly got reduced to the empty string, and that will disappear here due to having length 0.

It’s also possible that our initial match was, say, ‘1—-3—–6—7’, which will at this point have been reduced to just ‘1367’ — too short for our taste.

In this version, I allow strings of any length between 7 (old OCLC number) and 14 (barcode).

What we have now: A string consisting purely of 7-14 characters, the last of which may be an ‘x’, or nothing at all (e.g., nothing will get indexed).

Step 8: Remove leading 0s

My ILS (Aleph) loves to zero-pad all its local identifiers. I’d rather get rid of them.

What we have now: What we had before, but with no leading zeros

Let’s try it!

If you’re following along at home, get the latest version of the schema and try it!

   cd solr_stupid_tricks   git pull origin master   java -jar start.jar

…and then:

Go to the analysis page at http://localhost:8983/solr/admin/analysis.jsp?highlight=on
Set the first line of the form to use Field: type and input numericID
Check the “verbose output” checkbox under Field value: index
Put in a test value and see what the analyzer gives you!

For those of you not following along at home, here are the examples from waaaaaay at the top of this post:

1234-5678 => 12345678
123-4567-890 => 1234567890
12-34-567-X => 1234567x
0012-0045 => 120045
ISBN13: 1234567890123 => 1234567890123
ISSN: 1234567X (1998-99) => 1234567x
ISSN (1998-99): 1234567X => 199899
1234567890 (hdk. 22 pgs) => 1234567890
9 => [nothing]
Behind the 3rd floor desk => [nothing]
Henry VIII => [nothing]

So…not too bad. We did miss one, mistaking a year range for a numeric ID, but if your data are that bad, there’s only so much we can do.

Conclusions

Obviously, this is the tip of the iceberg with this sort of thing. And it can still be confused.

But it does follow our goal of having the exact same behavior on index and query, moving the logic to solr, and being pretty flexible.

Perfect? No. Useful? Yes.

Stupid Solr tricks: Introduction (SST #0)

Bill Dueber — Wed, 29 Feb 2012 00:00:00 +0000

Completed parts of the series:

Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.

Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.

Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.

So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.

Premise 1: put all the logic you possible can into Solr

Much of what I’ll be doing is looking at new field type definitions that are appropriate (in my mind, anyway) for library data. Some of this stuff (e.g., normalizing ISBNs) would be a lot easier to do in your indexing code.

But then you’d have to do it again in your application to munge whatever is entered in the search box. And maybe it won’t be the same every time. Or maybe you don’t want to write a freakin’ parser to try to find anything that might look like an ISBN and mess with it.

I take it as gospel that you should put all your logic into the solr field analysis chain, so the exact same thing is happening on index and on query. That way, even if it’s wrong, at least it’ll be wrong in the exact same way and your users will find the stuff they’re looking for.

Premise 2: Doing it crappily is better than not doing it at all.

Look, the right way to do much of this stuff is by hacking on Solr itself, building custom field analyzers or filters or tokenizers that mess with the token chain and…

Wait. I already lost myself, and probably you, too. At some point, I’m going to do an actual sample custom filter for the new Solr codebase (the stuff I did once before is out-of-date); the example will be LCCN normalization and you’ll be able to follow along with me on this blog.

But in the meantime, we can do a lot of fairly ambitious stuff just by using and abusing the out-of-the-box stuff: pattern replacement filters, the existing tokenizers, etc. It might be ugly, and not very fast, but if I start getting the 200 hits a second that mean this is a bottleneck for me, I’ll be happy to deal with it then.

Premise 3: It’s always better to put something out there so smart people can tell you how to do it right

One of the disappointments in my life right now is that there isn’t more formal and informal discussion about what people are doing/trying. I’m sure it’s out there, but some of it is buried in a sea of application-level crap, and much of it is ignored by the people that really understand the data.

With luck, I’ll get comments from folks who really know their stuff and can tell me, in excruciating detail, exactly how I don’t. Please: correct me. I might not be the brightest guy in the room, but I know enough to try to outsource my thinking.

Follow along at home!

Option 1: Build your own current-trunk Solr

If you want to follow along at home, you’ll need a copy of the current source (not the 3.5 stable, since I use things like the ICUTokenizer coming in 3.6 / 4.0), which you can find and build from the Solr site.

Option 2: Just use what I’m using

Alternately, if you’re lazy (and who isn’t??), I’ve provided a github repo of the standard solr “example” directory you can nab and run on your own java-equipped machine.

Warning: the git repo is currently 60MB or so.

  git clone https://billdueber@github.com/billdueber/solr_stupid_tricks.git   cd solr_stupid_tricks   java -jar start.jar

…and then head to your local Solr Admin page page on port 8983 to check things out. We’ll be spending most of our time in the analysis tab.

I’ll get the first post in the series up later today, and then every few days as I think of more things to talk about. I hope you’ll join me!

Another short personal note

Bill Dueber — Mon, 27 Feb 2012 00:00:00 +0000

The baby spent all last week in the hospital. Nothing life-threatening (so long as he was in the hospital and could get O2 when needed); it was just annoying.

So….here’s to a week-long hospital stay being able to be merely “annoying”. A tip of the hat to steady employment, generous sick/vacation policies, flexible co-workers, excellent insurance, and having a world-class hospital in town. This could have been a much, much worse week than it was.

Solr and boolean operators

Bill Dueber — Thu, 01 Dec 2011 00:00:00 +0000

[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!]

What does Solr do, given the following query?

  a OR b AND c

I’ll give you three guesses, but you’ll get the first two wrong and won’t have any idea how to generate a third, so don’t spend too much time on it.

Boolean algebra and operator precedence

Anyone who’s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping:

   a OR (b AND c)

That’s guess one. It’s not how Solr does it.

Left to right?

Some naive students, and at least one programming language (Smalltalk), do a simple left-to-right evaluation. So you might go with:

   (a OR b) AND c

Nope. Wrong again.

So what’s left???

Excellent question. I don’t know the code well enough to know what’s going on underneath, but here’s what we get under the lucene query parser.

     (b AND c)

That’s right. The first term is thrown away.(More correctly, the first term is deemed “optional”).

Do you let your users put AND/OR/NOT in their queries?

Hopefully, they don’t know any boolean algebra. If they do, hopefully they use parentheses, or you parse it out for them. And if not, well, they’re gonna be pretty damn confused.

It gets weirder

I populated a fresh solr (3.5) index with all possible subsets of the strings “curly”, “larry”, “moe”, and “shemp” (not Joe. Don’t talk to me about Joe). There are 15 of them, from the one-item ‘curly’ to all four at once.

I wrote a script to run a set of queries against the index under both lucene and edismax to see what I would get. In all cases the default lucene operator is ‘AND’ and the edismax mm parameter is set to 100% (equivalent to “all required”).

         Lucene                    EDismax   -------------------------------------------------------    1. curly AND larry         curly larry               curly larry         curly larry moe           curly larry moe         curly larry shemp         curly larry shemp         curly larry moe shemp     curly larry moe shemp    2. curly AND larry OR moe         curly                     curly larry         curly larry               curly larry moe         curly moe                 curly larry shemp         curly shemp               curly larry moe shemp         curly larry moe         curly larry shemp         curly moe shemp         curly larry moe shemp    3. curly OR larry AND moe         larry moe                 larry moe         curly larry moe           curly larry moe         larry moe shemp           larry moe shemp         curly larry moe shemp     curly larry moe shemp    4. curly AND larry OR moe AND shemp         curly moe shemp           curly larry moe shemp         curly larry moe shemp    5. moe AND shemp OR curly AND larry         curly larry moe           curly larry moe shemp         curly larry moe shemp

Query 1 is as expected. Query 2 apparently reduces to just ‘curly’ under the lucene parser and ‘curly AND larry’ under edismax (and query 3 similarly reduces to the two AND’d words). Queries 4 and 5 are…well, you can look at the debugQuery output to see what it gets, but not why. And then tell me how to explain it to a user.

Where does this leave us?

The good news is that both lucene and edismax behave predictably when you use parentheses for grouping. So do that.

I’m generally not one to complain about open-source software, at least partially because I don’t have the chops to do anything about it most of the time, but I don’t understand how this could seem OK to anyone. There are a couple lucene Jira tickets (Lucene-167 and Lucene-1823) and a 2005 mailing list thread denouncing the current behavior, but it persists.

Until the Solr/Lucene powers that be decide to tackle this, the rest of us will either have to write pre-parsers to make sure users get something sensible, or cripple our applications to disallow unrestricted boolean queries.

A short personal note

Bill Dueber — Tue, 11 Oct 2011 00:00:00 +0000

We had another baby.

Shai Brown Dueber was born last Monday, the 3rd, at a very moderate 7lbs 7.2oz (his brothers were 9lbs and 9.5lbs). Mother, baby, and older brothers are all doing well. Father is freakin’ tired.

Even better, even simpler multithreading with JRuby

Bill Dueber — Fri, 01 Jul 2011 00:00:00 +0000

[Yes, another post about ruby code; I’ll get back to library stuff soon.]

Quite a while ago, I released a little gem called threach (for “threaded #each”). It allows you to easily process a block with multiple threads.

   # Process a CSV file with three threads   FIle.open('data.csv').threach(3, :each_line) {|line| send_to_db(line)}

Nice, right?

The problem is that I could never figure out a way to deal with a break or an Exception raised inside the block. The core problem is that once a thread trying to push/pop from a ruby SizedQueue is blocking, there’s no way (I could find) to tell it to wake up and see if there’s an error from another thread floating around that needs to be addressed.

So, I got into a pattern of running my code with each for a while, debugging, and eventually doing the production run under threach. Which is just dumb. Then I’d try to re-write threach to deal with this stuff using different approach (mutexes, lightweight events), quickly (or not so quickly) fail, give up, and start again.

So…let’s not worry MRI for the moment. I run all my big jobs under JRuby these days anyway, and there I can take advantage of Java’s blocking queues that have timeouts. When a queue operation times out, I can check to see if there’s been a break or an exception thrown in the meantime and behave appropriately.

The result is the gem jruby_threach. It works just like threach, except that, you know, it actually works the way I’d like it to.

 require 'jruby_threach' FIle.open('data.csv').threach(3, :each_line) {|line| send_to_db(line)}

Looks familiar, doesn’t it.

But you can also break out of the loop.

 myarray.threach(2) do |item|   break if item_indicates_to_break(item)   if item == :really_bad_value     raise RuntimeError.new, "Something's really wrong", nil   end   process_item(item) end

Any exceptions that are rescued within the block are handled internally and don’t cause processing to stop. Any that are not handled within the block are noticed by threach, cause the processing to stop, and the re-raised so you can deal with them outside of threach

  reader = SpecializedFileReader.new(filename)  begin   reader.threach(2) do |item|     process_item(item)   end rescue SpecializedFileReaderError   # deal with the fact that the reader failed rescue Exception   # deal with the problem processing the item end

Dealing with the underlying Java data structures makes life a lot easier. To the point that I added an enhancement — threading production as well.

   # Use two threads to read lines from files, and another three threads   # to process the data that comes out of those files.   Dir.glob("*.csv").map{|f| File.open(f)}.mthreach(2,3) do |item|     send_item_to_datbase(item)   end

mthreach basically allows you to treat an array of Enumerables as a single logical entity, multithreading both the producer and consumer sides of the operation. There aren’t a whole lot of obvious use cases, but it can certainly come in handy.

You can also access the underlying class that aggregates multiple enumerables directly.

 require 'jruby_threach' me = Threach::MultiEnum.new(   [enum1, enum2, enum3], # enumerables   threads,               # How many threads to use to   :each_with_index,      # the iterator to call on the enumerables   size                   # size of the under-the-hood queue )  # Note that like threach, calling #each against an MultiEnum actually # calls the iterator you sent in (in this case, #each_with_index) me.each {|item| process_item(item)}

Using SQLite3 from JRuby without ActiveRecord

Bill Dueber — Thu, 26 May 2011 00:00:00 +0000

I spent way too long asking my friend, The Internet, how to get a normal DBI connection to SQLIte3 using JRuby. Apparently, everyone except me is using ActiveRecord and/or Rails and doesn’t want to just connect to the database.

But I do. Here’s how.

First, get the gems:

   gem install dbi   gem install dbd-jdbc   gem install jdbc-sqlite3

Then you’re ready to load it up into DBI.

 require 'rubygems' # if you're using 1.8 still require 'java' require 'dbi' require 'dbd/jdbc' require 'jdbc/sqlite3'  databasefile = 'test.db' dbh = DBI.connect(   "DBI:jdbc:sqlite:#{databasefile}",  # connection string   '',                                 # no username for sqlite3   '',                                 # no password for sqlite3   'driver' => 'org.sqlite.JDBC')      # need to set the driver  # That's it. Everything below here is stock DBI  dbh.do "create table squares (i integer, isquared integer)"  ins = dbh.prepare("insert into squares values (?, ?)") (1..20).each do |i|   ins.execute(i, i*i) end

How good is our relevancy ranking?

Bill Dueber — Wed, 25 May 2011 00:00:00 +0000

For those of us that spend our days trying to tweak Mirlyn to make it better, one of the most important — and, in many ways, most opaque — questions is, “How good is our relevancy ranking?”

Research from the UMich Library’s Usability Group (pdf; 600k) points to the importance of relevancy ranking Â for both known-item searches and discovery, but mapping search terms to the “best” results involves crawling deep inside the searcher’s head to know what she’s looking for.

So, what can we do?

Record interaction as a way of showing interest

One possibility is to look at those records that are somehow “touched” by a user in such a way that we can log it. If a user bothers to interact with an individual record, we’ll assume the record is interesting to her in the context of the current search.

There are three links associated with an individual record that a user can click on from the search results:

(62% of all record interactions) The title
(28%) An external link (HathiTrust, Google Books, or one of our vendors)
(10%) The “see holdings” link for those items that have multiple holdings

Our first issue arises quickly: only about a quarter of Mirlyn sessions contain any of these actions. For a full 75% of sessions, we have no data about which records users are paying attention to. They get a call number — or determine they have a failed search — and move on.

Where on the page do users interact with items?

We don’t know how users that interact with items differ from those that don’t. But for those that do, more than half of all record interactions are with the first record.

Here are the numbers for the first five records:

First record: 54%
Second record: 12%
Third record: 6%
Fouth record: 3.7%
Fifth record: 2.5%

More than 75% of all record interactions are with the first four items on the first page of results.

What does it all mean?

Frustratingly, we don’t know. Several possibilities are obvious:

we’re doing a good job with relevancy ranking
people do mostly known-item searches
people don’t bother looking past the first few results
excellent general search engines (e.g., Google) have trained people to believe that the first result is always worth a closer look.

The interactions between these (and unknown other) factors are likely complex.

In the meantime, though, to the extent these data can be extended to the general case (not at all obvious), we’re not doing too bad of a job.

Ruby gem library_stdnums goes to version 1.0

Bill Dueber — Fri, 06 May 2011 00:00:00 +0000

I just released another (this time pretty good) version of my gem for normalizing/validating library standard numbers, library_stdnums (github source / docs).

The short version of the functions available:

ISBN: get checkdigit, validate, convert isbn10 to/from isbn13, normalize (to 13-digit)
ISSN: get checkdigit, validate, normalize
LCCN: validate, normalize

Validation of LCCNs doesn’t involve a checkdigit; I basically just normalize whatever is sent in and then see if the result is syntactically valid.

My plan in my Copious Free Time is to do a Java version of these as well and then stick them into a new-style Solr v.3 filter so I (and, by extension, you, if you’re interested) can have Solr do normalization during both index and search time.

A short ruby diversion: cost of flow control under Ruby

Bill Dueber — Tue, 03 May 2011 00:00:00 +0000

A couple days ago I decided to finally get back to working on threach to try to deal with problems it had — essentially, it didn’t deal well with non-local exits due to calls to break or even something simple like a NoMethodError.

[BTW, I think I managed it. As near as I can tell, threach version 0.4 won’t deadlock anymore]

Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an exception raised.

I re-discovered something that a lot of people already know: raise/rescue under MRI is slow, and under JRuby can be unbearably slow. How slow?

Let’s look at four simple blocks that exercise four different block exit strategies: break, catch and throw, raise with the normal single (or zero) arguments, as well as the three-argument version of raise.

Simple break	Catch/Throw
range.each do \|i\| break end	catch(:benchmarking) do range.each do \|i\| throw(:benchmarking) end end
Raise (1 arg)	Raise (3 args)
begin range.each do \|i\| raise StandardError end rescue # do nothing end	begin range.each do \|i\| raise StandardError, :hi, nil end rescue # do nothing end

Simple break

Catch/Throw

 range.each do |i|   break end

 catch(:benchmarking) do    range.each do |i|    throw(:benchmarking)  end end

Raise (1 arg)

Raise (3 args)

  begin    range.each do |i|      raise StandardError    end  rescue   # do nothing  end

 begin   range.each do |i|     raise StandardError, :hi, nil   end rescue  # do nothing end

In each case, we immediately exit the block without doing any work; the idea is to measure how long it takes to break out for each case.

So….let’s run them each 100K times and see what happens, shall we? Times are in seconds, averaged over two runs.

	Ruby 1.8	Ruby 1.9	JRuby	JRuby –1.9
break	0.12	0.07	0.29	0.21
catch/throw	0.35	0.28	0.64	0.48
raise (1 arg)	1.78	2.10	26.60	22.06
raise (3 arg)	1.85	2.13	0.45	0.45

The first thing to note is that this is 100K iterations. Three of the strategies are fast enough that you’d have to work really, really hard to notice them.

In terms of speed, raise (3 args), catch/throw, and break are fast enough that you shouldn’t bother worrying about them (although you should choose the method that makes your code easy to understand).

The second things to note is Holy Camoli! JRuby is slow there!

This Jira ticket tells the tale: The creation of the backtrace is very, very expensive for JRuby. That nil at the end of the raise (3 args) call suppresses the creation of that backtrace, so the speed is fine.

Three things worth saying here:

If you’re using raise/rescue for flow control, you’re already doing it wrong. Reserve exceptions for, well, exceptional conditions that are only going to be raised once or twice, not all the time.
If you’re writing code that, for some ungodly reason, is planning on raising a crapload of exceptions, use the three-arg version. I’m looking at you, gem authors.
If you’re writing your code without worrying about how it will work under multiple threads, well, please don’t do that. Everyone has multi-core systems these days, and it’s silly to not be able to use them. Plus, counting on Matz to never move to a VM with real threads is a big gamble.

ISBN parenthetical notes: Bad MARC data #1

Bill Dueber — Tue, 12 Apr 2011 00:00:00 +0000

Yesterday, I gave a brief overview of why free text is hard to deal with.

Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it.

The point is not to mock anything. Mocking will, however, be included for free.

What’s supposed to be in the 020?

Well, for starters, an ISBN (10 or 13 digit, we’re not picky).

Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or not.

Wait, no, let’s go ahead and worry about it. It’s an easy enough script to write, although it takes a while to run.

8,630,794  Total records 3,220,666  Total 020a's     6,498  020a's that don't obviously contain an ISBN     8,407  that look like an ISBN but fail checksum test: ... so 0.26% of the ISBNs have invalid checksums

So, not bad at all, especially considering some of those are known to be bad, but are transcribed dutifully from the actual (mis-)printed book.

A lot of the malformed data (anything from which I can’t seem to extract something that looks like an ISBN) is pricing data, and most of it appears in system numbers that are close enough to each other that I presume it was just a bad batch.

What’s goes after the ISBN in the 020?

I’m no cataloger, of course, but it looks to me like the answer is “Something about how the book is bound together, or the publisher, unless you want to put something else there, and then, really, go ahead, because it’s not like anyone is ever going to want to parse this out, all we need to do is print cards with it for god’s sake.”

No, I kid, I kid! The actual rules are in Library of Congress Rule Interpretation 1.8, which reads, in part:

For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.

I think it’s important to read that a second time, because it succinctly conveys the culture in which these rules were devised.

Don’t worry about consistency, because your only reader is human.
Defer to the cataloger.
Being complete is more important than being consistent.
Base your notes on your subjective view of the actual, physical item you’re presumed to be holding in your hands.

Interestingly (to me, anyway), it looks like the OCLC once had a (now deprecated) $$b subfield for binding information. Apparently it didn’t catch on.

What did I find?

So, let’s pretend I’d like to be able to differentiate between paperback and hardbound books. Probably useful, yes?

I went ahead and took all parenthetical notes from any field in the 020, split them on colon (’cause that seems to be the way they roll) and did some basic normalization:

Eliminate numbers (so ‘vol. 1’ and ‘vol. 2’ count as only one pattern)
Lowercase everything
Turn runs of spaces into a single space
Trim leading/trailing spaces
Remove any trailing punctuation

I found 1,506,729 parenthetical remarks in the 020 subfields of our catalog.

The top twenty most common entries using those normalizations are:

402537 pbk
387406 alk. paper
99260 v # (e.g., “v. 1”, “v. 22”, etc.)
82918 cloth
51125 hbk
42036 electronic bk
41360 acid-free paper
38792 hardcover
28913 set
20358 hardback
19160 ebook
16264 paper
15269 u.s
12770 hd.bd
11793 print
10625 lib. bdg
10520 hc
8772 est
7767 pb
7639 hard

The kicker? These are the top twenty of 13,374 unique parenthetical strings found in the 020 field. Many of them are publishers, or cities, or whatnot, but an awful lot of them are variations on “hardcover” and “paperback.”

For example, a quick search for anything that might be “hard” (regexp: /h[ar]{0,2}d/) got me started on a list. Here’s just the 90 examples from that list that start with ‘h’:

hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover | hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr

And that’s after eliminating things like places of publication, strings like “with…”, “plus…”, “alk. paper”, etc.

“Yeah, but you have to understand that historically…”

Stop hiding behind that.

I understand that at one point in time it probably made sense (to someone at least) to do it this way. I can deal with that.

What I can’t accept is that as I type this there’s a cataloger doing this in this way. Today. April 2011. Some, what? maybe thirty years since computer-based OPACs became prevalent?

These sorts of problems were recognized ages ago and should have been dealt with. Add a subfield. Invent a controlled vocabulary. Don’t worry about the legacy data; it’s always going to suck.

But why are we still producing sucky data???

To sum up

The point is that there’s a better way to do this stuff. Lots and lots of better ways, in fact. Time I spend dealing with crappy data is time I don’t spend making relevancy raking better, or building a better command language search option for my librarians, or working on ways to get a decent “more like this”.

The need is both dire and urgent; the latter because sooner or later we’re going to have to go to a “two state solution” with traditional MARC21 for many of our records and whatever comes next (RDA?) for the newer stuff. And every day we wait, that first category grows, and the growth rate keeps increasing.

And then there’s serials. Don’t talk to me about serials.

Why programmers hate free text in MARC records

Bill Dueber — Mon, 11 Apr 2011 00:00:00 +0000

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job.

A lot of people seem to not understand why.

This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem.

Description vs Findability

I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description and findability. AACR2 is clearly designed for description; once you’ve found a record, it does a pretty good job telling a human being what she’s looking at. With respect to a person who’s already got a copy of the record in her (virtual) hand, strings of text and reasonable abbreviations are…well, often good enough, let’s say.

But much of AACR2 is a giant mountain of fail when it comes to supporting findability — the ability for a machine to slice and dice the data in ways that can be mapped onto searches and transformations. What those of us on the business end of the computer need are well-defined values stuck into well-defined places that represent well-defined relationships.

Free text stuck on the end of a field fails all three of those criteria.

Machine Reasoning vs. Machine Parsing

When many people look at something like RDF, their first reaction is, “Great Googally Moogally! Just tell me the language! I don’t want to follow a chain of reasoning that’s seventeen steps long just to figure out the damn thing is in English!!!”

Of course you don’t. And you don’t have to. Someone — hopefully someone smarter than me — needs to write a program to do it. And we can.

Following all that logic — deriving relationships, figuring out eventual values, determining how to convert between various forms — is what I’ll call (for simplicity’s sake) machine reasoning. And machine reasoning — for the purposes of this discussion, anyway — is a solved problem. I’m not saying it’s not hard, and I’m not saying it might not take gobs of hardware resources. But we, the collective of humanity, know how to do it.

On the other hand, machine parsing — looking at all that free text that is sprinkled throughout our records and trying to turn it into something that is susceptible to machine reasoning — is vehemently not a solved problem. Even if you ignore all the misspellings, we’re still stuck with one-off abbreviations, lack of ordering, gobs of “local practice,” and iffy punctuation.

And, come to think of it, you can’t ignore the misspellings, either.

The point is this: good data trumps everything else. If there’s good, solid, well-defined data in computable places, we can (given some time) do damn near anything with it. If there’s human-entered, free-text, parenthetical-remark-type data, we’re pretty much stuck.

Examples?

Jonathan Rochkind just did a great post looking at LC call numbers, and how, well, they might be in a few different places, and may or may not be valid LC call numbers, and so on and on and on and on.

And my next post (hopefully tomorrow) will be an analysis of the first freetext in MARC I ever tried to deal with — the parenthetical remarks in the 020 (ISBN) field. If that doesn’t keep you up all night, well, I don’t know what will.

Corrected Code4Lib slides are up

Bill Dueber — Tue, 15 Feb 2011 00:00:00 +0000

…at the same URL.

I was, to put it mildly, incredibly excited about code4lib this year because, for once, I thought I had something to say. And I did have something to say. And I said it. But it was wrong.

I presented a bunch of statistics drawn from nearly a year of Mirlyn logs. The most outlandish of my assertions, and the one that eventually turned out to be the most incorrect, was that some 45% of all our user sessions consist of only one action: a search.

Unfortunately, I’d missed a whole swath of things I should have excluded. I’d remembered robots and stuff coming in from our link resolver and so on. I hadn’t counted on having to fight my own stupidity.

In short: catalog.hathitrust.org and mirlyn.lib.umich.edu share a common code base, as well as a Solr backend. I was correctly excluding all the HathiTrust stuff from my stats except for simple searches. What I ended up with was a whole lotta sessions with nothing in them but that search. Luckily, I noticed waaaay too many people coming in via the HathiTrust site (which I know doesn’t have a link to Mirlyn) and did more digging.

The slides have been updated with correct numbers. Luckily, even though the adjustment was pretty extreme, I don’t think many of my conclusions are invalidated, especially given corroborating evidence from an extensive survey conducted by our usability team (PDF). They conclude, among other things, that known-item searching is prevalent and relevancy raking is important across task boundaries.

The basic stats from the powerpoint, for those who don’t want to read all my notes:

17% of all sessions have one action: a search
Â—In only 28% of all sessions does the user see the Record View
75% of all logged actions that target an individual record (see the full record view, look at extended holdings, etc.) happen with a record in the top 6 search results
7% of sessions involve a user adding a facet
2% of sessions involve a user exporting records

[RETRACTED] Code4Lib 2011 Lightning Talk Slides

Bill Dueber — Wed, 09 Feb 2011 00:00:00 +0000

DANGER!

I was trying to re-verify my numbers and found a glaring and hugely important mistake. I’ll make a new post with the details, but basically I was counting about 180k sessions (out of only 735k) that I should have been ignoring. Please ignore my basic stats until further notice.

See the new numbers and corrected slides for more accurate data.

I did a little Lightning Talk at Code4Lib 2011 and cleaned up (and heavily annotated) my slides for anyone interested in them.

The focus was on some basic stats about usage of our OPAC, Mirlyn, in calendar 2010.

I’ll be doing some posts and/or more rigorous writing on this stuff soon, but wanted to get these up in a timely fashion.

Bill Dueber Lightening Talk Slides from Code{4}Lib 2011. (powerpoint .ppt file, 1.2MB)

Four things I hate about Ruby

Bill Dueber — Thu, 13 Jan 2011 00:00:00 +0000

Don’t get me wrong. I use ruby as my default language when possible. I love JRuby in a way that’s illegal in most states.

But there are…issues. There are with any language and the associated environment. These are the ones that bug the crap out of me.

Ruby is slow. Let’s get this one out of the way right away. Ruby (at least the MRI 1.8.x implementation) is, for many things, slow. Sometimes not much slower. Sometimes (e.g., numerics) a hell of a lot slower.

Now, there’s nothing necessarily wrong with that. For what I do, MRI Ruby is usually fast enough, and JRuby is pretty much always fast enough. But the community response (“Buy more hardware! Programming time is more expensive than CPUs anyway! These are not the droids you’re looking for!”), esp. surrounding Rails, is simply annoying. If The Power That Be want to make a decision to not worry about improving the performance of the language, well, that’s fine then. But to pretend — or even insist — that it’s not at all in issue, well, that’s just disingenuous.

Version nonsense. Yes, yes, I understand the historical process that produced a version 1.9.x that’s not backwards compatible with 1.8.x. But it’s dumb. Gems don’t often seem much better (including, hypocritically, my own). Versioning — meaning assigning version numbers in such a way that the underlying semantics are transparent — doesn’t seem to be something The Ruby World ™ is all that interested in.
No relationship between gem names and the modules they contain. This drives me freakin’ crazy. The Perl community does a great job with this. One module per file, one file per module, filenames follow the module names. I know exactly what to put after use in Perl. In Ruby, what comes after require is anybody’s guess.
Lack of thread-safety. Look, I get it that the MRI doesn’t have real threads. And so maybe there’s not a huge incentive on the part of the core folks to make things thread-safe in general. But at least one language construct — autoload — is just plain broken under real threads, with seemingly little interest in getting it fixed.

Does anyone use those prev/next/back-to-search links?

Bill Dueber — Wed, 03 Nov 2010 00:00:00 +0000

There’s a common problem among developers of websites that paginate, including OPACs: how do you provide a single item view that can have links that go back to the search (or to the prev/next item) without making your URLs look ugly?

The fundamental problem is that as soon as your user opens up a couple searches in separate tabs, your session data can’t keep track of which search she wants to “go back to” unless you put some random crap in the URL, which none of us want to do.

But let’s take three giant steps backwards before we throw a ton of resources at this problem, and ask, “Does anyone use those links”?

Data from Mirlyn, the University of Michigan OPAC

Here’s the data since February of 2010 for Mirlyn, our library OPAC.

Action	Count	Pct. of Basic Search count
Basic search (baseline)	1,446,881	100%
Previous record	1,347	0.09%
Next record	8,394	0.58%
Back to search	9,568	0.66%

For what it’s worth, I looked at these number by percentage of sessions as well, and the numbers come up a little higher — about 0.8% of all sessions included at least one click of the “Back to Search” button.

Given these numbers, I’m pretty sure I wouldn’t put a whole lot of effort into it. In general, next/prev record navigation only makes sense when you have a really, really small number of hits, anyway.

So…why not just disappear the links? I know people will complain, but hopefully our days of doing an enormous amount of work for …well, some tiny but vocal minority…are past.

Size/speed of various MARC serializations using ruby-marc

Bill Dueber — Wed, 29 Sep 2010 00:00:00 +0000

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data.

I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21.

Why bother?

Binary MARC-21 is “broken” in that a lot of us have records that are so long (more than 99,999Â bytes) it’s impossible to create a valid marc binary record. The standard alternative, MARC-XML, has huge filesizes (roughly 3 times as large) and runs a lot more slowly in every benchmark I’ve ever run. For ruby-marc, the penalty for using XML is further exaggerated because the serializer is based on REXML and is super-slow.

There have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton bigger in terms of file size, and (b) much easier to query from a NoSQL database using something like JSONPath or JSONQuery.

What I’m testing

For this test, I used:

marc21 binary This is the stock serialization / deserialization provided by ruby-marc.
YAJL for JSON YAJL is a very fast C-based JSON library. Here we’re using the Ruby bindings and calling Yajl::Encoder.encode(r.to_hash) to serialize and MARC::Record.new_from_hash(Yajl::Parser.parse(JSON)) to deserialize.
Msgpack The Msgpack project is explicitly designed to be “binary JSON” — smaller, faster, etc — at the expense of human readability/editabilty . Again, this used the ruby bindings.

The benchmark and its results

I’m interested in how long it takes to serialize and deserialize a single record. My primary use-case is sticking a single record into Solr, and then pulling the string representation of that record out and turning it back into MARC.

It’s entirely possible that trying to deal with a whole set of MARC records — as a JSON array of marc-in-json objects, or as a set of newline-delimited JSON (or perhaps LDJSONÂ or Msgpack objects — would yield different results. The former is especially interesting, since to parse a large JSON array one needs to use a streaming parser, which will almost certainly have a different profile in both processing and memory use.

The ambitious can see the full source code of the benchmark.

Note that the following represent only the performance of ruby-marc and the particular serializers used. Other platforms or other libraries will certainly give different results!

Total of 18880 records run 20 times (377,600 serialize/deserialize cycles per method) on my Mac OSX desktop; comparisons are to MARC21-Binary.

 SERIALIZING MARC Binary 357.02 s (100%) YAJL 312.65 s ( 88%) Msgpack 266.26 s ( 75%)  DESERIALIZING MARC Binary 648.91 s (100%) YAJL 507.64 s ( 78%) Msgpack 459.73 s ( 71%)  SERIALIZE + DESERIALIZE MARC Binary 1005.93 s (100%) YAJL 820.29 s ( 82%) Msgpack 725.99 s ( 72%)  SIZE MARC Binary 31.15 MBytes (100%) Msgpack 42.00 MBytes (135%) JSON 55.99 MBytes (180%) XML 93.42 MBytes (300%)

Analysis, such as it is

Obviously, there are size/speed tradeoffs. Nothing is as small as binary MARC21, but both YAJL and Msgpack are faster — significantly so for deserialization, which happens to be where I want the speed for my uses.

At 80% larger, the JSON serialization is quite a big bigger, but it’s a hell of a lot smaller than MARC-XML and suffers none of the limitations of binary MARC.

For a closed system (i.e., you’re not worried about anyone else being able to read your data) such as a Blacklight installation, I’d be tempted to move to using JSON sooner rather than later.

VuFind Midwest gathering

Bill Dueber — Thu, 16 Sep 2010 00:00:00 +0000

A couple weeks ago, representatives from UMich (that’d be me), Purdue, Notre Dame, UChicago, and our hosts at Western Michigan got together in lovely Kalamazoo to talk about our VuFind implementations.

Eric Lease Morgan already wrote up his notes about the meeting, and I encourage you to go there for more info, but I’ll add my two cents here.

So, in light of that meeting, here’s what I’m thinking about VuFind of late:

None of us are running VuFuind 1.0 as released with full catalog data. Eric has a special purpose portal running the current code over an aggregated special collection and hasn’t done much to the underlying PHP. The rest of us were running heavily modified versions of RC1. An issue we had in common was that the changes from RC1 to RC2 to 1.0 release were so significant, including some complete architectual change (some based on the stuff I’ve done with mirlyn) that the effort required to get up with 1.0 would be no less significant than the effort to switch wholesale to something else (e.g., Blacklight).
A point that I made that was echoed by others is that we need to remember that these new discovery systems are all just thin wrappers over Solr. They basically have two jobs: to get a query and format it in a way that Solr can handle, and then to take the Solr results and display them. There’s some sugar on top of that (exporting, tagging, etc) but that’s really it. The heavy lifting is all done by your indexer (Solrmarc for most, although watch this space for my announcement of my JRuby-based stuff today) and Solr itself. It’s not a hard problem, although it is occasionally a messy one.
VuFind has, in my mind, fundamental architectural issues mostly based on the inability to easily separate local code from core code. A re-architecture to base everything on subclasses of the core code would help, but at some point you start to run up against fundamental limitations of PHP and Smarty to do things cleanly. Without the ability to update core code and know it won’t affect your local code, there’s no good way to keep on track with the trunk of the code and do upgrades; for the same reason, it’s almost impossible to send changes back to trunk.
Coupled tightly to the architectural issues is the lack of tests. The code is potentially very brittle; there’s no good way to know if you’re breaking anything until you notice it’s broken. It’s not at all clear how to write good tests for the code, because there’s a lot of inter-dependencies.
The second big problem is one of community; to wit, there isn’t much of one. There are some active players, and there’s what seems like a great conference going on right now, so this may change. But — especially because of the technical difficulties in contributing local changes back –VuFind could use a benevolent dictator, someone who has organizing and administrating VuFind be a part of his/her job. The last bit is important.

All of these are surmountable issues. The reason they’re at the top of my head, of course, is that the Blacklight community has, in many ways, already taken care of most of them.

If I were starting from scratch tomorrow, we’d already decided to do something locally, and I could convince my systems people to run a ruby implementation (I like JRuby myself), I’d go with Blacklight. If we were already looking at something like Summon, I’d take a hard, hard look at the build-vs-buy numbers. Summon and Primo both give you APIs to program an interface against, and boy, it might be worth the effort to do so and leave everything else alone.

Simple Ruby gem for dealing with ISBN/ISSN/LCCN

Bill Dueber — Mon, 13 Sep 2010 00:00:00 +0000

I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums.

It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome.

Functionality is all as module functions, as follows:

ISBN

char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn)
boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn)
thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn)
tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn)

ISSN

char = StdNum::ISSN.checkdigit(issn)
boolean = StdNum::ISSN.valid?(issn)

LCCN

normalizedLCCN = StdNum::LCCN.normalize(lccn)

Again, there’s nothing special here — just letting folks know it’s out there.

Solr: Forcing items with all query terms to the top of a Solr search

Bill Dueber — Wed, 18 Aug 2010 00:00:00 +0000

[Note: I’ve since made a better explanation of, and solution for, this problem.]

Here at UMich, we’re apparently in the minority in that we have Mirlyn, our catalog discovery interface (a very hacked version of VuFind), set up to find records that match only a subset of the query terms.

Put more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on ‘OR’.

Now, I’m actually quite proud of how our searching behaves. Reference desk anecdotes and our statistics all point to the idea that people tend to find what they’re looking for. I invite you to try our current configuration out — and, of course, let me know if something feels off to you. We have control of our OPAC now, and can actually fix things.

The “problem”: DisMax is weird

The DisMax algorithm is complex. Even if you ignore the fact that we weight some fields (title, author) much higher than others, a fundamental feature of DisMax is that it basically gives ranking based on the question, “What percentage of the words in the document match one of our query terms”?

Most of the time, that’s exactly what you want. In general, items that have all the keywords, and more of them, appear at the top of the search results.

But sometimes you can have just, say, two of your three search terms appearing like a rash all across a relatively short record, and it’ll pop to the top, appearing ahead of records that actually contain all three search terms. Or maybe three of four search terms appear in both title and author (highly-weighted fields) and the same thing happens.

And, yeah, it really happens.

An actual, real-life example

Searching for the three terms information AND architecture AND usability, explicitly requiring all three, gives 12 results.

The equivalent DisMax search (where only two of three need to be found) nets about 4300 results. Which is great — we’re casting a much wider net, with some pretty common words. That doesn’t matter so long as the most relevant results float to the top.

The kicker? The first time an item in the first set appears in the second is at record number 62. Our user is more than three pages in before she even see a record that contains all three terms.

Again, most of the time, our current algorithm does really, really well in my opinion. But noticing this led to talk about artificially pushing all the “all terms are present” items to the top.

Pushing records that contain all the terms to the top

So, I wanted to:

Push records with all search terms to the top, but
…don’t otherwise change their scores. i.e., don’t otherwise re-order them in any way, ’cause I’m already happy with my ordering.

It turns out to be harder than I initially thought. I fought with my code for a whole day, then asked for help, and help was provided.

So, with special thanks to Jan HÃ¸ydahl for his solution, we get this, in Ruby psuedocode:

andedTerms = allMyTerms.join(‘ AND ‘) bf = map(query($qq),0,0,0,100000.0) # Add this value to the ranking score qq = “allFields:(#{andedTerms})” # Use this as the query

add bf and qq to your solr query

The qq is easy enough — it basically says that to get any relevancy score at all, the record must have all the terms in the allFields Solr field.

For the map, we want to say > If the record matches all the terms, give it an extra 100K points. If not, don’t.

The map takes 5 arguments:

An initial value. In this case, we’re getting the relevancy ranking score based on the qq query. Basically, items that don’t have all the terms will have a score of zero; items that do have all three terms will have something bigger than zero.
The beginning of range to compare to. In this case, 0.
The end of the range. Another zero, so basically, we’ll be seeing if our initial value is between 0 and 0, e.g., if it’s exactly 0.
The value to return if the initial value fits in the range — zero. So, if the records doesn’t have all the terms, return a 0.
The value to return if the initial value falls outside the given range. 100K — a random very-large number I picked.

And…?

I just pushed this to our beta site, and folks are still looking at it, but so far, it looks awesome. I’ll do a little update post if/when it goes into production. And if it doesn’t, I’ll say why.

Why RDA is doomed to failure

Bill Dueber — Fri, 23 Apr 2010 00:00:00 +0000

[Note: edited for clarity thanks to rsinger’s comment, below]

Doomed, I say! DOOOOOOOOOOMMMMMMMED!

My reasoning is simple: RDA will fail because it’s not “better enough.”

Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.”

First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of.

And, of course, you’re right on all counts. I don’t know what I’m talking about in any of those realms.

And yet I’m still willing to make a strong statement?

Yes. I am. Here’s why.

[Oh, and if you’re convinced I’m wrong — please say so. I’d love to be wrong about this.]

First, an assertion

The purpose of any bibliographic metadata is to facilitate three things:

Description/Identification. If you know what you want, does the metadata give you enough information to determine if the described item is what you want? Alternately, if you’re holding an item (or an alternate metadata representation of it), can you find the record that describes it?
Machine finding. Can a machine, given a good-enough query, find a work via a search of the metadata?
Machine grouping. Given the metadata, can a machine help a person find items “like this one”?

Take issue with one or more of those statements. I don’t care. The point I’m really trying to make is that any standard that doesn’t put unmediated machine reasoning at the forefront of what the metadata needs to support is living in a deep, deep hole.

Computer cycles are pretty cheap, and programmers are pretty smart. We can figure out how to do useful things with virtually any data, but only if we can reliably get at those data.

Getting 75% of the way there

Three-fourths of the problem can be addressed with one simple concept.

A solid equality relationship.

By this I mean that “=” had better damn well mean “equal,” as opposed to “probably the same, but there might be other representations, too.” If I want to say “A = B” (where A and B are authors, or works, or subjects, or anything that can be nailed down) there’s better be no false positives and no false negatives. Ever. MARC’s use of “hopefully-unique strings” is ridiculously insufficient in the modern era.

RDA does pretty well with this, with URIs for appropriate concepts, so that’s good.

What’s wrong with it?

Well, it’s gonna cost money to access the spec, for starters. That’s just dumb.

But it’s also not flexible/extensible enough. It’s true that I’m not a cataloger. I do have an MS in computer science, though, and there is stuff in the various versions of the RDA spec which lead me to believe that the committee desperately, desperately needed some hardcore geeks on it. Computer science has basically done nothing but develop methods for abstraction and composition for decades, and that isn’t reflected enough here.

Language such as, “If it is determined that a mechanism for providing a direct link between a note and the instance of the element to which it relates is required,…” worries me. if? IF????? That’s not a spec. That’s a guideline. Nail it down, for god’s sake. When is it appropriate or inappropriate? How do you add links to multiple (but not all) instances of the element?

The spec also seems to describe at least half a dozen kinds of titles. One of these is “Abbreviated title.” Do we really want an abbreviated title? No. We want a title with an “abbreviated” modifier, so we can use that same modifier for, say, a corporate name or publisher or anything else. [Note: see rsinger’s comment below, indicating this was a piss-poor example on my part.]

Well, sure, but it’s still better than the AACR2!

[This section updated to disabiguate my use of ‘MARC’ when I really meant ‘AACR2 as commonly talked about in term of MARC tags’]

Of course it is. It’s just not better enough!

We’re not just talking about writing a spec. We’re talking about replacing every single tool in the library toolchain, from the ILS to editing software to OPACs to scripts that keep it all put together. We’ll be asking programmers to learn new skills and new ways of thinking, vendors to produce functional software for untested data formats, and catalogers to essentially take their whole brain out of their heads and get a new one.

But that, frankly, is the easy part. The entire culture of the library is built around AACR2 concepts and MARC data structures. The thought processes, nomenclature — everything sometimes feels as if it’s built around three-digit tags. The majority of the (crucial!) specialized vocabulary librarians, and experts and specialists, use to communicate with each other is directly or indirectly tied to MARC

So, yeah, RDA is a hellofa lot better than AACR2/MARC. But in my view, it’s not better enough to justify all the pain. Switching is incredibly, astoundingly expensive both in terms of cost and in terms of the devaluation of institutional knowledge. We can’t do it every few years. We need to be damn sure we’re getting it right.

Data structures and Serializations

Bill Dueber — Tue, 20 Apr 2010 00:00:00 +0000

Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in.

What this post is not

There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them.

What I am comfortable saying is this:

Anyone advocating or dismissing a data model based on the data structure or serialization most-often associated with that model is missing the goddamn point.

Data serializations

…are boring. They’re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even MARC has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).

So, let me repeat again, serializations are boring and not worth talking about until you’ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.

Serializations are measured from “less pain” to “more pain”, but all have the exact same expressiveness. Data structures, on the other hand, do not.

A hierarchy of data structures

Think about the following data structures:

An ordered list
key-value pairs
A hierarchy (e.g., an XML document)
An undirected graph
A directed graph
A labeled, directed multigraph (e.g., a set of RDF Triples)

You don’t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item’s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you’re careful to make links both ways. And a directed graph can easily be represented as a set of RDF triples (where you may, for example, only have one label for your relationships: “links to”).

[Note that I didn’t say any of these would be efficient implementations!]

The reverse is not true — or, at least, not without an incredible amount of “out of band” information in another layer somewhere.

The structures at the end of the list have more expressiveness. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I’m not going to try to model my set of key=value pairs in an array. I could do it, but it would take so much of my attention that the data modeling would suffer.

Don’t handicap yourself

Don’t start with the data structure.

DON’T START WITH THE DATA STRUCTURE!

GET THAT MOTHER-FREAKIN’ DATA STRUCTURE OFF MY MOTHER-FREAKIN’ PLANE!

Seriously. Don’t be stupid. If all you’ve got is a hammer, everything starts to look like a thumb.

If you start off with a restrictive data structure before you even fully define the domain you’re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you’re imagining in your head.

Domain modeling is ridiculously hard for any domain worth modeling. If you start with a handicap (a restrictive data structure) it’s going to be even harder.

No one would think of trying to model bibliographic data using only arrays. That’s premature optimization on an epic scale.

The appeal of RDF Triples

Even if you ignore all the semantics and rules that make RDF Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that any semantic model based on RDF Triples has enormous expressive power at its disposal.

Does it turn out that after you’ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You’ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a “type” of sorts, and add some out-of-band knowledge). And lord knows RDBMS’s have great performance characteristics.

But don’t start there. Start with the domain. Model it. Figure out what you need to describe and derive. Then pick the most appropriate data structure.

The nightmare that is MARC

MARC-the-data-structure (not to be confused with a serialization of that data structure, on the one hand, or with the AACR2 on the other) can incompletely (but usefully, I think) be described as:

A set of key-value pairs
…that have a defined order
…where keys can be repeated
…and values are strings
…and keys are a concatenation of tag/ind1/ind2/code

Control fields are especially restricted (ind1, ind2, and code are all ‘null’). There’s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.

It doesn’t give us much to work with. It’s restricted. And, sadly, so is our thinking.

Putting the cart before the horse

As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the MARC data structure. It’s a handicap. It’s an anchor around our necks.

MARC-the-data-model (nee AACR2) is not inherently bad because it’s built on an impoverished data structure. It’s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that’d be awesome because it would indicate that things are simple.

Things, of course, aren’t simple. They’re hard.

So, if you want to complain about MARC or RDA or FRBR, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don’t conflate data models, data structures, and serializations.

Oh, and don’t say “PIN Number” or “ATM Machine.” That drives me crazy, too.

Stupid catalog tricks: Subject Headings and the Long Tail

Bill Dueber — Tue, 13 Apr 2010 00:00:00 +0000

Library of Congress Subject Headings (LCSH) in particular.

I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird.

And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying.

So, just for kicks, I ran some numbers.

The process

I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation in any of the subfields. I called the concatenation of what was left a unique LCSH.

Then I printed them out and put them all onto index cards, using tick-marks to indicate…

No, of course not. I used sort, uniq -c, and wc -l. Here’s what I found.

Counts of LCSH

…in round numbers.

In our catalog, there are:

8.50M subject headings (using the definition above)
1.87M unique subject headings
…66% of which (1.23M) appear exactly once

We only have to go out to 30K subjects to account for half of all subject entries. The top 1000 most-used subjects account for 14.5% of all 8.5M subject entries.

The top ten subjects by count are:

6029 $$aSermons, American
6131 $$aPhilosophy
7224 $$aFeature films
7591 $$aPiano music
7968 $$aSocialism
8796 $$aEconomics
9185 $$aCommunism
12440 $$aSermons, English$$y17th century
13539 $$aBills, Private$$zUnited States
58823 $$aEconomics$$xHistory$$vSources

From a record’s point of view

Our catalog has:

7M records
4.4M records with at least one subject (as defined above)
2.4M records with more than one subject
2.0M records with exactly one subject
2.6M records with zero subjects

The records with the most subject headings tend to be collections of stuff (theses, photos, etc). Our local standout is the Dept. of Medicine and Surgery (University of Michigan) theses, 1851-1878 with 208 subject entries. 14 records have at least 30 subject entries.

What it means

Gee, lady, I don’t know.

One way to look at it: suppose you’re considering defining subjects in this way, and making them “hot” in the catalog interface. For our data, 2/3 of records would have either no subjects or a subject that found only the record you’re at. So…think again.

In real life, we index lots of possible subject fields, and we additionally index the $$a as well as the whole string, so ours are a little bit more useful. A little.

Why bother with threading in jruby? Because it’s easy.

Bill Dueber — Fri, 12 Mar 2010 00:00:00 +0000

[Edit 2011-July-1: I’ve written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you’re running jruby]

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

   enumerable_object.threach(number_of_threads, :which_iterator) do |i|     do_something_threadsafe(i)   end

Some examples

   # You like #each? You'll love...err..probably like #threach   load 'threach.rb'    # Process with 2 threads. It assumes you want 'each'   # as your iterator.   (1..10).threach(2) {|i| puts i.to_s}      # You can also specify the iterator   File.open('mybigfile') do |f|     f.threach(2, :each_line) do |line|       processLine(line)     end   end    # threach does not care what the arity of your block is   # as long as it matches the iterator you ask for    ('A'..'Z').threach(3, :each_with_index) do |letter, index|     puts "#{index}: #{letter}"   end    # Or with a hash   h = {'a' => 1, 'b'=>2, 'c'=>3}   h.threach(2) do |letter, i|     puts "#{i}: #{letter}"   end

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

   require 'thread'   module Enumerable      def threach(threads=0, iterator=:each, &blk)       if threads == 0         # Just call the iterator itself         self.send(iterator, &blk)       else         bq = SizedQueue.new(threads * 4)         consumers = []         threads.times do |i|           consumers << Thread.new do             until (a = bq.pop) === :end_of_data               blk.call(*a)             end           end         end          # The producer         count = 0         self.send(iterator) do |*x|           bq.push x           count += 1         end         # Now end it         threads.times do           bq << :end_of_data         end         # Do the join         consumers.each {|t| t.join}       end     end   end

That’s it. If threads=0, just use the iterator itself. If not:

Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

   if defined? JRUBY_VERSION     numthreads = 3   else     numthreads = 0   end    my_enumerable.threach(numthreads) {|i| ...}

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.

Pushing MARC to Solr; processing times and threading and such

Bill Dueber — Thu, 04 Mar 2010 00:00:00 +0000

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

Read the records into marc4j objects and do nothing. This is a baseline of sorts.
The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
The big “allfields” field is text from tags 100 through 900.
The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

18,881 records in marc-binary format
Times are in seconds, run on my desktop
Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.

Total Seconds	Description
19	Just read the records with marc4j and do nothing.
85	Read and do 35 “normal” fields (no custom)
104	Read, 35 normal, 15 custom fields
110	Read, normal, custom, allfields
129	Read, normal, custom, allfields, to_xml
136	Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142	Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124	Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds	Description
19	read the records and do nothing
66	process the 35 normal fields
19	process the 15 custom fields
6	generate the “allfields” field
19	generate the XML (yowza!)
7	send to solr with two threads
13	send to solr with one thread

Or like this:

Seconds	Description
129	do all the reading and processing
13	send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

SUSS is just faster than whatever solrmarc does.
My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time	SUSS Threads	Processing Threads
70	1	1 (was 142 seconds on the desktop)
47	1	2
39	1	3
35	1	4
68	2	1
48	2	2
38	2	3
34	2	4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

ruby-marc with pluggable readers

Bill Dueber — Tue, 02 Mar 2010 00:00:00 +0000

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

   require 'marc'   require 'my_marc_stuff'    mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto    MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now    xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)    # ...and maybe further on down the road    asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

New interest in MARC-HASH / JSON

Bill Dueber — Fri, 26 Feb 2010 00:00:00 +0000

EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people!

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

   // A record is a four-pair hash, as follows. UTF-8 is mandatory.   {     "type" : "marc-hash"     "version" : [1, 0]     "leader" : "...leader string ... "     "fields" : [array, of, fields]   }    // A field is an array of either 2 or 4 elements   [tag, value] // a control field   [tag, ind1, ind2, [array, of subfields]]    // A subfield is an array of two elements    [code, value]

So, a short example:

  {     "type" : "marc-hash",     "version" : [1, 0],      "leader" : "leader string"     "fields" : [        ["001", "001 value"]        ["002", "002 value"]        ["010", " ", " ",         [           ["a", "68009499"]         ]       ],       ["035", " ", " ",         [           ["a", "(RLIN)MIUG0000733-B"]         ],       ],       ["035", " ", " ",         [           ["a", "(CaOTULAS)159818014"]         ],       ],       ["245", "1", "0",         [           ["a", "Capitalism, primitive and modern;"],           ["b", "some aspects of Tolai economic growth" ],           ["c", "[by] T. Scarlett Epstein."]         ]       ]     ]   }

How’s the speed?

I think it’s important to separate the format marc-hash from the eventual marshaling format — partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let’s look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you’d expect. The JSON file is “Newline-Delimited JSON” — a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal   x.report("MARC Binary") do     reader = MARC::Reader.new('test.mrc')     reader.each do |r|       title = r['245']['a']     end   end    # Marshal   x.report("MARC Binary") do     reader = MARC::Reader.new('test.mrc')     writer = MARC::Writer.new('benchout.mrc')     reader.each do |r|       writer.write(r)     end     writer.close   end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I’ve been using for all my benchmarking of late. It’s nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format	Ruby 1.87	Ruby 1.9	JRuby 1.4	Jruby 1.4 –1.9
XML	393	443	188	356
MARC Binary	36	23	23	25
JSON/ NDJ	31	19	25	ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format	Ruby 1.87	Ruby 1.9	JRuby 1.4	Jruby 1.4 –1.9
XML	113	89	75	89
MARC Binary	29	16	16	19
JSON/ NDJ	17	9	13	16

And so…

I’m not sure what else to say. The format is totally brain-dead. It round-trips. It’s fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that’d be great. Any thoughts?

OCLC still not (NO! They are!) normalizing their LCCNs

Bill Dueber — Thu, 18 Feb 2010 00:00:00 +0000

NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So…good news all around, and huge kudos to Xiaoming Liu for his quick response!

**NOTE** It strikes me that I haven’t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you’ll get back either good data or nothing (and the “nothing” might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.

A long time ago, Jonathan Rochkind noted that the OCLC doesn’t correctly normalize their LCCNs.

Well, it’s not fixed.

I could really, really use the xlccn service right about now — a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you’re interested in.

Except they “normalize” their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.

The xLCCN service will silently provide no data or incorrect data for many LCCN requests!

An example:

(F) Full LCCN is “sn 83011407”
(D) First set of digits is “83011407”. This is what I think the OCLC is indexing.
(N) Correct normalization is “sn83011407”

The problem, of course, is that (D) “83011407” is itself a valid LCCN.

(F) is associated with OCLC# 47212967
(D) is associated with OCLC# 12505148. That’s not the same record.

So, how do the OCLC services respond?

(F) Worldcat search finds correct (probably just doing a string match); xid finds nothing
(D) Worldcat finds both correct and incorrect records. The xLCCN service finds only the incorrect record, OCLC# 12505148.
(N) Neither worldcat nor xid finds anything for the correctly normalized version.

So, what am I supposed to do? Only use the service on LCCNs where the original and normalized versions are the same and include only digits? Frustrating.

Indexing data into Solr via JRuby (with threads!)

Bill Dueber — Tue, 16 Feb 2010 00:00:00 +0000

[Note: in this post I’m just going to focus on the “get stuff into Solr” part. My normal focus — MARC data — will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love me the java.

So…taking Erik Hatcher’s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with Solr.

Getting the code

I’ve added the gems to gemcutter, if you want to play along at home:

jruby_producer_consumer (github, rdoc.info) Ruby syntax for threaded operations under jruby
jruby_streaming_update_solr_server (github, rdoc.info) Ruby syntax on top of the Java class of the same name
marc4j4r (github, rdoc.info) Ruby syntax on top of the marc4j java library.

WARNING: None of these gems have a 1.0 version tag on them, and that means that the API may change a titch in the future. Also, the fact that they’re released as gems means that it’s easy to release gems, not that I’m not an idiot.

The basics: Using SolrInputDocument and StreamingUpdateSolrServer

OK, with the disclaimer out of the way, let’s look at some code.

   require 'rubygems'   require 'jruby_streaming_update_solr_server'    solrurl = 'http://your.solr.server:port/solr'   sussqueuesize = 24 # how many items to buffer on their way to solr   sussthreads = 1   # how many threads to use to send stuff to solr    suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)    # Let's add a simple document via a hash: A title, three authors, and a year    h = {     :title => "Never been deader",     :author => ['Bill', 'Mike', 'Molly'],     :year => 2003   }   suss << h   suss.commit    # YEA! You just added a document to solr and committed it.   # Have a cookie!    # We can also use a document object to do the same thing    doc = SolrInputDocument.new   # Add the title   doc << ['title', 'Never been deader']    # Add the first author   doc << [:author, 'Bill']    # Add more. Re-used keys mean you're adding additional values   # Note values can be scalars or arrays    doc << [:author, ['Mike', 'Molly']]    # Add the wrong year using [] syntax   doc[:year] = 2001    # Oops! fix it. []= overwrites existing value(s)    doc[:year] = 2003    # Finally, we can merge a hash (or anything else that responds to   # 'each_pair' with key-value pairs) into an existing doc    doc.merge! {'author' => 'Ringo Starrre', 'publisher'=>'Vainity Books'}    # Add it    suss << doc    # Commit and optimize if you'd like    suss.commit   suss.optimize # if you want

Nothing really fancy in there — just a few things worth noting:

An suss object will take a hash (again, anything that responds to #each_pair) or a SolrInputDoc
You can use either strings or symbols to represent Solr field names
Values can be either a single value, or an array of multiple values

And there are three ways to get data into a doc:

Via << [field, value(s)] (additive)
Via doc.merge! hash (additive)
Via doc[field] = value (replaces)

Adding Threads

I also went down the garden path of threading things. There are an awful lot of operations that are not threadsafe (e.g., reading a line from a file) but once you’ve got a bunch of records to worth with, turning them into Solr documents is usually thread-safe.

My model is that there’s a producer (usually the method #each) from an underlying data object. A thread takes whatever that method yields and sticks the values into a java BlockingQueue awaiting consumption. You then use ProdcuerConsumer#threaded_each (or ProducerConsumer#threaded_each_with_index) to pull items out of the queue and do something useful with them.

I extracted stuff into a library (jruby_producer_consumer) for your viewing pleasure.

CONFUSION ALERT: It’s perhaps unfortunate that the object you send to ProducerConsumer.new(obj) must implement #each and that the ProducerConsumer method #threaded_each calls that underlying #each…well there’s a lot of #each‘s floating around. Keep them straight.

So…let’s look at some code to work with consumer threads.

   # Start off the same as before   require 'rubygems'   require 'jruby_streaming_update_solr_server'   require 'jruby_producer_consumer'   require 'marc4j4r'    solrurl = 'http://your.solr.server:port/solr'   sussqueuesize = 24 # how many items to buffer on their way to solr   sussthreads = 2   # how many threads to use to send stuff to solr    suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)    # I'll go ahead and use a MARC file as my example, but won't talk about the   # MARC parts of it. All you need to know is that the reader object   # implements #each    reader = MARC4J4R.reader('test.xml', :marcxml)    # Get a producer/consumer object with the reader at its base, using   # the default method #each to get stuff out of it, and with the assumption   # that we only need to keep the default 5 items in memory at a time to   # keep up with consumption    pc = ProducerConsumer.new(reader)    # Get three threads to actually consume the things, turn them into solr   # documents, and send them to solr (potentially out of order)    numconsumerthreads = 3   pc.threaded_each(numconsumerthreads).each do |r|     suss << turn_marc_record_into_a_hash_or_solrdoc(r)   end   suss.commit

Again, not a lot happening here.

The “producer” is always one thread, because so little is thread-safe at the ‘each’ level. In this case, there’s a single thread pulling data out of the file and turning it into MARC records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don’t starve. I presume that producing items is cheaper than consuming them, or else this library won’t help you much.
ProducerConsumer#threaded_each calls the #each method of the underlying object. You can substitute anything that yields, though, as in this example where I call #each_line instead of the default #each

   queuesize = 5   pc = ProducerConsumer.new(File.new('myfile.txt'), queuesize, :each_line)

Keep track of your threads. In this last example, there is one thread getting MARC records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the suss object, and another two pulling stuff out of the suss object and sending things to Sorl. And, of course, there’s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.
See the docs for how to mess with what goes into a ProducerConsumer object. It’s entirely possible to use, say, #each_slice. There’s also a convenience method #threaded_each_with_index, but it does not call the underlying #each_with_index, it produces its own index as things are read.

Feedback not only welcome but necessary!

I’ve done a lot of messing around with Ruby in the last 10 days or so, but I’m still basically converting from Perl in my head. Any comments, bugs reports, or whatnot are definitely welcome!

jruby_producer_consumer dead-simple producer/consumer for JRuby

Bill Dueber — Fri, 05 Feb 2010 00:00:00 +0000

Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don’t look at this! It’s hideous! There’s a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was…ugly. And I didn’t really understand it. So I dug in today and wrote this.]

I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer.

It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t a huge win under regular Ruby.

There’s no testing there because I’m not sure how to test threaded stuff

It is, I hope, easy to use:

    require 'rubygems'    require 'jruby_producer_consumer'     # Create a ProducerConsumer. Arguments are anything that implements #each    # and the size for the underlying queue. For the former, I'll just use a Range object.     eachable = 1..10    queuesize = 3     pc = ProducerConsumer.new(eachable, queuesize)     # Just a method to show what happens    def sample (consumerid, x)      puts "Consumer #{consumerid}: consuming #{x}"      sleep 1 # otherwise this'll finsish before I can create multiple consumers    end     # Create three consumers. You can pass any number of args to    # #consumer, and must pass a block whose arguments are the    # object returned by eachable#each and those args back.     ['A', 'B', 'C'].each do |consumerid|      pc.consumer(consumerid) do |x, consumerid|        sample(consumerid, x)      end    end     # OUTPUT    # Consumer A: consuming 1    # Consumer B: consuming 2    # Consumer C: consuming 3    # Consumer A: consuming 4    # Consumer B: consuming 5    # Consumer C: consuming 6    # Consumer B: consuming 7    # Consumer A: consuming 8    # Consumer C: consuming 9    # Consumer B: consuming 10

Still another look at MARC parsing in ruby and jruby

Bill Dueber — Fri, 29 Jan 2010 00:00:00 +0000

I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.

Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.
Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby instead of java would be nice.
The Big Question: Is the MARC parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?
The Answer (see below): Yes, if I use marc4j.

On our ridiculously-awesome hardware, right now we’re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.

I’ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn’t installed on the server) and on the server where we usually run these things.

The machines are my desktop OSX machine and the beefy linux server where we usually do this stuff
The platforms are jruby 1.4 –server and MRI ruby 1.87
The libraries are marc4j and ruby-marc 0.3.3
The parsers are
- The standard binary parsers all around
- A home-grown AlephSequential format reader for the ‘seq’ type. AlephSequential is a MARC representation that uses one line for each field. We use it because it doesn’t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to MARC-XML.
- Whatever marc4j uses internally for MARC-XML
- ruby-marc’s ‘jstax’ xml parser under jruby (which I wrote and apparently needs some love, see below)
- ruby-marc’s ‘libxml’ xml parser under MRI ruby
Seconds is the average of two rounds, with measurements taken after a warmup run in each case.

The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.

 MACHINE  PLATFORM LIBRARY     PARSER    SECONDS    REC/SECOND desktop  jruby    marc4j      binary      4.06       4650 desktop  jruby    marc4j      xml         5.55       3401 desktop  jruby    ruby-marc   binary     17.35       1088 desktop  jruby    ruby-marc   jstax      80.11        236  desktop  ruby     ruby-marc   binary     33.54        562 desktop  ruby     ruby-marc   libxml     46.87        402  server   jruby    marc4j      binary      2.29       8245 server   jruby    marc4j      xml         3.36       5619 server   jruby    marc4j      AlephSeq    3.68       5130 server   jruby    ruby-marc   binary      9.93       1901 server   jruby    ruby-marc   jstax      44.56        424

The quick takeaways, with all the obvious caveats:

jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI
marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.
The server is fast.

We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can’t use marc-binary format because our records are too big.

If I’m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I’m gonna need to use marc4j and just wrap it up in some nicer syntax.

Beta version of the HathiTrust Volumes API available

Bill Dueber — Tue, 15 Dec 2009 00:00:00 +0000

MAJOR CHANGE

So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character.

Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.

So, we’re back to using pipe (|) characters to separate multiple calls — the examples below have been updated to reflect this.

Introduction

I’ve put up a beta version of the HathiTrust Volumes API previously discussed on this blog and via email.

Currently, I’ve only got json output, although there is space in there for other output formats as necessary.

What exactly is this?

Given: an identifier or set of identifiers, this API will Return: a set of matched records and a sorted list of the items available in the HathiTrust.

Useful, for example, if you want to display HathiTrust holdings alongside your own in your OPAC.

Simple, single-value call

Given the URL:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json

You’ll get the following back:

   {       "records":       {           "000791709":           {               "recordURL":"http://catalog.hathitrust.org/Record/000791709",               "titles":               [                   "\"Zhong gong dang shi\" fu dao /",                   "\u300a\u4e2d\u5171\u515a\u53f2\u300b\u8f85\u5bfc /"               ],               "isbns": [],               "issns": [],               "oclcs": ["15420548"],               "lccns": []           }       },       "items":       [           {               "orig":"University of Michigan",               "fromRecord":"000791709",               "htid":"mdp.39015058510069",               "itemURL":"http://hdl.handle.net/2027/mdp.39015058510069",               "rightsCode":"ic",               "lastUpdate":"00000000",               "enumcron":false           }       ]   }

Note that the ‘records’ are keyed on the local umid, also available in the ‘fromRecord’ field of each item.

The generic short form is:

http://catalog.hathitrust.org/api/volumes/(idtype)/id.(outputtype)

Right now the valid idtypes are:

issn (will be normalized to just digits, no leading zeros)
isbn (will be normalized to an ISBN-13)
oclc (will be normalized to all digits, no leading zeros)
lccn (will be normalized as recommended)
htid (HathiTrust item id, seen above as “mdp.39015058510069”)
umid (the University of Michigan record ID, seen above in the “fromRecord” field of an item)

Currently the only valid outputtype is ‘json’.

More complex, multi-valued call

The full API URL is [this]

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613

This is a request for data on two separate items, identified on the calling end as simply ‘1’ and ‘2’ (id:1 and id:2). The first item is searched for using both an oclc number and an lccn; the second supplies only an isbn.

Note that

The output format (json) has moved to appear right after the ‘/volumes/’
There’s an arbitrary ‘id’ field. This will be used to index the return values, so use something meaningful on your end.
keys and values are separated by colons. Key-Value pairs are separated by semi-colons.
Separate requests are separated by ‘/’ in the URL, allowing you to request data for an arbitrary number of items with a single call.
Return values are
Matches follow the “#3” option on the old post, the “Must match if present” option — basically, if you supply an identifier and a record has one of those identifiers, they must match.

So, in the example, the first request has both an oclc number and an lccn. Matches are as follows:

If a record has an oclc number but no lccn, its oclc number must match the passed oclc number.
If a record has an lccn but no oclc number, its lccn must match the passed lccn value.
If a record has both an lccn and an oclc number, both its identifiers must match the passed values.

The returned structure is keyed on the arbitrary id passed in the search string (if not present, the whole search string will be used instead):

   {       "1":       {           "records":           {               "001474331":               {                   "recordURL":"http://catalog.hathitrust.org/Record/001474331",                   "titles":                   ["Some aspects of seventeenth-century medicine & science; papers read at a Clark Library seminar, October 12, 1968"],                   "isbns": [],                   "issns": [],                   "oclcs": ["00045678"],                   "lccns": ["70628581 //r86"]               }           },           "items":           [{                   "orig":"University of Michigan",                   "fromRecord":"001474331",                   "htid":"mdp.39015004074095",                   "itemURL":"http://hdl.handle.net/2027/mdp.39015004074095",                   "rightsCode":"ic",                   "lastUpdate":"20090713",                   "enumcron":false               }]       },       "2":       {           "records":           {               "004370624":               {                   "recordURL":"http://catalog.hathitrust.org/Record/004370624",                   "titles":                   ["ARBA in-depth. Philosophy and religion /"],                   "isbns":                   ["1591581613"],                   "issns": [],                   "oclcs": ["53462174"],                   "lccns": ["2003065945"]               }           },           "items":           [{                   "orig":"University of Michigan",                   "fromRecord":"004370624",                   "htid":"mdp.39015058261911",                   "itemURL":"http://hdl.handle.net/2027/mdp.39015058261911",                   "rightsCode":"ic",                   "lastUpdate":"20090907",                   "enumcron":false            }]       }   }

Enumeration / Chronology

An effort is made to return items in “enumcron order” — hopefully, with earlier volumes showing up before later volumes.

The full enumcron is listed in the items if you need to try something different.

JSONP Support

JSONP output is supported — just throw a ‘&callback=blahblahblah’ on the end of the URL you call and you’ll get a function definition back.

Some examples:

http://catalog.hathitrust.org/api/volumes/oclc/15420548.json&callback=myfunc

http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613&callback=myfunc

Running Blacklight under JRuby

Bill Dueber — Wed, 18 Nov 2009 00:00:00 +0000

I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there.

There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing.

This is the procedure I eventually arrived at — if there are places where I made trouble for myself, please let me know!

[And does anyone know how to get jruby’s nokogiri to link to a different libxml and stop with the crappy libxml2-version error message every time I run it under OSX???]

Download jruby

Go to jruby.org and download a binary distribution. Extract the tar.gz (or zip or whatever)

I’ll put mine in ~/jruby. Or, at least that’s what I’ll tell you.

tar xzf jruby-1.4.tar.gz

To avoid confusion, let’s make jrake an alias for rake and add the jruby bin directory to the path

cd ~/jruby/bin ln -s rake jrake export PATH=`pwd`:$PATH

Download Blacklight

git clone git://github.com/projectblacklight/blacklight.git

Again, well say that I put this in ~/blacklight/

Muck with Blacklight dependencies

Edit the file init.rb to comment out references to libxml and ruby-xslt, as well as nokogiri. My understanding is that the first two are used, at this point, only for the EAD stuff. Both rely on libxml2 which is a C-extension and hence unavailable to JRuby.

Nokogiri gets pulled in during other installs and for some reason jrake will complain later on that it’s got a wrong version or something. So, we’ll just work without that particular net for now.

#### File ~/blacklight/init.rb # config.gem 'libxml-ruby', :lib=>'libxml', :version=>'1.1.3' # config.gem 'ruby-xslt', :lib=>'xml/xslt', :version=>'0.9.6' # config.gem 'nokogiri', :version=>'1.3.3'

Do some initial installs

jgem install -v=2.3.4 rails jgem install activerecord-jdbc-adapter jdbc-sqlite3              activerecord-jdbcsqlite3-adapter ActiveRecord-JDBC jgem install rcov -s http://gemcutter.org --no-rdoc --no-ri jrake jrake gems:install

Edit the config/database.yml file

…to change the adapter to jdbcsqlite3 for development and testing.

Edit the databases.rake file

This one was harder to track down. The default rake task has hard-coded database names in the .rake file — jdbcsqlite3 isn’t included. I keep seeing things saying, “Oh, yeah, that’s been fixed…” but, well, it wasn’t for me. I had to do it by hand.

edit ~/jruby/lib/ruby/gems/1.8/gems/rails-2.3.4/lib/tasks/databases.rake

You need to find everywhere there’s a

when "sqlite", "sqlite3" # or when /^sqlite/ in one case

…and change it to

when "sqlite", "sqlite3", "jdbcsqlite3"

Repeat for other databases you want to use (e.g., mysql). For the moment, since I’m only worried about running jrake spec, that’s all I’m gonna do.

Try again

jrake   Missing these required gems:    mislav-hanna  = 0.1.11

OK. Not sure why that didn’t come in before. Go head and add it.

jgem install  mislav-hanna

Migrate the databases

jrake

The databases should migrate, and then it’ll poop out because Solr didn’t start.

Fire up solr

Since we’re running jruby, accessing the shell doesn’t work. You’ll have to fire up your test solr instance by hand.

cd ~/blacklight/jetty java -Djetty.port=8888 -jar start.jar 2>log.jetty

Try it again!

cd ~/blacklight jrake spec     ................................................................    ................................................................    ....F............................................................    1)    'ApplicationHelper Export EndNote should render the correct    EndNote text file' FAILED    expected: "%0 Format\n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%T Music for horn \n%I Harmonia Mundi USA, \n%C [United States] : \n%D p2001. \n",   got: "%0 Format\n%C [United States] : \n%D p2001. \n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%I Harmonia Mundi USA, \n%T Music for horn \n" (using ##) ./spec/helpers/application_helper_spec.rb:128:  Finished in 15.519 seconds 193 examples, 1 failure

I can live with that for the moment. Anyone know why that spec fails?

Great! How about the features?

jrake features   (much output)    59 scenarios (59 passed)   434 steps (434 passed)   0m51.186s

And so…

…it appears that, at least on the surface, jruby is a viable platform for Blacklight so long as I don’t actually need any of the libxml stuff. In the next couple days I’ll try and actually get it all up and running and see if I can break it.

Setting up your OPAC for Zotero support using unAPI

Bill Dueber — Fri, 06 Nov 2009 00:00:00 +0000

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let’s get them to play nice with each other!

How’s it all work?

Zotero looks for a well-constructed tag in the head of the page
It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks.
Zotero then looks for IDs in the body of the page
If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records.

What you’ll need

An OPAC whose output you can futz with
Access to an individual record’s ID in that output
A URL based on the ID that gives an RIS representation of the records
A screwdriver. Made with decent — but not too expensive — vodka and fresh orange juice.

Yes. I’m cheating.

I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your OPAC’s URL scheme, or write an RIS export tool by hand, well, this will take you a bit longer.

The process

1. Build an upAPI target script

You need a script that’ll do three things:

With no arguments, return a list of available formats in general
With one argument, id=, return a list of formats available for that item. This will likely be exactly the same as #1.
With two arguments, id= & format=, return the record identified by in format

Mine looks like this:

    // id is of the form urn:bibnum:000000000    $id = isset($_REQUEST['id'])? $_REQUEST['id'] : false;    // Format, at this point, had better be 'ris'   $format = isset($_REQUEST['format'])? $_REQUEST['format'] : false;    // Got neither? Return the general list   if (!($id || $format)) {     header('Content-type: application/xml');     echo '                      ';   exit;     }     // Got just the id? Return formats for that ID   if ($id && !$format) {     header('Content-type: application/xml');     echo '                      ';     exit;     }     // Otherwise...    // Parse out the actual numeric part of the id from the urn: prefix   preg_match('/^urn:bibnum:(.*)$/', $id, $match);   $actualID = $match[1];    // Again: format had better be 'ris' because that's all I'm supporting at this point.   header("Location: /Search/SearchExport?id=$actualID&method=$format", true, 302);

You can see that a is a just a name, a mime-type, and an optional reference to documentation on the type.

I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in — I’m hard-coding ‘bibnum’ for the moment, but could allow, say, “oclc” or “isbn” or whatnot, too.

2. Tell your OPAC where the script lives

You’ll need a line in the section of all your pages that might have an ID on them:

Everything should be left alone except for the actual href.

3. Add your IDs to the HTML

In the HTML of your page, you can add one or more tags of the form:

(where the title of the conforms to what you’re expecting in your script).

You can put stuff inside the but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.

4. Final step

Drink your screwdriver.

Where can I see it?

Well…here’s the thing.

You can take a look at my test instance, http://dueberb.vufind.lib.umich.edu/ and play there. You can not see it in production, because there’s a little problem.

Our old OPAC — now dubbed mirlyn-classic — had a custom translator written for it. And it worked fine, and that was great.

But now we’ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.

I’ve asked around about getting it fixed (changing the target URL for the old translator to point at mirlyn-classic) but it’s Friday, and no one is around. Hopefully soon.

Thinking through a simple API for HathiTrust item metadata

Bill Dueber — Tue, 03 Nov 2009 00:00:00 +0000

EDITS:

Added “recordURL” per Tod’s request
Made a record’s title field an array and call it titles, to allow for vernacular entries
Changed item’s ingest to lastUpdate to accurately note what the actual date reflects. This gets updated every time either the item or the record to which it’s attached gets changed.
Fixed a couple typos, including one where I substituted an ampersand for a pipe in the multi-get example (thanks again, Tod).
Added a better explanation of option #4

Introduction and History

Ages ago, I wrote a simple(ish) little cgi program to get basic item-level data out of what is now Mirlyn Classic, our OPAC. Soon enough, I was asked to modify it so people could get HathiTrust data from the underlying Aleph system, check viewability of the associated items, etc.

It works…kinda…but reflects what I now look back on as blissful ignorance. It doesn’t deal at all with serials, and doesn’t deal correctly with duplicated records or cases where multiple records have the same (supposedly-unique) identifiers,

We need something better. And I’m hoping comments on this post will result in something better.

Scope

Given standard identifiers for a known item, return basic item-level metadata for volumes deposited in the HathiTrust.

I want to keep this simple. There will likely be other APIs for other, more complex (or specialized) tasks, linked data for folks who dig that sort of thing, and so on. The goal here is to make something that’s fast and can help people inline data about HT into their own OPAC or similar system.

I’d also like to get this thing in place, at least the basics, in the next two weeks. Anything longer is self-indulgent.

Data returned

At the moment, I’m only planning on offering JSON out, unless someone really, really needs something else. Speak up if you’re an edge case.

Proposed basic return structure

…complete with JSON-illegal comments embedded

   {     "records":       {         "003384758": // The HathiTrust record id of a matched record           {             "recordURL" : "http://catalog.hathitrust.org/Record/003384758"             "titles":  ["Full, space-joined 245s"],             "isbns" : ["123456789X"], // any/all ISBNs on this record             "issns" : [], // any/all ISSNs on this record             "oclcs" : [], // any/all OCLC numbers             "lccns"  : ["68001537"], // any/all LCCNs           },           ... // any more records that were matched       },     'items' :       [         {           "fromRecord" : "003384758",           "htid": "mdp.39015054407062",           "itemURL": "http://hdl.handle.net/2027/mdp.39015054407062",           "rights": "ic",           "orig": "University of Michigan", // supplying institution           "lastUpdate" : "20090807" // date of ingest into HathiTrust or last change           "enumcron" : "An enumeration/chronology, if available" // OPTIONAL         }       ]   }

A quick walk through the proposed return structure

Obviously, there are two sets of items: a list of records that matched the query, and a list representing the union of all items on those matched records.

records

For most purposes, people won’t care so much about the record-level data unless you’re trying to do your own error-checking (possible) or want to link to the catalog record-level page (more likely).

[I’m actually very open to just plain leaving it out.]

The format is a hash keyed on the HathiTrust record ID, which can currently be turned into a URL such as http://catalog.hathitrust.org/Record/003384758. Elements are:

recordURL: The URL to the human-readable record view in the catalog
titles: an array of all the full 245, space-separated subfields. Always present, usually with one item, sometimes more than one (vernacular entries), almost never with zero.
isbns: An array of all the ISBNs associated with the record. Always present; an empty array if none.
issns, oclcs, lccns: Same as ISBNs, but for the appropriate data.

Note that at this time, LCCNs are taken from the 010, so the LCCN array will either be empty or have one item. I left it as an array just for consistency.

items

This is an array of items, taken from all the matched records and ordered (as best I can) based on their enumcron. If no enumcron is present, order is undefined.

fromRecord: The HathiTrust record ID, as used as a key in the hash of records (explained above).
htid: The HathiTrust ID for the item.
itemURL: The URL to the page-turner (or search box, for search-only items) for this item. It’s currently just appended to the prefix “http://hdl.handle.net/2027/“, but I thought I’d include it in case the preferred URL algorithm changes at some point.
rights: The rights code for this item, as explained at http://www.hathitrust.org/hathifiles_metadata.
orig: The institution that supplied the item for digitization.
lastUpdate: The date of the last time this item or its containing record was touched, either because of ingest by the HathiTrust system or later editing, as YYYYMMDD. May be 00000000 if unknown.
enumcron: (OPTIONAL) The enumeration/cronology (e.g., “v. 3 1997” or somesuch). Again — optional. Leave out the key? Provide an empty string? Provide a false?

A word about enumcron

The enumcron string is fickle, and very local. The algorithm I’m using to sort them basically consists of taking all the numbers in the enumcron strings and zero-padding them to 8 digits, then sorting. It works pretty well, but isn’t perfect. I’m incredibly resistant to trying to do anything fancier, simply because I want it to be fast and because trying to deal with all possible enumcron formats is a sisyphean task.

An actual record

Here’s the simplest possible case: a single matched record with a single item

   {     "records":       {         "000366004":           {             "recordURL" : "http://catalog.hathitrust.org/Record/000366004",             "titles": ["The Sneetches, and other stories. Written and illustrated by Dr. Seuss."],             "isbns": [],             "issns": [],             "oclcs": ["00470409"],             "lccns": ["68001537"]           }       }     "items": [       {         "fromRecord": "000366004",         "htid": "mdp.39015079651611",         "itemURL": "http://hdl.handle.net/2027/mdp.39015079651611",         "rightscode": "ic",         "lastUpdate": "20091004",         "orig": "University of Michigan",         "enumcron": false       }     ],   }

We can see that (despite my expectation) we don’t happen to have an ISBN for this item. The item originally came from Michigan, either ingested or last updated on October 10th, 2009. The HathiTrust catalog page for this item is http://catalog.hathitrust.org/Record/000366004 (derived from the record ID) and it is In Copyright (ic), so the itemURL goes to a page that allows only search.

Making the request

I’ll take care of normalizing data on the way in (mostly done by the Solr backend): strip leading zeros off the OCLC number, normalize the LCCN as per this page at the Library of Congress, strip anything funny-looking from the ISBN and ISSN, and (probably) convert all ISBNs into ISBN13s.

I’m anticipating three formats for a request (note: they don’t work yet. There’s no code):

Single-identifier option

http://catalog.hathitrust.org/api/volumes/oclc/00470409.json http://catalog.hathitrust.org/api/volumes/lccn/68001537.json http://catalog.hathitrust.org/api/volumes/issn/1051290x.json http://catalog.hathitrust.org/api/volumes/isbn/0835221792.json

Simple and unambiguous; returns the proposed return structure as described above (and presumable amended before actual implementation). Again, any normalization that needs to be done will be done on my end, so “00470409” and “470409” are considered the same OCLC number.

Multiple-identifier, multi-request option

http://catalog.hathitrust.org/api/volumes?yourID1=oclc:00470409|lccn:68001537&yourID2=oclc:67890987|isbn:987652348X

In this format, you can see that (a) you can provide multiple pieces of metadata for a record, separated by pipe characters (|), and (b) you can provide metadata sets for multiple records at once, keyed on whatever arbitrary ID you want to use.

The return format would look like this:

   {     "yourID1" : ,     "yourID2" : ,       ...   }

What to do when the provided metadata don’t agree?

It’s entirely possible to provide an OCLC number and an LCCN that, in fact, refer to two different records. It’s also possible that we have two records in the system that should be merged, but haven’t been.

Some possible algorithms:

Require that all sent numbers match: If you send an OCLC, and ISBN, and an LCCN, any returned record must have all three, and all must match. That seems too strict.
Return any records that match any sent numbers: I could do a boolean-OR, so any record that matches any of the numbers you send gets returned. The risk of returning too much data seems too great.
Return any records that don’t mismatch any sent numbers: The same as the first option, but null matches anything. So, if you sent an LCCN, and if the record has an LCCN, they must match. If you sent an OCLC number and if the record has an OCLC number, it, too, must match, etc.. Basically, every piece of metadata, if provided, must match.
Order the number types and only match the best available. We provide an ordered list of type: OCLC, LCCN, ISBN, and finally ISSN. If you provide an OCLC number and there is a record with that OCLC number, return it and ignore everything else. If you didn’t provide an OCLC number (or if you did but we didn’t get any matches), move on to the LCCN and try again, as shown below.

     // The algorithm for #4     foreach type in (OCLC, LCCN, ISBN, ISSN) {       next unless (providedSearch[type]); ## move on unless a number was provided       records = recordsThatMatch(type, providedSearch[type]);       if records.size > 0 { # If we found some, return         return records;       }       ## else, we move to the next type.     }

So, for #4, if you provide an OCLC number and we find a match or matches, stop looking and return them. If we don’t find an OCLC match but you also provided an LCCN, look for records that match the LCCN, and if found return them. Repeat with ISBN and ISSN.

Understanding #3 vs. #4

Suppose the following are true:

You provide an OCLC number O and an LCCN L
I have a record r1 with OCLC number O and no LCCN at all
I have a record r2 with LCCN L and no OCLC number at all.

Under option #3, both records would be returned. They both fulfill the criteria that they match all the supplied identifiers in all fields for which they have values. In other words, r1 has a positive match on OCLC (O == O) and a null-matches-everything match on LCCN (L == no data).

Under option #4, only r1 is returned. We first look for all records that match on the OCLC number provided, find exactly one, and return it. We never even bother to look for records that match on LCCN only.

Let’s pick one and see how it works in the real world

I’m leaning toward #4, but I’m open to #3 as well, or any other variant that can be computed quickly and easily on this end. We’re talking about some pretty weird edge cases when we start going down this road, and I don’t want to sacrifice ease of use and ease of computation any more than we have to.

Please comment!

You can comment here, or send email directly to me. I’ll follow up this post periodically with more thoughts and synopses of what I’ve heard.

Adding LibXML and Java STAX support to ruby-marc with pluggable XML parsers

Bill Dueber — Wed, 07 Oct 2009 00:00:00 +0000

JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there.

Since I’ve been messing around with MARC-XML parsing of late, and since Ross Singer added pluggable xml-parser awesomeness to the ruby-marc project, I thought I’d see what I could do with native Java methods when parsing MARC-XML.

And just for kicks, I threw in the old code that I wrote before that uses LibXML.

Why do this at all?

Because…er…there’s an obvious work-situation where I need to squeeze every last drop of speed out of…ruby…which we don’t use…er…

Because. Because I wanted to screw around with the technologies. Because I wanted to learn about calling java native stuff. Because I already wrote the libxml stuff. Because it feels silly to run on the JVM and not use JVM-native code to deal with XML, given that standard java projects make it seem like Java is a giant XML processor with a language wrapped around it.

What exactly did I do?

For the LibXML stuff, I copied my own code. For the java stax (javax.xml.stream.XMLInputFactory.StreamReader) parser, I stole just about everything from Ross’s nokogiri code and put it into its own module, and then slimmed down the nokogiri module and the stax module to only include their differences.

The patch is at the ruby-marc rubyforge site if you want to play along at home.

Other than using the stax or libxml parser, everything else is the same — MARC::Record objects and their components are created exactly as they are with the other parsers. It might be “fun” (for some twisted definition of “fun”) to wrap the MARC::Record interface around marc4j at some point, but right now all that’s changed is the parsing.

Do they work?

Yes. Thanks for asking. At least all the tests pass when I type ‘rake’.

How fast is it?

As always, the numbers are iffy. These were done on my desktop, with other stuff going on. I didn’t bother to benchmark rexml because we know how slow that is.

The test file is a nightly dump intended to go into our VuFind install. It was born as binary marc, and changed to marc-xml using yaz-marcdump, which is so fast that I thought maybe something had gone wrong. Holy cow, is yaz-marcdump fast.

The resulting XML is 219MB and contains 46,242 records.

The test was to open it up, loop through the records, and pull the 245 out of each. Each segment looks something like this:

   reader = MARC::XMLReader.new(filename, :parser=>'jstax')   reader.each do |record|     title = record[245]   end

Times are in seconds. I ran each one five times, with the exception of jrexml, during which I got bored. And the perl code, for which I just wanted to get a ballpark to compare.

 MRI 1.8.7     libxml     104    (103, 103, 106, 104, 103)     nokogiri   301    (304, 300, 301, 301, 300)  JRuby trunk     jrexml     547    (539, 554)     jstax      203    (201, 208, 201, 201, 204 )  Perl 5.10 w/MARC::File:XML     perl       340    (340)

So…faster, right?

Pretty much, yeah.

Under (MRI) ruby, Ross found that nokogiri was 3.5x faster than rexml, and my noodling-around at home showed the same speedup. Using that as a baseline, we get the following speed comparison table using the libxml time normalized to 1.00.

In case that wasn’t clear: lower numbers are better.

libxml:   1.00 jstax:    1.95 nokogiri: 2.89 jrexml:   5.16 rexml:    10.11 (estimated; 3.5x nokogiri's speed)

What does it all mean?

It means that adding pluggable parsers was freakin’ brilliant.

It means that a guy like me — with no real expertise in any of the applicable technologies — can do a passable job at integrating a java library into JRuby.

And it means that if I (a) can get folks around here to use Ruby, and (b) can get them to use MARC-XML instead of binary MARC (which we can’t use anyway because of the record-length limitations), I can be sure that any bottlenecks aren’t going to be the result of those choices.

An exercise in Solr and DataImportHandler: HathiTrust data

Bill Dueber — Mon, 28 Sep 2009 00:00:00 +0000

Many of the folks who read this blog (hi, both of you! Mom, say hello to Dad!) are aware, at least tangentially, of the HathiTrust. Currently hosted by us at the University of Michigan, the most public interface to its data is a VuFind installation you can access at catalog.hathitrust.org (or, for you smart-phone types, at m.catalog.hathitrust.org). Once you do a metadata search, you get links into the actual page images or a chance to search the fulltext of the selected item (depending on its copyright status).

It’s awesome. Seriously. Even in the absence of fulltext, being able to search within an item can be incredibly useful. Give it a shot if you haven’t.

You don’t always need an OPAC

But there are plenty of folks who don’t want or need a full-flown interface into all the metadata. They’ve already got one of those. What they’re interested in, mostly, is figuring out how to easily put links in their own OPAC (or whatnot or whoseits) to the HathiTrust if page images or searching are available. See, for example, a typical record from Tod Olson’s stuff at U-Chicago — he sniffs for HathiTrust and Google Books availability via embedded javascript.

To this end, the HathiTrust folks provide a set of simple, tab-delimited files — a full extract on the first of every month, and nightly updates every …er…night.

You can see from the description of the file that it’s very simple. Tab-delimited fields of the HathiTrust ID, right information, and all the golden-oldie standard identifiers — some of which (ISSNs, ISBNs, etc.) are further comma-delimited in cases where multiple values are available and a field repeats. And a title and enumcron (description of an individual volume, e.g., “Sept 2007, vol. 33, issue 4”), so you have something useful to display if you need to, and that’s 98% of what most folks want.

The smart way to do it: RDBMS

If you want to query this data quickly and easily, the obvious thing to do is to dump it into a database. One main table for the non-repeated values, and either a few key=>value tables (or, if you’re lazy, a single key => type/value) for the repeated ISBNs/ISSNs/whatnot. A quick mod-perl script to set up some data normalization going in and out and persist the prepared SQL queries and you’re set.

It’s hard to make an argument against using a database for these data. I mean, c’mon. We’ve got a well-defined structure. An obvious foreign-key. No full-text searching needed. This is practically designed for a good old-fashioned RDBMS. Plus, I’ve done this approximately a zillion times before, so I’m good and fast at it. Case closed.

How I’m gonna do it

Screw that. What I really wanted to do was start messing around with the DataImportHandler(DIH) in Solr.

I can make a weak argument for including the data in a Solr instance. To wit, it’ll certainly be fast enough for anything I’m gonna throw at it, and (more important to me) it’s easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about manipulating the input much, if at all.

The list of simple DIH examples is…well, I never really found any good ones, although I’m sure they’re out there. The documentation isn’t bad, but it’s not full of complete examples, and almost all of them have to do with the potential complexities of sucking data out of a database, which is what most people want to do. Not me, I’ve got flat files to work with.

Luckily, you can fire up an “interactive” DIH session where, at the very least, you can try to import a few rows of data and see if things are puking. I didn’t find the error reports particularly helpful all the time, but it’s about a zillion times better than nothing, I can tell you that much.

The game plan

We’ll start with the assumption that I’ve already managed to load a full dump from some date (run with me here; I’ll explain how to do it later). Then what we want to do is the following:

Every night, download the nightly additions/changes file and gunzip it.
Hit the DIH handle to import all files that (a) have a filename of the right format, and (b) have a created date after the last time the DIH handle was run.

And that’s it. Get the new stuff, have DIH figure out what’s new, and import it.

The first part is easy enough to do with perl/python/ruby/whatever. I’ll leave it as an exercise for all you diligent students.

Setting up solrconfig.xml

This is the easy part. Set up the handler, give it a semi-meaningful name, and call out to a config file.

                   hathi-data-config.xml

Define some useful data types in `schema.xml`

I left pretty much all of the boilerplate in schema.xml and just added a few types to deal with identifiers.

lowercase: return a single token that’s been lowercased. Don’t muck with it otherwise.
genericID: trim it, lowercase it, ditch everything that’s not a number or a letter, and return as a single token.
numeric: Ditch everything but the first string of digits, and then ditch any leading zeros. Useful when you know it’s gotta be an integer.
stdnum Find the first set of digits (optionally followed by an ‘X’ and potentially interspersed with dashes or dots), strip off the leading zeros, and return it. Good to extract an ISBN from a string like “(alt) 123-45-678X electronic only”.
lccnnormalizer: Custom code to normalize an LCCN as per this page at the LoC.

Add field definitions to `schema.xml`

This is pretty straight-forward: just set it up.

hathi-data-config.xml — define how DIH is going to work.

This, of course, is the meat of the heart of the center of the matter.

I’m going to make use of four DIH technologies:

FileDataSource: In DIH, you declare a data source from which you’ll be sucking the raw data for manipulation and massaging. I’m just using a file, so this is for me. You can, as you might expect, pull in from a URL or (as mentioned) a database via JDBC.
FileListEntityProcessor: Given a directory and a set of criteria for a file, this will return a list of filenames that match those criteria. The criteria we’ll be using are (a) a regexp the filename must match, and (b) a creation date after the last time we ran the process.
LineEntityProcessor: Once you’ve got a data source, you need to stream it in somehow. There are Processors for XML and other formats, but this one just pulls in lines one at a time. The documentation all talks about LineEntityProcessor basically only being useful for pulling in, say, a list of filenames, but since my data is all line-by-line, this is what I’m using as my primary record-fetcher. It populates a single field called rawLine for later processing.
RegexTransformer: Allows you to take a field pulled from the datasource (or already derived from previous processing) and do regexp substitutions, group extraction, or splitting.

SO…I’m going to:

Set up a FileDataSource to read from files
Use FileListEntityProcessor to get a list of files that match my criteria
Run each through LineEntityProcessor to generate a bunch of rawLines.
Use the RegexTransformer multiple times to extract the data from the line.

[If you never went to look at it, this might be a good time to check out the description of the tab-delimited metadata files.]

And…it doesn’t work.

It almost works. The problem is that my attempt to use the variable ${dataimporter.last_index_time} is busted. There’s a ticket to fix it and a patch already provided, so it’s only a matter of time before it’s not an issue.

For the moment, though, we’ll change that line to:

That says to basically take everything created since midnight and use it. If you have cron scripts set up to run this every day, you’ll have no problems.

Dealing with a full extract

You’ll only have to do this once, of course, but it has to be done. Basically, reproduce the DIH handler with a different name, pulling in the data from a full extract (you could, e.g., just change the filename parameter to accept /^hathi_full_.*\.txt$/). Maybe call it hathifullimport instead of hathiimport.

Fire her up!

Once you’re ready to go, just hit the right URL:

http:://solrmachine:port/solr/hathifullimport?command=full-import&clean=true  http:://solrmachine:port/solr/hathiimport?command=full-import&clean=false

The first one will get the initial big, full file; the second will pull in all the nightlies you’ve downloaded, gunzipped, and put in the right place (provided, of course, they’re dated after the last midnight, or they’ve fixed DIH to allow the last_index_time syntax).

Next steps?

Beer or wine. Take your pick.

After that, though, it’d be a matter of actually writing the download scripts and setting up cron jobs. And, of course, putting a front-end on it if you want, or massaging the data as they come out to return a nice JSON format for your consumers. That sort of thing.

So, wait…is this really worth doing?

Maybe. Probably not. It was worth it to me to start thinking about DIH and how I can use it. And it might be worth it to you, if you want to play around with these data in the ways that solr makes easy.

But, like to many things, it’s less worth doing that it was worth writing up. I learned a lot.

Dead-easy (but extreme) AJAX logging in our VuFind install

Bill Dueber — Fri, 25 Sep 2009 00:00:00 +0000

One of the advantages of having complete control over the OPAC is that I change things pretty easily. The downside of that is that we need to know what to change.

Many of you that work in libraries may have noticed that data are not necessarily the primary tool in decision-making. Or, say, even a part of the process. Or even thought about hard. Or even considered.

For many decisions I see going on in the library world, the primary motivator is the anecdote. In fact, to be honest, the primary driver is the faculty anecdote. Those cliched three curmudgeonly old faculty members invariably have huge influence over systems and interfaces that will be used by 40K undergraduates. The tiny percentage of weirdos that actually talk to reference librarians end up wielding enormous power compared to the untold masses that don’t.

Enter the dragon…er…log.

So…I’m logging everything. EVERYTHING. Everything I can think of, anyway, and that doesn’t slow things down too far.

I’ve got a simple database table set up with the following columns:

incrementing integer (solely for innodb’s efficiency needs)
sessionid
action
data1, data2, and data3 (all of these are action-dependent)
logweekday, logdate, logtime (instead of a single timestamp for easy and efficient queries)

And that’s it. I’ve had it running (initially with only a few actions) for two weeks and have on the order of 300,000 rows in it at this point. Obviously, at some point I’ll have a better idea of which data I actually care about and things will get slimmed down a little bit. But for now, it’s fun having that all around.

Common log events include an action, and usually at least one other piece of data. Stuff like:

start-a-new-session with IP Address
simple-search with search-index, searchstring
choose-a-facet with facet-index, facet-value, position-on-list
view-a-full-record with recordID, search-result-number
click on an electronic resource link to proquest/google/hathitrust/whatever

…etc. I track adding and removing things from the selected items set and the user’s favorites, exporting to email or refworks or whatnot, logging in and out, clicking on the author’s name or a LCSH subject in the full record view, picking a “similar item” from the eponymous list, clicking on the spelling suggestion and the prev/next buttons, etc. Currently I’m logging 78 events.

[Note: by “search result number” I mean the enumeration of that record in that specific search set. So, the top result is #1. The first result on page two is #21]

What do I think I’m gonna learn?

I’m not exactly sure. Of course, I can get all the basics — how much traffic and what people are searching for — but there’s the possibility of other stuff. Things like:

Do people actually use the [prev/next, facets, facets-below-the-fold, items past the third page, etc]
Seriously, is anyone using the boolean searches and wildcards on the basic search page? Are any of them using IP addresses from outside the staff subnet? If not, can I please please please start using DisMax???
What facets are most popular? Do people hit the little “more” button to expand the list of facet values from 6 to 30?
What’s the average search result number of a record chosen for a full-record view for each search index (perhaps an indicator of how well the relevancy-ranking is working?)
Looking at all the full-record displays, what are the patterns for those records (e.g., break down by callnumber prefix, or by our “Academic Discipline” subject)

I’ve got a lot to learn about stats, and user tracking, and clickpath analysis, but dammit I’ll have data and I’m not afraid to use it!

[Er…them. Not “it”. Data are a “them.” Always feels weird to me to refer to data in the plural, but I’m forcing myself to do so these days.]

What’s the server implementation?

I already mentioned the database. I’ve got a little module called ActivityLog that does three and a half things:

Get the session id from the session
Get logging information from the GET/POST or passed in as parameters
Modify the parameters if need be (e.g., pull domain name out of an external URL). This is the half a thing.
Stuff it into the database with appropriate timestamps.

And that’s it.

What’s the client need to do?

I start off with the following rules:

I want to be able to log damn near everything
I can’t degrade the user experience in a meaningful way just for logging
I want to log outgoing links, too.
I must must must have pretty, bookmarkable URLs.

Truth be told, some of the “client” stuff can be (and is) done on the server. When someone is, say, sending a record or set of records to RefWorks, the server knows everything it needs to know and I can just take care of logging as part of the regular request fulfillment.

But some stuff — like the search result number, say — are best taken care of from the browser. Easy enough, for the most part, esp. with form submissions and such.

The potentially-non-obvious part comes in with rule #3 — I want pretty URLs. That means that the full display of record 123456789 is always going to be at /Record/123456789 no matter what the user clicked on. Ditto with adding/removing facets and such — the URL contains the resulting search, not the resulting search plus which facet was removed or added.

But — see #1 — I want to log damn near everything.

My solution — and I know lots of people are doing this; this isn’t rocket science — is to fire off an AJAX post for the click events that I’m interested in, sending log data off to my server and then not waiting for a return. Just send the data and follow the link as if nothing had intervened. It degrades gracefully (although the rest of my VuFind doesn’t, so that doesn’t matter much) and it dead-easy to implement.

The actual javascript implementation

I long ago switched our VuFind stuff over to use jQueryuery, just because I like it and know it.

First thing is to use the templates to modify the links to have a particular class (logit) and a well-structured ref (pipe-delimited values).

So, a link from the title of a work on the search-results page to the individual record will look like this:

{$title}

The ref attribute tells us that we’re going to log the type of event (record view from the search results), the ID of the record, a null in the data2 column, and the search result number.

Then there’s javascript to make all the magic happen:

     function logit(a, args) {     a = jQuery(a);      // Allow the caller to pass in args, or get them from the ref attribute     if (!args) {       args = a.attr('ref').split('|');     }      jQuery.post(       url_to_the_logging_method,       {         'lc' : args[0],         'lv1': args[1] || '',         'lv2': args[2] || '',         'lv3': args[3] || ''       }     );   }    jQuery(document).ready(function() {     jQuery('a.logit').live('click', function(e) {       logit(this);     });   });

The logit function just does a brain-dead post of the data in the ref attribute. We then bind that function to all anchors with the appropriate class, and we’re done.

[Note the use of the jQuery live event — this makes sure the event will be bound to stuff that comes in via AJAX after page load. Our links to Google Books, for example, come in like this.]

Since I’m not returning false from the logit function, the default action (actually follow the link) will fire — without even waiting for the AJAX call to come back. Delay to the user is, hopefully, unnoticeable.

Final words

This isn’t all that smart. I should be doing more data-integrity stuff than I am, and of course someone could spoof my numbers if they wanted. But someone could spoof my stats just by hitting my normal catalog pages programatically, too, so there’s no more risk involved, and I do log IPs.

And, of course, I get my pretty URLs, and most users (i.e., those not running firebug) will never notice anything.

I don’t know that this would work for everyone, but so far it’s working pretty well for us. I’ll let you know if that continues in a post in a few weeks.

The sad truths about journal bundle prices

Bill Dueber — Wed, 23 Sep 2009 00:00:00 +0000

[Notes taken during a talk today, Ted Bergstrom: “Some Economics of Saying Nix To Big Deals and the Terrible Fix”. My own thoughts are interspersed throughout; please don’t automatically ascribe everything to Dr. Bergstrom.

Check out his stuff at Ted Bergstrom’s home page.]

Journals are a weird market — libraries buy as agents of professors, using someone else’s money, in deals of enormous complexity and uncertain value from companies that basically have a monopoly.

Similar to a few other situations: doctors prescribe drugs for patients using insurance money. Professors assign textbooks to students whose parents (in general) buy them. In all these cases, the supplier is (or is nearly) a monopoly operation.

Median price per article of for-profit journals is 3-4 times the median prices for non-profit journals. When you look at price per citation it gets even worse (because the “best” — or at least most cited — journals tend to be non-profit).

Marginal cost of supplying print journals is about a penny a page. The marginal cost of supplying electronic access is nearly zero, of course and shelf space and multiple-copies become a thing of the past.

SO…enter the Big Deal.

Academic press and then Elsevier figured out how to price-discriminate: calculate each library’s current expenditure on paper journals, multiply by 1+x (for x about 0.15) and provide access to all their journals electronically, plus whatever paper you used to buy. Elsevier had a 5-year contract, during which they promise not to increase the price more than 7% per year.

This is a great deal for Elsevier, because they know what a library is already paying and the marginal cost of providing electronic access is essentially zero to them. Huge success — lots of libraries jumped on board, and then so did the other publishers.

Bundling deters entry into the market

Libraries who bought the first Big Deal had their payments increase 7%/year (note that about half of UC’s serials budget goes to Elsevier), but their own budget increases about 3.5%/year. So, they’re in constant cancellation mode, but you can’t cancel only a portion of the electronic access, so it’s exempt. Small-time journal publishers are the only thing left to cut.

We have to learn to say “no”

Plus…it’s an incredibly popular product. Faculty love online access, as do students. So negotiating a new contract is tough to do, because libraries (as always) are unwilling to walk away.

The theory of bargaining suggests the the library needs to know what will happen if the Big Deal bargain breaks down — what happens if we walk?

Problem: we (the libraries) don’t know what the deal is worth to us OR what it’s worth to the publisher. Valuation of the big bundles is a ridiculously complex problem.

What happens if we cut if off?

Library owns access rights to back issues of journals previously subscribd to.
Pay-per-view access required only for recent volumes

So…we need to calculate number of pay-per-view access, which will obviously increase as time goes on (and more stuff falls under this model) and would go down if we were to change consumers (faculty and students) some percentage of the cost.

Big problem: we can’t make that calculation. We have no good way of knowing what percentage of article downloads are for current journals, and the publishers don’t release that data.

Even if we had the data, though, the likely outcome after a certain amount of time, as more stuff falls into the pay-per-view window, is a new Big Deal.

What about the Big Deals themselves?

Hard to know — because there are NDA sprinkled around like snow in Minnesota. But it turns out that these deals are FOIA-able! Yea! Elsevier actually tried to sue Ted’s group in the state of Washington, but Paul Courant and others helped to win the day, and they haven’t been sued since.

Publishes want to give the impression that the renewals of the Big Deals are basically formulaic, but in real life there are significant differences from institution to institution.

What’s the “Economic Solution”

Llet users pay for what they use. If users paid their own money (about $35/article at the for-profit institutions), users will modify their own behavior AND authors will stop submitting to the expensive-to-access journals because they want their stuff to be read.

How much money are they making?

Reported profits of Elsevier and Springer are about 30% of sales. That’s a huge margin in the regular world, but their costs are tiny? Where does it all go?

Basically, it goes to lobbyists, lawers, and executive salaries.

The Optical Society of America (physics, not eyesight) is a non-profit organization that publishes journals at about 1/3 the cost per page of Elsevier, but makes 40% profit on sales. They, of course, plow it back into physics journals, being a non-profit and all.

What’s the economic model of a journal?

Publishers have fixed costs (editing, harassing referees, typesetting, technology, etc.). I (Bill) think of this as the “First reader” cost — the cost to get one reader to be able to read it.
The marginal cost of adding more users is essentially zero.
The “efficient” option is to either allow user access at zero cost, with various institutions subsidizing the fixed costs, or just don’t publish the journal.

What can a single library do?

Not much. Faculty will scream, and one library acting alone will have essentially no effect on anything.

An interim strategy

Drop Big Deals to overpriced journals. Maintain subscription and free access to journals priced near the average cost, and subsidize (at less than 100%) pay-per-view access to the overpriced journals

How big are the differences in what people pay?

Just as an aside, almost, he tells us that while UMich and Illinois pay Elsevier about $2.25M for the “Freedom Collection”, Wisconsin pays about $1.2M for the exact same collection. Whoops!

He’s getting contract via FOIA and analyzing the differences. I imagine there’ll be publications forthcoming.

More Ruby MARC Benchmarks: Adding in MARC-XML

Bill Dueber — Fri, 18 Sep 2009 00:00:00 +0000

It turns out that UVA’s reluctance to use the raw MARC data on the search results screen is driven more by processing time than parsing time. Even if they were to start with a fully-parsed MARC object, they’re doing enough screwing around with that data that the bottleneck on their end appears to be all the regex and string processing, not the parsing. Their specs for what gets displayed are complex enough that they want to do the work up-front.

But I remain interested, at least partially because of the reason UVA is using MARC-XML: they have MARC records too big for binary MARC format to handle. We do, too, and we’ve just been talking about what to do with them. So I’m thinking that

First, I spent some time dusting off my first attempt at ruby programming: modifying ruby-marc to use libxml if it’s available. It’s not super-well tested, but I’m pretty sure it works. And the speed increases are … well, see below.

Anyone who wants to mess with my attempt at libxml-enabled ruby-marc is welcome to do so. This is a very forgiving parser — it trusts that whatever ended up in the XML should, in fact, have been there. If you say ‘XXE’ is a control field, well, I’ll treat it as a control field.

But back to the data. A few points are obvious:

XML with REXML is dead-slow on both platforms (at least an order of magnitude slower )
XML with LibXML is competitive with binary MARC (within 20% or so)
Even with REXML, though, time to create MARC records out of the 50 input strings is less than a second, which might be ok depending on your application.

Full results

As with last time, the total numbers below show how long it took to process all 40 sets of 50 records. The unadorned numbers are the average time it took to process a set of 50 records.

Call up solr with a null search, get 2000 records back in batches of 50 with wt=ruby, eval it, and stick it into arrays

jruby-Get/Eval data              0.143550 mri-Get/Eval data                0.106550  jruby-Get/Eval data (total)      5.742000 mri-Get/Eval data (total)        4.262017

Turn raw strings into MARC::Record objects from MARC-Binary strings, joining all the returned MARC together first

jruby-marc4j-multistring         0.026575 jruby-marc-multistring           0.037175 mri-marc-multistring             0.073396  jruby-marc4j-multistring (total) 1.063000 jruby-marc-multistring (total)   1.487000 mri-marc-multistring (total)     2.935842

Turn raw strings into MARC::Record objects from MARC-XML

mri-marc-LibXML                  0.091332 jruby-marc-REXML                 0.799500 mri-marc-REXML                   0.948549  mri-marc-LibXML (total)          3.653276 jruby-marc-REXML (total)        31.980000 mri-marc-REXML (total)          37.941975

Conclusions

I’m not sure exactly where this leaves me, other than knowing that marc-xml is probably a viable alternative if you can use libxml. Getting a version of that code which uses native Java XML libraries when run under jruby might be a useful exercise.

Benchmarking MARC record parsing in Ruby

Bill Dueber — Thu, 17 Sep 2009 00:00:00 +0000

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow]

[UPADTE It turns out they don’t use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ]

[UP-UPDATE Well, no, they do use MARC-XML. I’m not afraid to constantly change my story. This is why I’m the best investigative reporter in the business]

The other day on the blacklight mailing list, Bess Sadler wrote

Yes, we do still include the full marc record, but the rule of thumb we’re currently using is that anything that needs to display in the index view (the search results) needs to be broken out into a separate display field, because retrieving and parsing marc records for every item in a list of search results is too much of a performance hit.

This surprised me a fair bit, because in our implementation of VuFind (which uses PHP, versus Ruby for Blacklight) I do just that — grab the MARC out of Solr, parse it, and pull stuff like full titles and such out of it.

As it turns out, I’d been screwing around with calling marc4j from jruby, anyway, so I threw that into the mix, and here’s what I found.

What the benchmark tries to measure

The focus is on measuring time to parse MARC records as returned in a field from Solr in MARC-binary.

I got 40 sets of 50 records each (2000 records) from our Solr instance in ruby format and extracted the binary MARC strings. This resulted in an array of 40 sets of 50 strings, each of which is a valid MARC record.

Fifty records seems largish to me — we only display 20 at a time — but thought I’d swing for the fences.

I’m testing along three(ish) dimensions:

jruby vs mri
marc4j vs ruby-marc (only on jruby, obviously)
parsing each string individually, or globbing them all together and treating it as if it’s a multi-record file

[Note that MRI is using Net::HTTP to get the data; I presume Curl would be faster still. It’s already faster than jruby]

The following data show the average time to parse out each set of 50 records and extract the first 245 (title) field from each one, along with the totals for doing all 2000 records.

Method                           User       Total      Real  jruby Get/Eval data              0.134750   0.134750 (  0.134850) jruby Get/Eval data (2000)       5.390000   5.390000 (  5.394000)  MRI Get/Eval data                0.008500   0.012750 (  0.115942) MRI Get/Eval data (2000)         0.340000   0.510000 (  4.637677)  jruby-marc4j-oneAtATime          0.056075   0.056075  (0.056125) jruby-marc4j-multistring         0.027925   0.027925  (0.028000)  jruby-marc-oneAtATime            0.066625   0.066625  (0.066650) jruby-marc-multistring           0.034300   0.034300  (0.034325)  mri-marc-oneAtATime              0.084500   0.085250  (0.086597) mri-marc-multistring             0.085000   0.085750  (0.086026)  jruby-marc4j-oneAtATime (2000)   2.243000   2.243000  (2.244999) jruby-marc-oneAtATime (2000)     2.665001   2.665001  (2.666000) mri-marc-oneAtATime (2000)       3.380000   3.410000  (3.463888)   jruby-marc4j-multistring (2000)  1.117001   1.117001  (1.120001) jruby-marc-multistring (2000)    1.371999   1.371999  (1.372999) mri-marc-multistring (2000)      3.400000   3.430000  (3.441052)

So…the worst-case scenario is taking an average 0.085 second to get the first title field out of each one of 50 binary MARC records once we’ve got them.

Now, I’m sure all my records came out of the cache, so my query time wasn’t very long. But we still end up with a maximum of roughly 0.2 seconds plus the time to actually do the query to end up with a set of 50 marc records.

We can see from looking at the totals that it looks like MRI’s bottleneck is the actual parsing, whereas constructing the input streams is expensive under jruby (at least the way I’m doing it), resulting in a benefit of concatenating them all together into one longish string before parsing.

Marc4j is faster (20%ish), but not enough faster to be worth the effort in my mind. Keep in mind that I have no idea how fast Marc4j is when running under pure java, without all the jruby overhead.

Bottom line, though: that seems fast enough to me.

I’ll try to benchmark with XML later on today or tomorrow.

Building a solr text filter for normalizing data

Bill Dueber — Thu, 20 Aug 2009 00:00:00 +0000

[Kind of part of a continuing series on our VUFind implementation; more of a sidebar, really.]

In my last post I made the case that you should put as much data normalization into Solr as possible. The built-in text filters will get you a long, long way, but sometimes you want to have specialized code, and then you need to build your own filter.

Huge Disclaimer: I’m putting this up not because I’m the best person to do so, but because it doesn’t look as if anyone else has. I don’t know what I’m doing. I don’t know why the code I’m showing below is the way it is, and if anyone would like to make it better, that’d be great. This is basically just a lot of pattern-matching on my part.

[A second disclaimer: I haven’t actually built this into Solr yet, although I’ve done some simple testing on the ISBN-13 checksum code. I’ll remove this disclaimer when I get a chance to actually index some data with it.]

The Setup: An ISBN-10 to ISBN-13 converter

Last time, I said I didn’t know why I hadn’t put together an ISBN longifier yet. So let’s walk through it.

This is a lot easier than most things in that I’m assuming we’re going to be getting exactly one token to work with (via the KeywordTokenizer) and can just work on it with impunity.

If you’d like to follow along, get the solr source via svn on a machine with java and ant. And junit, I think.

Where to put stuff

Of all the black magic associated with doing this, figuring out how to actually make it build is the part that’s probably easiest for Java-heads and the most confusing to the rest of us. Anyone attempting this sort of thing should probably get a good grounding in how Solr is set up and how its build system works before doing anything else.

Me? I cheated.

I basically just copied the directory structure of another project in the config directory in the solr root (looks like maybe it was velocity), did some tiny modifications to the build.xml file to change the name of the project, renamed the ‘.pom’ file and edited it in the obvious ways, and followed the copied directory structure to figure out where to put my files.

And then it worked. And I didn’t ask any question, and metaphorically just backed away slowly with a nonchalant look on my face. Of course, if you know what you’re doing with java and ant, I’m sure there are better ways.

For the record, the directory in solr/config/umichnormalizers (where I put this stuff) would look something like this by the end of this project:

 ./target/ ./build.xml  ./src/main/java/edu/umich/lib/normalizers/ISBNLongifier.java ./src/test/java/edu/umich/lib/normalizers/ISBNLongifier.java  ./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilter.java ./src/main/java/edu/umich/lib/solr/analysis/ISBNLongifierFilterFactory.java

You then just run ant in your config directory to generate a .jar file that can be put in solrmarc’s lib directory or (I think) jetty’s lib directory. You can also just run ant dist at the solr root level to get a .war file with your stuff embedded.

The converter

First, you just need some basic code to actually do the conversion. I’m sure this is hideously inefficient, but probably not as inefficient as the actual filter I’ll be producing in a minute.

We take in a string. If it looks like it might have a 10-digit ISBN in it (possibly with dashes or periods as delimiters), extract it, do the conversion to an ISBN-13, and return that as a 13-character string (e.g., no dashes or whatnot).

Note that I’m not working hard to determine if it’s an ISBN — this isn’t designed to try to pull an ISBN from random text. The hope is that by the time you get this far you’ve already got a pretty good idea that you’ve got an ISBN on your hands. I’m also not checking to see if the incoming ISBN is valid in any way; that’s left as an exercise for the dilligent reader.

 package edu.umich.lib.normalizers; import java.util.regex.*;  public class ISBNLongifier {    // dashes and dots are acceptable delimiters. Should we add spaces??   private static String  ISBNDelimiiterPattern = "[\-\.]";    // Look for a string of nine digits followed by another digit or an X   private static Pattern ISBNPattern = Pattern.compile("^.*?(\d{9})[\dXx].*$");    public static Boolean matches(String isbn)  throws IllegalArgumentException {     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");     Matcher m = ISBNPattern.matcher(isbn);     return m.matches();   }    public static String longify(String isbn) {     isbn = isbn.replaceAll(ISBNDelimiiterPattern, "");     Matcher m = ISBNPattern.matcher(isbn);     if (!m.matches()) {       throw new IllegalArgumentException(isbn + ": Not an ISBN");     }      String longisbn = "978" + m.group(1);     int[] digits = new int[12];     for (int i=0;i<12;i++) {       digits[i] =  new Integer(longisbn.substring(i, i+1));     }      Integer sum = 0;     for (int i = 0; i < 12; i++) {       sum = sum + digits[i] + (2 * digits[i] * (i % 2));     }      // Get the smallest multiple of ten > sum     Integer top = sum + (10 - (sum % 10));     Integer check = top - sum;     if (check == 10) {       return longisbn + "0";     } else {       return longisbn + check.toString();     }   } }

The Factory Object

Next is a boilerplate factory object. The only change will be the package you put it in, and the last method’s name and return value.

 package edu.umich.lib.solr.analysis; import java.util.Map; import org.apache.solr.analysis.BaseTokenFilterFactory; import org.apache.lucene.analysis.TokenStream;  public class ISBNLongifierFilterFactory extends BaseTokenFilterFactory {   Map args;    public Map getArgs()   {     return args;   }   public void init(Map args)   {     this.args = args;   }   public ISBNLongifierFilter create(TokenStream input)   {     return new ISBNLongifierFilter(input);   } }

The actual filter

And, finally, the filter class. You’ll notice that I’m catching any illegal argument error and just returning the input unchanged. So anything that comes through that isn’t an ISBN just gets passed along.

 package edu.umich.lib.solr.analysis;  import edu.umich.lib.normalizers.ISBNLongifier; import org.apache.lucene.analysis.Token; import org.apache.lucene.analysis.TokenFilter; import org.apache.lucene.analysis.TokenStream; import java.util.regex.*; import java.io.IOException;  public final class ISBNLongifierFilter extends org.apache.lucene.analysis.TokenFilter {    public ISBNLongifierFilter(TokenStream in) {     super(in);   }    public Token next() throws IOException {     return normalize(this.input.next());   }    public Token next(Token result) throws IOException {     return normalize(this.input.next());    }    public Token normalize(Token t) {     if (null == t || null == t.termBuffer() || t.termLength() == 0) {       return t;     }     String val = new String(t.termBuffer());     try {       t.setTermBuffer(ISBNLongifier.longify(val));       return t;     } catch (IllegalArgumentException e) {        // pass it through unchanged       return t;     }   } }

How to use it

Assuming you’ve managed to get it built into Solr and then deployed, just define it as a type in your schema.xml:

                                  # and later...

Conclusion

There it is. The rocket science is all hidden behind the import statements. My understanding is that casting the token value to/from Strings makes things horribly inefficient, but I’m pretty sure I’ve got bigger bottlenecks to tackle before worrying about this.

Going with and “forking” VUFind

Bill Dueber — Wed, 19 Aug 2009 00:00:00 +0000

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff.

When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight.

I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around it (e.g., Stanford), and was based on a technology stack we had some experience with (PHP). Blacklight seemed to be just getting off to a fitfull start, and its Ruby stack was at that time an iffy proposition (this was before any sort of major adoption of Passenger or JRuby).

As I write this, things have flipped around a little. Andrew Nagy, the principle architect of VUFind, left Villanova for Serial Solutions and VUFind stopped being his primary focus. The Blacklight community decided to go with a major reorganization of the code to make it easier to deploy, which resulted in a flurry of refactoring and improvements and folks generally thinking things through really well. Stanford just flipped the switch from their VUFind to a Blacklight installation, and as I pointed out, the Ruby deployment options are more stable and less resource-hungry than they were back then. If the decision were being made today, it would be a much more complex analysis.

But anyway, the decision was made, and Tim Prettyman and I were tapped to do most of the hardcore nerd work to make it suitable for our environment.

Right away, I found things that would need some pretty major revision. The user model was based on a local database of logins (we use cosign), even moderately-long search strings would crash the thing, cookies were being used instead of sessions and hitting the 4K limit, search specification were hardcoded in the PHP, and lots of the UI elements didn’t actually have working code behind them (RSS feeds, endnote export, spellcheck, etc).

So, I dug in and started learning PHP and Smarty and refactoring/rewriting/rearchitecting the crap out of it. One of the first things I did was to extract the search specification — the mapping of, say, a ‘title’ search to a weighted search of six or seven actual Solr fields — into a yaml file so we could mess around with it more easily than modifying the giant case-statement in the PHP code. I built a patch against the then-current revision, filed it as a bug, and sent email to the list.

And nothing happened. That patch is still sitting there, in fact. Maybe I’m the only one that thinks it’s useful. But in any case, there was no discussion of it, no one rejected it. It just sat. Sits. Whatever.

I could have asked for write access to the repository, but I didn’t. I saw a few other patches get submitted and met with yawns all around, and started looking more closely at the list and saw pretty much no one doing anything with the then-current code base, and frankly kind of gave up. The folks that I knew were working actively on implementing VUFind — us, Australia and Alan Rykhus at MNPals — were all working from very different code bases, which made our ability to share code very limited. Any sort of official work on VUFind seemed to have slowed to a near standstill (based on svn checkins), and almost no one else seemed interested in submitting patches. After a while, we stopped, too.

So, we didn’t really fork VUFind. We just rewrote much of it and stopped trying to generate interest in our changes. The right thing to do would have been to either grab the bull by the horns, or do an actual fork of the project. But we didn’t feel as if we had time to shepherd a project of this size, and after many, many (many) discussions, decided to just do our thing. I assume that’s what everyone else has done, too, since I see plenty of differences in how things work at the different sites.

As it stands, the wiki shows a good handful of libraries live with VUFind, and a bunch more marked as being in “beta.” I don’t know if what we’re running Mirlyn on is still enough VUFind to be called VUFind. Probably. The basic structure is the same, the search syntax as exposed in the URL is the same. The plumbing underneath is changed in a lot of ways, and I like to think the flow of control makes a little more sense now.

In real life, of course, it doesn’t matter where you draw the line. Our code is far enough removed from the svn repository now that we’re essentially going it alone.

That doesn’t bother me.

The reality is that we’ve taken control of the UI and learned what we need to know about using Solr with our data. If I need to change the backend — to Blacklight, to a newer VUFind, to anything — my users need not ever know, other than to notice that things are a little bit better. If we end up moving to a release-quality version of VUFind, there’s almost nothing I can’t reuse if it makes sense.

We’ve also learned a lot. Solr, obviously, and how to write text filters for it and push it around just a little bit. Solrmarc, too. But we’ve also taken a hard look at data normalization in ways we haven’t before, and decided how we’re going to output to Refworks, and to email, what kinds of searches we want to offer, where we have collisions in ID namespaces (OCLC & ISSN, I’m looking at you).

We’ve discovered issues and problems with our data we’d have never seen otherwise, and started up whole sets of conversations about OPAC issues that used to languish for lack of a reification for reference. The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.

The system has tons of problems still, starting with underlying templates that will make you a little sick if you do a “view source” and going right through my call number search not working for some edge cases. But that stuff will get cleaned up as we get a little downtime from adding new features, and there are elements of the new backend code that could be useful to others once I clean them up and remove local dependencies.

I’m not sure when, if ever, we’ll start thinking of ourselves as part of the “VUFind community” again. The heavy intellectual lifting about how to organize what is essentially a front-end for Solr doesn’t seem to be happening on the VUFind list. And to be honest, I’m not sure it should be. Solr is the real engine. Solrmarc is, for us right now, an important piece. Data normalization, translation, workaround for crappy data, and the basic information theory of a faceted search system are all independent of the particular middleware you’re using to grab Solr results and throw them up on the screen.

So, what we have is good for us, for now, and we’re continuing to learn how to move forward. And I’ve been able to get bug reports and say, “Thanks, Fixed” fifteen minutes later and get warm fuzzy feelings that don’t usually accompany, “Thanks. I’ll put a request in at Ex Libris’ online ticket system”.

Next time: using and abusing Solr for data normalization.

Easy Solr types for library data

Bill Dueber — Wed, 19 Aug 2009 00:00:00 +0000

[Yet another bit in a series about our Vufind installation]

While I’m no longer shocked at the terrible state of our data every single day, I’m still shocked pretty often. We figured out pretty quickly that anything we could do to normalize data as it went into the Solr index (and, in fact, as queries were produced) would be a huge win.

There’s a continuum of attitudes about how much “business logic” belongs in the database layer of any application. Some folks — including super-high throughput sites, but mostly people who have never used anything by MySQL — tend to put no logic into the database. I’ve always edged over the middle to the other side of that debate, preferring to let the database do type-checking and conversions and track foreign keys and the like.

Solr, while not a traditional RDBMS, offers this type of functionality in its text filters. One can pipe data through a few standard filters, or write a custom one in Java if need be. The nice part is that it applies at index and query time. One obvious application, which I somehow haven’t bothered to write yet, is to convert all ISBNs to 13-characters new-style ISBNs upon both index and query. That way, you don’t care if your original records had the short or long form; all the data gets converted no matter how it comes in.

Our standard text field is similar to the default schema.xml, for example, running text through the following filters:

UnicodeNormalization to normalize unicode composition and (optionally) remove diacritics
StopFilter to ignore stopwords in a separate file
WordDelimiter to do intelligent word deliniation
LowerCase to…you know…lowercase everything
EnglishPorter to do stemming
RemoveDuplicates to do what it says

And because it happens on index and on query, everything works out.

We’re running Solr basically from trunk — whenever we need to change something, I pull down a fresh svn copy, put in our local changes to make sure it all works, and then deploy — so I have access to stuff slated for Solr 1.4, including most importantly Trie fields and the PatternReplaceFilterFactory.

The stdnum type

One of the first things we defined was a “stdnum” type, to deal with supposedly-unique identifiers, possibly with embedded dashes and dots and leading/trailing nonsense. Here’s a variant.

Let’s walk through it. It could probably be done in one go, but solr is not our bottleneck at this point…

We start by defining it as a TextField because it’s the only type that can take filters.
We then declare that instead of the standard tokenizer, we’re using the KeywordTokenizer. Confusingly, the KeywordTokenizer doesn’t tokenize in the traditional sense — it just returns the whole input as a single token.
Lowercase it.
Trim spaces off both ends
Skip any leading non-digts, find a string of numbers, dashes, and dots, with optional x at the end, and skip everything after it.
Remove anything left that isn’t a digit or an ‘x’.
Remove leading zeros, if you’ve got ’em.

The net effect is a trimmed string that has only digits (with an optional trailing ‘x’) and removes any leading zeros.

We use this “stdnum” field for ISBNs and ISSNs (and I think OCLC numbers) and it should work for any messy numerics you might have lying around. If you wanted to, you could change the regexp to enforce a minimum string of digits so it doesn’t get confused by any leading nonsense, e.g, “ISSN2: 1234567X (online)”. But if your data are that bad, you may have bigger problems to worry about.

textProper type

We define a textProper type that is exactly the same as the default text type, but without the stemming and synonyms. In the presence of stemming, exact matches and stemmed matches count the same toward relevancy (e.g. row and rowing). We had plenty of examples where exact results were getting overridden by the stemmed results, and this is confusing.

So for most of our important fields, we index them as both text and textProper so we can apply different weights to searches against them.

By the way, don’t forget to make sure your authors are in a textProper type; you don’t want stemming on author names!

exactmatcher type

The name exactmatcher is a red herring, of course, It’s not an exact matcher. It just strips out all the delimiters so we can pretend it’s an exact match.

That’s it. Lowercase it, normalize the unicode, and pull out everything that’s not a (unicode) letter or number.

Note that we’re still using KeywordTokenizerFactory — we’re getting exactly one token out of this thing. That means that the query input either matches (as one string) or it doesn’t.

Here’s how we use it:

Control numbers: Our controlnums (old ids, that sort of thing), report numbers, sdr numbers (related to HathiTrust), the HathiTrust ID
Callnumbers: I also try to normalize LC, but this helps people find everything else
Titles: in addition to a regular tokenized title, we index the 245a (as title_a) and the 245ab (as title_ab). If someone types in an exact match for either of them, we shoot the relevancy through the roof (more so for the title_ab than the title_a, obviously). This makes known item searching a little less painful.

wildcard searching

One downside of using all these filters is that Solr ignores filters when doing wildcard searches. There is a patch floating around that will using an analyzing query parser for wildcard searches, but I haven’t had time to fiddle around with it.

One thing you can do is to do the exact same normalization in your calling code and then throw a ‘*’ on the end of it. The data are in the index, after all — you just have to do the filtering yourself. For example, for a cheap and easy “Title starts with” search, you can do the same normalization in PHP or Ruby or whatever as we do in the Solr exactmatcher type, drop a ‘*’ on the end of it, and query against the exactmatcher version of your title. Voila.

Custom filters

Regular expressions can get you ridiculously far, but for a couple cases it’d be nice to have custom code running. I’ve already mentioned that we should by upcasting all our ISBNs to the 13-character variant. The other two areas where I do this are to normalize LCCNs and to badly normalize LC CallNumbers. I’ll talk about both soon.

Sending unicode email headers in PHP

Bill Dueber — Mon, 17 Aug 2009 00:00:00 +0000

I’m probably the last guy on earth to know this, but I’m recording it here just in case. I’m sending record titles in the subject line of emails, and of course they may be unicode. The body takes care of itself, but you need to explicitly encode a header like “Subject.”

      $headers['To'] = $to;     $headers['From'] = $from;     $headers['Content-Type'] = "text/plain; charset=utf-8";     $headers['Content-Transfer-Encoding'] = "8bit";     $b64subject = "=?UTF-8?B?" . base64_encode($subject) . "?=";     $headers['Subject'] = $b64subject;      $mail =& Mail::factory('sendmail', array('host' => $host,                                              'port'=>$port));     $retval =  $mail->send($to, $headers, $body);

Rolling out UMich’s “VUFind”: Introduction and New Features

Bill Dueber — Fri, 14 Aug 2009 00:00:00 +0000

For the last few months, I’ve been working on rolling out a ridiculous-modified version of Vufind, which we just launched as our primary OPAC, Mirlyn, with a slightly-different version powering catalog.hathitrust.org, a temporary metadata search on the HathiTrust data until the OCLC takes it over at some undetermined date.

(Yeah, the HathiTrust site is a lot better looking.)

[Our Aleph-based catalog lives on at mirlyn-classic) — I’ll be interested to see how the traffic on the two differs as time goes on.]

I’m going to spend a few posts talking about how and why we essentially forked vufind, what sorts of modifications I made, and what technologies I hope to extract from our implementation that may be useful to the wider library community. And, I’m sure, a lot about why I hate Solr, why I love love love Solr, why I hate PHP, and why I love…er…no, I still hate PHP.

Credit where it’s due

And… a little credit where it’s due. I did a lot, but I didn’t do it all. I probably didn’t even do most of it. Half the effort, including all the heavy Aleph lifting — from getting the MARC out with all the filters and expansions we needed, to pulling holdings in real time, to grabbing a patron’s current checked-out items and holds, to fighting the inevitably-scarring battle with ILL — was done by Tim Prettyman. Suzanne Chapman lent her expertise to make it a lot less ugly and more usable than it once was (you can see her talents more strongly expressed at the HathiTrust catalog). And a whole horde of librarians were tapped by my boss, Jon Rothman, to try to figure out how to deal with the MARC data and facets and everything else that required a much deeper understand of our data than I possess.

Non-stock user-facing features

In the next post, I’ll start with a look at how and why we changed the backend and what I’d do differently if I were starting from scratch. But right now, a quick list of the user-facing stuff that you might find interesting.

Email and export searches and search results, as opposed to just individual records.
Working endnote and refworks export.
Multi-select on the advanced search (e.g., pick two languages to get English OR German).
Publication date-range searching (with date-added-to-catalog searching coming soon).
A “sticky” institution selection, so each campus can choose to default to searching just their own stuff. We sniff IPs to set a default, too.
A “call number starts with” search based on semantics for LC searches (e.g., searching on CA11 won’t find CA1105), with call number range searching in testing now.
Contracted holdings for long lists of serials (see, e.g., Nature).
[Coming soon] Selecting records to a temporary set, which can be manipulated en masse (sent to Refworks, etc.). I’ll be hooking this up to mTagger, our home-grown bookmarking and tagging tool, later on.

Of course, I also broke some things. I haven’t added back in Search History, but will do so when I’ve got a couple hours. “Search Within” will make a comeback soon, too, but there are usability issues to contend with. And …for the love of god, don’t do a “View Source.” It’s the ugliest HTML underpinnings I’ve been associated with since 1993 or so.

All in all, though, it’s not bad work, and I’m glad to be able to offer it to our patrons.

Sending MARC(ish) data to Refworks

Bill Dueber — Mon, 11 May 2009 00:00:00 +0000

Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I’d put down how I’m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested.

The basic procedure is:

Send your user to a specific refworks URL along with a callback URL that can enumerate the record(s) you want to import in a supported form
Your user logs in (if need be) gets to her RefWorks page
RefWorks calls up your system and requests the record(s)
The import happens, and your user does whatever she want to do with them

Of course, there are lots of issues with doing this well (quick! Is this MARC record for a book? An edited book? Is it a journal, or a serial of some other sort? Who’s the actual author/editor?), but doing it at all isn’t so bad.

The URL to send them to

This is the “Export this record” URL on my system:

http://www.refworks.com.proxy.lib.umich.edu/express/expressimport.asp? vendor=[your system]& filter=MARC+Format& database=All+MARC+Formats& encoding=65001 &url=[your callback URL]

Note that the vendor variable should be a unique string (made up by your) for your system, not a larger entity (like the whole library or the institution).

The “MARC Format” filter we’re using is not a filter for real MARC. It’s a MARC-like delimited format (see an example from my catalog).

Basically, you have three types of lines (but really, look at the example, ’cause it’ll make everything a lot clearer):

LEADER : LEADER [one space] [leader text]

Control Field : [three-digit control tag] [four spaces] [data text]

Data Field : [three-digit data tag] [one space] [ind1] [ind2] [one space] [value of subfield a] [other subfield constructs]

…where [other subfield constructs] look like

  [pipe characeter][subfield code][subfield value]

Notice that (a) there’s no leading ‘|a’ before the subfield a value, and (b) there are no spaces between the pipe, the subfield code, and the subfield value for the non-code-a subfields.

Some easy PHP code to produce such a format is as follows. Note that I’m sending it as text (because it’s not MARC) and UTF-8. If you’re got MARC-8, you’ll have to convert it before sending.

       $m = $this->marcRecord;       header('Content-type: text/plain; charset=UTF-8');        echo 'LEADER ', $m->getLeader(), "\n";        foreach ($m->getFields() as $tag => $val) {         echo $tag;         if ($val instanceof File_MARC_Control_FIELD) {           echo '    ', $val->getData(), "\n";         } else {           echo ' ', $val->getIndicator(1),  $val->getIndicator(2), ' ';           $subs = array();           foreach ($val->getSubFields() as $code=>$subdata) {             $line = '';             if ($code != 'a') {               $line = '|' . $code;             }             $subs[] = $line . $subdata->getData();           }           echo implode(' ', $subs), "\n";         }       }

MARC-HASH: The saga continues (now with even less structure)

Bill Dueber — Wed, 15 Apr 2009 00:00:00 +0000

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

 {   "type" : "marc-hash",   "version" : [1, 0],    "leader" : "leader string"   "fields" : [      ["001", "001 value"]      ["002", "002 value"]      ["010", " ", " ",       [         ["a", "68009499"]       ]     ],     ["035", " ", " ",       [         ["a", "(RLIN)MIUG0000733-B"]       ],     ],     ["035", " ", " ",       [         ["a", "(CaOTULAS)159818014"]       ],     ],     ["245", "1", "0",       [         ["a", "Capitalism, primitive and modern;"],         ["b", "some aspects of Tolai economic growth" ],         ["c", "[by] T. Scarlett Epstein."]       ]     ]   ] }

MARC-HASH control field, now with less structure

Bill Dueber — Wed, 15 Apr 2009 00:00:00 +0000

Why do I ever, ever think that MARC might not rely on order? I don’t know.

In any case, control fields will now be just an array of duples:

 control: [   ['001', 'value of the 001'],   ['006', 'value of the 006']   ['006', 'another 006'] }

MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records

Bill Dueber — Mon, 13 Apr 2009 00:00:00 +0000

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.

So…I’m suggesting we try something a little simpler. Something so brain-dead, in fact, that I’m loathe to put it down because it’s pretty much the obvious way to do it. To wit:

 {   "type" : "marc-hash",   "version" : [1, 0],    "leader" : "leader string"   "control" : [      ["001", ["all", "001", "values"]],      ["002", ["all", "002", "values"]],   ],   "data" : [     ["010", " ", " ",       [         ["a", "68009499"]       ]     ],     ["035", " ", " ",       [         ["a", "(RLIN)MIUG0000733-B"]       ],     ]     ["035", " ", " ",       [         ["a", "(CaOTULAS)159818014"]       ],     ]     ["245", "1", "0",       [         ["a", "Capitalism, primitive and modern;"],         ["b", "some aspects of Tolai economic growth" ],         ["c", "[by] T. Scarlett Epstein."]       ]     ]   ] }

Stupid MARC allows all the stupid fields to stupid repeat and be out of stupid order and such, so it’s just a lot of arrays. Easily round-tripable.

Why bother? Excellent question, and one that’s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it’s still a lot easier than working with raw MARC (or, I would claim, MARC-XML), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.

A few things worth noting about the assumptions in my mind:

By definition, it’s always UTF-8. The leader should be changed to note this on the sending end, but it’s not required.
We include both a type “marc-hash”, and a version with major/minor numbers.
Everything is a string.
Alpha characters in indicators/tags are all lowercased.
A control field is a duple: tag and array of values.
A data field has four values:
- The tag
- Indicator one
- Indicator two
- An array of duples: subfield and its value

A simple transformation to make it a little more queryable

Let’s say you don’t give a damn about tags that appear out of order, because that’s just a crime against nature, anyway. And you really don’t care what order the subtags appear in most of the time, ’cause really, who does?

A simple run-through (psuedocode ahead):

   my marchash = getTheMarcHash();   my kindamarc;   kindamarc{leader} = marchash{leader};    # Map the control fields by tag => array-of-values   foreach cfield (marchash{control}) {     kindamarc{control}{cfield[0] ||= []};     kindamarc{control}{cfield[0]}.push(cfield[1]);   }    foreach d (marchash{data}) {     (tag, ind1, ind1) = (d[0], d[1], d[2]);      # build up a hash based on subfields for this tag     newd = {};     foreach subfield (d[3]) {       (stag, sval) = subfield;       newd{stag} = sval;     }      # Store the subfield hash in a few places so it's easy to find.     foreach i1 ('*', ind1) {       foreach i2 ('*', ind2) {         kindamarc{data}{tag}{i1}{i2} ||= [];         kindamarc{data}{tag}{i1}{i2}.push(newd);       }     }   }

Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator ‘*’.

Basically, this will allow things like this:

  $leader = $kindamarc{leader};   $first001 = $kindamarc{control}{"001"}[0];    # Find 856s where indicator 2 is '1'    @mystuff = $kindamarc{data}{856}{'*'}{1};

It’s easy to see how we could store the index from the original array to make it easy to find the original order, too.

For many, I’m sure, the prospect of dealing with something like this is more daunting than just learning to use MARC-XML or using existing libraries to deal with straight MARC. But there seems to be a set of folks out there for whom this might be useful, so I’m throwing it out there.

A plea: use Solr to normalize your data

Bill Dueber — Mon, 30 Mar 2009 00:00:00 +0000

[Only, of course, if you’re using Solr. Otherwise, that’d be dumb.]

We’ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don’t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.

Arguments about how much business logic to move into the database layer, in the form of foreign-key requirements, cascading inserts and deletes, stored procedures, etc. are as old as the features themselves. Solid arguments for and against are made on all sides, and like all things, there’s a happy middle ground for most people. [1. “Most,” in this case, excluding the old-time MySQL fanboys who took it as gospel that all data validation and manipulation belongs in the application layer, because their “database” didn’t do any of it. Februrary 30th in a date field, anyone?]

But Solr provides an incredibly compelling use case because it allows for data transformation at both index and query time via the use of custom analyzers (or a standard analyzer with text filters applied). We’re starting to migrate our schema to use more and more of these things, and I even went so far as to create a custom text filter for LCCNs after being inspired by Jonathan Rochkind.

The incentive is easy to see: client diversity. Let a thousand interfaces bloom, if you can give them all access to the same underlying Solr instance. And, seriously, how many times are you going to write that regexp to semi-normalize ISBNs and ISSNs, huh? Enough already.

If you’re using a Solr nightly (and, really, you should be — faceting is so much faster than the official 1.3 release) you have access to regexp-based filters as well, which makes stuff like this really, really easy:

Here, we use the KeywordTokenizerFactory which, not so intuitively, produces a single token from the input. Then lowercase it and pull of any leading and trailing spaces (Trim).

For those of you that don’t read regexp, we then match anything that looks like:

Any number of leading zeros
…followed by any number of digits, dashes, or periods and an optional ‘X’
…followed by…well, we don’t care. Anything else.

…and throw away all but the stuff in #2. Then take that and throw away all the dashes and dots, and you’re left with a string of numbers.

The beauty is that it happens both while the index is being made and during query time, so if your user types in ” 123-45-6-X ” it will be normalized to 123456x, and then checked against your index.

This is simple stuff, and probably doesn’t deserve the virtual ink I’m providing for it, but Vufind out of the box doesn’t do any of this sort of thing (likely because “the box” existed before it was super-easy to do this), and we all should be doing it.

Enough with the freakin’ LC Call Number normalization!

Bill Dueber — Wed, 18 Mar 2009 00:00:00 +0000

OK. I’m done with it, and this time I mean it.

I’ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I’ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.

Ask, and you shall receive, and it shall be AWESOME!

Bill Dueber — Thu, 12 Feb 2009 00:00:00 +0000

The good folks at ticTocs heard the call for open data, and they responded…exactly as I asked them to. Which makes me think I should have asked for a pony, too, but I’m still very, very happy!

Anyone can now download a simple tab-delimited text file describing all the journal table of contents RSS files they’ve assembled, for use however anyone wants.

The data include issns and eissns (where available), the title of the journal, and of course the URL of the RSS/Atom/Whatever feed.

The feeds themselves are all over the map — it’s whatever the publisher decides to provide, which might includeÂ abstract/volume/number/doi, or might just be the title of the article. But regardless, they represent data that are useful to our patron and are now available in a format that’s easy to exploit.

So…go to it?

TicTocs: Give us a file! Pretty pretty pretty please!

Bill Dueber — Mon, 02 Feb 2009 00:00:00 +0000

For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome.

In their blog at News from TicTocs, a post titled I want to be completely honest with you aboutÂ ticTOCs notes that:

As for the API – yes, weâ€ve been asked this several times, and the answer is that it is currently being written and should be available very soon.

That’s great, but writing in a comment on that post (after logging in with a very, very old OpenID — I used to have a blog named Opachyderm, a name which I thought was insufferably clever), I noted that we don’t need an API right away.

What we need is a text file.

Simple. Tab-delimited. TicTocID,Title,URL,issn,eissn. Update it every night.

That’s all we need.

We can do the rest. Put it in the OPAC. Stick it on our SFX pages. Not screw around with Javascript/AJAX calls when the data we need are (relatively) static and (absolutely) simple.

Someone needed to put a web interface on those data, and the one provided at ticTocs is really nice. I’m glad it’s there.

And I can’t tell you how much I applaud the JISC for starting this project and getting vendors on board. That’s always the hard part — participation and standardization. They’re doing it, and I couldn’t be happier.

But these data are incredibly valuable,Â and their value is currently limited because they’re boxed up.

Spreading these data far and wide is good for scholarship, and I can’t imagine the case that could be made showing it’s better for JISC to keep them at a single endpoint.

The knee-jerk reaction is always, I know, to keep things behind a wall, even if it’s a short wall. “Things will get out of sync if people have their own copies.” Or, “We’ll provide whatever access you need, as fast as you need it, honest.” Or, “We’re going to be providing value-added services on top of the data.”

It’s all true. Things will get out-of-sync — but that’s going to happen whether you encourage people to not cache results or not. And I don’t doubt for a moment that the API provided will be great. And of course you’ll be in a position where you can provide value-added services.

But so can the rest of us.

I’ve run into this myself. I fear…well, let’s be honest. I fear providing a service, having the data stripmined, and then having no one appreciate the front-end I put on it. I do this job for the fame, not the fortune. Obviously.

But I’ll never provide services as fast as me plus three hundred other geeks, all responding to different situations and servicing different patrons.

So…provide an API. Start simple: a single call named getCurrentTextFile. Or maybe add getCurrentTextFileGzipped. It’s only ten-thousand lines of text, probably less than 75k gzipped up. I promise to call it every night about 3am local time so I’m up-to-date.

So….pretty please? With sugar on top? My catalog is waiting. So is my SFX install. And our list of ejournals. And our subject guides. And lots of pages on our website. And our pre-packaged OPML files to offer students and professors. And a thousand yet-to-be-devised services as well.

Pretty pretty pretty please???

Five rules to make your open source more open

Bill Dueber — Sun, 25 Jan 2009 00:00:00 +0000

[I’ve noticed that a sure way to get people to look at stuff (as measured by, say, digg) is to include a number. So I did. Five. ]

Over at Bibliographic Wilderness, Jonathan Rothkind has a great followup to an ongoing discussion on the Blacklight list called How to build shared open source in which he tackles some of the differences between open-sourcing your code (a legal and distribution issue) and actually making it so someone else can usefully contribute to your code.

The project I’m spending most of my time on right now, VUFind, is a great piece of functional code but, in many ways, a nightmare in terms of trying to contribute code and abstract out local functionality. This isn’t meant as a slam on the main contributor(s) to VUFind — Andrew, especially, seems to be an almost frighteningly-productive coder — but my experiences trying to customize the code to our local situation has given me a lot of time to think about how I wish things had been architected.

So, here I give some general rules and some specifics as to What I Wish I Had To Work With.

1. Abstraction

General rule: Abstract things out as much as makes sense

Specific rule: Abstract the living crap out of your authentication scheme.

Look, pretty much everyone with anything worth protecting already has an auth/authZ infrastructure in place. Sometimes an extensive, perhaps multi-institutional infrastructure. One that isn’t going to be bypassed without, say, getting fired.

So if you’re going to require people to log in, make sure you make that process as abstract as you possibly can, both in algorithm and in code. Have a singleton class that’s easily subclassed to represent your user, and call it exclusively. Make sure that your URIs are easily separated into those that require auth and those that don’t, for simple use of mod_rewrite or whatnot to redirect to authentication. Make sure it’s easy to hook into (or work around) AJAX links that might require authentication that has expired.

And for the love of god, don’t stuff username/password information into a cookie if you’re doing web work. Use a session and session key. Any auth scheme that I can spoof is no auth scheme at all, because I’m an idiot and not even trying hard.

2. Configuration files

General rule: use config files for anything local

Specific rule: Use a configuration file format that can represent complex data

That’s right, I’m looking at you, .ini and .properties files.

Use something like YAML, or XML, or even straight programming-language code (i.e., a file with a PHP hash or a perl hashref or whatnot) that can actually represent, in a logical way, the complexities of the stuff you need to configure. And then, again, have a singleton class that will read that data and expose it in a useful and safe way.

And include a semantics checker if you can manage to write one.Â It’ll save everyone a load of trouble.

Huge bonus points if your configuration singleton class can read from multiple files, overriding previous (default) definitions with subsequent (local) ones.

3. Hide subapplications

General rule: Don’t force your user to intimately understand every piece of every library/application you include

Specific rule: Generate configuration information for sublibraries/applications

This might be a little specific to the project I’m working on now, which uses Solr as a backend, but I think it applies more generally.

If you’re using a non-brain-dead configuration file format, and if you can assume reasonable defaults, then generate configuration files for your user. A low-level extreme of this is the traditional unix autoconf, which essentially allows you to install software without knowing a damn thing about your own system. Which is useful to those of us that don’t.

In VUFind, there are three files — a .properties file that specifies how to map MARC data intoÂ field names,Â Solr’s schema.xml that describes the structure and behavior of those same fields, and an XSLT stylesheet that pretties the data as it comes out of Solr to make it easier to work with. As you might expect, the overlap in data is about 80% across the three of them, and it would be a bazillion times easier to have a single file that generated all three.

OK. Maybe not a bazillion, because if it was that easy, I’d have taken a couple hours to write the code to do it already. Let’s say just a zillion times easier.

The caveat to this is that you need to either make sure your config file specification is complete enough to encompass everything all the other files might need to know (bad), or that the other config files can import subsections that override your defaults (good).

4.Testing

General rule: practice test-first (TDD or BDD) development

Specific rule: write your code in such a way that it’s testable

Look, we all know we should spend the first three weeks writing eight thousand tests to describe every corner of the code. And we all have bosses that will ask, every morning about 10:30am, “So, what do you have that you can show me?”

Not everyone is going to be able to write tests first. That’s not right, it’s not smart, but it’s the way the world works. But at least put in the hooks so someone else can come along and write tests.

Writing tests is one of the easiest ways that a newbie can come along to a project and instantly contribute in a meaningful way. But if you’re constantly calling global variables, depending on live database connections and not providing a way to mock them up, or throwing fatal errors if every subsystem isn’t present no matter the context, then it’s going to be hard to write tests.Â So hard, in fact, that not only will you not do it, but neither will anyone else.

5. Error handling

General rule: provide a sane, hierarchical set of error classes and hooks to catch them as necessary

Specific rule: THROW SOME GODDAMN ERRORS!!!!

Don’t be an idiot. Things will fail. In the absense of Design by Contract or somesuch, errors will happen. Throw them. Catch them. But at least throw them, instead of letting your code die six hundred lines later with a “Cannot cast null value to string” when you finally get around to trying to print something out.

And then I finally shut the hell up

Bill Dueber — Mon, 08 Dec 2008 00:00:00 +0000

I had a great — great! I tell you — 30 second conversation with Ken Varnum (of RSS4Lib fame) that went something like this (much paraphrasing, obviously):

B: You’re gonna have to fix that interface. The standard header won’t work.
K: Well, no, we’re going leave it as it is.
B: It’s not gonna work.
K: We’ve decided to make it all consistent.
B: OK, you can keep saying that, but I’m really, really smart and I say users are going to be confused.
K: We’ve done user testing. They weren’t confused. And here’s our plan to see if they are confused once we go live.

And then I finally shut the hell up. While I’m never crazy about being just plain wrong, it was so so SO refreshing to have someone say, “Well, actually, we’re making this decision based on data and not just pulling answers out of our pants like so many flying monkeys.”

Where, oh where in the library is the dedication to making actual data-based decisions? Besides Ken’s office, I mean?

Normalizing LoC Call Numbers for sorting

Bill Dueber — Thu, 13 Nov 2008 00:00:00 +0000

Updated: I missed a ‘?’ in the original code that pushed a single cutter into the second-cutter position. Fixed below.

Crap. Update 2: Initial letters can be three characters long. Regexp and output changed.

LoC Call numbers tend to be a mess, and I’ve been working this morning trying to normalize them for easy string comparison.

The perl function below takes a call number (with some level of sloppiness) and returns a string suitable for comparisons with other strings returned by the function. It outputs stuff like this:

 E                          E 0000.0000  0000  0000 E 184 .A1 G78              E 0184.0000A 1000G 7800 E184.A2 G78 1967           E 0184.0000A 2000G 7800 1967 E184.A2 G78 1970           E 0184.0000A 2000G 7800 1970 EA                         EA0000.0000  0000  0000 EA 10                      EA0010.0000  0000  0000 EA 10 1970                 EA0010.0000  0000  0000 1970 EA10 B7                    EA0010.0000B 7000  0000 EA 10.B7.G8                EA0010.0000B 7000G 8000 EA10.5                     EA0010.5000  0000  0000

The code, in perl, follows:

 sub normalizeLC {   my $lc = uc(shift);   $lc =~ /^           \s*           ([A-Z]{1,3})  # alpha           \s*           (         # optional numbers             \d+             (?: \s*\.\s*\d+)?  # ...with optional decimal point           )?           \s*           (?:               # optional cutter             \.? \s*             ([A-Z]+)      # cutter letter             \s*             (\d+)?        # cutter numbers           )?           (?:               # optional cutter             \.? \s*             ([A-Z]+)      # cutter letter             \s*             (\d+)?        # cutter numbers           )?           \s*           (.*?)            # everthing else           \s*$         /x;   my ($alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra) = ($1, $2, $3, $4, $5, $6, $7);   $c1num .= 0 x (4 - length($c1num)); # Pad out to four decimal places   $c2num .= 0 x (4 - length($c2num)); # ditto   $extra = ' ' . $extra if ($extra);   return sprintf("%-3s%09.4f%-2s%4s%-2s%4s%s", $alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra); }

How to rig an election

Bill Dueber — Mon, 03 Nov 2008 00:00:00 +0000

No matter where I’ve gone today and for the past few days, I keep running into people (on both sides) who are sure that if Their Guy Doesn’t Win, it’s going to be because of dirty tactics.

I’m not an expert in this stuff. Not by a long shot. But I thought it would be fun to work out, for my own benefit, types of election fraud and what to really worry about.

Note that how you might interpret all of this really depends on what you consider the greater evil: a voteÂ cast that shouldn’t have been, or a vote suppressed that shouldn’t have been. I lean toward the latter.

[More specific disclaimer: I’m a bed-wetting liberal.]

In each case I’ll define what I’m talking about, what class it goes into, how hard it is to do once, and ratio of people-in-the-conspiracy to votes affected.

As examples, voter non-registration (telling someone you’re registering them and not doing it) is easy to do at all (Difficulty: Easy) and one person can screw over a few tens of others, depending on how dedicated you are (Ratio: medium). Re-programming a voting machine is very hard, but has the potential to mess with hundreds and hundreds of votes (hence the high ratio).

Obviously, all this below is (a) pulled out of my ass, and (b) depends on the size of the electorate. A local race where only 300 people will be voting can be turned by any method at all. I’m looking mostly at national races, where lots of people vote so a few fraudulent votes aren’t likely to be problematic.

Voter Non-Registration

Definition: Tell people you’re registering them to vote, but don’t
Class: voter suppression
Difficulty: easy
Ratio: medium
Effect: small-medium
Notes: To do this up right, you need lots of people out there registering folks and then throwing the registrations away or a centralized system where one or two people can collect the data and then throw it away. The former involves a pretty big conspiracy; the latter leaves a lot of people who could testify that they turned in registrations that disappeared. Most voters are registered and stay that way. Voter non-registration as a suppression tactic is most useful against those that traditionally don’t vote, or those that have never voted or recently moved (e.g., college kids).Would seem to favor Republicans.

Fraudulent registration of voters

Definition: Register people who don’t exist, are dead, or aren’t planning on voting themselves.
Class: voting fraud
Difficulty: ridiculously easy
Ratio: medium-ish
Effect: zero — nothing happens until a fraudulent vote is entered
Notes: This is what the ACORN dustup is about. Not speaking one way or the other about ACORN, organizations that register voters have a tough time in that (a) they’re required by law to pass along all registrations (see above) even if they know they’re fakes, and (b) organizations willing to pay people to registering voters tend to be most interested in finding individuals traditionally undeserved in that area — the homeless, the poor, illiterate etc. That means that the data you’re going to get tends to be less that great. Note the zero effect: nothing to screw with the election happens until someone actually casts a fraudulent vote. Which leads us to…

Fraudulent voting

Definition: Casting a vote you shouldn’t be allowed to cast, usually while pretending to be someone else.
Class: voting fraud
Dificulty: pretty hard (note: requires voter registration fraud)
Ratio: very small
Effect: pretty small
Notes: While some areas of the country are famous for voting fraud (I’m looking at you, Chicago), actually walking into a place and voting as someone else takes some serious balls. And with long lines expected this year, any one person, no matter how dedicated, isn’t going to be able to vote that often. A different version of this is filing out people’s absentee voting slips “for them” and has been around forever; this is a tactic I tend to associate with “machine” politics that tend to favor Democrats.

After Hour Ballot stuffing

Definition: Placing votes “after hours” as done by an individual with no (human or technical) oversight, or by a conspiracy of people who are supposed to be overlooking each other.
Class: voting fraud
Dificulty: medium
Ratio: large
Effect: medium-large
Notes: Again, first this requires some sort of voter registration fraud if you’re going to do it in any serious numbers. Then you need ridiculously lax oversight of the balloting / counting process, which is not that hard to find, unfortunatley.

“Losing” votes

Definition: Have ballot boxes take a detour to your basement or the dump
Class: voting fraud
Dificulty: pretty hard
Ratio: very large
Effect: very large
Notes: Ah, a classic. As voters, we tend to group geographically — Ann Arbor, for example, is almost devoid of Republicans. So, you let everyone vote, and then “disappear” the ballot boxes from Ann Arbor, while doing your best to make sure the Democrats don’t do the same thing in predominantly Republican areas. This is only slightly easier than after hour ballot stuffing, but still hard. Payoff is huge, though.

Voter misdirection

Definition: Give people bad information about when/how to vote
Class: voter suppression
Dificulty: easy
Ratio: large
Effect: Depends on how good you are, doesn’t it?
Notes: We’re finally heading into the gray areas — things that, depending on how you do them, likely aren’t actually illegal. That makes them easier to do, because you don’t have to worry about members of your conspiracy squealing. We’re seeing a lot of this already this election, most notably in robo-calls telling people that they should vote on Wednesday, or that their polling place has changed, or whatnot. Mostly anonymous, very difficult to trace, and can be pretty effective if your database of friendly/unfriendly voters is good.

Voter de-registration

Definition: Sue to get whole classes of people removed from the roles
Class: voter suppression
Dificulty: pretty hard
Ratio: Gigundous
Effect: Very large
Notes: This has been all over the news, and for good reason. Anything that causes someone to have to cast a provisional ballot makes it a pain in the ass for that ballot to get counted. For lots of folks (esp. “working class” people who punch a clock), taking a couple hours off to go down to the courthouse and prove you’re who you say you are is a non-starter. This is why everyone wants their people to vote early if they can — it avoids anything that might screw with the voting process, like this or challenges.

Profiled voter challenges

Definition: Challenge the votes of people that don’t look like you
Class: voter suppression
Dificulty: pretty easy
Ratio: large
Effect: depends on how good the poll workers are and the state laws
Notes: This is easy. You post someone at a polling station, and challenge anyone who looks like “the other guys” (because Black and Hispanic voters tend to go Democrat; identifiying Republicans on sight might be harder). Some states make this hard; others allow anyone at all to challenge anyone else and force them to use a provisional ballot.

Break voting

Definition: Make voting so difficult or slow that people give up and go home
Class: voter suppression
Dificulty: varies
Ratio: large
Effect: large
Notes: If I’m at a place with three mechanical voting machines and I stick a bunch of gum in one of them, rendering it useless, I’ve just made it a hell of a lot harder for people to vote. Less extreme examples would be challenging everyone who walks through the door, or having a poll worker who takes for freakin’ ever (on purpose, I mean; nothing in general against our dedicated poll workers– whose average age is 72, I heard).

Cause bad weather

I’m not sure how to go about this, but the little men in my head say flooding is a great way to keep people from the polls.

Wanted: a better proxy server

Bill Dueber — Thu, 02 Oct 2008 00:00:00 +0000

We in the library world have a problem. We spend a zillion-with-a-Z dollars subscribing to online databases, purchases which presume our ability to make sure only authorized people can look at them. The alternative is to be in breach of contract law, which I’ve been assured is something we’d like to avoid.

The problem I see is this: The limitations of our proxy server software restrict how we can write contracts with our vendors.

The standard approach is to define two types of access:

By IP address. The person is sitting in front of the right computer (or has hooked up to the right wireless network) and is assumed to be “OK” based on either the location of the computer (e.g., in the library building) or through the nature of the auth/authZ built into the computer’s login procedure. We tell our vendors, “Hey,” (all vendor-library conversations start with ‘Hey’) “here’s a list of IP addresses that you should allow and associate with us.”
By authenticating with a central mechanism and then sending everything through a rewriting proxy server, thus allowing us to tell the vendor, “Hey. Anything coming through our proxy server is OK. Honest.”

The venerable EZProxy (now owned by OCLC) has been the solution of choice for libraries for a long time. It does what it does very well.

But I want more. Much more. More more more.

The current model assumes there’s exactly one question: Is this person authorized as a UM-Ann Arbor user?

But that’s a pretty crude question. Suppose the Business or Law school wants to buy access to stuff for only their students (news flash: they already do)? Or we want to subscribe to a journal but, because it’s so esoteric, restrict access to a couple departments to save money. Or recognize when an Ann Arbor faculty member is sitting at a public computer on a different campus but still allow her to get full rights as an Ann Arbor faculty member instead of appearing to be Joe-Random-Dearborn student, a group which has significantly less access to online journals.

Why can’t people with roles on multiple campuses get the best of all worlds, getting the least restrictive access possible to a given titleÂ based on all their student/staff/faculty affiliations?

Why can’t we negotiate access to given titles (or even articles???) in lieu of course packets (or online reserves), restricting access to only those enrolled in the class?

Here at UMich, we’re just starting to get an Enterprise Directory online where we’ll actually be able to ask some of these questions. But until we get a proxy server that’s smart enough to do something with all the information, it’ll just sit there and taunt me.

This isn’t an idle question. We already have databases that the Business School subscribes to alone that can only be accessed when you’re physically in the B-School at one of the approved-IP-address computers. That’s freakin’ ridiculous.

Of course, this all presumes that all-or-nothing contracts aren’t the best way to go, but shouldn’t we at least have the option?

Planet Code4Lib in a snapshot

Bill Dueber — Mon, 07 Jul 2008 00:00:00 +0000

Inspired by the Inquiring Librarian, I just used Wordle to create a “tagcloud” of the current [Planet Code4Lib]() feed.

What kills me is the tiny little “Library” in the lower left-hand corner.

http://wordle.net/gallery/wrdl/55861/Planet_Code4lib” title=”Wordle: Planet Code4lib”>http://wordle.net/thumb/wrdl/55861/Planet_Code4lib” style=”padding:4px;border:1px solid #ddd” />

Intuition-based librarianship?

Bill Dueber — Wed, 02 Jul 2008 00:00:00 +0000

Not long after I started working in the library, I heard someone talking about “Evidence Based Librarianship.” Like the good little kind-of-a-librarian I’d become, I looked it up and found this article which states that:

EBL employs the best available evidence based upon library science research to arrive at sound decisions about solving practical problems in librarianship.

My immediate response was, of course, What the $#!&% is everyone else doing?

The sad truth, of course, is that in general folks working in libraries do not use the “best evidence” based on “library science research” because, like many of the practitioners I met when I was in the education world, they (a) don’t know most of the research and data, and (b) are convinced that their users are so magical, so special, so utterly unique, that there’s no point in looking to the research and are better off just going with their guts.

That’s an over-simplification, of course. But I have found, across a bunch of situations, that practicing librarians tend to think:

their time is much better spent directly helping patrons than reading about research regarding how to help patrons,
“data” (defined incredibly loosely) derived from reference desk interviews are sufficient to make decisions
“I know my patrons better than anyone”

The logical conclusions to this is that:

Most library research is essentially being thrown down a dark hole because the people that could most benefit from it don’t read it
We’re assuming that the 99.999% of users who never talk to a librarian (many of whom, in fact, never enter the library building) have the exact same needs and perspective as those who engage in reference interviews
Librarians, as a group, confuse casual and/or episodic interaction with self-selected patrons with actual social-science research.

And the over-simplified solutions:

Make reading a job requirement — for real! Make librarians responsible for keeping up with the literature — “responsible” in a “prove to your direct manager that you spent two hours reading and writing this week”.

Librarians as a group, I think, want to use the research. But not so much that they’re willing to let Curmudgeonly Old Faculty Member #2 hang tight for a few hours while they brush up.

Use the data you already have! Your systems — your ILS, any reference desk software, your proxy server, your web server — all collect data. Warehouse the data. Mine the data. Provide both colorful graphical interfaces and ugly powerful analysis functions for the data. Figure out how to do something with the freakin’ data!

Most (all?) libraries have gobs of data that are pulled out once a year for ACRL statistics. And even if they’re looked at by someone, they’re certainly not easily available to everyone.

Push access to the data and associated visualization tools as far down the stack as you can. At least people will know what kinds of questions can be answered.

Don’t pretend to do research — do real research! Do real social science research — something that certainly doesn’t have a front-seat in library schools as near as I can tell. Find some MS students in Sociology or Anthropology who are looking for a project and ask them to find something out, with real honest-to-god case study methodology, text analysis, data analysis — the whole nine yards. Better yet — hire someone to do it, and for god’s sake don’t put down that they must have an MLS.

Times are tight all around, of course — no one has enough time, enough money, enough resources. But that’s exactly why now is the time to focus on existing research (it’s free — someone already did it) and data (it’s free — your systems are already collecting it) — to find out what’s being used, what’s being ignored, how to market your under-utilized resources and which populations need some outreach.

Going with your gut might seem to work, but maybe that’s only because you’re not actually using any solid criteria to evaluate what you’re doing now.

The friend of my enemy’s friend’s enemy’s…err…

Bill Dueber — Thu, 15 May 2008 00:00:00 +0000

Move over, Axis of Evil! Our 43rd president, George W. Bush (and you gotta know that his dad hangs on to that ‘H.’ with two white-knuckled hands) is now in search of “the surest way to defeat the enemies of hatred.” Of course, we’re the best of friends with hatred here at Robot Librarian, so we should be safe.

Google Doctype — open documentation, open code

Bill Dueber — Thu, 15 May 2008 00:00:00 +0000

Because you can never have too many open encyclopedia-type-thingies, Google has launched Google Doctype, a “Google-sponsored open encyclopedia and reference library for developers of web applications. By web developers, for web developers.” It’s set up to use an open license (Creative Commons Attribution 3.0 Unported License) and, unlike other similar resources, is explicitly set up to include code for testing and browser-compatibility tables generated by running that code against different browsers. Simple, direct… what’s not to like?

JSON, JSON everywhere

Bill Dueber — Tue, 13 May 2008 00:00:00 +0000

Via Ajaxian, just saw an announcement for Persevere, a network-centric, JSON-based generic storage engine. It features:

A REST-based interface over regular old HTTP
JSON as the native data going in and out, including circular references and such
Search interface based around JSONPath
RPC interface based on JSON-RPC
Seemingly buzzword compliant across the board

I’ve been thinking about these sorts of servers a lot lately (couchdb and strokedb are two others) in the context of the “not-the-catalog” data we track here at the library.

For some stuff, clearly we need the power and speed of a real database. That power and speed isn’t free, though — you have to set up the tables, map relationships, build an interface on top of it, etc. While it’s not rocket science by any stretch of the imagination, it’s a lot of screwing around and involves a few levels of security and has a friendly red sign on the door that reads “Programmers only, please.”

For other data, though, a structured or semi-structured data store based on a plain text format like JSON would be great. Since everything is a URL, we can handle security at the HTTP-auth/authz level. Library hours, lists of databases we subscribe to, staff directory data — these are data that could, if we wanted, be moved into a generic store like this.

The exciting stuff comes when you stop thinking about traditional database applications and think more in terms of having a data storage endpoint that pretty much anyone with a modicum of knowledge and authorization could throw stuff into. Want to build your own tagging system? Your own “My Shelf?” How about a comment form that straddles the edge between “email me the results” and “ask someone to hook me up to a database”? Or a javascript library that automatically takes survey submissions and sticks them into a system like this?

This is the flip-side of my last post. We’re not talking about hard-core, multiply-linked, core-business metadata. For that, we need ridiculously smart people figuring out how to best leverage the, say, 8 million MARC records we’ve got lying around. But for other stuff…this seems really, really cool.

Psst. We’re not printing cards anymore

Bill Dueber — Mon, 12 May 2008 00:00:00 +0000

[From a series I’m calling, “Things About The Library I Think Are Stoooopid”, part one of about a zillion.]

I’m going to wallow in a little bit of hyperbole here, but only a little.

The problem

Suppose, just for a moment, that you’re a computer programmer working anytime in the last twenty years, and someone wants you to set up a data structure to deal with a timeless issue — how to keep track of who’s on which committees in a library.

If you’re a computer person

Easy enough. First off, what’s a committee?

Committee

Committee name (string)
Committee inception date (date)
Chair (person)
Members (set of people)

How about a person?

Person

Last name (string)
First name (string)
Email address (email)

Okeedokee. That looks ok so far, but we’ve got problems.

First off, everyone knows that committee names change. And, everyone also knows that last names can change, preferred first names can change. email addresses change, etc. We need some sort of unique identifier to represent the abstract ideal of a particular committee or a specific individual. Let’s be lazy and just throw in an integer ID that we’ll be careful not to reuse, ever, for any reason.

So, we’ll throw that in, and make sure our references are to these unique IDs, not names or whatnot.

That gives us this.

Committee

cID (unique integer)
Committee name (string)
Committee inception date (date)
Chair (pID)
Members (set of pIDs)

How about a person?

Person

pID (unique integer)
Last name (string)
First name (string)
Email address (email)

And the mapping, of course.

Committee-Person Mapping

pID (unique integer pointing into the Person table)
cID (unique integer pointing into the Committee table)
dateTermStarted (date)
dateTermEnds (date)

If this seems simple, well, it is. Like I said, the theory is almost forty years old, and common implementations of databases at least twenty. We have well-defined unique keys, special types for dates and email addresses so we can do some sanity checking and order things and so forth, and a very, very simple mapping of people to committees where we keep track of start and end dates just to be complete.

Most importantly, you know what’s not here? There’s nothing about how to print it out, or what format I’m going to store it in. Those are afterthoughts. They don’t matter. Any well-specified data model can be machine-translated into pretty much anything you need.

If you’re writing a library spec

As near as I can tell, the “library” way to write this would be as follows:

Committee

[Let “hus” stand for “hopefully unique string created by ridiculously complex algorithm”]

Committee name (hus)
Committee inception (string masquerading as a date in any of several formats)
Chair (hus)
Members
- person1 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)
- person2 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)

Ummmmm…strings. Nothing but strings. Short strings, long strings, fat strings, tall strings. Strings with dollar signs. Strings that look like dates. Strings that contain other strings. And, just for luck, a little bit of hierarchy, where “hierarchy” means “two levels.”

If someone’s name changes, well, good luck trying to find all the occurrences and fixing them all (and making sure you don’t get the wrong John Smith). Good luck parsing out all the dates, which rely not on machine syntax checking but on a whole set of data-enterers trying to follow some sort of rule without making any mistakes. And good, good luck getting a list of which committees a specific person belongs to.

Why I bring it up

One of the most eye-opening talks I heard at code4lib 2008 was a keynote by Karen Coyle on RDA and its ongoing specification. You can view the slides or watch the presentation if you’d like.

In it, she makes the point that, when push comes to shove, AACR2 and RDA both ended up being tremendously focused on producing text strings.

Whaaaaa??

Was there no one on the RDA committee that had experience with anything even approaching modern data theory?

Of course there was. But the giant weight of history is crushing library data modeling like a skinless grape under a dump truck.

Look, I understand that this is not a simple data modeling problem. I understand that there’s a whole set of issues, including a (what I think to be a specious) demand that the cataloged data accurately reflect the actual text in a real, physical object that’s sitting in front of you. I’m not so naive as to think this is an easy task.

But anyone who, in the 21st century, approaches the large-scale creation of data without first and foremost worrying about machine-parsability, consistent data types with machine-checkable syntax (and even some semantics) and one-to-one mappings between unique objects (an author, an editor, a publishing house, a work) and something that uniquely identifies that object in any reification is….well, I don’t know what they’re smoking.

We’re not printing cards anymore, people.

If something is only understandable if a human is reading it, it’s not understandable by any modern definition.
Punctuation doesn’t belong in the description of an object. Ever. Punctuation is a rendering issue. If you’re using punctuation, or well-formed strings, instead of descriptive attributes, you’re doing it wrong.
Just because you know your data doesn’t mean you know how to model it. Get outside help from the smartest people you can find.

Whew! That felt good!

OK. Rant off.

UPenn library has video “commercials

Bill Dueber — Wed, 07 May 2008 00:00:00 +0000

The University of Pennsylvania Library has a set of video commercials touting their products — some of which are musicals! Worth a look-see.

Robot Librarian

Reintroducing Traject: Traject 2.0

How does it work?

Questions about traject

A 2.0 release?

So…give it a whirl!

How good/bad is MARC data? The case of place-of-publication

Focus on validity

Results: pretty good!

And now, the complaints

Ruby MARC serialization/deserialization revisited

File sizes

Serialization / Deserialization time

MRI Ruby

JRuby

Conclusions

Schemaless” solr with dynamicField and copyField

Indexed XOR Stored?

Part 1: Dynamic Fields

Part 2: Copy Fields

Part 3: Copy Field with globs

Part 4: Putting it all together

Why this is probably a bad idea.

Help me test yet another LC Callnumber parser

New blog front- and back-end

Announcing “traject” indexing software

What’s it look like?

Why use (or move to) traject?

What does it have out of the box?

How do I get a taste?

Come work at the University of Michigan

Please: don’t return your books

Starting data

Finding the number of pages in a book

Bringing it all together

So…what’s the damage?

1.22 miles???

What is this good for again?

Next steps

Boosting on Exactish (anchored) phrase matching in Solr: (SST #4)

Exactish matching vs phrase matching

Our goals

Follow along at home

Step 1: get a decent text type

Step 2: Set up parallel text types that anchor phrase matches to one or both ends

Try it out!

To sum up

Requiring/Preferring searches that don’t span multiple values (SST #3)

Solr and multiValued fields

Following along at home?

The relevance ranking seems…wrong

Phrase slop

Enter positionIncrementGap

But I’m already using the pf parameter!

Query slop

Package it up

Let’s try it out!

Where it breaks down

What have we learned?

Using localparams in Solr (or, how to boost records that contain all terms) (SST #2)

What the heck is a localparams query?

Solution: Build a query of queries

An example: boost records that contain all terms

Try along at home

Special Stupid Solr Trick: Make a special query handler for a complex query

Try along at home

Special Stupid Solr Trick: Make a special query handler for a complex query

Special Stupid Solr Trick: Make a special query handler for a complex query

Solr Field Type for numeric(ish) IDs (SST #1)

What we’re shooting for

The numericID field, suitable for ISBN/ISSN/OCLC/etc.

Things we’ll be learning about today

Step 1: “Tokenize” to a single token

Step 2: Find the first thing that looks like an ID and mark it

Step 3: If we didn’t find a match, throw it all away

Step 4: Ditch the ‘***’ used to mark a candidate ID

Step 5: Lowercase it

Step 6: Get rid of everything that’s not a number or an ‘x’

Step 7: Make sure what we have is a reasonable length

Step 8: Remove leading 0s

Questions about `traject`

Enter `positionIncrementGap`

But I’m already using the `pf` parameter!