Uncategorized – Robot Librarian

Reintroducing Traject: Traject 2.0

Bill Dueber — Thu, 19 Feb 2015 00:00:00 +0000

Traject 2.0.0 released! Now runs under MRI/RBX!

traject is an ETL (extract/transform/load) system written in ruby with a special
view towards extracting fields from MARC data and writing it out into Solr. [Jonathan Rochkind] (http://bibwild.wordpress.com) and I wrote this primarily out of
frustration using other tools in this space (e.g., Solrmarc, or my own precursor to traject , marc2solr

Note: Catmandu is another, perl-based system I don’t have any direct experience with.

traject had its first release almost a year and a half ago (at least based on the date of my
post introducting it), and I’ve used it literally every day since then indexing data for the
Univeristy of Michigan and HathiTrust library catalogs.

How does it work?

traject is packaged as a gem, and ships with a command-line program (traject) that reads in configuration files or switches and the name of the file to operate on and transforms the incoming records as specified.

The "configuration" is actually just ruby code, with some macros included to make it simple to do the common operations (e.g., get the ISBN) and possible to do …well, anything you can do with ruby.

require 'traject/macros/marc21_semantics'
extend  Traject::Macros::Marc21Semantics
require 'library_stdnums' # just a regular ruby gem

to_field "id", extract_marc("001", :first => true)
to_field 'marcxml_record', serialized_marc(:format=>:xml)
to_field "allfields", extract_all_marc_values(:from=>'100', :to=>'999')

to_field 'oclc', oclcnum('035a:035z')
to_field 'isbn', extract_marc('020a') do |rec, acc|
  acc.map!{|x| StdNum::ISBN.normalize(x)}
end

to_field 'title', extract_marc_filing_version('245abdefghknp', :include_original => true)
to_field 'vtitle', extract_marc('245abdefghknp', :alternate_script=>:only, :trim_punctuation => true, :first=>true)
to_field "publisher", extract_marc('260b:264|*1|:533c')
to_field "edition", extract_marc('250a')

Questions about `traject`

How do I get started?
The best way is likely to look at the heavily-documented sample project we provide, followed by checking out the traject documentation. And, of course, just ask me for help.

Do you need to use JRuby?
Not anymore. As of version 2.0.0, traject runs under MRI ("regular") ruby, although without
all the speed-enhancing true threading that JRuby offers.

How fast is it?
Apple-to-apples comparison is difficult. The stock Blacklight indexing scheme is reportedly about as fast as Solarmarc when just using single-threaded MRI (JRuby would presumably speed things up). I run a hideously complex indexing scheme using JRuby and a few threads and can average over 900 records/second during a longish run (e.g., I can index all
eleven-million bib records before lunch). For me, it’s fast enough.

What kinds of data can I throw at it?
In theory, anything you want — it’s pretty easy to write a Traject reader. Out of the box or
with existing gems, though, we support a few kinds of MARC:

MARC (binary), via either ruby-marc or marc4j (the latter requiring JRuby)
MARC-XML (again, via either ruby-marc or marc4j)
marc-in-json in the form of newline-delimited json (a text file with one MARC-in-JSON record per line)
Alephsequential, a human-readable serialization put out by the Aleph ILS from Ex Libris.
Direct import from the Horizon ILS

How does it make it easier to deal with MARC?
MARC is the bread-and-butter of what traject is currently used for. traject ships with macros for reading through MARC records and transforming the often-weird data within them. Some of these can:

extract data from fields based on tag, indicators, and subfield values
trim punctuation from extracted data
translate MARC codes into human-readable languages, countries, etc.
correctly deal with "filing characters" (e.g., leading articles like "a" or "the")
find field data repeated in other languages ("vernacular" data, usually in 880 fields)
find OCLC numbers, with their myriad of prefixes
…and many others. And, if you know a little ruby, it’s not hard to write your own.

What are the records transformed into?
Given its history focused on indexing data into Solr, the basic result of a
traject transformation of a record is a hash (map) of arrays (e.g., key1=>[val1, val2,...] — each key/fieldname is mapped to one or more values). This is easily transformed into
something that you can send to Solr or write to a file. If you need to produce more complex hierarchical data, traject may not be the right tool for you.

What kind of output can it produce?
Obviously, the resulting documents can be sent to Solr, via Traject::SolrJsonWriter. Additionally, we ship Traject
with writers that produce other formats.

Traject::DebugWriter produces a human-readable file with one field and its values per line.
Traject::JsonWriter produces newline-delimited JSON, one valid JSON record per line.
Traject::YamlWriter writes a yaml file that contains multiple documents, good for both further processing and human inspection.
Traject::DelimitedWriter, by default, writes a tab-delimited file suitable for further processing or import into Excel.
Traject::CSVWriter produces comma-separated value files, as you’d expect.

A 2.0 release?

So, what’s changed enough to warrant a 2.0 release?

No longer requires JRuby
The first release of traject only ran under JRuby, based on its need to use the
solrj java library to efficiently indexing things into Solr. More modern
versions of Solr (since version 3.2) allow indexing documents via HTTP with JSON;
doing so not only works under any ruby implementation, but in my tests the JSON indexer goes about 20% faster than the old solrj-based indexer.

(Tab-)Delimited and CSV Writers:
How often are systems librarians asked to do things like "find all the records with publisher string XXX, and give me a list of them with the title, isbn, author, and date of publication"? For me, the answer is "often" and traject now makes it easy to output something that your user can inspect as text or import into Excel for further processing.

Cross-platform threading
For most applications of traject to date, the bottleneck is the transformation process of turning a MARC record into a Solr document. Under JRuby, you can throw as many cores as you have available at that transformation to speed up the indexing process. Even under MRI, which can’t run multiple threads on ruby code at the same time, we can use a second thread to talk to Solr so indexing on the server doesn’t slow down processing of MARC records.

So…give it a whirl!

You can find traject and its related gems on Github. Besides traject itself and the associated reader/writers, there’s a heavily-documented sample project to get you started.

I’m heavily invested in traject, and am more than willing to assist folks as they start using it, so don’t be afraid to contact me (via email or twitter) if you want a little advice or a helping hand.

How good/bad is MARC data? The case of place-of-publication

Bill Dueber — Mon, 10 Nov 2014 11:09:00 +0000

I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else.

One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself.

[Note: there will be some complaining at the end. I promise.]

One of my recent charges was to try to put in place a better place-of-publication filter in the catalog. Place of Publication is most formally dictated by the (poorly-named, since it includes states/provinces) Country of Publication code in the 008 fixed field. This one-, two- or three-letter code that is then translated into a place name via a mapping provided by the LoC. Like most important pieces of data, the place of publication can appear in a few different places in a valid MARC record — because the searching is half the fun! — but we decided to just stick with the 008 for the catalog search.

Of course, the name of a place of publication may have changed since the actual publication. Historically speaking, borders have been remakably consistent over the last half of a century or so, but there are still changes (fall of the Soviet Union), splits (the former Czechoslovakia) and merges (Germany).

Focus on validity

So, there are roughly a bazillion ways one could try to slice and dice the data to figure out what the most accurate textual representation of a place name should be for a given record. More cut-and-dry is a simple question: how many of the 008s have a valid (current or obsolete) place-of-publication code in them?

I ran an analysis of all the 008s in all the records in the University of Michigan catalog, which helpfully includes all the HathiTrust holdings as well, so we’re getting a nice cross-section of institutional records.

Here’s what I found, in round numbers

	Total	Pct. of Total
All records	12M	100%
Invalid 008	1900	0.15%
Valid code	11.6M	96.6%
Unknown place-of-pub	381k	3.1%
Invalid code	27k	0.2%

[“No place-of-pub” includes both records with no data in the 008 and those with the code ‘ x’ which explicitly indicates “Unknown”]

Results: pretty good!

Given much of the data I’ve worked with over the years, this strikes me a stunningly good. Of course, in the case of a place as big as UMich, that means we’ve still got about 408k items about which we have no good place-of-publication information, but as a percentage, it’s small enough that I’m happy to live with it.

I was, admittedly, a little put out by the fact that we have records in which the 008 fixed field — which is pretty important, as these things go — was just plain invalid (including 176 just plain missing). You’d think that the ILS software would reject things like that, but, as in almost all cases when you think the ILS should do something smart, you’d be wrong.

And now, the complaints

Of course, all we know is that the codes are (or were) valid — not whether or not they’re accurate.

There are two obvious problems:

Some rocket scientists at some point decided that the code ‘ai’, which had been used to represent Anguilla, should now be used to represent the Republic of Armenia. As if that weren’t enough to make you slam your head into a brick wall, the change is based on the date of cataloging, not the date of publication, so there’s no way for me to know which country is supposed to be indicated. It looks like this was to try to keep the first two letters of codes from the old Soviet Union the same one it fell apart, but c’mon, people! (Note that Anguilla is now ‘am’, because of the …ummmm….”m” in it’s…er…nevermind.) We don’t have many records with that code, but this is the sort of blatent disregard for simple data integrity that drives me crazy.
A presumably different set of rocket scientists (once NASA downsized, those folks were everywhere) at various points in time and at various locations decided that the place of publication on a reproduction (say, a microfilm) should be the place the reproduction was created. So, a microfilm of The New York Times that happened to be created in Ann Arbor, MI is coded as ‘miu’, for Michigan.

The latter, of course, is designed to serve those people studying where microfilms were created at the expense of people who want to, you know, find things actually published in a particular location. I’m sure all three of the people in the country who want to know the former are forever grateful.

Ruby MARC serialization/deserialization revisited

Bill Dueber — Thu, 09 Oct 2014 12:58:00 +0000

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them.

The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild.

I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from a hash?

File sizes

File size isn’t as important as it once was, but still matters to some of us working with ginormous amounts of data:

This is the file size to hold the 18,881 records used for the benchmark.

Serialization	Size on disk (MB)	Size vs. marc21	Gzipped size on disk (MB)	Gzipped size vs marc21
marc21	31	100%	9.0	100%
msgpack	42	135%	8.2	91%
json (ndj)	56	180%	8.1	90%
marshal	69	223%	9.4	104%
marc-xml	93	300%	9.2	102%

It’s interesting, if not super-useful, to note that the file sizes differ by a factor of three uncompressed, but hardly at all when compressed. I was surprised at how well the binary formats (msgpack, marshal, and marc21) compressed.

Serialization / Deserialization time

I took a file of about 19k MARC records and tested the serialization/deserialization time, as follows:

marc21 uses MARC::Reader and MARC::Writer from the ruby marc distribution
json Uses MARC::Record#to_hash to produce a marc-in-json hash, serializes with the stock JSON library, and the writes to a file with one record per line (sometimes known as newline-delimited JSON, or NDJ). Deserialization reverses the process
json (oj) is the same, except using the Oj json library under MRI.
msgpack uses the msgpack or msgpack-jruby gems to serialize/deserialize msgpack objects to/from a file stream.
marshal uses the core ruby Marshal class to serialize/deserialize to a file stream.

In all cases, deserialization means to pull each record in turn from a file on disk and turn it into a MARC::Record object; serialization means to take a set of pre-created MARC::Record objects, serialize them, and push them into a file.

All times are in “real time” seconds as reported by Benchmark, averaged across two runs on my desktop machine:

MRI ruby 2.1.2p95 (2014-05-08 revision 45877) [x86_64-darwin13.0
JRuby jruby 1.7.15 (1.9.3p392) 2014-09-03 82b5cc3 on Java HotSpot(TM) 64-Bit Server VM 1.8.0-b132 +indy +jit [darwin-x86_64]

The benchmark code is up on a gist if you’d like to look at or modify it.

MRI Ruby

Method	Deserialize (s)	Serialize (s)	Total round trip (s)	Total time vs. marc21
marc21	11.5	6.7	18.2	100%
json	9.5	5.9	15.4	85%
json (oj)	6.4	3.5	9.9	54%
msgpack	5.8	3.4	9.2	51%
marshal	7.0	8.7	15.7	86%

JRuby

Method	Deserialize (s)	Serialize (s)	Total round trip (s)	Total time vs. marc21
marc21	6.9	4.2	11.1	100%
json	4.1	2.5	6.6	59%
json (oj)	N/A	N/A	N/A	N/A
msgpack	4.0	3.9	7.9	71%
marshal	5.3	8.1	13.4	121%

Conclusions

JRuby is faster than MRI on this these tasks, at least once it’s warmed up.
JSON, with a decent library, is either the fastest (JRuby) or really close (MRI)
Marshal is slow and big.

I come away with this thinking the same thing I did last time. I’m going to use compressed ndj in files, and (possibly compressed) JSON over the wire. The speed is great, tool support is outstanding, and having something human-readable is a big bonus.

Schemaless” solr with dynamicField and copyField

Bill Dueber — Thu, 02 Oct 2014 00:00:00 +0000

[Holy Kamoly, it’s been a long time since I blogged!]

Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields.

I find this intruguing, but it’s not what I’m after right now.

The problem I’m in the first stages of addressing is that my schema.xml is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on reorganization.

The way people tend to address this is with strict naming conventions (possibly using dynamicField ) and judicious use of copyField directives. The Project Hydra folks have a nice, straightforward system for how they set up dynamic fields.

Indexed XOR Stored?

The more I thought about it, the more I wondered whether it might be useful to have a strict separation of stored and indexed fields. Indexed fields would be named with an appropriate suffix, so you know how they’ve been analyzed. And stored fields would have pleasant, human-readable names to make them easy to deal with for consuming applications.

What I think I’d like is a system where:

All stored fields have ‘bare’ names (e.g., ‘title’, not ‘title_t’ or ‘title_s’)
All indexed fields are typed according to their name (so I know ‘title_t’ is an indexed field of type "text")
Separation of stored and indexed fields — a field is either stored or indexed, but not both.
A "schemaless" setup, where I don’t need to define all (any of?) my fields in my schema and reboot solr when I make a change.

To be clear: I’m not sure this is a great way to go as of yet. But I figured out what I think is a good way to do it, should it turn out to be worthwhile.

Part 1: Dynamic Fields

Solr allows one to define dynamic fields — a field whose type is determined by a glob-match on its name. Instead of explicitly naming your field in your schema, you can do something like:

…to indicate that any unrecognized field whose name ends in _is will treated as an indexed, stored integer.

Dynamic Field definitions are processed in order of declaration; first one wins. That allows you to define a “default” as the very last dynamicField that matches anything (e.g., *). The schema.xml that ships with Solr suggests that you can use this functionality to just ignore unrecognized fields.

But that gives me an idea.

Part 2: Copy Fields

The copyField directive allows you to index the same text into multiple fields (presumably with different analysis chains). Index data into one field, it automatically gets copied into another.

In this case, even though I only send a title, the indexed field title_l will automatically be created and available for me to search against. Nice.

Part 3: Copy Field with globs

But it gets better. You can have globs (*) in your copyField source or destination attributes.

So that’s nice. But what if you have globs in both the source and the destination? The docs say:

The copyField command can use a wildcard (*) character in the dest parameter only if the source parameter contains one as well. copyField uses the matching glob from the source field for the dest field name into which the source content is copied.

Hmmmmmm….

Part 4: Putting it all together

Once I read that, I thought, “Huh. I’m hungry.”

But after lunch, I thought, “Maybe I can do something with this.”

Here’s what I came up with.

Let’s walk through that.

First, there are two dynamicField definitions. The first is a no-op: unstored, unindexed. We use it only for copying. The second is a standard indexed (but not stored) text field.

Then come the copyFields, where we match on the suffixes of the field types. Finally, we have our default: a stored, unindexed string. (Note that when Solr stores a value, it stores whatever you put into it, not the value after analysis — same as a String does anyway).

Suppose I index an undeclared field called title_t_s:

title_t_s matches the first dynamicField declaration. This specific field is ignored (no indexing, no storing), but the text sent to it remains available for further processing by the copyFields.
The first copyField matches, and copies the text into newly-generated field formed by what matched the * in the source field, followed by _t. That’s title, so we get title_t.
The newly-minted title_t field is also unrecognized, but it matches the second dynamicField and is thus assigned to be an indexed text field.
Meanwhile, the second copyField also matches our original title_t_s. It uses what matched against the * in the source (title, again) to create a new field just called title.
Now we have a new field called title not matching any declared field, so it runs down the list of dynamicField definitions until it hits our stopgap at the end: a stored, nonindexed string.

Yeah, like that wasn’t confusing.

The result is what’s important, though. What we end up with field-wise is:

title_t_s disappearing into the ether. It’s just gone.
title_t, an indexed text field
title, a stored string.

Now I can run searches against title_t, but my document will have a nice stored string in it just called title.

Why this is probably a bad idea.

Depending on how crazy you want to get options-wise (multi-valued or not, termVectors or not, etc.) you can get a combinatorial explosion on the number of dynamicField/copyField sets you need to generate. But that’s not the real problem.

The real problem is that you don’t have any intrinsic documentation of what your index looks like. None. You can’t even look at the indexing code, because it’ll look like you’re sending a document with a field called title_t_s and that field is nowhere to be found.

So, like I said: interesting, but by no means the obvious way to go. Still, I’m sure I’ll have some variant of this in my schema when it comes time for me to reboot the library catalog.

Help me test yet another LC Callnumber parser

Bill Dueber — Thu, 30 Jan 2014 00:00:00 +0000

Those who have followed this blog and my code for a while know that I have a long, slightly sad, and borderline abusive relationship with Library of Congress call numbers.

They’re a freakin’ nightmare. They just are.

But, based on the premise that Sisyphus was a quitter, I took another stab at it, this time writing a real (PEG-) parser instead of trying to futz with extended regular expressions.

The results, so far, aren’t too bad.

The gem is called lc_callnumber, but more importantly, I’ve put together a little heroku app to let you play with it, and then correct any incorrect parses (or tell me that it worked correctly) to build up a test suite.

So…Please try to break my LC Callnumber parser!

[Code for the app itself is on github; pull requests for both the app and the gem joyously received]

New blog front- and back-end

Bill Dueber — Tue, 17 Dec 2013 00:00:00 +0000

A while back, Dreamhost had some problems and my blog and assorted other websites I help keep track of went down.

For more than two weeks.

Now, I understand that crap happens. And I understand that sometimes lots of things happen at once. But fundamentally, their infrastructure is such that they could lose everything on a machine and be unable to get it back for more than two weeks. I’m not a mathematician, but that’s not “five-nine” service.

So, I decided to start hunting around for another provider. And then I got distracted by the idea that maybe having my blog in WordPress was more trouble than it was worth. There’s something to be said for simplicity, especially since all I really wanted to do is throw up posts written in markdown with code samples.

I got a few pointers toward using middleman, a pre-processor that takes in almost anything and produces regular css/html. Between that and Disqus for the comments, well, this just seems easier. And now that I’ve put in the effort, it’ll be easier to actually get blog posts up, most importantly, move it over when I find a new hosting provider.

Feel free to tell me how ugly it is and suggest improvements. I have the design skills of a one-eyed poodle.

Announcing “traject” indexing software

Bill Dueber — Mon, 14 Oct 2013 00:00:00 +0000

[Over the next few days I’ll be writing a series of posts that highlight a new indexing solution by Jonathan Rochkind and myself called traject that we’re using to index MARC data into Solr. This is the introduction.]

Wow. Six months since I posted here. What have I been doing?

Well, mostly parenting, but in the last few weeks I was lucky enough to get on board with a project started by Jonathan Rochkind for a new JRuby-based tool optimized for indexing MARC data into solr. You know, kinda like solrmarc, but JRuby.

What’s it look like?

I encourage you to take a look at a little sample setup I put together for instructional purposes. It’s based on the HathiTrust catalog indexing scheme and shows off about 85% of what traject can do. Clone it and go through the README and the two indexing files to get a taste of how things are put together.

Real quickly, though, here’s a sample configuration file to pull out the ID, title, and authors (if any) out of a file of MARC records and send them to a file as JSON object, one record per line (i.e., newline-delimited JSON)

 # we'll pretend this file is called 'sample.rb' require 'traject' require 'traject/marc_reader' require 'traject/json_writer'   # It's just ruby, so I can have comments! # Here we set up which reader/writer to use and so on settings do   provide "reader_class_name", "Traject::MarcReader"   provide "writer_class_name", "Traject::JsonWriter"   provide "output_file", "basics.ndj"   provide 'processing_thread_pool', 3 end   # It's *still* just ruby, so I can declare a variable! idfield = '001'  # ...and then use it to find the ID to_field "id", extract_marc(idfield, :first => true)  # Now the other data to_field "title", extract_marc('245') to_field "author", extract_marc('100abcd:110abcd:111abc')   # You'd run this as: #    traject -c sample.rb myfile.mrc

That’s simplistic, of course, but it should drive home the point that we strove to make sure traject makes the easy stuff easy. For a more complex example, look at the heavily-annotated index.rb file in the sample project.

Why use (or move to) traject?

First off, you can and should look at the annoucement and/or the README for a longer answer, but I’ll tell you why I use traject in one word:

Flexibility.

After a year or so of struggling with solrmarc (often due to my lack of Java-fu), and then even more years after that using my own, home-grown marc2solr, the things I most wanted were the ability to decouple the various components from each other, rely on code instead of configuration, and basically just know that I can up the complexity of my code without paying an enormous price.

I’m fast wtih Ruby. And the architecture of traject allows me to easily build and test my transformations in isolation, with tools I’m good with, with debugging output that’s easy to read or process by machine or inspection.

What does it have out of the box?

One advantage traject has that my previous system didn’t is, well, years of struggling with my previous system. I’ve learned a lot about what I need, what needs to be easy, and how I want to think about indexing.

The nature of traject is that “a reader” sends “a record” to “an indexer” which produces a key=>value hash and sends that to “a writer.” Obviously, this is a pretty abstract setup; it’s not hard to see how it could be used for all sorts of transformations (e.g., I’m already thinking about a simple gem that would provide macros to index CSV or tab-delmited files into Solr. Or maybe going to/from a database).

But Jonathan and I are, mostly, stuck dealing with MARC data and Solr. So here’s what we get:

Readers: MARC readers for MARC21 binary and MARC-XML based on both ruby-marc and marc4j (the latter allowing you to deal with encoding transformations and the like). An NDJ reader (for one marc-in-json structure per line in a file — that’s what we use in for the HathiTrust). And we’ve already got a couple gems for people with other needs: traject_alephsequential_reader for those that need to deal with AlephSequential, and Jonathan’s new horizon reader for efficiently pulling records right out of your Horizon ILS, if you happen to run one.

Transforming Macros: A traject indexing step is just a well-formed ruby block (or lambda), which makes writing macros ridiculously easy. Traject ships with most of what you’d commonly need to deal with MARC: extracting data based on tag/subfield/indicators (or substring of a fixed field), dealing with non-filing characters, automatically dealing with 880 linked fields. Mucking with publication dates. Dealing with languages, formats, etc. And, of course, doing it all with multiple threads, because who wants to see all those lovely cores go to waste?

Writers: Of course, you can write to solr, using the excellent solrj java library. And you can do it in multiple threads, to keep things fast. But there’s also the DebugWriter to spit stuff out in a human-readable format, and the JsonWriter mentioned above to spit stuff out in a machine-readable format. And building your own writer is literally just a couple methods.

How do I get a taste?

Like I said, clone and play with the sample project. And ask me questions, either here or via email. After years of being the only person running my indexing software, I’m anxious to try to build up a community around traject.

Come work at the University of Michigan

Bill Dueber — Thu, 18 Apr 2013 00:00:00 +0000

The Library has three UX positions available right now — interface designer, interface developer, and a web content strategist.

Come join me at what is easily the best place I’ve ever worked! Full details are over at Suz’s blog.

Please: don’t return your books

Bill Dueber — Tue, 12 Feb 2013 00:00:00 +0000

So, I’m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part.

Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question:

How much more shelving would we need if everyone returned their books?

Assuming we could get them all checked in and such, well, where would we put them?

I’m looking at this in the simplest, most conservative way possible:

Assume they’re all paperbacks, so we don’t worry about how thick a cover is (cover width = 0)
Assume items for which we don’t have page count information are “average”

Starting data

What’s my current situation at Michigan?

Total bibs: about 10M (but that includes a bunch of HathiTrust items and other electronic-only items that could never be checked out)
Total items checked out right now: 162,080

The first problem I run into is that I don’t know how many pages are in a given book. Well, in theory I can look in MARC field 300$a, and it will tell me.

Finding the number of pages in a book

I went through a recent dump of all our records and pulled out page counts from the 300 (those that matched the regular expression $$a\d+\s+[pP].).

Problem solved, right? Well, kind of

3,085,433 total bibs with page count data (about 30%)
40,872 checked out items with page count data (about 25%)

OK, so I don’t have data for everything. Plus, some of those are multi-volume works that list the total page count, even though only a single volume may be checked out.

We’ll have to drop down into statistics:

Average number of pages in a checked-out item: 270
Median number of pages in a checked-out item: 244

The median is lower, so we’ll go with that. Being conservative, remember?

Bringing it all together

Obviously we need to make a lot of assumptions.

All paperbacks (== no space allowance for covers)
244 pages per item (the median of checked out items for which we have data)
Pages = 244 * 162,080 = 39,547,520 pages

So…what’s the damage?

But how to do the calculation?

It turns out that simply googling book spine width calculator a few come up.

I picked one and input 39,547,520 pages and assumed 50lb paper (the lightest paper in the tool).

Total width: 77,241.25 inches, or 6437 feet, or 1.22 miles

1.22 miles???

Well, we had a lot of assumptions,but most of them were pretty conservative. And I have no idea if the book spine calculator is at all accurate.

But…it’s gonna be a big number no matter what. Add in that many of them are hardcover, and this seems like a pretty good guess at a lower end.

What is this good for again?

Oh, nothing at all. Just a little fun while I’m at code4lib.

Next steps

Well, the best next step would be to walk away. This is a huge waste of time.

But…we could look in the 020s for a hint of whether it’s hardcover or paperback (which is really hard. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and take that into account.

But really: this is enough for me. Whether Michael wants to pursue it further on his own, well, that’s up to him.

Requiring/Preferring searches that don’t span multiple values (SST #3)

Bill Dueber — Fri, 09 Mar 2012 00:00:00 +0000

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Solr and multiValued fields

Here’s another thing you need to understand about Solr: it doesn’t really have fields that can take multiple values.

But Bill, you’re saying, sure it does. I mean, hell, it even has a ‘multiValued’ parameter.

First off: watch your language.

Second off: are you sure?

Let’s do a quick test. Look at the following documents

// exampledocs/names.json 
[
  {
    "id":1,
    "title":"The Monkees",
    "name_text":[
      "Peter Tork",
      "Mike Nesmith",
      "Micky Dolenz",
      "Davy Thomas Jones"
    ]
  },
  {
    "id":2,
    "title":"Heros of the Wild West",
    "name_text":[
      "Buck Jones",
      "Davy Crockett"
    ]
  }

Question: what do you get when you run this query against those two documents?

# ruby/names_query.rb  
{   
  'fl' => 'score, *',   'defType' => 'dismax',   'wt' => 'csv',   
  'qf' => 'name_text',  
  'q' => 'davy jones'   # Poor guy just died. So young. So short.
}

See how I threw the wt=csv in there? Check out all the query response formats if you’re interested, but really all you’ll use is standard (XML), json, or csv unless you’re rolling your own in some way.

I’ve updated ruby/browse.rb to allow a second argument of the type of output you want. You can now do ruby browse.rb jsonfile [json|csv|standard|xml]

Following along at home?

If so, let’s go ahead and index these document and run the query.

java -jar start.jar
cd exampledocs 
./reset_and_index_json.sh names.json  
cd ../ruby  
ruby browse.rb names_query.rb

Here’s the scores that I get:

score	id	title	name_text
0.42039964	2	Heros of the Wild West	[Buck Jones, Davy Crockett]
0.26274976	1	The Monkees	[Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones]

Check out that last column. The query was davy jones. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.

The relevance ranking seems…wrong

While it looks like we added four separate names to the name_text field in our first document, Solr doesn’t see it that way. Solr treats those four poor Monkees as if they had one long name.

Then it finds all the documents that match the query (both of our documents match) and figures out which is a better match by assigning a score.

In this case, while both document have both query terms, the field in the second document is shorter. Which means that, essentially, a higher percentage of the terms in the field value match the given query terms. In Solr’s mind, that makes it a better match, and the shorter document shows up first.

Solr doesn’t automatically give more weight to the recently-dead Monkee because internally it doesn’t care that you’re thinking of those values as four separate names. It just concatenates them together and indexes them.

This is not, for most people, expected behavior.

Phrase slop

Part of what’s going on here is that we haven’t told Solr that it should care how close together the terms are.

One way to do that is to use a phrase query by throwing quotes around the terms

# Put double-quotes around it to make it a phrase query   
q => '"Davy Jones"'

…but that won’t find anything, because Davy and Jones aren’t right next to each other in our document.

Solr does allow a phrase query to be sloppy, though — basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.

For that, we’ll tell solr to search against certain fields (pf) treating the query as a phrase, and allow a little slop (ps) as well.

#ruby/names_sloppy_query.rb   
{    
    'fl' => 'score, *',    
    'defType' => 'dismax',    
    'wt' => 'csv',     
    'q' => 'davy jones',     
    'qf' => 'name_text',    
    'pf' => 'name_text^10', # search this field as a phrase     
    'ps' => '4' # allow 'phrase' to mean 'within 4 tokens of each other'   
    }

That gets us something more expected.

   id,title,name_text,score   
   1,The Monkees,Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones,0.2806283  
   2,Heros of the Wild West,Buck Jones,Davy Crockett,0.029652705

Enter `positionIncrementGap`

OK. Now that we have the concept of slop, one of those mystery fieldtype parameters makes sense: positionIncrementGap. Basically, a positionIncrementGap of 1000 means When computing slop, pretend there are 1000 tokens between the entries in a multValued field.

A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your positionIncrementGap.

All you have to do is use the pf and ps parameters and you’re set.

Note that this should be telling you two things:

Always use the same positionIncrementGap for your multiValued fields
Make it a number much larger than the maximum number of tokens you expect to ever have in a field.

Note that a large positionIncrementGap doesn’t actually put 1000 tokens in there — a large value doesn’t affect processing time or your index size or anything.

But I’m already using the `pf` parameter!

Slop is great when you want it. But I don’t always want to use slop. Slop of 4 makes the phrase Sex in the City be treated exactly the same as In the Sex City. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.

[Forshadowing: We’ll talk about exact-ish matches in a few days.]

OK, so we can’t just appropriate the pf/ps parameters and and push the slop value up all the time — that cripples our ability to create the query boost structure we want.

Query slop

So, dismax (and its cousin edismax) have an analogous parameter that affects only phrases within the normal query: qs.

qs is a dismax param that affects query slop — how much slop to allow in phrases within the query, much like the ps param.

The query

# A three-token query   
'q' => 'Bill "The Weasel" Dueber'

…has three tokens, the second of which (The Weasel) is a phrase. It’s that phrase token that is affected by query slop.

OK. So it affects only the phrases in the normal query. But…suppose we just force the entire query to be one big phrase? That’ll get us somewhere!

We just need to do the following:

Create a boost query that uses the same fields as the regular query
…but treats all the query terms as one big phrase
…and give it a query slop of one less that the positionIncrementGap in our field type definition (in my case, 999)

Package it up

OK, so here’s what we’re going to do. You can just take this basic idea and build it into your own queries in your application code. Try it. You might like it. Play around with what fields are affected, how much weight to give it, etc.

But heck, we’ve gone this far. Let’s encode it into the Solr configuration file solrconfig.xml itself as a custom request handler.

We’re going to extend our edismaxplus requestHandler from last time, but we’ll add an extra boost query that reflects this new prefer documents where all the tokens appear in the same ‘line’ of a multiValued query attitude.


   
  
    10
    *,score
    explicit
    
      _query_:{!edismax qf=$fields mm=$mymm v=$qwords bq=$boostForAll}
    0%
    JunkThatWillNEverShowUpInAMillionFreakinYears
    
      _query_:{!edismax qf=$fields mm='100%' v=$qwords }^5 OR
      _query_:{!dismax  qf=$fields mm='100%' v=$qwordsphrase qs='999'}^5

We now do a few new things:

(Line 15) Add a second clause to the boost query that use the same fields provided for the regular query (note the boolean OR between the two localparams queries that comprise this boost query)
(Line 17) Ask for another user-provided value: qwordsphrase which your application-level stuff should set to the list of all the regular query terms, but as a single phrase. Basically, strip out all the double-quotes, then put the whole thing in double quotes. In ruby: qwordsphrase = '' + qwords.gsub(//, '') + ''
(Line 10) Provide a default value for the new qwordsphrase that won’t ever show up in a real query (empty string won’t work; I tried it and it throws an error). So, if the application doesn’t provide qwordsphrase, no harm is done — the search regresses to what we had last time.
(Line 18) Use a qs (query slop) of 999 in the new boost clause acting against qwordsphrase. That value is one less than the positionIncrementGap of 1000, making sure that we don’t cross multiValue boundaries.

Note: If you wanted to, you could make this a filter query (fq) instead of a boost query to only allow documents that meet this criterion.

Let’s try it out!

Once again, if you did a git pull origin master you’ve got this up and running already — the updated requestHandler source is already in solr/conf/solrconfig.xml.

We first construct the query just like we did last week, without the qwordsphrase argument:

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text

You’ll see Davy Crockett and friend appear as the first item.

But when you add the phraseified query, you’ll see the boost we’ve been talking about this whole post and get something more expected.

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text&qwordsphrase=Davy Jones

The Monkees are again on top! Party like it’s 1967!

Where it breaks down

If you actually have a phrase as one of your query terms, it will no longer be treated as a phrase during the boost because we’re getting rid of all the double-quotes.

And, of course, if you’ve got gobs of full-text and include your fulltext field, setting query slop to 999 isn’t just a cute trick, it’s a cute trick that will melt your servers to slag and still not do what you want it to do.

What have we learned?

…ish?

Solr doesn’t really separate multiple values from each other in a multiValued field
Phrase slop (ps) and query slop (qs) can be used to allow phrase to mean a bunch of tokens within X spots of each other
I’m A Believer is the best damn song Neil Diamond ever wrote.

Solr and boolean operators

Bill Dueber — Thu, 01 Dec 2011 00:00:00 +0000

[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!]

What does Solr do, given the following query?

  a OR b AND c

I’ll give you three guesses, but you’ll get the first two wrong and won’t have any idea how to generate a third, so don’t spend too much time on it.

Boolean algebra and operator precedence

Anyone who’s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping:

   a OR (b AND c)

That’s guess one. It’s not how Solr does it.

Left to right?

Some naive students, and at least one programming language (Smalltalk), do a simple left-to-right evaluation. So you might go with:

   (a OR b) AND c

Nope. Wrong again.

So what’s left???

Excellent question. I don’t know the code well enough to know what’s going on underneath, but here’s what we get under the lucene query parser.

     (b AND c)

That’s right. The first term is thrown away.(More correctly, the first term is deemed “optional”).

Do you let your users put AND/OR/NOT in their queries?

Hopefully, they don’t know any boolean algebra. If they do, hopefully they use parentheses, or you parse it out for them. And if not, well, they’re gonna be pretty damn confused.

It gets weirder

I populated a fresh solr (3.5) index with all possible subsets of the strings “curly”, “larry”, “moe”, and “shemp” (not Joe. Don’t talk to me about Joe). There are 15 of them, from the one-item ‘curly’ to all four at once.

I wrote a script to run a set of queries against the index under both lucene and edismax to see what I would get. In all cases the default lucene operator is ‘AND’ and the edismax mm parameter is set to 100% (equivalent to “all required”).

         Lucene                    EDismax   -------------------------------------------------------    1. curly AND larry         curly larry               curly larry         curly larry moe           curly larry moe         curly larry shemp         curly larry shemp         curly larry moe shemp     curly larry moe shemp    2. curly AND larry OR moe         curly                     curly larry         curly larry               curly larry moe         curly moe                 curly larry shemp         curly shemp               curly larry moe shemp         curly larry moe         curly larry shemp         curly moe shemp         curly larry moe shemp    3. curly OR larry AND moe         larry moe                 larry moe         curly larry moe           curly larry moe         larry moe shemp           larry moe shemp         curly larry moe shemp     curly larry moe shemp    4. curly AND larry OR moe AND shemp         curly moe shemp           curly larry moe shemp         curly larry moe shemp    5. moe AND shemp OR curly AND larry         curly larry moe           curly larry moe shemp         curly larry moe shemp

Query 1 is as expected. Query 2 apparently reduces to just ‘curly’ under the lucene parser and ‘curly AND larry’ under edismax (and query 3 similarly reduces to the two AND’d words). Queries 4 and 5 are…well, you can look at the debugQuery output to see what it gets, but not why. And then tell me how to explain it to a user.

Where does this leave us?

The good news is that both lucene and edismax behave predictably when you use parentheses for grouping. So do that.

I’m generally not one to complain about open-source software, at least partially because I don’t have the chops to do anything about it most of the time, but I don’t understand how this could seem OK to anyone. There are a couple lucene Jira tickets (Lucene-167 and Lucene-1823) and a 2005 mailing list thread denouncing the current behavior, but it persists.

Until the Solr/Lucene powers that be decide to tackle this, the rest of us will either have to write pre-parsers to make sure users get something sensible, or cripple our applications to disallow unrestricted boolean queries.

Even better, even simpler multithreading with JRuby

Bill Dueber — Fri, 01 Jul 2011 00:00:00 +0000

[Yes, another post about ruby code; I’ll get back to library stuff soon.]

Quite a while ago, I released a little gem called threach (for “threaded #each”). It allows you to easily process a block with multiple threads.

   # Process a CSV file with three threads   FIle.open('data.csv').threach(3, :each_line) {|line| send_to_db(line)}

Nice, right?

The problem is that I could never figure out a way to deal with a break or an Exception raised inside the block. The core problem is that once a thread trying to push/pop from a ruby SizedQueue is blocking, there’s no way (I could find) to tell it to wake up and see if there’s an error from another thread floating around that needs to be addressed.

So, I got into a pattern of running my code with each for a while, debugging, and eventually doing the production run under threach. Which is just dumb. Then I’d try to re-write threach to deal with this stuff using different approach (mutexes, lightweight events), quickly (or not so quickly) fail, give up, and start again.

So…let’s not worry MRI for the moment. I run all my big jobs under JRuby these days anyway, and there I can take advantage of Java’s blocking queues that have timeouts. When a queue operation times out, I can check to see if there’s been a break or an exception thrown in the meantime and behave appropriately.

The result is the gem jruby_threach. It works just like threach, except that, you know, it actually works the way I’d like it to.

 require 'jruby_threach' FIle.open('data.csv').threach(3, :each_line) {|line| send_to_db(line)}

Looks familiar, doesn’t it.

But you can also break out of the loop.

 myarray.threach(2) do |item|   break if item_indicates_to_break(item)   if item == :really_bad_value     raise RuntimeError.new, "Something's really wrong", nil   end   process_item(item) end

Any exceptions that are rescued within the block are handled internally and don’t cause processing to stop. Any that are not handled within the block are noticed by threach, cause the processing to stop, and the re-raised so you can deal with them outside of threach

  reader = SpecializedFileReader.new(filename)  begin   reader.threach(2) do |item|     process_item(item)   end rescue SpecializedFileReaderError   # deal with the fact that the reader failed rescue Exception   # deal with the problem processing the item end

Dealing with the underlying Java data structures makes life a lot easier. To the point that I added an enhancement — threading production as well.

   # Use two threads to read lines from files, and another three threads   # to process the data that comes out of those files.   Dir.glob("*.csv").map{|f| File.open(f)}.mthreach(2,3) do |item|     send_item_to_datbase(item)   end

mthreach basically allows you to treat an array of Enumerables as a single logical entity, multithreading both the producer and consumer sides of the operation. There aren’t a whole lot of obvious use cases, but it can certainly come in handy.

You can also access the underlying class that aggregates multiple enumerables directly.

 require 'jruby_threach' me = Threach::MultiEnum.new(   [enum1, enum2, enum3], # enumerables   threads,               # How many threads to use to   :each_with_index,      # the iterator to call on the enumerables   size                   # size of the under-the-hood queue )  # Note that like threach, calling #each against an MultiEnum actually # calls the iterator you sent in (in this case, #each_with_index) me.each {|item| process_item(item)}

How good is our relevancy ranking?

Bill Dueber — Wed, 25 May 2011 00:00:00 +0000

For those of us that spend our days trying to tweak Mirlyn to make it better, one of the most important — and, in many ways, most opaque — questions is, “How good is our relevancy ranking?”

Research from the UMich Library’s Usability Group (pdf; 600k) points to the importance of relevancy ranking Â for both known-item searches and discovery, but mapping search terms to the “best” results involves crawling deep inside the searcher’s head to know what she’s looking for.

So, what can we do?

Record interaction as a way of showing interest

One possibility is to look at those records that are somehow “touched” by a user in such a way that we can log it. If a user bothers to interact with an individual record, we’ll assume the record is interesting to her in the context of the current search.

There are three links associated with an individual record that a user can click on from the search results:

(62% of all record interactions) The title
(28%) An external link (HathiTrust, Google Books, or one of our vendors)
(10%) The “see holdings” link for those items that have multiple holdings

Our first issue arises quickly: only about a quarter of Mirlyn sessions contain any of these actions. For a full 75% of sessions, we have no data about which records users are paying attention to. They get a call number — or determine they have a failed search — and move on.

Where on the page do users interact with items?

We don’t know how users that interact with items differ from those that don’t. But for those that do, more than half of all record interactions are with the first record.

Here are the numbers for the first five records:

First record: 54%
Second record: 12%
Third record: 6%
Fouth record: 3.7%
Fifth record: 2.5%

More than 75% of all record interactions are with the first four items on the first page of results.

What does it all mean?

Frustratingly, we don’t know. Several possibilities are obvious:

we’re doing a good job with relevancy ranking
people do mostly known-item searches
people don’t bother looking past the first few results
excellent general search engines (e.g., Google) have trained people to believe that the first result is always worth a closer look.

The interactions between these (and unknown other) factors are likely complex.

In the meantime, though, to the extent these data can be extended to the general case (not at all obvious), we’re not doing too bad of a job.

A short ruby diversion: cost of flow control under Ruby

Bill Dueber — Tue, 03 May 2011 00:00:00 +0000

A couple days ago I decided to finally get back to working on threach to try to deal with problems it had — essentially, it didn’t deal well with non-local exits due to calls to break or even something simple like a NoMethodError.

[BTW, I think I managed it. As near as I can tell, threach version 0.4 won’t deadlock anymore]

Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an exception raised.

I re-discovered something that a lot of people already know: raise/rescue under MRI is slow, and under JRuby can be unbearably slow. How slow?

Let’s look at four simple blocks that exercise four different block exit strategies: break, catch and throw, raise with the normal single (or zero) arguments, as well as the three-argument version of raise.

Simple break	Catch/Throw
range.each do \|i\| break end	catch(:benchmarking) do range.each do \|i\| throw(:benchmarking) end end
Raise (1 arg)	Raise (3 args)
begin range.each do \|i\| raise StandardError end rescue # do nothing end	begin range.each do \|i\| raise StandardError, :hi, nil end rescue # do nothing end

Simple break

Catch/Throw

 range.each do |i|   break end

 catch(:benchmarking) do    range.each do |i|    throw(:benchmarking)  end end

Raise (1 arg)

Raise (3 args)

  begin    range.each do |i|      raise StandardError    end  rescue   # do nothing  end

 begin   range.each do |i|     raise StandardError, :hi, nil   end rescue  # do nothing end

In each case, we immediately exit the block without doing any work; the idea is to measure how long it takes to break out for each case.

So….let’s run them each 100K times and see what happens, shall we? Times are in seconds, averaged over two runs.

	Ruby 1.8	Ruby 1.9	JRuby	JRuby –1.9
break	0.12	0.07	0.29	0.21
catch/throw	0.35	0.28	0.64	0.48
raise (1 arg)	1.78	2.10	26.60	22.06
raise (3 arg)	1.85	2.13	0.45	0.45

The first thing to note is that this is 100K iterations. Three of the strategies are fast enough that you’d have to work really, really hard to notice them.

In terms of speed, raise (3 args), catch/throw, and break are fast enough that you shouldn’t bother worrying about them (although you should choose the method that makes your code easy to understand).

The second things to note is Holy Camoli! JRuby is slow there!

This Jira ticket tells the tale: The creation of the backtrace is very, very expensive for JRuby. That nil at the end of the raise (3 args) call suppresses the creation of that backtrace, so the speed is fine.

Three things worth saying here:

If you’re using raise/rescue for flow control, you’re already doing it wrong. Reserve exceptions for, well, exceptional conditions that are only going to be raised once or twice, not all the time.
If you’re writing code that, for some ungodly reason, is planning on raising a crapload of exceptions, use the three-arg version. I’m looking at you, gem authors.
If you’re writing your code without worrying about how it will work under multiple threads, well, please don’t do that. Everyone has multi-core systems these days, and it’s silly to not be able to use them. Plus, counting on Matz to never move to a VM with real threads is a big gamble.

ISBN parenthetical notes: Bad MARC data #1

Bill Dueber — Tue, 12 Apr 2011 00:00:00 +0000

Yesterday, I gave a brief overview of why free text is hard to deal with.

Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it.

The point is not to mock anything. Mocking will, however, be included for free.

What’s supposed to be in the 020?

Well, for starters, an ISBN (10 or 13 digit, we’re not picky).

Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or not.

Wait, no, let’s go ahead and worry about it. It’s an easy enough script to write, although it takes a while to run.

8,630,794  Total records 3,220,666  Total 020a's     6,498  020a's that don't obviously contain an ISBN     8,407  that look like an ISBN but fail checksum test: ... so 0.26% of the ISBNs have invalid checksums

So, not bad at all, especially considering some of those are known to be bad, but are transcribed dutifully from the actual (mis-)printed book.

A lot of the malformed data (anything from which I can’t seem to extract something that looks like an ISBN) is pricing data, and most of it appears in system numbers that are close enough to each other that I presume it was just a bad batch.

What’s goes after the ISBN in the 020?

I’m no cataloger, of course, but it looks to me like the answer is “Something about how the book is bound together, or the publisher, unless you want to put something else there, and then, really, go ahead, because it’s not like anyone is ever going to want to parse this out, all we need to do is print cards with it for god’s sake.”

No, I kid, I kid! The actual rules are in Library of Congress Rule Interpretation 1.8, which reads, in part:

For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.

I think it’s important to read that a second time, because it succinctly conveys the culture in which these rules were devised.

Don’t worry about consistency, because your only reader is human.
Defer to the cataloger.
Being complete is more important than being consistent.
Base your notes on your subjective view of the actual, physical item you’re presumed to be holding in your hands.

Interestingly (to me, anyway), it looks like the OCLC once had a (now deprecated) $$b subfield for binding information. Apparently it didn’t catch on.

What did I find?

So, let’s pretend I’d like to be able to differentiate between paperback and hardbound books. Probably useful, yes?

I went ahead and took all parenthetical notes from any field in the 020, split them on colon (’cause that seems to be the way they roll) and did some basic normalization:

Eliminate numbers (so ‘vol. 1’ and ‘vol. 2’ count as only one pattern)
Lowercase everything
Turn runs of spaces into a single space
Trim leading/trailing spaces
Remove any trailing punctuation

I found 1,506,729 parenthetical remarks in the 020 subfields of our catalog.

The top twenty most common entries using those normalizations are:

402537 pbk
387406 alk. paper
99260 v # (e.g., “v. 1”, “v. 22”, etc.)
82918 cloth
51125 hbk
42036 electronic bk
41360 acid-free paper
38792 hardcover
28913 set
20358 hardback
19160 ebook
16264 paper
15269 u.s
12770 hd.bd
11793 print
10625 lib. bdg
10520 hc
8772 est
7767 pb
7639 hard

The kicker? These are the top twenty of 13,374 unique parenthetical strings found in the 020 field. Many of them are publishers, or cities, or whatnot, but an awful lot of them are variations on “hardcover” and “paperback.”

For example, a quick search for anything that might be “hard” (regexp: /h[ar]{0,2}d/) got me started on a list. Here’s just the 90 examples from that list that start with ‘h’:

hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover | hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr

And that’s after eliminating things like places of publication, strings like “with…”, “plus…”, “alk. paper”, etc.

“Yeah, but you have to understand that historically…”

Stop hiding behind that.

I understand that at one point in time it probably made sense (to someone at least) to do it this way. I can deal with that.

What I can’t accept is that as I type this there’s a cataloger doing this in this way. Today. April 2011. Some, what? maybe thirty years since computer-based OPACs became prevalent?

These sorts of problems were recognized ages ago and should have been dealt with. Add a subfield. Invent a controlled vocabulary. Don’t worry about the legacy data; it’s always going to suck.

But why are we still producing sucky data???

To sum up

The point is that there’s a better way to do this stuff. Lots and lots of better ways, in fact. Time I spend dealing with crappy data is time I don’t spend making relevancy raking better, or building a better command language search option for my librarians, or working on ways to get a decent “more like this”.

The need is both dire and urgent; the latter because sooner or later we’re going to have to go to a “two state solution” with traditional MARC21 for many of our records and whatever comes next (RDA?) for the newer stuff. And every day we wait, that first category grows, and the growth rate keeps increasing.

And then there’s serials. Don’t talk to me about serials.

Why programmers hate free text in MARC records

Bill Dueber — Mon, 11 Apr 2011 00:00:00 +0000

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job.

A lot of people seem to not understand why.

This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem.

Description vs Findability

I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description and findability. AACR2 is clearly designed for description; once you’ve found a record, it does a pretty good job telling a human being what she’s looking at. With respect to a person who’s already got a copy of the record in her (virtual) hand, strings of text and reasonable abbreviations are…well, often good enough, let’s say.

But much of AACR2 is a giant mountain of fail when it comes to supporting findability — the ability for a machine to slice and dice the data in ways that can be mapped onto searches and transformations. What those of us on the business end of the computer need are well-defined values stuck into well-defined places that represent well-defined relationships.

Free text stuck on the end of a field fails all three of those criteria.

Machine Reasoning vs. Machine Parsing

When many people look at something like RDF, their first reaction is, “Great Googally Moogally! Just tell me the language! I don’t want to follow a chain of reasoning that’s seventeen steps long just to figure out the damn thing is in English!!!”

Of course you don’t. And you don’t have to. Someone — hopefully someone smarter than me — needs to write a program to do it. And we can.

Following all that logic — deriving relationships, figuring out eventual values, determining how to convert between various forms — is what I’ll call (for simplicity’s sake) machine reasoning. And machine reasoning — for the purposes of this discussion, anyway — is a solved problem. I’m not saying it’s not hard, and I’m not saying it might not take gobs of hardware resources. But we, the collective of humanity, know how to do it.

On the other hand, machine parsing — looking at all that free text that is sprinkled throughout our records and trying to turn it into something that is susceptible to machine reasoning — is vehemently not a solved problem. Even if you ignore all the misspellings, we’re still stuck with one-off abbreviations, lack of ordering, gobs of “local practice,” and iffy punctuation.

And, come to think of it, you can’t ignore the misspellings, either.

The point is this: good data trumps everything else. If there’s good, solid, well-defined data in computable places, we can (given some time) do damn near anything with it. If there’s human-entered, free-text, parenthetical-remark-type data, we’re pretty much stuck.

Examples?

Jonathan Rochkind just did a great post looking at LC call numbers, and how, well, they might be in a few different places, and may or may not be valid LC call numbers, and so on and on and on and on.

And my next post (hopefully tomorrow) will be an analysis of the first freetext in MARC I ever tried to deal with — the parenthetical remarks in the 020 (ISBN) field. If that doesn’t keep you up all night, well, I don’t know what will.

Does anyone use those prev/next/back-to-search links?

Bill Dueber — Wed, 03 Nov 2010 00:00:00 +0000

There’s a common problem among developers of websites that paginate, including OPACs: how do you provide a single item view that can have links that go back to the search (or to the prev/next item) without making your URLs look ugly?

The fundamental problem is that as soon as your user opens up a couple searches in separate tabs, your session data can’t keep track of which search she wants to “go back to” unless you put some random crap in the URL, which none of us want to do.

But let’s take three giant steps backwards before we throw a ton of resources at this problem, and ask, “Does anyone use those links”?

Data from Mirlyn, the University of Michigan OPAC

Here’s the data since February of 2010 for Mirlyn, our library OPAC.

Action	Count	Pct. of Basic Search count
Basic search (baseline)	1,446,881	100%
Previous record	1,347	0.09%
Next record	8,394	0.58%
Back to search	9,568	0.66%

For what it’s worth, I looked at these number by percentage of sessions as well, and the numbers come up a little higher — about 0.8% of all sessions included at least one click of the “Back to Search” button.

Given these numbers, I’m pretty sure I wouldn’t put a whole lot of effort into it. In general, next/prev record navigation only makes sense when you have a really, really small number of hits, anyway.

So…why not just disappear the links? I know people will complain, but hopefully our days of doing an enormous amount of work for …well, some tiny but vocal minority…are past.

Size/speed of various MARC serializations using ruby-marc

Bill Dueber — Wed, 29 Sep 2010 00:00:00 +0000

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data.

I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21.

Why bother?

Binary MARC-21 is “broken” in that a lot of us have records that are so long (more than 99,999Â bytes) it’s impossible to create a valid marc binary record. The standard alternative, MARC-XML, has huge filesizes (roughly 3 times as large) and runs a lot more slowly in every benchmark I’ve ever run. For ruby-marc, the penalty for using XML is further exaggerated because the serializer is based on REXML and is super-slow.

There have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton bigger in terms of file size, and (b) much easier to query from a NoSQL database using something like JSONPath or JSONQuery.

What I’m testing

For this test, I used:

marc21 binary This is the stock serialization / deserialization provided by ruby-marc.
YAJL for JSON YAJL is a very fast C-based JSON library. Here we’re using the Ruby bindings and calling Yajl::Encoder.encode(r.to_hash) to serialize and MARC::Record.new_from_hash(Yajl::Parser.parse(JSON)) to deserialize.
Msgpack The Msgpack project is explicitly designed to be “binary JSON” — smaller, faster, etc — at the expense of human readability/editabilty . Again, this used the ruby bindings.

The benchmark and its results

I’m interested in how long it takes to serialize and deserialize a single record. My primary use-case is sticking a single record into Solr, and then pulling the string representation of that record out and turning it back into MARC.

It’s entirely possible that trying to deal with a whole set of MARC records — as a JSON array of marc-in-json objects, or as a set of newline-delimited JSON (or perhaps LDJSONÂ or Msgpack objects — would yield different results. The former is especially interesting, since to parse a large JSON array one needs to use a streaming parser, which will almost certainly have a different profile in both processing and memory use.

The ambitious can see the full source code of the benchmark.

Note that the following represent only the performance of ruby-marc and the particular serializers used. Other platforms or other libraries will certainly give different results!

Total of 18880 records run 20 times (377,600 serialize/deserialize cycles per method) on my Mac OSX desktop; comparisons are to MARC21-Binary.

 SERIALIZING MARC Binary 357.02 s (100%) YAJL 312.65 s ( 88%) Msgpack 266.26 s ( 75%)  DESERIALIZING MARC Binary 648.91 s (100%) YAJL 507.64 s ( 78%) Msgpack 459.73 s ( 71%)  SERIALIZE + DESERIALIZE MARC Binary 1005.93 s (100%) YAJL 820.29 s ( 82%) Msgpack 725.99 s ( 72%)  SIZE MARC Binary 31.15 MBytes (100%) Msgpack 42.00 MBytes (135%) JSON 55.99 MBytes (180%) XML 93.42 MBytes (300%)

Analysis, such as it is

Obviously, there are size/speed tradeoffs. Nothing is as small as binary MARC21, but both YAJL and Msgpack are faster — significantly so for deserialization, which happens to be where I want the speed for my uses.

At 80% larger, the JSON serialization is quite a big bigger, but it’s a hell of a lot smaller than MARC-XML and suffers none of the limitations of binary MARC.

For a closed system (i.e., you’re not worried about anyone else being able to read your data) such as a Blacklight installation, I’d be tempted to move to using JSON sooner rather than later.

Simple Ruby gem for dealing with ISBN/ISSN/LCCN

Bill Dueber — Mon, 13 Sep 2010 00:00:00 +0000

I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums.

It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome.

Functionality is all as module functions, as follows:

ISBN

char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn)
boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn)
thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn)
tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn)

ISSN

char = StdNum::ISSN.checkdigit(issn)
boolean = StdNum::ISSN.valid?(issn)

LCCN

normalizedLCCN = StdNum::LCCN.normalize(lccn)

Again, there’s nothing special here — just letting folks know it’s out there.

Solr: Forcing items with all query terms to the top of a Solr search

Bill Dueber — Wed, 18 Aug 2010 00:00:00 +0000

[Note: I’ve since made a better explanation of, and solution for, this problem.]

Here at UMich, we’re apparently in the minority in that we have Mirlyn, our catalog discovery interface (a very hacked version of VuFind), set up to find records that match only a subset of the query terms.

Put more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on ‘OR’.

Now, I’m actually quite proud of how our searching behaves. Reference desk anecdotes and our statistics all point to the idea that people tend to find what they’re looking for. I invite you to try our current configuration out — and, of course, let me know if something feels off to you. We have control of our OPAC now, and can actually fix things.

The “problem”: DisMax is weird

The DisMax algorithm is complex. Even if you ignore the fact that we weight some fields (title, author) much higher than others, a fundamental feature of DisMax is that it basically gives ranking based on the question, “What percentage of the words in the document match one of our query terms”?

Most of the time, that’s exactly what you want. In general, items that have all the keywords, and more of them, appear at the top of the search results.

But sometimes you can have just, say, two of your three search terms appearing like a rash all across a relatively short record, and it’ll pop to the top, appearing ahead of records that actually contain all three search terms. Or maybe three of four search terms appear in both title and author (highly-weighted fields) and the same thing happens.

And, yeah, it really happens.

An actual, real-life example

Searching for the three terms information AND architecture AND usability, explicitly requiring all three, gives 12 results.

The equivalent DisMax search (where only two of three need to be found) nets about 4300 results. Which is great — we’re casting a much wider net, with some pretty common words. That doesn’t matter so long as the most relevant results float to the top.

The kicker? The first time an item in the first set appears in the second is at record number 62. Our user is more than three pages in before she even see a record that contains all three terms.

Again, most of the time, our current algorithm does really, really well in my opinion. But noticing this led to talk about artificially pushing all the “all terms are present” items to the top.

Pushing records that contain all the terms to the top

So, I wanted to:

Push records with all search terms to the top, but
…don’t otherwise change their scores. i.e., don’t otherwise re-order them in any way, ’cause I’m already happy with my ordering.

It turns out to be harder than I initially thought. I fought with my code for a whole day, then asked for help, and help was provided.

So, with special thanks to Jan HÃ¸ydahl for his solution, we get this, in Ruby psuedocode:

andedTerms = allMyTerms.join(‘ AND ‘) bf = map(query($qq),0,0,0,100000.0) # Add this value to the ranking score qq = “allFields:(#{andedTerms})” # Use this as the query

add bf and qq to your solr query

The qq is easy enough — it basically says that to get any relevancy score at all, the record must have all the terms in the allFields Solr field.

For the map, we want to say > If the record matches all the terms, give it an extra 100K points. If not, don’t.

The map takes 5 arguments:

An initial value. In this case, we’re getting the relevancy ranking score based on the qq query. Basically, items that don’t have all the terms will have a score of zero; items that do have all three terms will have something bigger than zero.
The beginning of range to compare to. In this case, 0.
The end of the range. Another zero, so basically, we’ll be seeing if our initial value is between 0 and 0, e.g., if it’s exactly 0.
The value to return if the initial value fits in the range — zero. So, if the records doesn’t have all the terms, return a 0.
The value to return if the initial value falls outside the given range. 100K — a random very-large number I picked.

And…?

I just pushed this to our beta site, and folks are still looking at it, but so far, it looks awesome. I’ll do a little update post if/when it goes into production. And if it doesn’t, I’ll say why.

Why RDA is doomed to failure

Bill Dueber — Fri, 23 Apr 2010 00:00:00 +0000

[Note: edited for clarity thanks to rsinger’s comment, below]

Doomed, I say! DOOOOOOOOOOMMMMMMMED!

My reasoning is simple: RDA will fail because it’s not “better enough.”

Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.”

First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of.

And, of course, you’re right on all counts. I don’t know what I’m talking about in any of those realms.

And yet I’m still willing to make a strong statement?

Yes. I am. Here’s why.

[Oh, and if you’re convinced I’m wrong — please say so. I’d love to be wrong about this.]

First, an assertion

The purpose of any bibliographic metadata is to facilitate three things:

Description/Identification. If you know what you want, does the metadata give you enough information to determine if the described item is what you want? Alternately, if you’re holding an item (or an alternate metadata representation of it), can you find the record that describes it?
Machine finding. Can a machine, given a good-enough query, find a work via a search of the metadata?
Machine grouping. Given the metadata, can a machine help a person find items “like this one”?

Take issue with one or more of those statements. I don’t care. The point I’m really trying to make is that any standard that doesn’t put unmediated machine reasoning at the forefront of what the metadata needs to support is living in a deep, deep hole.

Computer cycles are pretty cheap, and programmers are pretty smart. We can figure out how to do useful things with virtually any data, but only if we can reliably get at those data.

Getting 75% of the way there

Three-fourths of the problem can be addressed with one simple concept.

A solid equality relationship.

By this I mean that “=” had better damn well mean “equal,” as opposed to “probably the same, but there might be other representations, too.” If I want to say “A = B” (where A and B are authors, or works, or subjects, or anything that can be nailed down) there’s better be no false positives and no false negatives. Ever. MARC’s use of “hopefully-unique strings” is ridiculously insufficient in the modern era.

RDA does pretty well with this, with URIs for appropriate concepts, so that’s good.

What’s wrong with it?

Well, it’s gonna cost money to access the spec, for starters. That’s just dumb.

But it’s also not flexible/extensible enough. It’s true that I’m not a cataloger. I do have an MS in computer science, though, and there is stuff in the various versions of the RDA spec which lead me to believe that the committee desperately, desperately needed some hardcore geeks on it. Computer science has basically done nothing but develop methods for abstraction and composition for decades, and that isn’t reflected enough here.

Language such as, “If it is determined that a mechanism for providing a direct link between a note and the instance of the element to which it relates is required,…” worries me. if? IF????? That’s not a spec. That’s a guideline. Nail it down, for god’s sake. When is it appropriate or inappropriate? How do you add links to multiple (but not all) instances of the element?

The spec also seems to describe at least half a dozen kinds of titles. One of these is “Abbreviated title.” Do we really want an abbreviated title? No. We want a title with an “abbreviated” modifier, so we can use that same modifier for, say, a corporate name or publisher or anything else. [Note: see rsinger’s comment below, indicating this was a piss-poor example on my part.]

Well, sure, but it’s still better than the AACR2!

[This section updated to disabiguate my use of ‘MARC’ when I really meant ‘AACR2 as commonly talked about in term of MARC tags’]

Of course it is. It’s just not better enough!

We’re not just talking about writing a spec. We’re talking about replacing every single tool in the library toolchain, from the ILS to editing software to OPACs to scripts that keep it all put together. We’ll be asking programmers to learn new skills and new ways of thinking, vendors to produce functional software for untested data formats, and catalogers to essentially take their whole brain out of their heads and get a new one.

But that, frankly, is the easy part. The entire culture of the library is built around AACR2 concepts and MARC data structures. The thought processes, nomenclature — everything sometimes feels as if it’s built around three-digit tags. The majority of the (crucial!) specialized vocabulary librarians, and experts and specialists, use to communicate with each other is directly or indirectly tied to MARC

So, yeah, RDA is a hellofa lot better than AACR2/MARC. But in my view, it’s not better enough to justify all the pain. Switching is incredibly, astoundingly expensive both in terms of cost and in terms of the devaluation of institutional knowledge. We can’t do it every few years. We need to be damn sure we’re getting it right.

Data structures and Serializations

Bill Dueber — Tue, 20 Apr 2010 00:00:00 +0000

Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in.

What this post is not

There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them.

What I am comfortable saying is this:

Anyone advocating or dismissing a data model based on the data structure or serialization most-often associated with that model is missing the goddamn point.

Data serializations

…are boring. They’re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even MARC has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).

So, let me repeat again, serializations are boring and not worth talking about until you’ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.

Serializations are measured from “less pain” to “more pain”, but all have the exact same expressiveness. Data structures, on the other hand, do not.

A hierarchy of data structures

Think about the following data structures:

An ordered list
key-value pairs
A hierarchy (e.g., an XML document)
An undirected graph
A directed graph
A labeled, directed multigraph (e.g., a set of RDF Triples)

You don’t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item’s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you’re careful to make links both ways. And a directed graph can easily be represented as a set of RDF triples (where you may, for example, only have one label for your relationships: “links to”).

[Note that I didn’t say any of these would be efficient implementations!]

The reverse is not true — or, at least, not without an incredible amount of “out of band” information in another layer somewhere.

The structures at the end of the list have more expressiveness. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I’m not going to try to model my set of key=value pairs in an array. I could do it, but it would take so much of my attention that the data modeling would suffer.

Don’t handicap yourself

Don’t start with the data structure.

DON’T START WITH THE DATA STRUCTURE!

GET THAT MOTHER-FREAKIN’ DATA STRUCTURE OFF MY MOTHER-FREAKIN’ PLANE!

Seriously. Don’t be stupid. If all you’ve got is a hammer, everything starts to look like a thumb.

If you start off with a restrictive data structure before you even fully define the domain you’re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you’re imagining in your head.

Domain modeling is ridiculously hard for any domain worth modeling. If you start with a handicap (a restrictive data structure) it’s going to be even harder.

No one would think of trying to model bibliographic data using only arrays. That’s premature optimization on an epic scale.

The appeal of RDF Triples

Even if you ignore all the semantics and rules that make RDF Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that any semantic model based on RDF Triples has enormous expressive power at its disposal.

Does it turn out that after you’ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You’ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a “type” of sorts, and add some out-of-band knowledge). And lord knows RDBMS’s have great performance characteristics.

But don’t start there. Start with the domain. Model it. Figure out what you need to describe and derive. Then pick the most appropriate data structure.

The nightmare that is MARC

MARC-the-data-structure (not to be confused with a serialization of that data structure, on the one hand, or with the AACR2 on the other) can incompletely (but usefully, I think) be described as:

A set of key-value pairs
…that have a defined order
…where keys can be repeated
…and values are strings
…and keys are a concatenation of tag/ind1/ind2/code

Control fields are especially restricted (ind1, ind2, and code are all ‘null’). There’s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.

It doesn’t give us much to work with. It’s restricted. And, sadly, so is our thinking.

Putting the cart before the horse

As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the MARC data structure. It’s a handicap. It’s an anchor around our necks.

MARC-the-data-model (nee AACR2) is not inherently bad because it’s built on an impoverished data structure. It’s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that’d be awesome because it would indicate that things are simple.

Things, of course, aren’t simple. They’re hard.

So, if you want to complain about MARC or RDA or FRBR, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don’t conflate data models, data structures, and serializations.

Oh, and don’t say “PIN Number” or “ATM Machine.” That drives me crazy, too.

Stupid catalog tricks: Subject Headings and the Long Tail

Bill Dueber — Tue, 13 Apr 2010 00:00:00 +0000

Library of Congress Subject Headings (LCSH) in particular.

I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird.

And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying.

So, just for kicks, I ran some numbers.

The process

I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation in any of the subfields. I called the concatenation of what was left a unique LCSH.

Then I printed them out and put them all onto index cards, using tick-marks to indicate…

No, of course not. I used sort, uniq -c, and wc -l. Here’s what I found.

Counts of LCSH

…in round numbers.

In our catalog, there are:

8.50M subject headings (using the definition above)
1.87M unique subject headings
…66% of which (1.23M) appear exactly once

We only have to go out to 30K subjects to account for half of all subject entries. The top 1000 most-used subjects account for 14.5% of all 8.5M subject entries.

The top ten subjects by count are:

6029 $$aSermons, American
6131 $$aPhilosophy
7224 $$aFeature films
7591 $$aPiano music
7968 $$aSocialism
8796 $$aEconomics
9185 $$aCommunism
12440 $$aSermons, English$$y17th century
13539 $$aBills, Private$$zUnited States
58823 $$aEconomics$$xHistory$$vSources

From a record’s point of view

Our catalog has:

7M records
4.4M records with at least one subject (as defined above)
2.4M records with more than one subject
2.0M records with exactly one subject
2.6M records with zero subjects

The records with the most subject headings tend to be collections of stuff (theses, photos, etc). Our local standout is the Dept. of Medicine and Surgery (University of Michigan) theses, 1851-1878 with 208 subject entries. 14 records have at least 30 subject entries.

What it means

Gee, lady, I don’t know.

One way to look at it: suppose you’re considering defining subjects in this way, and making them “hot” in the catalog interface. For our data, 2/3 of records would have either no subjects or a subject that found only the record you’re at. So…think again.

In real life, we index lots of possible subject fields, and we additionally index the $$a as well as the whole string, so ours are a little bit more useful. A little.

Why bother with threading in jruby? Because it’s easy.

Bill Dueber — Fri, 12 Mar 2010 00:00:00 +0000

[Edit 2011-July-1: I’ve written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you’re running jruby]

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

   enumerable_object.threach(number_of_threads, :which_iterator) do |i|     do_something_threadsafe(i)   end

Some examples

   # You like #each? You'll love...err..probably like #threach   load 'threach.rb'    # Process with 2 threads. It assumes you want 'each'   # as your iterator.   (1..10).threach(2) {|i| puts i.to_s}      # You can also specify the iterator   File.open('mybigfile') do |f|     f.threach(2, :each_line) do |line|       processLine(line)     end   end    # threach does not care what the arity of your block is   # as long as it matches the iterator you ask for    ('A'..'Z').threach(3, :each_with_index) do |letter, index|     puts "#{index}: #{letter}"   end    # Or with a hash   h = {'a' => 1, 'b'=>2, 'c'=>3}   h.threach(2) do |letter, i|     puts "#{i}: #{letter}"   end

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

   require 'thread'   module Enumerable      def threach(threads=0, iterator=:each, &blk)       if threads == 0         # Just call the iterator itself         self.send(iterator, &blk)       else         bq = SizedQueue.new(threads * 4)         consumers = []         threads.times do |i|           consumers << Thread.new do             until (a = bq.pop) === :end_of_data               blk.call(*a)             end           end         end          # The producer         count = 0         self.send(iterator) do |*x|           bq.push x           count += 1         end         # Now end it         threads.times do           bq << :end_of_data         end         # Do the join         consumers.each {|t| t.join}       end     end   end

That’s it. If threads=0, just use the iterator itself. If not:

Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

   if defined? JRUBY_VERSION     numthreads = 3   else     numthreads = 0   end    my_enumerable.threach(numthreads) {|i| ...}

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.

Pushing MARC to Solr; processing times and threading and such

Bill Dueber — Thu, 04 Mar 2010 00:00:00 +0000

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

Read the records into marc4j objects and do nothing. This is a baseline of sorts.
The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
The big “allfields” field is text from tags 100 through 900.
The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

18,881 records in marc-binary format
Times are in seconds, run on my desktop
Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.

Total Seconds	Description
19	Just read the records with marc4j and do nothing.
85	Read and do 35 “normal” fields (no custom)
104	Read, 35 normal, 15 custom fields
110	Read, normal, custom, allfields
129	Read, normal, custom, allfields, to_xml
136	Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142	Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124	Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds	Description
19	read the records and do nothing
66	process the 35 normal fields
19	process the 15 custom fields
6	generate the “allfields” field
19	generate the XML (yowza!)
7	send to solr with two threads
13	send to solr with one thread

Or like this:

Seconds	Description
129	do all the reading and processing
13	send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

SUSS is just faster than whatever solrmarc does.
My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time	SUSS Threads	Processing Threads
70	1	1 (was 142 seconds on the desktop)
47	1	2
39	1	3
35	1	4
68	2	1
48	2	2
38	2	3
34	2	4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

ruby-marc with pluggable readers

Bill Dueber — Tue, 02 Mar 2010 00:00:00 +0000

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

   require 'marc'   require 'my_marc_stuff'    mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto    MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now    xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)    # ...and maybe further on down the road    asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

Setting up your OPAC for Zotero support using unAPI

Bill Dueber — Fri, 06 Nov 2009 00:00:00 +0000

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let’s get them to play nice with each other!

How’s it all work?

Zotero looks for a well-constructed tag in the head of the page
It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks.
Zotero then looks for IDs in the body of the page
If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records.

What you’ll need

An OPAC whose output you can futz with
Access to an individual record’s ID in that output
A URL based on the ID that gives an RIS representation of the records
A screwdriver. Made with decent — but not too expensive — vodka and fresh orange juice.

Yes. I’m cheating.

I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your OPAC’s URL scheme, or write an RIS export tool by hand, well, this will take you a bit longer.

The process

1. Build an upAPI target script

You need a script that’ll do three things:

With no arguments, return a list of available formats in general
With one argument, id=, return a list of formats available for that item. This will likely be exactly the same as #1.
With two arguments, id= & format=, return the record identified by in format

Mine looks like this:

    // id is of the form urn:bibnum:000000000    $id = isset($_REQUEST['id'])? $_REQUEST['id'] : false;    // Format, at this point, had better be 'ris'   $format = isset($_REQUEST['format'])? $_REQUEST['format'] : false;    // Got neither? Return the general list   if (!($id || $format)) {     header('Content-type: application/xml');     echo '                      ';   exit;     }     // Got just the id? Return formats for that ID   if ($id && !$format) {     header('Content-type: application/xml');     echo '                      ';     exit;     }     // Otherwise...    // Parse out the actual numeric part of the id from the urn: prefix   preg_match('/^urn:bibnum:(.*)$/', $id, $match);   $actualID = $match[1];    // Again: format had better be 'ris' because that's all I'm supporting at this point.   header("Location: /Search/SearchExport?id=$actualID&method=$format", true, 302);

You can see that a is a just a name, a mime-type, and an optional reference to documentation on the type.

I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in — I’m hard-coding ‘bibnum’ for the moment, but could allow, say, “oclc” or “isbn” or whatnot, too.

2. Tell your OPAC where the script lives

You’ll need a line in the section of all your pages that might have an ID on them:

Everything should be left alone except for the actual href.

3. Add your IDs to the HTML

In the HTML of your page, you can add one or more tags of the form:

(where the title of the conforms to what you’re expecting in your script).

You can put stuff inside the but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.

4. Final step

Drink your screwdriver.

Where can I see it?

Well…here’s the thing.

You can take a look at my test instance, http://dueberb.vufind.lib.umich.edu/ and play there. You can not see it in production, because there’s a little problem.

Our old OPAC — now dubbed mirlyn-classic — had a custom translator written for it. And it worked fine, and that was great.

But now we’ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.

I’ve asked around about getting it fixed (changing the target URL for the old translator to point at mirlyn-classic) but it’s Friday, and no one is around. Hopefully soon.

Benchmarking MARC record parsing in Ruby

Bill Dueber — Thu, 17 Sep 2009 00:00:00 +0000

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow]

[UPADTE It turns out they don’t use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ]

[UP-UPDATE Well, no, they do use MARC-XML. I’m not afraid to constantly change my story. This is why I’m the best investigative reporter in the business]

The other day on the blacklight mailing list, Bess Sadler wrote

Yes, we do still include the full marc record, but the rule of thumb we’re currently using is that anything that needs to display in the index view (the search results) needs to be broken out into a separate display field, because retrieving and parsing marc records for every item in a list of search results is too much of a performance hit.

This surprised me a fair bit, because in our implementation of VuFind (which uses PHP, versus Ruby for Blacklight) I do just that — grab the MARC out of Solr, parse it, and pull stuff like full titles and such out of it.

As it turns out, I’d been screwing around with calling marc4j from jruby, anyway, so I threw that into the mix, and here’s what I found.

What the benchmark tries to measure

The focus is on measuring time to parse MARC records as returned in a field from Solr in MARC-binary.

I got 40 sets of 50 records each (2000 records) from our Solr instance in ruby format and extracted the binary MARC strings. This resulted in an array of 40 sets of 50 strings, each of which is a valid MARC record.

Fifty records seems largish to me — we only display 20 at a time — but thought I’d swing for the fences.

I’m testing along three(ish) dimensions:

jruby vs mri
marc4j vs ruby-marc (only on jruby, obviously)
parsing each string individually, or globbing them all together and treating it as if it’s a multi-record file

[Note that MRI is using Net::HTTP to get the data; I presume Curl would be faster still. It’s already faster than jruby]

The following data show the average time to parse out each set of 50 records and extract the first 245 (title) field from each one, along with the totals for doing all 2000 records.

Method                           User       Total      Real  jruby Get/Eval data              0.134750   0.134750 (  0.134850) jruby Get/Eval data (2000)       5.390000   5.390000 (  5.394000)  MRI Get/Eval data                0.008500   0.012750 (  0.115942) MRI Get/Eval data (2000)         0.340000   0.510000 (  4.637677)  jruby-marc4j-oneAtATime          0.056075   0.056075  (0.056125) jruby-marc4j-multistring         0.027925   0.027925  (0.028000)  jruby-marc-oneAtATime            0.066625   0.066625  (0.066650) jruby-marc-multistring           0.034300   0.034300  (0.034325)  mri-marc-oneAtATime              0.084500   0.085250  (0.086597) mri-marc-multistring             0.085000   0.085750  (0.086026)  jruby-marc4j-oneAtATime (2000)   2.243000   2.243000  (2.244999) jruby-marc-oneAtATime (2000)     2.665001   2.665001  (2.666000) mri-marc-oneAtATime (2000)       3.380000   3.410000  (3.463888)   jruby-marc4j-multistring (2000)  1.117001   1.117001  (1.120001) jruby-marc-multistring (2000)    1.371999   1.371999  (1.372999) mri-marc-multistring (2000)      3.400000   3.430000  (3.441052)

So…the worst-case scenario is taking an average 0.085 second to get the first title field out of each one of 50 binary MARC records once we’ve got them.

Now, I’m sure all my records came out of the cache, so my query time wasn’t very long. But we still end up with a maximum of roughly 0.2 seconds plus the time to actually do the query to end up with a set of 50 marc records.

We can see from looking at the totals that it looks like MRI’s bottleneck is the actual parsing, whereas constructing the input streams is expensive under jruby (at least the way I’m doing it), resulting in a benefit of concatenating them all together into one longish string before parsing.

Marc4j is faster (20%ish), but not enough faster to be worth the effort in my mind. Keep in mind that I have no idea how fast Marc4j is when running under pure java, without all the jruby overhead.

Bottom line, though: that seems fast enough to me.

I’ll try to benchmark with XML later on today or tomorrow.

Going with and “forking” VUFind

Bill Dueber — Wed, 19 Aug 2009 00:00:00 +0000

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff.

When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight.

I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around it (e.g., Stanford), and was based on a technology stack we had some experience with (PHP). Blacklight seemed to be just getting off to a fitfull start, and its Ruby stack was at that time an iffy proposition (this was before any sort of major adoption of Passenger or JRuby).

As I write this, things have flipped around a little. Andrew Nagy, the principle architect of VUFind, left Villanova for Serial Solutions and VUFind stopped being his primary focus. The Blacklight community decided to go with a major reorganization of the code to make it easier to deploy, which resulted in a flurry of refactoring and improvements and folks generally thinking things through really well. Stanford just flipped the switch from their VUFind to a Blacklight installation, and as I pointed out, the Ruby deployment options are more stable and less resource-hungry than they were back then. If the decision were being made today, it would be a much more complex analysis.

But anyway, the decision was made, and Tim Prettyman and I were tapped to do most of the hardcore nerd work to make it suitable for our environment.

Right away, I found things that would need some pretty major revision. The user model was based on a local database of logins (we use cosign), even moderately-long search strings would crash the thing, cookies were being used instead of sessions and hitting the 4K limit, search specification were hardcoded in the PHP, and lots of the UI elements didn’t actually have working code behind them (RSS feeds, endnote export, spellcheck, etc).

So, I dug in and started learning PHP and Smarty and refactoring/rewriting/rearchitecting the crap out of it. One of the first things I did was to extract the search specification — the mapping of, say, a ‘title’ search to a weighted search of six or seven actual Solr fields — into a yaml file so we could mess around with it more easily than modifying the giant case-statement in the PHP code. I built a patch against the then-current revision, filed it as a bug, and sent email to the list.

And nothing happened. That patch is still sitting there, in fact. Maybe I’m the only one that thinks it’s useful. But in any case, there was no discussion of it, no one rejected it. It just sat. Sits. Whatever.

I could have asked for write access to the repository, but I didn’t. I saw a few other patches get submitted and met with yawns all around, and started looking more closely at the list and saw pretty much no one doing anything with the then-current code base, and frankly kind of gave up. The folks that I knew were working actively on implementing VUFind — us, Australia and Alan Rykhus at MNPals — were all working from very different code bases, which made our ability to share code very limited. Any sort of official work on VUFind seemed to have slowed to a near standstill (based on svn checkins), and almost no one else seemed interested in submitting patches. After a while, we stopped, too.

So, we didn’t really fork VUFind. We just rewrote much of it and stopped trying to generate interest in our changes. The right thing to do would have been to either grab the bull by the horns, or do an actual fork of the project. But we didn’t feel as if we had time to shepherd a project of this size, and after many, many (many) discussions, decided to just do our thing. I assume that’s what everyone else has done, too, since I see plenty of differences in how things work at the different sites.

As it stands, the wiki shows a good handful of libraries live with VUFind, and a bunch more marked as being in “beta.” I don’t know if what we’re running Mirlyn on is still enough VUFind to be called VUFind. Probably. The basic structure is the same, the search syntax as exposed in the URL is the same. The plumbing underneath is changed in a lot of ways, and I like to think the flow of control makes a little more sense now.

In real life, of course, it doesn’t matter where you draw the line. Our code is far enough removed from the svn repository now that we’re essentially going it alone.

That doesn’t bother me.

The reality is that we’ve taken control of the UI and learned what we need to know about using Solr with our data. If I need to change the backend — to Blacklight, to a newer VUFind, to anything — my users need not ever know, other than to notice that things are a little bit better. If we end up moving to a release-quality version of VUFind, there’s almost nothing I can’t reuse if it makes sense.

We’ve also learned a lot. Solr, obviously, and how to write text filters for it and push it around just a little bit. Solrmarc, too. But we’ve also taken a hard look at data normalization in ways we haven’t before, and decided how we’re going to output to Refworks, and to email, what kinds of searches we want to offer, where we have collisions in ID namespaces (OCLC & ISSN, I’m looking at you).

We’ve discovered issues and problems with our data we’d have never seen otherwise, and started up whole sets of conversations about OPAC issues that used to languish for lack of a reification for reference. The ability to actually (try to) implement the collective intelligence of the library and embody it in a public-facing system is a rush compared to fighting with the ILS.

The system has tons of problems still, starting with underlying templates that will make you a little sick if you do a “view source” and going right through my call number search not working for some edge cases. But that stuff will get cleaned up as we get a little downtime from adding new features, and there are elements of the new backend code that could be useful to others once I clean them up and remove local dependencies.

I’m not sure when, if ever, we’ll start thinking of ourselves as part of the “VUFind community” again. The heavy intellectual lifting about how to organize what is essentially a front-end for Solr doesn’t seem to be happening on the VUFind list. And to be honest, I’m not sure it should be. Solr is the real engine. Solrmarc is, for us right now, an important piece. Data normalization, translation, workaround for crappy data, and the basic information theory of a faceted search system are all independent of the particular middleware you’re using to grab Solr results and throw them up on the screen.

So, what we have is good for us, for now, and we’re continuing to learn how to move forward. And I’ve been able to get bug reports and say, “Thanks, Fixed” fifteen minutes later and get warm fuzzy feelings that don’t usually accompany, “Thanks. I’ll put a request in at Ex Libris’ online ticket system”.

Next time: using and abusing Solr for data normalization.

MARC-HASH: The saga continues (now with even less structure)

Bill Dueber — Wed, 15 Apr 2009 00:00:00 +0000

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

 {   "type" : "marc-hash",   "version" : [1, 0],    "leader" : "leader string"   "fields" : [      ["001", "001 value"]      ["002", "002 value"]      ["010", " ", " ",       [         ["a", "68009499"]       ]     ],     ["035", " ", " ",       [         ["a", "(RLIN)MIUG0000733-B"]       ],     ],     ["035", " ", " ",       [         ["a", "(CaOTULAS)159818014"]       ],     ],     ["245", "1", "0",       [         ["a", "Capitalism, primitive and modern;"],         ["b", "some aspects of Tolai economic growth" ],         ["c", "[by] T. Scarlett Epstein."]       ]     ]   ] }

MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records

Bill Dueber — Mon, 13 Apr 2009 00:00:00 +0000

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.

So…I’m suggesting we try something a little simpler. Something so brain-dead, in fact, that I’m loathe to put it down because it’s pretty much the obvious way to do it. To wit:

 {   "type" : "marc-hash",   "version" : [1, 0],    "leader" : "leader string"   "control" : [      ["001", ["all", "001", "values"]],      ["002", ["all", "002", "values"]],   ],   "data" : [     ["010", " ", " ",       [         ["a", "68009499"]       ]     ],     ["035", " ", " ",       [         ["a", "(RLIN)MIUG0000733-B"]       ],     ]     ["035", " ", " ",       [         ["a", "(CaOTULAS)159818014"]       ],     ]     ["245", "1", "0",       [         ["a", "Capitalism, primitive and modern;"],         ["b", "some aspects of Tolai economic growth" ],         ["c", "[by] T. Scarlett Epstein."]       ]     ]   ] }

Stupid MARC allows all the stupid fields to stupid repeat and be out of stupid order and such, so it’s just a lot of arrays. Easily round-tripable.

Why bother? Excellent question, and one that’s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it’s still a lot easier than working with raw MARC (or, I would claim, MARC-XML), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.

A few things worth noting about the assumptions in my mind:

By definition, it’s always UTF-8. The leader should be changed to note this on the sending end, but it’s not required.
We include both a type “marc-hash”, and a version with major/minor numbers.
Everything is a string.
Alpha characters in indicators/tags are all lowercased.
A control field is a duple: tag and array of values.
A data field has four values:
- The tag
- Indicator one
- Indicator two
- An array of duples: subfield and its value

A simple transformation to make it a little more queryable

Let’s say you don’t give a damn about tags that appear out of order, because that’s just a crime against nature, anyway. And you really don’t care what order the subtags appear in most of the time, ’cause really, who does?

A simple run-through (psuedocode ahead):

   my marchash = getTheMarcHash();   my kindamarc;   kindamarc{leader} = marchash{leader};    # Map the control fields by tag => array-of-values   foreach cfield (marchash{control}) {     kindamarc{control}{cfield[0] ||= []};     kindamarc{control}{cfield[0]}.push(cfield[1]);   }    foreach d (marchash{data}) {     (tag, ind1, ind1) = (d[0], d[1], d[2]);      # build up a hash based on subfields for this tag     newd = {};     foreach subfield (d[3]) {       (stag, sval) = subfield;       newd{stag} = sval;     }      # Store the subfield hash in a few places so it's easy to find.     foreach i1 ('*', ind1) {       foreach i2 ('*', ind2) {         kindamarc{data}{tag}{i1}{i2} ||= [];         kindamarc{data}{tag}{i1}{i2}.push(newd);       }     }   }

Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator ‘*’.

Basically, this will allow things like this:

  $leader = $kindamarc{leader};   $first001 = $kindamarc{control}{"001"}[0];    # Find 856s where indicator 2 is '1'    @mystuff = $kindamarc{data}{856}{'*'}{1};

It’s easy to see how we could store the index from the original array to make it easy to find the original order, too.

For many, I’m sure, the prospect of dealing with something like this is more daunting than just learning to use MARC-XML or using existing libraries to deal with straight MARC. But there seems to be a set of folks out there for whom this might be useful, so I’m throwing it out there.

TicTocs: Give us a file! Pretty pretty pretty please!

Bill Dueber — Mon, 02 Feb 2009 00:00:00 +0000

For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome.

In their blog at News from TicTocs, a post titled I want to be completely honest with you aboutÂ ticTOCs notes that:

As for the API – yes, weâ€ve been asked this several times, and the answer is that it is currently being written and should be available very soon.

That’s great, but writing in a comment on that post (after logging in with a very, very old OpenID — I used to have a blog named Opachyderm, a name which I thought was insufferably clever), I noted that we don’t need an API right away.

What we need is a text file.

Simple. Tab-delimited. TicTocID,Title,URL,issn,eissn. Update it every night.

That’s all we need.

We can do the rest. Put it in the OPAC. Stick it on our SFX pages. Not screw around with Javascript/AJAX calls when the data we need are (relatively) static and (absolutely) simple.

Someone needed to put a web interface on those data, and the one provided at ticTocs is really nice. I’m glad it’s there.

And I can’t tell you how much I applaud the JISC for starting this project and getting vendors on board. That’s always the hard part — participation and standardization. They’re doing it, and I couldn’t be happier.

But these data are incredibly valuable,Â and their value is currently limited because they’re boxed up.

Spreading these data far and wide is good for scholarship, and I can’t imagine the case that could be made showing it’s better for JISC to keep them at a single endpoint.

The knee-jerk reaction is always, I know, to keep things behind a wall, even if it’s a short wall. “Things will get out of sync if people have their own copies.” Or, “We’ll provide whatever access you need, as fast as you need it, honest.” Or, “We’re going to be providing value-added services on top of the data.”

It’s all true. Things will get out-of-sync — but that’s going to happen whether you encourage people to not cache results or not. And I don’t doubt for a moment that the API provided will be great. And of course you’ll be in a position where you can provide value-added services.

But so can the rest of us.

I’ve run into this myself. I fear…well, let’s be honest. I fear providing a service, having the data stripmined, and then having no one appreciate the front-end I put on it. I do this job for the fame, not the fortune. Obviously.

But I’ll never provide services as fast as me plus three hundred other geeks, all responding to different situations and servicing different patrons.

So…provide an API. Start simple: a single call named getCurrentTextFile. Or maybe add getCurrentTextFileGzipped. It’s only ten-thousand lines of text, probably less than 75k gzipped up. I promise to call it every night about 3am local time so I’m up-to-date.

So….pretty please? With sugar on top? My catalog is waiting. So is my SFX install. And our list of ejournals. And our subject guides. And lots of pages on our website. And our pre-packaged OPML files to offer students and professors. And a thousand yet-to-be-devised services as well.

Pretty pretty pretty please???

Psst. We’re not printing cards anymore

Bill Dueber — Mon, 12 May 2008 00:00:00 +0000

[From a series I’m calling, “Things About The Library I Think Are Stoooopid”, part one of about a zillion.]

I’m going to wallow in a little bit of hyperbole here, but only a little.

The problem

Suppose, just for a moment, that you’re a computer programmer working anytime in the last twenty years, and someone wants you to set up a data structure to deal with a timeless issue — how to keep track of who’s on which committees in a library.

If you’re a computer person

Easy enough. First off, what’s a committee?

Committee

Committee name (string)
Committee inception date (date)
Chair (person)
Members (set of people)

How about a person?

Person

Last name (string)
First name (string)
Email address (email)

Okeedokee. That looks ok so far, but we’ve got problems.

First off, everyone knows that committee names change. And, everyone also knows that last names can change, preferred first names can change. email addresses change, etc. We need some sort of unique identifier to represent the abstract ideal of a particular committee or a specific individual. Let’s be lazy and just throw in an integer ID that we’ll be careful not to reuse, ever, for any reason.

So, we’ll throw that in, and make sure our references are to these unique IDs, not names or whatnot.

That gives us this.

Committee

cID (unique integer)
Committee name (string)
Committee inception date (date)
Chair (pID)
Members (set of pIDs)

How about a person?

Person

pID (unique integer)
Last name (string)
First name (string)
Email address (email)

And the mapping, of course.

Committee-Person Mapping

pID (unique integer pointing into the Person table)
cID (unique integer pointing into the Committee table)
dateTermStarted (date)
dateTermEnds (date)

If this seems simple, well, it is. Like I said, the theory is almost forty years old, and common implementations of databases at least twenty. We have well-defined unique keys, special types for dates and email addresses so we can do some sanity checking and order things and so forth, and a very, very simple mapping of people to committees where we keep track of start and end dates just to be complete.

Most importantly, you know what’s not here? There’s nothing about how to print it out, or what format I’m going to store it in. Those are afterthoughts. They don’t matter. Any well-specified data model can be machine-translated into pretty much anything you need.

If you’re writing a library spec

As near as I can tell, the “library” way to write this would be as follows:

Committee

[Let “hus” stand for “hopefully unique string created by ridiculously complex algorithm”]

Committee name (hus)
Committee inception (string masquerading as a date in any of several formats)
Chair (hus)
Members
- person1 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)
- person2 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)

Ummmmm…strings. Nothing but strings. Short strings, long strings, fat strings, tall strings. Strings with dollar signs. Strings that look like dates. Strings that contain other strings. And, just for luck, a little bit of hierarchy, where “hierarchy” means “two levels.”

If someone’s name changes, well, good luck trying to find all the occurrences and fixing them all (and making sure you don’t get the wrong John Smith). Good luck parsing out all the dates, which rely not on machine syntax checking but on a whole set of data-enterers trying to follow some sort of rule without making any mistakes. And good, good luck getting a list of which committees a specific person belongs to.

Why I bring it up

One of the most eye-opening talks I heard at code4lib 2008 was a keynote by Karen Coyle on RDA and its ongoing specification. You can view the slides or watch the presentation if you’d like.

In it, she makes the point that, when push comes to shove, AACR2 and RDA both ended up being tremendously focused on producing text strings.

Whaaaaa??

Was there no one on the RDA committee that had experience with anything even approaching modern data theory?

Of course there was. But the giant weight of history is crushing library data modeling like a skinless grape under a dump truck.

Look, I understand that this is not a simple data modeling problem. I understand that there’s a whole set of issues, including a (what I think to be a specious) demand that the cataloged data accurately reflect the actual text in a real, physical object that’s sitting in front of you. I’m not so naive as to think this is an easy task.

But anyone who, in the 21st century, approaches the large-scale creation of data without first and foremost worrying about machine-parsability, consistent data types with machine-checkable syntax (and even some semantics) and one-to-one mappings between unique objects (an author, an editor, a publishing house, a work) and something that uniquely identifies that object in any reification is….well, I don’t know what they’re smoking.

We’re not printing cards anymore, people.

If something is only understandable if a human is reading it, it’s not understandable by any modern definition.
Punctuation doesn’t belong in the description of an object. Ever. Punctuation is a rendering issue. If you’re using punctuation, or well-formed strings, instead of descriptive attributes, you’re doing it wrong.
Just because you know your data doesn’t mean you know how to model it. Get outside help from the smartest people you can find.

Whew! That felt good!

OK. Rant off.

UPenn library has video “commercials

Bill Dueber — Wed, 07 May 2008 00:00:00 +0000

The University of Pennsylvania Library has a set of video commercials touting their products — some of which are musicals! Worth a look-see.

Uncategorized – Robot Librarian

Reintroducing Traject: Traject 2.0

How does it work?

Questions about traject

A 2.0 release?

So…give it a whirl!

How good/bad is MARC data? The case of place-of-publication

Focus on validity

Results: pretty good!

And now, the complaints

Ruby MARC serialization/deserialization revisited

File sizes

Serialization / Deserialization time

MRI Ruby

JRuby

Conclusions

Schemaless” solr with dynamicField and copyField

Indexed XOR Stored?

Part 1: Dynamic Fields

Part 2: Copy Fields

Part 3: Copy Field with globs

Part 4: Putting it all together

Why this is probably a bad idea.

Help me test yet another LC Callnumber parser

New blog front- and back-end

Announcing “traject” indexing software

What’s it look like?

Why use (or move to) traject?

What does it have out of the box?

How do I get a taste?

Come work at the University of Michigan

Please: don’t return your books

Starting data

Finding the number of pages in a book

Bringing it all together

So…what’s the damage?

1.22 miles???

What is this good for again?

Next steps

Requiring/Preferring searches that don’t span multiple values (SST #3)

Solr and multiValued fields

Following along at home?

The relevance ranking seems…wrong

Phrase slop

Enter positionIncrementGap

But I’m already using the pf parameter!

Query slop

Package it up

Let’s try it out!

Where it breaks down

What have we learned?

Solr and boolean operators

Boolean algebra and operator precedence

Left to right?

So what’s left???

Do you let your users put AND/OR/NOT in their queries?

It gets weirder

Where does this leave us?

Even better, even simpler multithreading with JRuby

How good is our relevancy ranking?

A short ruby diversion: cost of flow control under Ruby

ISBN parenthetical notes: Bad MARC data #1

What’s supposed to be in the 020?

What’s goes after the ISBN in the 020?

What did I find?

“Yeah, but you have to understand that historically…”

To sum up

Why programmers hate free text in MARC records

Description vs Findability

Machine Reasoning vs. Machine Parsing

Examples?

Does anyone use those prev/next/back-to-search links?

Data from Mirlyn, the University of Michigan OPAC

Size/speed of various MARC serializations using ruby-marc

Why bother?

What I’m testing

The benchmark and its results

Analysis, such as it is

Simple Ruby gem for dealing with ISBN/ISSN/LCCN

ISBN

Questions about `traject`

Enter `positionIncrementGap`

But I’m already using the `pf` parameter!