Skip to content

Author: Bill Dueber

Reintroducing Traject: Traject 2.0

Traject 2.0.0 released! Now runs under MRI/RBX! traject is an ETL (extract/transform/load) system written in ruby with a special view towards extracting fields from MARC data and writing it out into Solr. [Jonathan Rochkind] (http://bibwild.wordpress.com) and I wrote this primarily out of frustration using other tools in this space (e.g., Solrmarc, or my own precursor to traject , marc2solr Note: Catmandu is another, perl-based system I don’t have any direct experience with. traject had its first release almost a year and a half ago (at least based on the date of my post introducting it), and I’ve used it literally…

Comments closed

How good/bad is MARC data? The case of place-of-publication

I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else. One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself. [Note: there will be some complaining at the end.…

Comments closed

Ruby MARC serialization/deserialization revisited

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…

Comments closed

Schemaless” solr with dynamicField and copyField

[Holy Kamoly, it’s been a long time since I blogged!] Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields. I find this intruguing, but it’s not what I’m after right now. The problem I’m in the first stages of addressing is that my schema.xml is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on…

Comments closed

Help me test yet another LC Callnumber parser

Those who have followed this blog and my code for a while know that I have a long, slightly sad, and borderline abusive relationship with Library of Congress call numbers. They’re a freakin’ nightmare. They just are. But, based on the premise that Sisyphus was a quitter, I took another stab at it, this time writing a real (PEG-) parser instead of trying to futz with extended regular expressions. The results, so far, aren’t too bad. The gem is called lc_callnumber, but more importantly, I’ve put together a little heroku app to let you play with it, and then correct…

Comments closed

New blog front- and back-end

A while back, Dreamhost had some problems and my blog and assorted other websites I help keep track of went down. For more than two weeks. Now, I understand that crap happens. And I understand that sometimes lots of things happen at once. But fundamentally, their infrastructure is such that they could lose everything on a machine and be unable to get it back for more than two weeks. I’m not a mathematician, but that’s not “five-nine” service. So, I decided to start hunting around for another provider. And then I got distracted by the idea that maybe having my…

Comments closed

Announcing “traject” indexing software

[Over the next few days I’ll be writing a series of posts that highlight a new indexing solution by Jonathan Rochkind and myself called traject that we’re using to index MARC data into Solr. This is the introduction.] Wow. Six months since I posted here. What have I been doing? Well, mostly parenting, but in the last few weeks I was lucky enough to get on board with a project started by Jonathan Rochkind for a new JRuby-based tool optimized for indexing MARC data into solr. You know, kinda like solrmarc, but JRuby. What’s it look like? I encourage you…

Comments closed

Come work at the University of Michigan

The Library has three UX positions available right now — interface designer, interface developer, and a web content strategist. Come join me at what is easily the best place I’ve ever worked! Full details are over at Suz’s blog.

Comments closed

Please: don’t return your books

So, I’m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part. Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question: How much more shelving would we need if everyone returned their books? Assuming we could get them all checked in and such, well, where would we put them? I’m looking at this in the simplest, most conservative way possible: Assume they’re all paperbacks, so we don’t…

Comments closed

Boosting on Exactish (anchored) phrase matching in Solr: (SST #4)

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.] Exact matching in Solr is easy. Use the default string type: all it does is, essentially, exact phrase matching. string is a great type for faceted values, where the only way we expect to search the index is via text pulled from the index itself. Query the index to get a value: use that value to re-query the index. Simple and self-contained. But much of the time, we don’t want exact matching. We want exactish matching. You know, where things are exactly the same except. Except…

Comments closed

Requiring/Preferring searches that don’t span multiple values (SST #3)

Check out introduction to the Stupid Solr Tricks series if you\’re just joining us.] Solr and multiValued fields Here\’s another thing you need to understand about Solr: it doesn\’t really have fields that can take multiple values. But Bill, you\’re saying, sure it does. I mean, hell, it even has a \’multiValued\’ parameter. First off: watch your language. Second off: are you sure? Let\’s do a quick test. Look at the following documents exampledocs/names.json [ { id: 1, title: The Monkees, name_text: [Peter Tork, Mike Nesmith, Micky Dolenz, Davy Thomas Jones] }, { id: 2, title: Heros of the Wild…

Comments closed

Using localparams in Solr (or, how to boost records that contain all terms) (SST #2)

[Note: this isn’t so much a Stupid Solr Trick as a Thing You Should Probably Know; consider it required reading for the next SST. If you’re just joining us, check out the introduction to the Stupid Solr Tricks series] What the heck is a localparams query? A garden-variety Solr query URL looks something like this: http://localhost:8983/solr/select? defType=dismax &qf=name^2 place^1 &q=Dueber Which is fine, as far as it goes. But it’s easy to run into the limits of the standard query plugins (e.g., Dismax). Say, for example, you want something like this: title:Constructivism AND author:Dueber And furthermore, you have multiple underlying…

Comments closed

Solr Field Type for numeric(ish) IDs (SST #1)

[For the introduction to this series, take a quick gander at the introduction] Like everyone else in the library world, I’ve got a bunch of well-defined, well-controlled standard identifiers I need to keep track of and allow searching on. You know, well-vetted stuff like this: 1234-5678 123-4567-890 12-34-567-X 0012-0045 ISBN13: 1234567890123 ISSN: 1234567X (1998-99) ISSN (1998-99): 1234567X 1234567890 (hdk. 22 pgs) 9 Behind the 3rd floor desk Henry VIII [Note: some of these may be a titch exaggerated] How does your system deal with these on index? How about on query? Here’s an idea of how to use a custom…

Comments closed

Stupid Solr tricks: Introduction (SST #0)

Completed parts of the series: A Solr Field Type for numeric(ish) IDs Using localparams in Solr (or, how to boost records that contain all terms) Requiring/Preferring searches that don’t span multiple values Boosting on Exactish (anchored) phrase matching Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr. Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know…

Comments closed

Another short personal note

The baby spent all last week in the hospital. Nothing life-threatening (so long as he was in the hospital and could get O2 when needed); it was just annoying. So….here’s to a week-long hospital stay being able to be merely “annoying”. A tip of the hat to steady employment, generous sick/vacation policies, flexible co-workers, excellent insurance, and having a world-class hospital in town. This could have been a much, much worse week than it was.

Comments closed

Solr and boolean operators

[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!] What does Solr do, given the following query? a OR b AND c I’ll give you three guesses, but you’ll get the first two wrong and won’t have any idea how to generate a third, so don’t spend too much time on it. Boolean algebra and operator precedence Anyone who’s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping: a OR (b AND c) That’s…

Comments closed

A short personal note

We had another baby. 🙂 Shai Brown Dueber was born last Monday, the 3rd, at a very moderate 7lbs 7.2oz (his brothers were 9lbs and 9.5lbs). Mother, baby, and older brothers are all doing well. Father is freakin’ tired.    

Comments closed

Even better, even simpler multithreading with JRuby

[Yes, another post about ruby code; I’ll get back to library stuff soon.] Quite a while ago, I released a little gem called threach (for “threaded #each”). It allows you to easily process a block with multiple threads. # Process a CSV file with three threads FIle.open(‘data.csv’).threach(3, :each_line) {|line| send_to_db(line)} Nice, right? The problem is that I could never figure out a way to deal with a break or an Exception raised inside the block. The core problem is that once a thread trying to push/pop from a ruby SizedQueue is blocking, there’s no way (I could find) to tell…

Comments closed

Using SQLite3 from JRuby without ActiveRecord

I spent way too long asking my friend, The Internet, how to get a normal DBI connection to SQLIte3 using JRuby. Apparently, everyone except me is using ActiveRecord and/or Rails and doesn’t want to just connect to the database. But I do. Here’s how. First, get the gems: gem install dbi gem install dbd-jdbc gem install jdbc-sqlite3 Then you’re ready to load it up into DBI. require ‘rubygems’ # if you’re using 1.8 still require ‘java’ require ‘dbi’ require ‘dbd/jdbc’ require ‘jdbc/sqlite3’ databasefile = ‘test.db’ dbh = DBI.connect( “DBI:jdbc:sqlite:#{databasefile}”, # connection string ”, # no username for sqlite3 ”, #…

Comments closed

How good is our relevancy ranking?

For those of us that spend our days trying to tweak Mirlyn to make it better, one of the most important — and, in many ways, most opaque — questions is, “How good is our relevancy ranking?” Research from the UMich Library’s Usability Group (pdf; 600k) points to the importance of relevancy ranking  for both known-item searches and discovery, but mapping search terms to the “best” results involves crawling deep inside the searcher’s head to know what she’s looking for. So, what can we do? Record interaction as a way of showing interest One possibility is to look at those…

Comments closed