Traject 2.0.0 released! Now runs under MRI/RBX! traject is an ETL (extract/transform/load) system written in ruby with a special view towards extracting fields from MARC data and writing it out into Solr. [Jonathan Rochkind] (http://bibwild.wordpress.com) and I wrote this primarily out of frustration using other tools in this space (e.g., Solrmarc, or my own precursor to traject , marc2solr Note: Catmandu is another, perl-based system I don’t have any direct experience with. traject had its first release almost a year and a half ago (at least based on the date of my post introducting it), and I’ve used it literally…
Comments closedCategory: Uncategorized
How good/bad is MARC data? The case of place-of-publication
I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else. One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself. [Note: there will be some complaining at the end.…
Comments closedRuby MARC serialization/deserialization revisited
A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…
Comments closedSchemaless” solr with dynamicField and copyField
[Holy Kamoly, it’s been a long time since I blogged!] Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields. I find this intruguing, but it’s not what I’m after right now. The problem I’m in the first stages of addressing is that my schema.xml is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on…
Comments closedHelp me test yet another LC Callnumber parser
Those who have followed this blog and my code for a while know that I have a long, slightly sad, and borderline abusive relationship with Library of Congress call numbers. They’re a freakin’ nightmare. They just are. But, based on the premise that Sisyphus was a quitter, I took another stab at it, this time writing a real (PEG-) parser instead of trying to futz with extended regular expressions. The results, so far, aren’t too bad. The gem is called lc_callnumber, but more importantly, I’ve put together a little heroku app to let you play with it, and then correct…
Comments closedAnnouncing “traject” indexing software
[Over the next few days I’ll be writing a series of posts that highlight a new indexing solution by Jonathan Rochkind and myself called traject that we’re using to index MARC data into Solr. This is the introduction.] Wow. Six months since I posted here. What have I been doing? Well, mostly parenting, but in the last few weeks I was lucky enough to get on board with a project started by Jonathan Rochkind for a new JRuby-based tool optimized for indexing MARC data into solr. You know, kinda like solrmarc, but JRuby. What’s it look like? I encourage you…
Comments closedPlease: don’t return your books
So, I’m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part. Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question: How much more shelving would we need if everyone returned their books? Assuming we could get them all checked in and such, well, where would we put them? I’m looking at this in the simplest, most conservative way possible: Assume they’re all paperbacks, so we don’t…
Comments closedRequiring/Preferring searches that don’t span multiple values (SST #3)
Check out introduction to the Stupid Solr Tricks series if you’re just joining us.] Solr and multiValued fields Here’s another thing you need to understand about Solr: it doesn’t really have fields that can take multiple values. But Bill, you’re saying, sure it does. I mean, hell, it even has a ‘multiValued’ parameter. First off: watch your language. Second off: are you sure? Let’s do a quick test. Look at the following documents // exampledocs/names.json [ { "id":1, "title":"The Monkees", "name_text":[ "Peter Tork", "Mike Nesmith", "Micky Dolenz", "Davy Thomas Jones" ] }, { "id":2, "title":"Heros of the Wild West", "name_text":[…
Comments closedSolr and boolean operators
[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!] What does Solr do, given the following query? a OR b AND c I’ll give you three guesses, but you’ll get the first two wrong and won’t have any idea how to generate a third, so don’t spend too much time on it. Boolean algebra and operator precedence Anyone who’s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping: a OR (b AND c) That’s…
Comments closedEven better, even simpler multithreading with JRuby
[Yes, another post about ruby code; I’ll get back to library stuff soon.] Quite a while ago, I released a little gem called threach (for “threaded #each”). It allows you to easily process a block with multiple threads. # Process a CSV file with three threads FIle.open(‘data.csv’).threach(3, :each_line) {|line| send_to_db(line)} Nice, right? The problem is that I could never figure out a way to deal with a break or an Exception raised inside the block. The core problem is that once a thread trying to push/pop from a ruby SizedQueue is blocking, there’s no way (I could find) to tell…
Comments closedHow good is our relevancy ranking?
For those of us that spend our days trying to tweak Mirlyn to make it better, one of the most important — and, in many ways, most opaque — questions is, “How good is our relevancy ranking?” Research from the UMich Library’s Usability Group (pdf; 600k) points to the importance of relevancy ranking  for both known-item searches and discovery, but mapping search terms to the “best” results involves crawling deep inside the searcher’s head to know what she’s looking for. So, what can we do? Record interaction as a way of showing interest One possibility is to look at those…
Comments closedA short ruby diversion: cost of flow control under Ruby
A couple days ago I decided to finally get back to working on threach to try to deal with problems it had — essentially, it didn’t deal well with non-local exits due to calls to break or even something simple like a NoMethodError. [BTW, I think I managed it. As near as I can tell, threach version 0.4 won’t deadlock anymore] Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an…
Comments closedISBN parenthetical notes: Bad MARC data #1
Yesterday, I gave a brief overview of why free text is hard to deal with. Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it. The point is not to mock anything. Mocking will, however, be included for free. What’s supposed to be in the 020? Well, for starters, an ISBN (10 or 13 digit, we’re not picky). Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or…
Comments closedWhy programmers hate free text in MARC records
One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job. A lot of people seem to not understand why. This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem. Description vs Findability I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description…
Comments closedDoes anyone use those prev/next/back-to-search links?
There’s a common problem among developers of websites that paginate, including OPACs: how do you provide a single item view that can have links that go back to the search (or to the prev/next item) without making your URLs look ugly? The fundamental problem is that as soon as your user opens up a couple searches in separate tabs, your session data can’t keep track of which search she wants to “go back to” unless you put some random crap in the URL, which none of us want to do. But let’s take three giant steps backwards before we throw…
Comments closedSize/speed of various MARC serializations using ruby-marc
Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data. I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21. Why bother? Binary MARC-21 is “broken” in…
Comments closedSimple Ruby gem for dealing with ISBN/ISSN/LCCN
I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums. It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome. Functionality is all as module functions, as follows: ISBN char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn) boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn) thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn) tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn) ISSN char = StdNum::ISSN.checkdigit(issn) boolean = StdNum::ISSN.valid?(issn) LCCN normalizedLCCN = StdNum::LCCN.normalize(lccn) Again, there’s nothing special here…
Comments closedSolr: Forcing items with all query terms to the top of a Solr search
[Note: I’ve since made a better explanation of, and solution for, this problem.] Here at UMich, we’re apparently in the minority in that we have Mirlyn, our catalog discovery interface (a very hacked version of VuFind), set up to find records that match only a subset of the query terms. Put more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on ‘OR’. Now, I’m actually quite proud of how our searching behaves. Reference desk anecdotes and our statistics all point to the idea that people tend to find what they’re looking for.…
Comments closed