I just released another (this time pretty good) version of my gem for normalizing/validating library standard numbers, library_stdnums (github source / docs). The short version of the functions available: ISBN: get checkdigit, validate, convert isbn10 to/from isbn13, normalize (to 13-digit) ISSN: get checkdigit, validate, normalize LCCN: validate, normalize Validation of LCCNs doesn’t involve a checkdigit; I basically just normalize whatever is sent in and then see if the result is syntactically valid. My plan in my Copious Free Time is to do a Java version of these as well and then stick them into a new-style Solr v.3 filter so…
Comments closedTag: old
A short ruby diversion: cost of flow control under Ruby
A couple days ago I decided to finally get back to working on threach to try to deal with problems it had — essentially, it didn’t deal well with non-local exits due to calls to break or even something simple like a NoMethodError. [BTW, I think I managed it. As near as I can tell, threach version 0.4 won’t deadlock anymore] Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an…
Comments closedISBN parenthetical notes: Bad MARC data #1
Yesterday, I gave a brief overview of why free text is hard to deal with. Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it. The point is not to mock anything. Mocking will, however, be included for free. What’s supposed to be in the 020? Well, for starters, an ISBN (10 or 13 digit, we’re not picky). Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or…
Comments closedWhy programmers hate free text in MARC records
One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job. A lot of people seem to not understand why. This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem. Description vs Findability I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description…
Comments closedCorrected Code4Lib slides are up
…at the same URL. I was, to put it mildly, incredibly excited about code4lib this year because, for once, I thought I had something to say. And I did have something to say. And I said it. But it was wrong. I presented a bunch of statistics drawn from nearly a year of Mirlyn logs. The most outlandish of my assertions, and the one that eventually turned out to be the most incorrect, was that some 45% of all our user sessions consist of only one action: a search. Unfortunately, I’d missed a whole swath of things I should have…
Comments closed[RETRACTED] Code4Lib 2011 Lightning Talk Slides
DANGER! I was trying to re-verify my numbers and found a glaring and hugely important mistake. I’ll make a new post with the details, but basically I was counting about 180k sessions (out of only 735k) that I should have been ignoring. Please ignore my basic stats until further notice. See the new numbers and corrected slides for more accurate data. I did a little Lightning Talk at Code4Lib 2011 and cleaned up (and heavily annotated) my slides for anyone interested in them. The focus was on some basic stats about usage of our OPAC, Mirlyn, in calendar 2010. I’ll…
Comments closedFour things I hate about Ruby
Don’t get me wrong. I use ruby as my default language when possible. I love JRuby in a way that’s illegal in most states. But there are…issues. There are with any language and the associated environment. These are the ones that bug the crap out of me. Ruby is slow. Let’s get this one out of the way right away. Ruby (at least the MRI 1.8.x implementation) is, for many things, slow. Sometimes not much slower. Sometimes (e.g., numerics) a hell of a lot slower. Now, there’s nothing necessarily wrong with that. For what I do, MRI Ruby is usually…
Comments closedDoes anyone use those prev/next/back-to-search links?
There’s a common problem among developers of websites that paginate, including OPACs: how do you provide a single item view that can have links that go back to the search (or to the prev/next item) without making your URLs look ugly? The fundamental problem is that as soon as your user opens up a couple searches in separate tabs, your session data can’t keep track of which search she wants to “go back to” unless you put some random crap in the URL, which none of us want to do. But let’s take three giant steps backwards before we throw…
Comments closedSize/speed of various MARC serializations using ruby-marc
Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data. I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21. Why bother? Binary MARC-21 is “broken” in…
Comments closedVuFind Midwest gathering
A couple weeks ago, representatives from UMich (that’d be me), Purdue, Notre Dame, UChicago, and our hosts at Western Michigan got together in lovely Kalamazoo to talk about our VuFind implementations. Eric Lease Morgan already wrote up his notes about the meeting, and I encourage you to go there for more info, but I’ll add my two cents here. So, in light of that meeting, here’s what I’m thinking about VuFind of late: None of us are running VuFuind 1.0 as released with full catalog data. Eric has a special purpose portal running the current code over an aggregated special…
Comments closedSimple Ruby gem for dealing with ISBN/ISSN/LCCN
I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums. It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome. Functionality is all as module functions, as follows: ISBN char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn) boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn) thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn) tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn) ISSN char = StdNum::ISSN.checkdigit(issn) boolean = StdNum::ISSN.valid?(issn) LCCN normalizedLCCN = StdNum::LCCN.normalize(lccn) Again, there’s nothing special here…
Comments closedSolr: Forcing items with all query terms to the top of a Solr search
[Note: I’ve since made a better explanation of, and solution for, this problem.] Here at UMich, we’re apparently in the minority in that we have Mirlyn, our catalog discovery interface (a very hacked version of VuFind), set up to find records that match only a subset of the query terms. Put more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on ‘OR’. Now, I’m actually quite proud of how our searching behaves. Reference desk anecdotes and our statistics all point to the idea that people tend to find what they’re looking for.…
Comments closedWhy RDA is doomed to failure
[Note: edited for clarity thanks to rsinger’s comment, below] Doomed, I say! DOOOOOOOOOOMMMMMMMED! My reasoning is simple: RDA will fail because it’s not “better enough.” Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.” First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of. And, of…
Comments closedData structures and Serializations
Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in. What this post is not There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them. What I am comfortable saying is this: Anyone advocating or…
Comments closedStupid catalog tricks: Subject Headings and the Long Tail
Library of Congress Subject Headings (LCSH) in particular. I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird. And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying. So, just for kicks, I ran some numbers. The process I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation…
Comments closedWhy bother with threading in jruby? Because it’s easy.
[Edit 2011-July-1: I’ve written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you’re running jruby] Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste. Well, it turns out I’ve been trying to figure out how to deal…
Comments closedPushing MARC to Solr; processing times and threading and such
[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.] What’s the question? The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage. I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the…
Comments closedruby-marc with pluggable readers
I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this: require ‘marc’ require ‘my_marc_stuff’ mbreader = MARC::Reader.new(‘test.mrc’) # => Stock marc binary reader mbreader = MARC::Reader.new(‘test.mrc’ :readertype=>:marcstrict) # => ditto MARC::Reader.register_parser(My::MARC::Parser, :marcstrict) mbreader = MARC::Reader.new(‘test.mrc’) # => Uses My::MARC::Parser now xmlreader = MARC::Reader.new(‘test.xml’, :readertype=>:marcxml) # …and maybe further on down the road asreader = MARC::Reader.new(‘test.seq’, :readertype=>:alephsequential) mjreader = MARC::Reader.new(‘test.json’, :readertype=>:marchashjson) A parser need only implement #each and a module-level method #decode_from_string. Read all about it on the github page.
Comments closedNew interest in MARC-HASH / JSON
EDIT: This is historical — the recommended serialization for marc in json is now Ross Singer’s marc-in-json. The marc-in-json serialization has implementations in the core marc libraries for Ruby and PHP, and add-ons for Perl and Java. C’mon, Python people! For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data. When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple…
Comments closedOCLC still not (NO! They are!) normalizing their LCCNs
NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So…good news all around, and huge kudos to Xiaoming Liu for his quick response! **NOTE** It strikes me that I haven’t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you’ll get back either good…
Comments closed