Skip to content

Author: Bill Dueber

Indexing data into Solr via JRuby (with threads!)

[Note: in this post I’m just going to focus on the “get stuff into Solr” part. My normal focus — MARC data — will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.] Working with Solr I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love me the java. So…taking Erik Hatcher’s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to…

Comments closed

jruby_producer_consumer dead-simple producer/consumer for JRuby

Yea! My first gem ever released! [YUCK! It was a disaster in a few ways! Don’t look at this! It’s hideous! There’s a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.] [In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was…ugly. And I didn’t really understand it. So I dug in today and wrote this.] I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer. It’s JRuby-only because it uses (a) A…

Comments closed

Still another look at MARC parsing in ruby and jruby

I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup. Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file. Assertion: If I can’t write a system that’s faster than what we have now, it’s probably not worth my time even though being able to fall back to ruby…

Comments closed

Beta version of the HathiTrust Volumes API available

MAJOR CHANGE So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character. Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85. So, we’re back to using pipe (|) characters to separate multiple calls — the examples below have been updated to reflect this. Introduction I’ve put up a beta version of the HathiTrust Volumes API previously discussed on this blog and via email. Currently, I’ve only got json output, although there is space in there for other output formats as necessary. What exactly is this? Given: an…

Comments closed

Running Blacklight under JRuby

I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there. There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing. This is the procedure I eventually arrived at — if there are places where I made trouble for myself, please let me know! [And does anyone know how to get jruby’s nokogiri to link to a different libxml and stop with the crappy libxml2-version error message every time I run it under OSX???] Download jruby…

Comments closed

Setting up your OPAC for Zotero support using unAPI

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI. Let’s get them to play nice with each other! How’s it all work? Zotero looks for a well-constructed <link> tag in the head of the page It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks. Zotero then looks…

Comments closed

Thinking through a simple API for HathiTrust item metadata

EDITS: Added “recordURL” per Tod’s request Made a record’s title field an array and call it titles, to allow for vernacular entries Changed item’s ingest to lastUpdate to accurately note what the actual date reflects. This gets updated every time either the item or the record to which it’s attached gets changed. Fixed a couple typos, including one where I substituted an ampersand for a pipe in the multi-get example (thanks again, Tod). Added a better explanation of option #4 Introduction and History Ages ago, I wrote a simple(ish) little cgi program to get basic item-level data out of what…

Comments closed

Adding LibXML and Java STAX support to ruby-marc with pluggable XML parsers

JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there. Since I’ve been messing around with MARC-XML parsing of late, and since Ross Singer added pluggable xml-parser awesomeness to the ruby-marc project, I thought I’d see what I could do with native Java…

Comments closed

An exercise in Solr and DataImportHandler: HathiTrust data

Many of the folks who read this blog (hi, both of you! Mom, say hello to Dad!) are aware, at least tangentially, of the HathiTrust. Currently hosted by us at the University of Michigan, the most public interface to its data is a VuFind installation you can access at catalog.hathitrust.org (or, for you smart-phone types, at m.catalog.hathitrust.org). Once you do a metadata search, you get links into the actual page images or a chance to search the fulltext of the selected item (depending on its copyright status). It’s awesome. Seriously. Even in the absence of fulltext, being able to search…

Comments closed

Dead-easy (but extreme) AJAX logging in our VuFind install

One of the advantages of having complete control over the OPAC is that I change things pretty easily. The downside of that is that we need to know what to change. Many of you that work in libraries may have noticed that data are not necessarily the primary tool in decision-making. Or, say, even a part of the process. Or even thought about hard. Or even considered. For many decisions I see going on in the library world, the primary motivator is the anecdote. In fact, to be honest, the primary driver is the faculty anecdote. Those cliched three curmudgeonly…

Comments closed

The sad truths about journal bundle prices

[Notes taken during a talk today, Ted Bergstrom: “Some Economics of Saying Nix To Big Deals and the Terrible Fix”. My own thoughts are interspersed throughout; please don’t automatically ascribe everything to Dr. Bergstrom. Check out his stuff at Ted Bergstrom’s home page.] Journals are a weird market — libraries buy as agents of professors, using someone else’s money, in deals of enormous complexity and uncertain value from companies that basically have a monopoly. Similar to a few other situations: doctors prescribe drugs for patients using insurance money. Professors assign textbooks to students whose parents (in general) buy them. In…

Comments closed

More Ruby MARC Benchmarks: Adding in MARC-XML

It turns out that UVA’s reluctance to use the raw MARC data on the search results screen is driven more by processing time than parsing time. Even if they were to start with a fully-parsed MARC object, they’re doing enough screwing around with that data that the bottleneck on their end appears to be all the regex and string processing, not the parsing. Their specs for what gets displayed are complex enough that they want to do the work up-front. But I remain interested, at least partially because of the reason UVA is using MARC-XML: they have MARC records too…

Comments closed

Benchmarking MARC record parsing in Ruby

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow] [UPADTE It turns out they don’t use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ] [UP-UPDATE Well, no, they do use MARC-XML. I’m not afraid to constantly change my story. This is why I’m the best investigative reporter in the business] The other day on the blacklight mailing list, Bess Sadler wrote Yes, we do still include the full marc record, but the rule of thumb we’re currently using…

Comments closed

Building a solr text filter for normalizing data

[Kind of part of a continuing series on our VUFind implementation; more of a sidebar, really.] In my last post I made the case that you should put as much data normalization into Solr as possible. The built-in text filters will get you a long, long way, but sometimes you want to have specialized code, and then you need to build your own filter. Huge Disclaimer: I’m putting this up not because I’m the best person to do so, but because it doesn’t look as if anyone else has. I don’t know what I’m doing. I don’t know why the…

Comments closed

Going with and “forking” VUFind

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff. When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight. I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around…

Comments closed

Easy Solr types for library data

[Yet another bit in a series about our Vufind installation] While I’m no longer shocked at the terrible state of our data every single day, I’m still shocked pretty often. We figured out pretty quickly that anything we could do to normalize data as it went into the Solr index (and, in fact, as queries were produced) would be a huge win. There’s a continuum of attitudes about how much “business logic” belongs in the database layer of any application. Some folks — including super-high throughput sites, but mostly people who have never used anything by MySQL — tend to…

Comments closed

Sending unicode email headers in PHP

I’m probably the last guy on earth to know this, but I’m recording it here just in case. I’m sending record titles in the subject line of emails, and of course they may be unicode. The body takes care of itself, but you need to explicitly encode a header like “Subject.” $headers[‘To’] = $to; $headers[‘From’] = $from; $headers[‘Content-Type’] = “text/plain; charset=utf-8”; $headers[‘Content-Transfer-Encoding’] = “8bit”; $b64subject = “=?UTF-8?B?” . base64_encode($subject) . “?=”; $headers[‘Subject’] = $b64subject; $mail =& Mail::factory(‘sendmail’, array(‘host’ => $host, ‘port’=>$port)); $retval = $mail->send($to, $headers, $body);

Comments closed

Rolling out UMich’s “VUFind”: Introduction and New Features

For the last few months, I’ve been working on rolling out a ridiculous-modified version of Vufind, which we just launched as our primary OPAC, Mirlyn, with a slightly-different version powering catalog.hathitrust.org, a temporary metadata search on the HathiTrust data until the OCLC takes it over at some undetermined date. (Yeah, the HathiTrust site is a lot better looking.) [Our Aleph-based catalog lives on at mirlyn-classic) — I’ll be interested to see how the traffic on the two differs as time goes on.] I’m going to spend a few posts talking about how and why we essentially forked vufind, what sorts…

Comments closed

Sending MARC(ish) data to Refworks

Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I’d put down how I’m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested. The basic procedure is: Send your user to a specific refworks URL along with a callback URL that can enumerate the record(s) you want to import in a supported form Your user logs in (if need be) gets to her RefWorks page RefWorks calls up your system and requests the record(s) The import happens, and your user does whatever she want to do with them…

Comments closed

MARC-HASH: The saga continues (now with even less structure)

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in. The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field. SO….it’s like this now. { “type” : “marc-hash”, “version” : [1, 0], “leader” : “leader string” “fields” : [ [“001”, “001 value”] [“002”, “002 value”] [“010″, ” “, ” “, [ [“a”, “68009499”] ] ], [“035″, ” “, ” “, [ [“a”, “(RLIN)MIUG0000733-B”] ], ], [“035″, ” “, ” “, […

Comments closed