NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So...good news all around, and huge kudos to Xiaoming Liu for his quick response!
**NOTE** It strikes me that I haven't seen a case where... <more>
[Note: in this post I’m just going to focus on the “get stuff into Solr” part. My normal focus – MARC data – will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]
Working with Solr
I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love... <more>
Yea! My first gem ever released!
[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]
[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was…ugly. And I didn’t really understand it. So I dug in today and wrote this.]
I’ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup.
- Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.
- Assertion: If I... <more>
So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character.
Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.
So, we’re back to using pipe (|) characters to separate multiple calls – the examples below have been updated to reflect this.
There was much pain. Much, much pain. Exacerbated by my almost complete lack of knowledge about what I was doing.
This is the procedure I eventually arrived at – if there are places where I made trouble for myself, please let me know!
[And does anyone know how... <more>
unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.
Let’s get them to play nice with each other!
How’s it all work?
- Zotero looks for a well-constructed <link> tag in the head of the page
- It checks the document on the... <more>
- Added “recordURL” per Tod’s request
- Made a record’s title field an array and call it titles, to allow for vernacular entries
- Changed item’s ingest to lastUpdate to accurately note what the actual date reflects. This gets updated every time either the item or the record to which it’s attached gets changed.
- Fixed a couple typos, including one where I substituted an ampersand for a pipe in the multi-get example (thanks again, Tod).
- Added... <more>
JRuby is my ruby platform of choice, mostly because I think its deployment options in my work environment are simpler (perhaps technically and certainly politically), but also because I have high, high hopes to use lots of super-optimized native java libraries. The CPAN is what keeps me tethered to Perl, and whether or not you like Java-the-language, boy, are there a lot of high-quality libraries out there.
Since I’ve been messing around with MARC-XML... <more>
Many of the folks who read this blog (hi, both of you! Mom, say hello to Dad!) are aware, at least tangentially, of the HathiTrust. Currently hosted by us at the University of Michigan, the most public interface to its data is a VuFind installation you can access at catalog.hathitrust.org (or, for you smart-phone types, at m.catalog.hathitrust.org). Once you do a metadata search, you get links into the actual page... <more>