Skip to content

Category: Uncategorized

Reintroducing Traject: Traject 2.0

Traject 2.0.0 released! Now runs under MRI/RBX! traject is an ETL (extract/transform/load) system written in ruby with a special view towards extracting fields from MARC data and writing it out into Solr. [Jonathan Rochkind] (http://bibwild.wordpress.com) and I wrote this primarily out of frustration using other tools in this space (e.g., Solrmarc, or my own precursor to traject , marc2solr Note: Catmandu is another, perl-based system I don’t have any direct experience with. traject had its first release almost a year and a half ago (at least based on the date of my post introducting it), and I’ve used it literally…

Comments closed

Ruby MARC serialization/deserialization revisited

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…

Comments closed

Schemaless” solr with dynamicField and copyField

[Holy Kamoly, it’s been a long time since I blogged!] Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields. I find this intruguing, but it’s not what I’m after right now. The problem I’m in the first stages of addressing is that my schema.xml is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on…

Comments closed

Requiring/Preferring searches that don’t span multiple values (SST #3)

Check out introduction to the Stupid Solr Tricks series if you\’re just joining us.] Solr and multiValued fields Here\’s another thing you need to understand about Solr: it doesn\’t really have fields that can take multiple values. But Bill, you\’re saying, sure it does. I mean, hell, it even has a \’multiValued\’ parameter. First off: watch your language. Second off: are you sure? Let\’s do a quick test. Look at the following documents exampledocs/names.json [ { id: 1, title: The Monkees, name_text: [Peter Tork, Mike Nesmith, Micky Dolenz, Davy Thomas Jones] }, { id: 2, title: Heros of the Wild…

Comments closed

Setting up your OPAC for Zotero support using unAPI

unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI. Let’s get them to play nice with each other! How’s it all work? Zotero looks for a well-constructed <link> tag in the head of the page It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks. Zotero then looks…

Comments closed

Benchmarking MARC record parsing in Ruby

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow] [UPADTE It turns out they don’t use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ] [UP-UPDATE Well, no, they do use MARC-XML. I’m not afraid to constantly change my story. This is why I’m the best investigative reporter in the business] The other day on the blacklight mailing list, Bess Sadler wrote Yes, we do still include the full marc record, but the rule of thumb we’re currently using…

Comments closed

Going with and “forking” VUFind

Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff. When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight. I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around…

Comments closed

MARC-HASH: The saga continues (now with even less structure)

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in. The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field. SO….it’s like this now. { “type” : “marc-hash”, “version” : [1, 0], “leader” : “leader string” “fields” : [ [“001”, “001 value”] [“002”, “002 value”] [“010″, ” “, ” “, [ [“a”, “68009499”] ] ], [“035″, ” “, ” “, [ [“a”, “(RLIN)MIUG0000733-B”] ], ], [“035″, ” “, ” “, […

Comments closed

MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself. Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke. So…I’m suggesting we try something a little simpler. Something so…

Comments closed

TicTocs: Give us a file! Pretty pretty pretty please!

For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome. In their blog at News from TicTocs, a post titled I want to be completely honest with you about ticTOCs notes that: As for the API – yes, we’ve been asked this several times, and the answer is that it is currently being written and should be available very soon. That’s great, but writing in a comment on that post (after logging in with a very, very old OpenID — I used to have a blog named…

Comments closed

UPenn library has video “commercials

The University of Pennsylvania Library has a set of video commercials touting their products — some of which are musicals! Worth a look-see.

Comments closed