My code

My github page holds just about everything I’ve worked on, past and present, finished and incomplete, which makes it a bit of a mess.

The things I think are ready for prime-time, though, are as follows. Everything is in ruby, unless otherwise noted.

Traject

traject (with Jonathan Rochkind) is an ETL (extract, transform, load) system with plenty of generic hooks but designed around the idea of indexing MARC records into Solr. I use this for both the University of Michigan catalog (Mirlyn) and the Hathitrust catalog.

I’ve also put together a sample configuration and make available the full, nasty configuration used for Mirlyn and the HT catalog.

We’re interested in getting other folks using traject so we can build up an ecosystem of macros and scripts; don’t be afraid to contact me if you’re interestred.

Associated work:

  • traject-alephsequential-reader: A reader for the Ex Libris AlephSequential MARC serialization.
  • traject_umich_format: A traject macro that implements the University of Michigan’s process for determining the format of an item represented by a MARC record (e.g., book, serial, etc.).

MARC

Given how much MARC I have to work with, it was inevitable, perhaps, that I would end up writing libraries to deal with it.

  • ruby-marc (with many, many others): I’ve had a small part in improving and adapting this over the years, mostly focused on JRuby support and some speed improvements (because when you’ve got 10M bibs to index, speed counts!)
  • marc4j_extra_reader_writers: Before traject, and before I wrote my own indexing code, I used to use solrmarc and, along with it, marc4j. This implements a marc-in-json reader/writer and an alephsequential reader for marc4j (in Java). Note: My own proposed JSON serialization for MARC marc-hash should be considered dead: the commuity has all gotten behind MARC-in-JSON.
  • ruby-marc-marc4j: A simple gem to translate from a marc4j record (pulled in via JRuby with the marc4j .jar file) into a ruby-marc ruby object.
  • MARC::File::MiJ: I also wrote a perl implementation of marc-in-json for the MARC::Record set of modules (in Perl)
  • marc_alephsequential: An alephsequential reader for ruby-marc.

Other library data

  • library_stdnums: Ruby code to identify, validate, and normalize ISBNs, ISSN, and LCCNs.

General

  • threach and jruby_threach: Add a #threach method to any Enumerable ruby object, allowing you to run a block across many threads at once. These days I’d probably just use a thread-pool from the concurrent-ruby project, but plenty of folks still use this in the wild.
  • match_map: an object with a hash-like interface that allows many-to-many relationships. Keys are strings or regexes; values can be single or multiple, and many keys can match in a single lookup. Useful when you need it, confusing when you don’t.