I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else. One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself. [Note: there will be some complaining at the end.…
Comments closedYear: 2014
Ruby MARC serialization/deserialization revisited
A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…
Comments closedSchemaless” solr with dynamicField and copyField
[Holy Kamoly, it’s been a long time since I blogged!] Recent versions of solr have the option to run in what they call "schemaless mode", wherein fields that aren’t recognized are actually added, automatically, to the schema as real named fields. I find this intruguing, but it’s not what I’m after right now. The problem I’m in the first stages of addressing is that my schema.xml is huge mess — very little consistency, no naming conventions dictating what’s stored/indexed, etc. It grew "ogranically" (which is what I say when I mean I’ve been lazy and sloppy) and needs a full-on…
Comments closedHelp me test yet another LC Callnumber parser
Those who have followed this blog and my code for a while know that I have a long, slightly sad, and borderline abusive relationship with Library of Congress call numbers. They’re a freakin’ nightmare. They just are. But, based on the premise that Sisyphus was a quitter, I took another stab at it, this time writing a real (PEG-) parser instead of trying to futz with extended regular expressions. The results, so far, aren’t too bad. The gem is called lc_callnumber, but more importantly, I’ve put together a little heroku app to let you play with it, and then correct…
Comments closed