[Note: edited for clarity thanks to rsinger’s comment, below] Doomed, I say! DOOOOOOOOOOMMMMMMMED! My reasoning is simple: RDA will fail because it’s not “better enough.” Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.” First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of. And, of…
Comments closedCategory: Uncategorized
Data structures and Serializations
Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in. What this post is not There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them. What I am comfortable saying is this: Anyone advocating or…
Comments closedStupid catalog tricks: Subject Headings and the Long Tail
Library of Congress Subject Headings (LCSH) in particular. I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird. And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying. So, just for kicks, I ran some numbers. The process I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation…
Comments closedWhy bother with threading in jruby? Because it’s easy.
[Edit 2011-July-1: I’ve written a jruby_specific threach that takes advantage of better underlying java libraries called jruby_threach that is a much better option if you’re running jruby] Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste. Well, it turns out I’ve been trying to figure out how to deal…
Comments closedPushing MARC to Solr; processing times and threading and such
[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.] What’s the question? The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage. I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the…
Comments closedruby-marc with pluggable readers
I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this: require ‘marc’ require ‘my_marc_stuff’ mbreader = MARC::Reader.new(‘test.mrc’) # => Stock marc binary reader mbreader = MARC::Reader.new(‘test.mrc’ :readertype=>:marcstrict) # => ditto MARC::Reader.register_parser(My::MARC::Parser, :marcstrict) mbreader = MARC::Reader.new(‘test.mrc’) # => Uses My::MARC::Parser now xmlreader = MARC::Reader.new(‘test.xml’, :readertype=>:marcxml) # …and maybe further on down the road asreader = MARC::Reader.new(‘test.seq’, :readertype=>:alephsequential) mjreader = MARC::Reader.new(‘test.json’, :readertype=>:marchashjson) A parser need only implement #each and a module-level method #decode_from_string. Read all about it on the github page.
Comments closedSetting up your OPAC for Zotero support using unAPI
unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI. Let’s get them to play nice with each other! How’s it all work? Zotero looks for a well-constructed <link> tag in the head of the page It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can’t decide which one it uses. It picks. Zotero then looks…
Comments closedBenchmarking MARC record parsing in Ruby
[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow] [UPADTE It turns out they don’t use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ] [UP-UPDATE Well, no, they do use MARC-XML. I’m not afraid to constantly change my story. This is why I’m the best investigative reporter in the business] The other day on the blacklight mailing list, Bess Sadler wrote Yes, we do still include the full marc record, but the rule of thumb we’re currently using…
Comments closedGoing with and “forking” VUFind
Note: This is the second in a series I’m doing about our VUFind installation, Mirlyn. Here I talk about how we got to where we are. Next I’ll start looking at specific technologies, how we solved various problems, and generally more nerd-centered stuff. When the University Library decided to go down the path of an open-source, solr-based OPAC, there were (and are, I guess) two big players: VUFind and Blacklight. I wasn’t involved in the decision, but it must have seemed like a no-brainer. VUFind was in production (at Villanova), seemed to be building a community of similar institutions around…
Comments closedMARC-HASH: The saga continues (now with even less structure)
After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in. The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field. SO….it’s like this now. { “type” : “marc-hash”, “version” : [1, 0], “leader” : “leader string” “fields” : [ [“001”, “001 value”] [“002”, “002 value”] [“010″, ” “, ” “, [ [“a”, “68009499”] ] ], [“035″, ” “, ” “, [ [“a”, “(RLIN)MIUG0000733-B”] ], ], [“035″, ” “, ” “, […
Comments closedMARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records
In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself. Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke. So…I’m suggesting we try something a little simpler. Something so…
Comments closedTicTocs: Give us a file! Pretty pretty pretty please!
For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome. In their blog at News from TicTocs, a post titled I want to be completely honest with you about ticTOCs notes that: As for the API – yes, we’ve been asked this several times, and the answer is that it is currently being written and should be available very soon. That’s great, but writing in a comment on that post (after logging in with a very, very old OpenID — I used to have a blog named…
Comments closedPsst. We’re not printing cards anymore
[From a series I’m calling, “Things About The Library I Think Are Stoooopid”, part one of about a zillion.] I’m going to wallow in a little bit of hyperbole here, but only a little. The problem Suppose, just for a moment, that you’re a computer programmer working anytime in the last twenty years, and someone wants you to set up a data structure to deal with a timeless issue — how to keep track of who’s on which committees in a library. If you’re a computer person Easy enough. First off, what’s a committee? Committee Committee name (string) Committee inception…
Comments closedUPenn library has video “commercials
The University of Pennsylvania Library has a set of video commercials touting their products — some of which are musicals! Worth a look-see.
Comments closed