Skip to content

Tag: MARC

How good/bad is MARC data? The case of place-of-publication

I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else. One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself. [Note: there will be some complaining at the end.…

Comments closed

Ruby MARC serialization/deserialization revisited

A few years ago, I benchmarked various methods of serializing/deserialzing MARC data using the ruby-marc gem. Given that I’m planning on starting fresh with my catalog setup, I thought I’d take a moment to revisit them. The biggest changes since that time have been (a) the continued speed improvements in JRuby, (b) the introduction of the Oj json parser for MRI ruby, and (c) wider availability of msgpack code in the wild. I also wondered what would happen if I tried ruby’s Marshal serialization; maybe it would be faster because I wouldn’t have to "manually" create a MARC::Record object from…

Comments closed

Announcing “traject” indexing software

[Over the next few days I’ll be writing a series of posts that highlight a new indexing solution by Jonathan Rochkind and myself called traject that we’re using to index MARC data into Solr. This is the introduction.] Wow. Six months since I posted here. What have I been doing? Well, mostly parenting, but in the last few weeks I was lucky enough to get on board with a project started by Jonathan Rochkind for a new JRuby-based tool optimized for indexing MARC data into solr. You know, kinda like solrmarc, but JRuby. What’s it look like? I encourage you…

Comments closed

ISBN parenthetical notes: Bad MARC data #1

Yesterday, I gave a brief overview of why free text is hard to deal with. Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it. The point is not to mock anything. Mocking will, however, be included for free. What’s supposed to be in the 020? Well, for starters, an ISBN (10 or 13 digit, we’re not picky). Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or…

Comments closed

Why programmers hate free text in MARC records

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job. A lot of people seem to not understand why. This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem. Description vs Findability I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description…

Comments closed

Size/speed of various MARC serializations using ruby-marc

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data. I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21. Why bother? Binary MARC-21 is “broken” in…

Comments closed

Why RDA is doomed to failure

[Note: edited for clarity thanks to rsinger’s comment, below] Doomed, I say! DOOOOOOOOOOMMMMMMMED! My reasoning is simple: RDA will fail because it’s not “better enough.” Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.” First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of. And, of…

Comments closed

Stupid catalog tricks: Subject Headings and the Long Tail

Library of Congress Subject Headings (LCSH) in particular. I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird. And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying. So, just for kicks, I ran some numbers. The process I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation…

Comments closed

ruby-marc with pluggable readers

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this: require ‘marc’ require ‘my_marc_stuff’ mbreader = MARC::Reader.new(‘test.mrc’) # => Stock marc binary reader mbreader = MARC::Reader.new(‘test.mrc’ :readertype=>:marcstrict) # => ditto MARC::Reader.register_parser(My::MARC::Parser, :marcstrict) mbreader = MARC::Reader.new(‘test.mrc’) # => Uses My::MARC::Parser now xmlreader = MARC::Reader.new(‘test.xml’, :readertype=>:marcxml) # …and maybe further on down the road asreader = MARC::Reader.new(‘test.seq’, :readertype=>:alephsequential) mjreader = MARC::Reader.new(‘test.json’, :readertype=>:marchashjson) A parser need only implement #each and a module-level method #decode_from_string. Read all about it on the github page.

Comments closed