Tag: Data Cleanup

How good/bad is MARC data? The case of place-of-publication

I complain a lot about the MARC format, the way people put data in MARC records, the actual data themselves I find in MARC records, the inexplicably complex syntax for identifiers and, ironically, attempts to replace MARC with something else. One nice little beacon of hope was when I found that only roughly 0.26% of the ISBNs in the UMich catalog have invalid checksums. That’s not bad at all, and it’s worth digging into other things about which I might be likely to complain before I make a fool of myself. [Note: there will be some complaining at the end.…

ISBN parenthetical notes: Bad MARC data #1

Yesterday, I gave a brief overview of why free text is hard to deal with. Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it. The point is not to mock anything. Mocking will, however, be included for free. What’s supposed to be in the 020? Well, for starters, an ISBN (10 or 13 digit, we’re not picky). Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or…

Why programmers hate free text in MARC records

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job. A lot of people seem to not understand why. This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem. Description vs Findability I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description…

Stupid catalog tricks: Subject Headings and the Long Tail

Library of Congress Subject Headings (LCSH) in particular. I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird. And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying. So, just for kicks, I ran some numbers. The process I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation…

Psst. We’re not printing cards anymore

[From a series I’m calling, “Things About The Library I Think Are Stoooopid”, part one of about a zillion.] I’m going to wallow in a little bit of hyperbole here, but only a little. The problem Suppose, just for a moment, that you’re a computer programmer working anytime in the last twenty years, and someone wants you to set up a data structure to deal with a timeless issue — how to keep track of who’s on which committees in a library. If you’re a computer person Easy enough. First off, what’s a committee? Committee Committee name (string) Committee inception…

