Home > Uncategorized > Why programmers hate free text in MARC records

Why programmers hate free text in MARC records

Tags:

April 11, 2011 6 Comments »

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job.

A lot of people seem to not understand why.

This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem.

Description vs Findability

I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description and findability. AACR2 is clearly designed for description; once you’ve found a record, it does a pretty good job telling a human being what she’s looking at. With respect to a person who’s already got a copy of the record in her (virtual) hand, strings of text and reasonable abbreviations are…well, often good enough, let’s say.

But much of AACR2 is a giant mountain of fail when it comes to supporting findability — the ability for a machine to slice and dice the data in ways that can be mapped onto searches and transformations. What those of us on the business end of the computer need are well-defined values stuck into well-defined places that represent well-defined relationships.

Free text stuck on the end of a field fails all three of those criteria.

Machine Reasoning vs. Machine Parsing

When many people look at something like RDF, their first reaction is, “Great Googally Moogally! Just tell me the language! I don’t want to follow a chain of reasoning that’s seventeen steps long just to figure out the damn thing is in English!!!”

Of course you don’t. And you don’t have to. Someone — hopefully someone smarter than me — needs to write a program to do it. And we can.

Following all that logic — deriving relationships, figuring out eventual values, determining how to convert between various forms — is what I’ll call (for simplicity’s sake) machine reasoning. And machine reasoning — for the purposes of this discussion, anyway — is a solved problem. I’m not saying it’s not hard, and I’m not saying it might not take gobs of hardware resources. But we, the collective of humanity, know how to do it.

On the other hand, machine parsing — looking at all that free text that is sprinkled throughout our records and trying to turn it into something that is susceptible to machine reasoning — is vehemently not a solved problem. Even if you ignore all the misspellings, we’re still stuck with one-off abbreviations, lack of ordering, gobs of “local practice,” and iffy punctuation.

And, come to think of it, you can’t ignore the misspellings, either.

The point is this: good data trumps everything else. If there’s good, solid, well-defined data in computable places, we can (given some time) do damn near anything with it. If there’s human-entered, free-text, parenthetical-remark-type data, we’re pretty much stuck.

Examples?

Jonathan Rochkind just did a great post looking at LC call numbers, and how, well, they might be in a few different places, and may or may not be valid LC call numbers, and so on and on and on and on.

And my next post (hopefully tomorrow) will be an analysis of the first freetext in MARC I ever tried to deal with — the parenthetical remarks in the 020 (ISBN) field. If that doesn’t keep you up all night, well, I don’t know what will.

Tags: bad data

Comments:5

Leave my own
  1. Chris
    April 11, 2011 at 4:30 pm

    I don’t think they don’t understand — I mean, the catalogers I deal with around here are pretty clever. I think that YOUR concerns are not THEIR concerns and they are working within the constraints of a system that has encouraged, and in some ways even forced, then to get cute with parens and abbreviations because no one was going to change anything to support their need to differentiate.

    The typos are of course only human, so I’m whatever on those. The punctuation, though? I cannot complain about enough. ;-)

  2. Chris
    April 12, 2011 at 9:41 am

    You’ll note I demonstrated how okay I am with typos by inserting one. I did that on purpose, you know. No, really….

  3. Jeffrey Beall
    April 12, 2011 at 11:29 am

    Programmers hate MARC because it’s the cool thing to do.

    MARC records contain a lot more than the description. You don’t mention the subject metadata that most contain. This greatly enables retrieval (or findability). You do mention class numbers, which are also not part of the description and also help with retrieva.

    Also, there’s no reason to play findability and description off against each other. Each fills a different need.

  4. Cathy
    April 12, 2011 at 5:40 pm

    Just guesing here, since I’ve always been Colection Developmenet/Acquisitions, but haven’t those MARC records been pulled in from a lot of different sources, over many years, from the hands of many different catalogers? I remember something about the older a librar’s records are (the cataloging records), I mean the more hands that have worked on them, throught the years, a steady increase of mistakes or differences. Example: your search results featureing hard,hardback, etc, are possibly not even from the same library, originally. The fun of copy cataloging? Just a thought.

  5. Bill
    April 12, 2011 at 9:12 pm

    Jeffery, that’s a….let’s say perhaps an “overly broad” statement, and I’m not sure what you hope to add to the discussion with it. Programmers don’t, of course, hate MARC because “it’s the cool thing to do.” MARC-as-data-format is outdated and has a lot of flaws that are well-known. MARC-as-AACR2 as I’m treating it here is easy to hate because it’s full of description that could easily be made susceptible to machine parsing, but isn’t, and hence is at least partially wasted effort in that people are doing work that could be useful along both dimensions but isn’t.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Trackbacks:1

Listed below are links to weblogs that reference Why programmers hate free text in MARC records

pingback from ISBN parenthetical notes: Bad MARC data #1 « Robot Librarian April 12, 2011

[...] Older » [...]