Archives: April 2011

ISBN parenthetical notes: Bad MARC data #1

Tags:

April 12, 2011 at 12:22 pmCategory:Uncategorized

Yesterday, I gave a brief overview of why free text is hard to deal with.

Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it.

The point is not to mock anything. Mocking will, however, be included for free.

What’s supposed to be in the 020?

Well, for starters, an ISBN (10 or 13 digit, we’re not picky).

Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or not.

Wait, no, let’s go ahead and worry about it. It’s an easy enough script to write, although it takes a while to run.

8,630,794  Total records
3,220,666  Total 020a's
    6,498  020a's that don't obviously contain an ISBN
    8,407  that look like an ISBN but fail checksum test:
... so 0.26% of the ISBNs have invalid checksums

So, not bad at all, especially considering some of those are known to be bad, but are transcribed dutifully from the actual (mis-)printed book.

A lot of the malformed data (anything from which I can’t seem to extract something that looks like an ISBN) is pricing data, and most of it appears in system numbers that are close enough to each other that I presume it was just a bad batch.

What’s goes after the ISBN in the 020?

I’m no cataloger, of course, but it looks to me like the answer is “Something about how the book is bound together, or the publisher, unless you want to put something else there, and then, really, go ahead, because it’s not like anyone is ever going to want to parse this out, all we need to do is print cards with it for god’s sake.”

No, I kid, I kid! The actual rules are in Library of Congress Rule Interpretation 1.8, which reads, in part:

For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.

I think it’s important to read that a second time, because it succinctly conveys the culture in which these rules were devised.

  • Don’t worry about consistency, because your only reader is human.
  • Defer to the cataloger.
  • Being complete is more important than being consistent.
  • Base your notes on your subjective view of the actual, physical item you’re presumed to be holding in your hands.

Interestingly (to me, anyway), it looks like the OCLC once had a (now deprecated) $$b subfield for binding information. Apparently it didn’t catch on.

What did I find?

So, let’s pretend I’d like to be able to differentiate between paperback and hardbound books. Probably useful, yes?

I went ahead and took all parenthetical notes from any field in the 020, split them on colon (’cause that seems to be the way they roll) and did some basic normalization:

  • Eliminate numbers (so ‘vol. 1′ and ‘vol. 2′ count as only one pattern)
  • Lowercase everything
  • Turn runs of spaces into a single space
  • Trim leading/trailing spaces
  • Remove any trailing punctuation

I found 1,506,729 parenthetical remarks in the 020 subfields of our catalog.

The top twenty most common entries using those normalizations are:

  1. 402537 pbk
  2. 387406 alk. paper
  3. 99260 v # (e.g., “v. 1″, “v. 22″, etc.)
  4. 82918 cloth
  5. 51125 hbk
  6. 42036 electronic bk
  7. 41360 acid-free paper
  8. 38792 hardcover
  9. 28913 set
  10. 20358 hardback
  11. 19160 ebook
  12. 16264 paper
  13. 15269 u.s
  14. 12770 hd.bd
  15. 11793 print
  16. 10625 lib. bdg
  17. 10520 hc
  18. 8772 est
  19. 7767 pb
  20. 7639 hard

The kicker? These are the top twenty of 13,374 unique parenthetical strings found in the 020 field. Many of them are publishers, or cities, or whatnot, but an awful lot of them are variations on “hardcover” and “paperback.”

For example, a quick search for anything that might be “hard” (regexp: /h[ar]{0,2}d/) got me started on a list. Here’s just the 90 examples from that list that start with ‘h’:

hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover | hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr

And that’s after eliminating things like places of publication, strings like “with…”, “plus…”, “alk. paper”, etc.

“Yeah, but you have to understand that historically…”

Stop hiding behind that.

I understand that at one point in time it probably made sense (to someone at least) to do it this way. I can deal with that.

What I can’t accept is that as I type this there’s a cataloger doing this in this way. Today. April 2011. Some, what? maybe thirty years since computer-based OPACs became prevalent?

These sorts of problems were recognized ages ago and should have been dealt with. Add a subfield. Invent a controlled vocabulary. Don’t worry about the legacy data; it’s always going to suck.

But why are we still producing sucky data???

To sum up

The point is that there’s a better way to do this stuff. Lots and lots of better ways, in fact. Time I spend dealing with crappy data is time I don’t spend making relevancy raking better, or building a better command language search option for my librarians, or working on ways to get a decent “more like this”.

The need is both dire and urgent; the latter because sooner or later we’re going to have to go to a “two state solution” with traditional MARC21 for many of our records and whatever comes next (RDA?) for the newer stuff. And every day we wait, that first category grows, and the growth rate keeps increasing.

And then there’s serials. Don’t talk to me about serials.

6 Responses to “ISBN parenthetical notes: Bad MARC data #1”

  1. Chris says:

    Okay, so, what do you suggest we actually DO about this?

  2. Jakob says:

    What we should do is first making clear what parts of cataloging produce useless junk (like Bill did) and second clean up and normalize the data. A typical reason for avoidable quality problems is a lack of feedback. If you do not instantly get feedback about illformed data when you start to create a record, you will unlikely change your cataloging practice. So third we should create better cataloging clients. As long as cataloging rules are only written as rules instead of implemented as code that given error messages, I doubt that we get better data.

  3. Karen Coyle says:

    Bill, ISBN is one of the examples I use in my talks when I get to the point of “text versus data”. I have grabbed a couple of examples, but you’ve done a full-blown study, and I’m here to thank you for it! I will point people to this post for more info.

    Chris, as to what we do… given that any library system in existence today has algorithms to separate the ISBN from the rest of the subfield, we need to add a new subfield to MARC to hold the text and make that separate permanent. Actually, we need to have done that 20 years ago, and I almost feel like now it’s too late to make the change worthwhile since it looks like we’re on the verge of moving beyond MARC anyway.

  4. Matthew Phillips says:

    Yet another example of where the UKMARC format was superior. In UKMARC the ISBN was in subfield “a”, qualifying remarks in subfield “c” and price in subfield “d”. There was even a subfield “b” with a code to indicate the type of ISBN (e.g. whether it was the ISBN for a set of volumes or an individual volume).

    Sadly UKMARC was abandoned about ten years ago because it was just so unsatisfactory trying to convert from USMARC to UKMARC. It’s always easier to convert metadata into a format which is less expressive, so we all moved to USMARC instead.

  5. [...] could look in the 020s for a hint of whether it’s hardcover or paperback (which is really hard. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and [...]

  6. [...] Bill Dueber differentiating between bindings [...]

Leave a Reply

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job.

A lot of people seem to not understand why.

This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem.

Description vs Findability

I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description and findability. AACR2 is clearly designed for description; once you’ve found a record, it does a pretty good job telling a human being what she’s looking at. With respect to a person who’s already got a copy of the record in her (virtual) hand, strings of text and reasonable abbreviations are…well, often good enough, let’s say.

But much of AACR2 is a giant mountain of fail when it comes to supporting findability — the ability for a machine to slice and dice the data in ways that can be mapped onto searches and transformations. What those of us on the business end of the computer need are well-defined values stuck into well-defined places that represent well-defined relationships.

Free text stuck on the end of a field fails all three of those criteria.

Machine Reasoning vs. Machine Parsing

When many people look at something like RDF, their first reaction is, “Great Googally Moogally! Just tell me the language! I don’t want to follow a chain of reasoning that’s seventeen steps long just to figure out the damn thing is in English!!!”

Of course you don’t. And you don’t have to. Someone — hopefully someone smarter than me — needs to write a program to do it. And we can.

Following all that logic — deriving relationships, figuring out eventual values, determining how to convert between various forms — is what I’ll call (for simplicity’s sake) machine reasoning. And machine reasoning — for the purposes of this discussion, anyway — is a solved problem. I’m not saying it’s not hard, and I’m not saying it might not take gobs of hardware resources. But we, the collective of humanity, know how to do it.

On the other hand, machine parsing — looking at all that free text that is sprinkled throughout our records and trying to turn it into something that is susceptible to machine reasoning — is vehemently not a solved problem. Even if you ignore all the misspellings, we’re still stuck with one-off abbreviations, lack of ordering, gobs of “local practice,” and iffy punctuation.

And, come to think of it, you can’t ignore the misspellings, either.

The point is this: good data trumps everything else. If there’s good, solid, well-defined data in computable places, we can (given some time) do damn near anything with it. If there’s human-entered, free-text, parenthetical-remark-type data, we’re pretty much stuck.

Examples?

Jonathan Rochkind just did a great post looking at LC call numbers, and how, well, they might be in a few different places, and may or may not be valid LC call numbers, and so on and on and on and on.

And my next post (hopefully tomorrow) will be an analysis of the first freetext in MARC I ever tried to deal with — the parenthetical remarks in the 020 (ISBN) field. If that doesn’t keep you up all night, well, I don’t know what will.

6 Responses to “Why programmers hate free text in MARC records”

  1. Chris says:

    I don’t think they don’t understand — I mean, the catalogers I deal with around here are pretty clever. I think that YOUR concerns are not THEIR concerns and they are working within the constraints of a system that has encouraged, and in some ways even forced, then to get cute with parens and abbreviations because no one was going to change anything to support their need to differentiate.

    The typos are of course only human, so I’m whatever on those. The punctuation, though? I cannot complain about enough. ;-)

  2. Chris says:

    You’ll note I demonstrated how okay I am with typos by inserting one. I did that on purpose, you know. No, really….

  3. Programmers hate MARC because it’s the cool thing to do.

    MARC records contain a lot more than the description. You don’t mention the subject metadata that most contain. This greatly enables retrieval (or findability). You do mention class numbers, which are also not part of the description and also help with retrieva.

    Also, there’s no reason to play findability and description off against each other. Each fills a different need.

  4. Cathy says:

    Just guesing here, since I’ve always been Colection Developmenet/Acquisitions, but haven’t those MARC records been pulled in from a lot of different sources, over many years, from the hands of many different catalogers? I remember something about the older a librar’s records are (the cataloging records), I mean the more hands that have worked on them, throught the years, a steady increase of mistakes or differences. Example: your search results featureing hard,hardback, etc, are possibly not even from the same library, originally. The fun of copy cataloging? Just a thought.

  5. Bill says:

    Jeffery, that’s a….let’s say perhaps an “overly broad” statement, and I’m not sure what you hope to add to the discussion with it. Programmers don’t, of course, hate MARC because “it’s the cool thing to do.” MARC-as-data-format is outdated and has a lot of flaws that are well-known. MARC-as-AACR2 as I’m treating it here is easy to hate because it’s full of description that could easily be made susceptible to machine parsing, but isn’t, and hence is at least partially wasted effort in that people are doing work that could be useful along both dimensions but isn’t.

Leave a Reply