Home > Uncategorized > Stupid catalog tricks: Subject Headings and the Long Tail

Stupid catalog tricks: Subject Headings and the Long Tail

April 13, 2010 4 Comments »

Library of Congress Subject Headings (LCSH) in particular.

I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird.

And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying.

So, just for kicks, I ran some numbers.

The process

I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6′s, and threw away any trailing punctuation in any of the subfields. I called the concatenation of what was left a unique LCSH.

Then I printed them out and put them all onto index cards, using tick-marks to indicate…

No, of course not. I used sort, uniq -c, and wc -l. Here’s what I found.

Counts of LCSH

…in round numbers.

In our catalog, there are:

  • 8.50M subject headings (using the definition above)
  • 1.87M unique subject headings
  • …66% of which (1.23M) appear exactly once

We only have to go out to 30K subjects to account for half of all subject entries. The top 1000 most-used subjects account for 14.5% of all 8.5M subject entries.

The top ten subjects by count are:

  • 6029 $$aSermons, American
  • 6131 $$aPhilosophy
  • 7224 $$aFeature films
  • 7591 $$aPiano music
  • 7968 $$aSocialism
  • 8796 $$aEconomics
  • 9185 $$aCommunism
  • 12440 $$aSermons, English$$y17th century
  • 13539 $$aBills, Private$$zUnited States
  • 58823 $$aEconomics$$xHistory$$vSources

From a record’s point of view

Our catalog has:

  • 7M records
  • 4.4M records with at least one subject (as defined above)
  • 2.4M records with more than one subject
  • 2.0M records with exactly one subject
  • 2.6M records with zero subjects

The records with the most subject headings tend to be collections of stuff (theses, photos, etc). Our local standout is the Dept. of Medicine and Surgery (University of Michigan) theses, 1851-1878 with 208 subject entries. 14 records have at least 30 subject entries.

What it means

Gee, lady, I don’t know.

One way to look at it: suppose you’re considering defining subjects in this way, and making them “hot” in the catalog interface. For our data, 2/3 of records would have either no subjects or a subject that found only the record you’re at. So…think again.

In real life, we index lots of possible subject fields, and we additionally index the $$a as well as the whole string, so ours are a little bit more useful. A little.

Tags:

Comments (Close):4

Leave my own
  1. Eric Lease Morgan
    April 13, 2010 at 11:25 am

    Cool!. Now consider graphing the results of your counting.

    Count other things such as length of book, dates, authors, etc. When you get this far compare the subject headings with the additional counts to see whether or not their are a relationships. Are books of one type of subject generally longer than others? Was this subject heading assigned more often during specific years? Are there common authors within subject headings? Are the books in question available via full text? Can you get those full text books and determine whether or not the books were cataloged “correctly” by doing text mining against the full text. Are there sets of “better” words that could be used to describe the books?

    Fun with counting.

  2. Naomi Dushay
    April 13, 2010 at 2:53 pm

    Eric,

    I’m in the “what can we do with the data we have” camp more than the “how should our data be improved” camp, by and large. If we’re using subjects to promote discovery; having 7 million records with suboptimal inconsistent data is not surprising. Even if there are patterns such as the ones you explored off the top of your head, retrospective cataloging seems unlikely. (whispering) In fact, perhaps human cataloging isn’t really scalable. That said, I will take LCSH over call numbers, but since we have both …

  3. Jonathan Rochkind
    April 13, 2010 at 7:53 pm

    I think this says you’ve GOT to get into the hieararchy in LCSH in order to make em useful. Yes, the hieararchy is weird. Yes, there are actually TWO OR THREE axes of hiearchy in LCSH. But it’s the data we’ve got, like Naomi says, and I think you’ve got to get into it to make em useful.

    So a “subject” according to your analysis is simply the pre-coordinated, for example,

    “Great Britain — Social conditions — 19th century.”

    If you just make that a link and show all things with exactly the same heading, will you get others? Well, in my catalog, you’ll actually get a few hundred, yeah. (And I TRIED to find a good example that wouldn’t do that! But maybe I didn’t try hard enough). But let’s pretend not.

    Okay, but how many will you get if you look for “Great Britain — Social conditions”, or “”Great Britain — Social conditions – ANY“? A lot lot more.

    How about “Great Britain — Social conditions — 19th century — something else“? A lot more there too.

    These subject headings were designed for a card catalog world where they’d all be laid out alphabetically, so the “wildcard” strings I suggest would neccesarily be right next to the original subject.

    Our challenge is to figure out how to present these things in a rational way in the online environment instead. But it’s definitely not only linking to things with exactly the same pre-coordinated subject heading — if that often gets you very few hits other than your origin record, it’s because that’s not what LCSH was designed for.

  4. John Mark Ockerbloom
    April 15, 2010 at 1:34 pm

    “Our challenge is to figure out how to present these things in a rational way in the online environment instead.”

    Well, here the online environment can work with you rather than agsinst you, since you can make displays that show you the hierarchies radiating out in all kinds of directions.

    Ideally, for “Great Britain — Social conditions — 19th century” you’d see not just that subject and its books, but related subjects and their books as well, displayed in a way that makes it easy for you both to find books of interest and to shift your focus based on what you find.

    See this link for an example of how it works for “Great Britain — Social conditions — 19th century” in a collection of about 40,000 titles. The display shows 7 titles with that subject, and also shows a few others with more specialized subjects (social conditions in England during that time, or social conditions for women). And then it goes on to shift the focus outward a bit, looking at books on social conditions in England without the explicit time qualifier, and so on.