Normalizing LoC Call Numbers for sorting

Updated: I missed a ‘?’ in the original code that pushed a single cutter into the second-cutter position. Fixed below.

Crap. Update 2: Initial letters can be three characters long. Regexp and output changed.

LoC Call numbers tend to be a mess, and I’ve been working this morning trying to normalize them for easy string comparison.

The perl function below takes a call number (with some level of sloppiness) and returns a string suitable for comparisons with other strings returned by the function. It outputs stuff like this:

E                          E 0000.0000  0000  0000
E 184 .A1 G78              E 0184.0000A 1000G 7800
E184.A2 G78 1967           E 0184.0000A 2000G 7800 1967
E184.A2 G78 1970           E 0184.0000A 2000G 7800 1970
EA                         EA0000.0000  0000  0000
EA 10                      EA0010.0000  0000  0000
EA 10 1970                 EA0010.0000  0000  0000 1970
EA10 B7                    EA0010.0000B 7000  0000
EA 10.B7.G8                EA0010.0000B 7000G 8000
EA10.5                     EA0010.5000  0000  0000

The code, in perl, follows:

sub normalizeLC {
  1.   my $lc = uc(shift);
  2.   $lc =~ /^
  3.           \s*
  4.           ([A-Z]{1,3})  # alpha
  5.           \s*
  6.           (         # optional numbers
  7.             \d+
  8.             (?: \s*\.\s*\d+)?  # …with optional decimal point
  9.           )?
  10.           \s*
  11.           (?:               # optional cutter
  12.             \.? \s*
  13.             ([A-Z]+)      # cutter letter
  14.             \s*
  15.             (\d+)?        # cutter numbers
  16.           )?
  17.           (?:               # optional cutter
  18.             \.? \s*
  19.             ([A-Z]+)      # cutter letter
  20.             \s*
  21.             (\d+)?        # cutter numbers
  22.           )?
  23.           \s*
  24.           (.*?)            # everthing else
  25.           \s*$
  26.         /x;
  27.   my ($alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra) = ($1, $2, $3, $4, $5, $6, $7);
  28.   $c1num .= 0 x (4 - length($c1num)); # Pad out to four decimal places
  29.   $c2num .= 0 x (4 - length($c2num)); # ditto
  30.   $extra = ' ' . $extra if ($extra);
  31.   return sprintf("%-3s%09.4f%-2s%4s%-2s%4s%s", $alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra);
  32. }

3 Comments

  1. Emily Lynema
    Posted November 14, 2008 at 6:10 pm | Permalink

    The first alphabetical characters can be 3 letters, not just 2. For example, KJA147 .M685 2007 (see http://www2.lib.ncsu.edu/catalog/record/NCSU2041714).

  2. Posted November 19, 2008 at 10:58 pm | Permalink

    What are you working on? Is this a part of the “foundation for building an effective means to virtually browse the library’s collection“?

    I adapted it for JavaScript: http://www-personal.umich.edu/~dfulmer/api/lc.html

  3. Bill
    Posted November 20, 2008 at 12:11 am | Permalink

    Emily — fixed. Thanks.

    David — I need normalized call numbers to check for inclusion in High Level Browse categories, which have call numbers as start- and end-points. Virtual browsing is another possible application, though — I’ll have to stick normalized call numbers into our VUfind installation.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*