Home > Uncategorized > Normalizing LoC Call Numbers for sorting

Normalizing LoC Call Numbers for sorting

Tags: ,

November 13, 2008 3 Comments »

Updated: I missed a ‘?’ in the original code that pushed a single cutter into the second-cutter position. Fixed below.

Crap. Update 2: Initial letters can be three characters long. Regexp and output changed.

LoC Call numbers tend to be a mess, and I’ve been working this morning trying to normalize them for easy string comparison.

The perl function below takes a call number (with some level of sloppiness) and returns a string suitable for comparisons with other strings returned by the function. It outputs stuff like this:

E                          E 0000.0000  0000  0000
E 184 .A1 G78              E 0184.0000A 1000G 7800
E184.A2 G78 1967           E 0184.0000A 2000G 7800 1967
E184.A2 G78 1970           E 0184.0000A 2000G 7800 1970
EA                         EA0000.0000  0000  0000
EA 10                      EA0010.0000  0000  0000
EA 10 1970                 EA0010.0000  0000  0000 1970
EA10 B7                    EA0010.0000B 7000  0000
EA 10.B7.G8                EA0010.0000B 7000G 8000
EA10.5                     EA0010.5000  0000  0000
The code, in perl, follows:

sub normalizeLC {
  1.   my $lc = uc(shift);
  2.   $lc =~ /^
  3.           \s*
  4.           ([A-Z]{1,3})  # alpha
  5.           \s*
  6.           (         # optional numbers
  7.             \d+
  8.             (?: \s*\.\s*\d+)?  # …with optional decimal point
  9.           )?
  10.           \s*
  11.           (?:               # optional cutter
  12.             \.? \s*
  13.             ([A-Z]+)      # cutter letter
  14.             \s*
  15.             (\d+)?        # cutter numbers
  16.           )?
  17.           (?:               # optional cutter
  18.             \.? \s*
  19.             ([A-Z]+)      # cutter letter
  20.             \s*
  21.             (\d+)?        # cutter numbers
  22.           )?
  23.           \s*
  24.           (.*?)            # everthing else
  25.           \s*$
  26.         /x;
  27.   my ($alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra) = ($1, $2, $3, $4, $5, $6, $7);
  28.   $c1num .= 0 x (4length($c1num)); # Pad out to four decimal places
  29.   $c2num .= 0 x (4length($c2num)); # ditto
  30.   $extra = ' ' . $extra if ($extra);
  31.   return sprintf("%-3s%09.4f%-2s%4s%-2s%4s%s", $alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra);
  32. }

Tags: callnumbers code

Comments:3

Leave my own
  1. Emily Lynema
    November 14, 2008 at 6:10 pm

    The first alphabetical characters can be 3 letters, not just 2. For example, KJA147 .M685 2007 (see http://www2.lib.ncsu.edu/catalog/record/NCSU2041714).

  2. David Fulmer
    November 19, 2008 at 10:58 pm
  3. Bill
    November 20, 2008 at 12:11 am

    Emily — fixed. Thanks.

    David — I need normalized call numbers to check for inclusion in High Level Browse categories, which have call numbers as start- and end-points. Virtual browsing is another possible application, though — I’ll have to stick normalized call numbers into our VUfind installation.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>