Normalizing LoC Call Numbers for sorting

Updated: I missed a ‘?’ in the original code that pushed a single cutter into the second-cutter position. Fixed below.

Crap. Update 2: Initial letters can be three characters long. Regexp and output changed.

LoC Call numbers tend to be a mess, and I’ve been working this morning trying to normalize them for easy string comparison.

The perl function below takes a call number (with some level of sloppiness) and returns a string suitable for comparisons with other strings returned by the function. It outputs stuff like this:

E                          E 0000.0000  0000  0000
E 184 .A1 G78              E 0184.0000A 1000G 7800
E184.A2 G78 1967           E 0184.0000A 2000G 7800 1967
E184.A2 G78 1970           E 0184.0000A 2000G 7800 1970
EA                         EA0000.0000  0000  0000
EA 10                      EA0010.0000  0000  0000
EA 10 1970                 EA0010.0000  0000  0000 1970
EA10 B7                    EA0010.0000B 7000  0000
EA 10.B7.G8                EA0010.0000B 7000G 8000
EA10.5                     EA0010.5000  0000  0000

The code, in perl, follows:

sub normalizeLC {
  1.   my $lc = uc(shift);
  2.   $lc =~ /^
  3.           \s*
  4.           ([A-Z]{1,3})  # alpha
  5.           \s*
  6.           (         # optional numbers
  7.             \d+
  8.             (?: \s*\.\s*\d+)?  # …with optional decimal point
  9.           )?
  10.           \s*
  11.           (?:               # optional cutter
  12.             \.? \s*
  13.             ([A-Z]+)      # cutter letter
  14.             \s*
  15.             (\d+)?        # cutter numbers
  16.           )?
  17.           (?:               # optional cutter
  18.             \.? \s*
  19.             ([A-Z]+)      # cutter letter
  20.             \s*
  21.             (\d+)?        # cutter numbers
  22.           )?
  23.           \s*
  24.           (.*?)            # everthing else
  25.           \s*$
  26.         /x;
  27.   my ($alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra) = ($1, $2, $3, $4, $5, $6, $7);
  28.   $c1num .= 0 x (4 - length($c1num)); # Pad out to four decimal places
  29.   $c2num .= 0 x (4 - length($c2num)); # ditto
  30.   $extra = ' ' . $extra if ($extra);
  31.   return sprintf("%-3s%09.4f%-2s%4s%-2s%4s%s", $alpha, $num, $c1alpha, $c1num, $c2alpha, $c2num, $extra);
  32. }

How to rig an election

No matter where I’ve gone today and for the past few days, I keep running into people (on both sides) who are sure that if Their Guy Doesn’t Win, it’s going to be because of dirty tactics.

I’m not an expert in this stuff. Not by a long shot. But I thought it would be fun to work out, for my own benefit, types of election fraud and what to really worry about.

Note that how you might interpret all of this really depends on what you consider the greater evil: a voteĀ  cast that shouldn’t have been, or a vote suppressed that shouldn’t have been. I lean toward the latter.

[More specific disclaimer: I'm a bed-wetting liberal.]

In each case I’ll define what I’m talking about, what class it goes into, how hard it is to do once, and ratio of people-in-the-conspiracy to votes affected.

As examples, voter non-registration (telling someone you’re registering them and not doing it) is easy to do at all (Difficulty: Easy) and one person can screw over a few tens of others, depending on how dedicated you are (Ratio: medium). Re-programming a voting machine is very hard, but has the potential to mess with hundreds and hundreds of votes (hence the high ratio).

Obviously, all this below is (a) pulled out of my ass, and (b) depends on the size of the electorate. A local race where only 300 people will be voting can be turned by any method at all. I’m looking mostly at national races, where lots of people vote so a few fraudulent votes aren’t likely to be problematic.

Voter Non-Registration

  • Definition: Tell people you’re registering them to vote, but don’t
  • Class: voter suppression
  • Difficulty: easy
  • Ratio: medium
  • Effect: small-medium
  • Notes: To do this up right, you need lots of people out there registering folks and then throwing the registrations away or a centralized system where one or two people can collect the data and then throw it away. The former involves a pretty big conspiracy; the latter leaves a lot of people who could testify that they turned in registrations that disappeared. Most voters are registered and stay that way. Voter non-registration as a suppression tactic is most useful against those that traditionally don’t vote, or those that have never voted or recently moved (e.g., college kids).Would seem to favor Republicans.

Fraudulent registration of voters

  • Definition: Register people who don’t exist, are dead, or aren’t planning on voting themselves.
  • Class: voting fraud
  • Difficulty: ridiculously easy
  • Ratio: medium-ish
  • Effect: zero — nothing happens until a fraudulent vote is entered
  • Notes: This is what the ACORN dustup is about. Not speaking one way or the other about ACORN, organizations that register voters have a tough time in that (a) they’re required by law to pass along all registrations (see above) even if they know they’re fakes, and (b) organizations willing to pay people to registering voters tend to be most interested in finding individuals traditionally undeserved in that area — the homeless, the poor, illiterate etc. That means that the data you’re going to get tends to be less that great. Note the zero effect: nothing to screw with the election happens until someone actually casts a fraudulent vote. Which leads us to…

Fraudulent voting

  • Definition: Casting a vote you shouldn’t be allowed to cast, usually while pretending to be someone else.
  • Class: voting fraud
  • Dificulty: pretty hard (note: requires voter registration fraud)
  • Ratio: very small
  • Effect: pretty small
  • Notes: While some areas of the country are famous for voting fraud (I’m looking at you, Chicago), actually walking into a place and voting as someone else takes some serious balls. And with long lines expected this year, any one person, no matter how dedicated, isn’t going to be able to vote that often. A different version of this is filing out people’s absentee voting slips “for them” and has been around forever; this is a tactic I tend to associate with “machine” politics that tend to favor Democrats.

After Hour Ballot stuffing

  • Definition: Placing votes “after hours” as done by an individual with no (human or technical) oversight, or by a conspiracy of people who are supposed to be overlooking each other.
  • Class: voting fraud
  • Dificulty: medium
  • Ratio: large
  • Effect: medium-large
  • Notes: Again, first this requires some sort of voter registration fraud if you’re going to do it in any serious numbers. Then you need ridiculously lax oversight of the balloting / counting process, which is not that hard to find, unfortunatley.

“Losing” votes

  • Definition: Have ballot boxes take a detour to your basement or the dump
  • Class: voting fraud
  • Dificulty: pretty hard
  • Ratio: very large
  • Effect: very large
  • Notes: Ah, a classic. As voters, we tend to group geographically — Ann Arbor, for example, is almost devoid of Republicans. So, you let everyone vote, and then “disappear” the ballot boxes from Ann Arbor, while doing your best to make sure the Democrats don’t do the same thing in predominantly Republican areas. This is only slightly easier than after hour ballot stuffing, but still hard. Payoff is huge, though.

Voter misdirection

  • Definition: Give people bad information about when/how to vote
  • Class: voter suppression
  • Dificulty: easy
  • Ratio: large
  • Effect: Depends on how good you are, doesn’t it?
  • Notes: We’re finally heading into the gray areas — things that, depending on how you do them, likely aren’t actually illegal. That makes them easier to do, because you don’t have to worry about members of your conspiracy squealing. We’re seeing a lot of this already this election, most notably in robo-calls telling people that they should vote on Wednesday, or that their polling place has changed, or whatnot. Mostly anonymous, very difficult to trace, and can be pretty effective if your database of friendly/unfriendly voters is good.

Voter de-registration

  • Definition: Sue to get whole classes of people removed from the roles
  • Class: voter suppression
  • Dificulty: pretty hard
  • Ratio: Gigundous
  • Effect: Very large
  • Notes: This has been all over the news, and for good reason. Anything that causes someone to have to cast a provisional ballot makes it a pain in the ass for that ballot to get counted. For lots of folks (esp. “working class” people who punch a clock), taking a couple hours off to go down to the courthouse and prove you’re who you say you are is a non-starter. This is why everyone wants their people to vote early if they can — it avoids anything that might screw with the voting process, like this or challenges.

Profiled voter challenges

  • Definition: Challenge the votes of people that don’t look like you
  • Class: voter suppression
  • Dificulty: pretty easy
  • Ratio: large
  • Effect: depends on how good the poll workers are and the state laws
  • Notes: This is easy. You post someone at a polling station, and challenge anyone who looks like “the other guys” (because Black and Hispanic voters tend to go Democrat; identifiying Republicans on sight might be harder). Some states make this hard; others allow anyone at all to challenge anyone else and force them to use a provisional ballot.

Break voting

  • Definition: Make voting so difficult or slow that people give up and go home
  • Class: voter suppression
  • Dificulty: varies
  • Ratio: large
  • Effect: large
  • Notes: If I’m at a place with three mechanical voting machines and I stick a bunch of gum in one of them, rendering it useless, I’ve just made it a hell of a lot harder for people to vote. Less extreme examples would be challenging everyone who walks through the door, or having a poll worker who takes for freakin’ ever (on purpose, I mean; nothing in general against our dedicated poll workers– whose average age is 72, I heard).

Cause bad weather

I’m not sure how to go about this, but the little men in my head say flooding is a great way to keep people from the polls.

Wanted: a better proxy server

We in the library world have a problem. We spend a zillion-with-a-Z dollars subscribing to online databases, purchases which presume our ability to make sure only authorized people can look at them. The alternative is to be in breach of contract law, which I’ve been assured is something we’d like to avoid.

The problem I see is this: The limitations of our proxy server software restrict how we can write contracts with our vendors.

The standard approach is to define two types of access:

  1. By IP address. The person is sitting in front of the right computer (or has hooked up to the right wireless network) and is assumed to be “OK” based on either the location of the computer (e.g., in the library building) or through the nature of the auth/authZ built into the computer’s login procedure. We tell our vendors, “Hey,” (all vendor-library conversations start with ‘Hey’) “here’s a list of IP addresses that you should allow and associate with us.”
  2. By authenticating with a central mechanism and then sending everything through a rewriting proxy server, thus allowing us to tell the vendor, “Hey. Anything coming through our proxy server is OK. Honest.”

The venerable EZProxy (now owned by OCLC) has been the solution of choice for libraries for a long time. It does what it does very well.

But I want more. Much more. More more more.

The current model assumes there’s exactly one question: Is this person authorized as a UM-Ann Arbor user?

But that’s a pretty crude question. Suppose the Business or Law school wants to buy access to stuff for only their students (news flash: they already do)? Or we want to subscribe to a journal but, because it’s so esoteric, restrict access to a couple departments to save money. Or recognize when an Ann Arbor faculty member is sitting at a public computer on a different campus but still allow her to get full rights as an Ann Arbor faculty member instead of appearing to be Joe-Random-Dearborn student, a group which has significantly less access to online journals.

Why can’t people with roles on multiple campuses get the best of all worlds, getting the least restrictive access possible to a given titleĀ  based on all their student/staff/faculty affiliations?

Why can’t we negotiate access to given titles (or even articles???) in lieu of course packets (or online reserves), restricting access to only those enrolled in the class?

Here at UMich, we’re just starting to get an Enterprise Directory online where we’ll actually be able to ask some of these questions. But until we get a proxy server that’s smart enough to do something with all the information, it’ll just sit there and taunt me.

This isn’t an idle question. We already have databases that the Business School subscribes to alone that can only be accessed when you’re physically in the B-School at one of the approved-IP-address computers. That’s freakin’ ridiculous.

Of course, this all presumes that all-or-nothing contracts aren’t the best way to go, but shouldn’t we at least have the option?

Planet Code4Lib in a snapshot

Inspired by the Inquiring Librarian, I just used Wordle to create a “tagcloud” of the current Planet Code4Lib feed.

What kills me is the tiny little “Library” in the lower left-hand corner.

Intuition-based librarianship?

Not long after I started working in the library, I heard someone talking about “Evidence Based Librarianship.” Like the good little kind-of-a-librarian I’d become, I looked it up and found this article which states that:

EBL employs the best available evidence based upon library science research to arrive at sound decisions about solving practical problems in librarianship.

My immediate response was, of course, What the $#!&% is everyone else doing?

The sad truth, of course, is that in general folks working in libraries do not use the “best evidence” based on “library science research” because, like many of the practitioners I met when I was in the education world, they (a) don’t know most of the research and data, and (b) are convinced that their users are so magical, so special, so utterly unique, that there’s no point in looking to the research and are better off just going with their guts.

That’s an over-simplification, of course. But I have found, across a bunch of situations, that practicing librarians tend to think:

  1. their time is much better spent directly helping patrons than reading about research regarding how to help patrons,
  2. “data” (defined incredibly loosely) derived from reference desk interviews are sufficient to make decisions
  3. “I know my patrons better than anyone”

The logical conclusions to this is that:

  1. Most library research is essentially being thrown down a dark hole because the people that could most benefit from it don’t read it
  2. We’re assuming that the 99.999% of users who never talk to a librarian (many of whom, in fact, never enter the library building) have the exact same needs and perspective as those who engage in reference interviews
  3. Librarians, as a group, confuse casual and/or episodic interaction with self-selected patrons with actual social-science research.

And the over-simplified solutions:

Make reading a job requirement — for real! Make librarians responsible for keeping up with the literature — “responsible” in a “prove to your direct manager that you spent two hours reading and writing this week”.

Librarians as a group, I think, want to use the research. But not so much that they’re willing to let Curmudgeonly Old Faculty Member #2 hang tight for a few hours while they brush up.

Use the data you already have! Your systems — your ILS, any reference desk software, your proxy server, your web server — all collect data. Warehouse the data. Mine the data. Provide both colorful graphical interfaces and ugly powerful analysis functions for the data. Figure out how to do something with the freakin’ data!

Most (all?) libraries have gobs of data that are pulled out once a year for ACRL statistics. And even if they’re looked at by someone, they’re certainly not easily available to everyone.

Push access to the data and associated visualization tools as far down the stack as you can. At least people will know what kinds of questions can be answered.

Don’t pretend to do research — do real research! Do real social science research — something that certainly doesn’t have a front-seat in library schools as near as I can tell. Find some MS students in Sociology or Anthropology who are looking for a project and ask them to find something out, with real honest-to-god case study methodology, text analysis, data analysis — the whole nine yards. Better yet — hire someone to do it, and for god’s sake don’t put down that they must have an MLS.

Times are tight all around, of course — no one has enough time, enough money, enough resources. But that’s exactly why now is the time to focus on existing research (it’s free — someone already did it) and data (it’s free — your systems are already collecting it) — to find out what’s being used, what’s being ignored, how to market your under-utilized resources and which populations need some outreach.

Going with your gut might seem to work, but maybe that’s only because you’re not actually using any solid criteria to evaluate what you’re doing now.

The friend of my enemy’s friend’s enemy’s…err…

Move over, Axis of Evil! Our 43rd president, George W. Bush (and you gotta know that his dad hangs on to that ‘H.’ with two white-knuckled hands) is now in search of “the surest way to defeat the enemies of hatred.” Of course, we’re the best of friends with hatred here at Robot Librarian, so we should be safe.

Google Doctype — open documentation, open code

Because you can never have too many open encyclopedia-type-thingies, Google has launched Google Doctype, a “Google-sponsored open encyclopedia and reference library for developers of web applications. By web developers, for web developers.” It’s set up to use an open license (Creative Commons Attribution 3.0 Unported License) and, unlike other similar resources, is explicitly set up to include code for testing and browser-compatibility tables generated by running that code against different browsers. Simple, direct… what’s not to like?

JSON, JSON everywhere

Via Ajaxian, just saw an announcement for Persevere, a network-centric, JSON-based generic storage engine. It features:

  • A REST-based interface over regular old HTTP
  • JSON as the native data going in and out, including circular references and such
  • Search interface based around JSONPath
  • RPC interface based on JSON-RPC
  • Seemingly buzzword compliant across the board

I’ve been thinking about these sorts of servers a lot lately (couchdb and strokedb are two others) in the context of the “not-the-catalog” data we track here at the library.

For some stuff, clearly we need the power and speed of a real database. That power and speed isn’t free, though — you have to set up the tables, map relationships, build an interface on top of it, etc. While it’s not rocket science by any stretch of the imagination, it’s a lot of screwing around and involves a few levels of security and has a friendly red sign on the door that reads “Programmers only, please.”

For other data, though, a structured or semi-structured data store based on a plain text format like JSON would be great. Since everything is a URL, we can handle security at the HTTP-auth/authz level. Library hours, lists of databases we subscribe to, staff directory data — these are data that could, if we wanted, be moved into a generic store like this.

The exciting stuff comes when you stop thinking about traditional database applications and think more in terms of having a data storage endpoint that pretty much anyone with a modicum of knowledge and authorization could throw stuff into. Want to build your own tagging system? Your own “My Shelf?” How about a comment form that straddles the edge between “email me the results” and “ask someone to hook me up to a database”? Or a javascript library that automatically takes survey submissions and sticks them into a system like this?

This is the flip-side of my last post. We’re not talking about hard-core, multiply-linked, core-business metadata. For that, we need ridiculously smart people figuring out how to best leverage the, say, 8 million MARC records we’ve got lying around. But for other stuff…this seems really, really cool.

Psst. We’re not printing cards anymore

[From a series I’m calling, “Things About The Library I Think Are Stoooopid”, part one of about a zillion.]

I’m going to wallow in a little bit of hyperbole here, but only a little.

The problem

Suppose, just for a moment, that you’re a computer programmer working anytime in the last twenty years, and someone wants you to set up a data structure to deal with a timeless issue — how to keep track of who’s on which committees in a library.

If you’re a computer person

Easy enough. First off, what’s a committee?

Committee

  • Committee name (string)
  • Committee inception date (date)
  • Chair (person)
  • Members (set of people)

How about a person?

Person

  • Last name (string)
  • First name (string)
  • Email address (email)

Okeedokee. That looks ok so far, but we’ve got problems.

First off, everyone knows that committee names change. And, everyone also knows that last names can change, preferred first names can change. email addresses change, etc. We need some sort of unique identifier to represent the abstract ideal of a particular committee or a specific individual. Let’s be lazy and just throw in an integer ID that we’ll be careful not to reuse, ever, for any reason.

So, we’ll throw that in, and make sure our references are to these unique IDs, not names or whatnot.

That gives us this.

Committee

  • cID (unique integer)
  • Committee name (string)
  • Committee inception date (date)
  • Chair (pID)
  • Members (set of pIDs)

How about a person?

Person

  • pID (unique integer)
  • Last name (string)
  • First name (string)
  • Email address (email)

And the mapping, of course.

Committee-Person Mapping

  • pID (unique integer pointing into the Person table)
  • cID (unique integer pointing into the Committee table)
  • dateTermStarted (date)
  • dateTermEnds (date)

If this seems simple, well, it is. Like I said, the theory is almost forty years old, and common implementations of databases at least twenty. We have well-defined unique keys, special types for dates and email addresses so we can do some sanity checking and order things and so forth, and a very, very simple mapping of people to committees where we keep track of start and end dates just to be complete.

Most importantly, you know what’s not here? There’s nothing about how to print it out, or what format I’m going to store it in. Those are afterthoughts. They don’t matter. Any well-specified data model can be machine-translated into pretty much anything you need.

If you’re writing a library spec

As near as I can tell, the “library” way to write this would be as follows:

Committee

[Let “hus” stand for “hopefully unique string created by ridiculously complex algorithm”]

  • Committee name (hus)
  • Committee inception (string masquerading as a date in any of several formats)
  • Chair (hus)
  • Members
    • person1 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)
    • person2 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)

Ummmmm…strings. Nothing but strings. Short strings, long strings, fat strings, tall strings. Strings with dollar signs. Strings that look like dates. Strings that contain other strings. And, just for luck, a little bit of hierarchy, where “hierarchy” means “two levels.”

If someone’s name changes, well, good luck trying to find all the occurrences and fixing them all (and making sure you don’t get the wrong John Smith). Good luck parsing out all the dates, which rely not on machine syntax checking but on a whole set of data-enterers trying to follow some sort of rule without making any mistakes. And good, good luck getting a list of which committees a specific person belongs to.

Why I bring it up

One of the most eye-opening talks I heard at code4lib 2008 was a keynote by Karen Coyle on RDA and its ongoing specification. You can view the slides or watch the presentation if you’d like.

In it, she makes the point that, when push comes to shove, AACR2 and RDA both ended up being tremendously focused on producing text strings.

Whaaaaa??

Was there no one on the RDA committee that had experience with anything even approaching modern data theory?

Of course there was. But the giant weight of history is crushing library data modeling like a skinless grape under a dump truck.

Look, I understand that this is not a simple data modeling problem. I understand that there’s a whole set of issues, including a (what I think to be a specious) demand that the cataloged data accurately reflect the actual text in a real, physical object that’s sitting in front of you. I’m not so naive as to think this is an easy task.

But anyone who, in the 21st century, approaches the large-scale creation of data without first and foremost worrying about machine-parsability, consistent data types with machine-checkable syntax (and even some semantics) and one-to-one mappings between unique objects (an author, an editor, a publishing house, a work) and something that uniquely identifies that object in any reification is….well, I don’t know what they’re smoking.

We’re not printing cards anymore, people.

  • If something is only understandable if a human is reading it, it’s not understandable by any modern definition.
  • Punctuation doesn’t belong in the description of an object. Ever. Punctuation is a rendering issue. If you’re using punctuation, or well-formed strings, instead of descriptive attributes, you’re doing it wrong.
  • Just because you know your data doesn’t mean you know how to model it. Get outside help from the smartest people you can find.

Whew! That felt good!

OK. Rant off.

UPenn library has video “commercials”

The University of Pennsylvania Library has a set of video commercials touting their products — some of which are musicals! Worth a look-see.