Tag: TALITAS

Wanted: a better proxy server

October 2, 2008 at 12:01 pmCategory:Uncategorized

We in the library world have a problem. We spend a zillion-with-a-Z dollars subscribing to online databases, purchases which presume our ability to make sure only authorized people can look at them. The alternative is to be in breach of contract law, which I’ve been assured is something we’d like to avoid.

The problem I see is this: The limitations of our proxy server software restrict how we can write contracts with our vendors.

The standard approach is to define two types of access:

  1. By IP address. The person is sitting in front of the right computer (or has hooked up to the right wireless network) and is assumed to be “OK” based on either the location of the computer (e.g., in the library building) or through the nature of the auth/authZ built into the computer’s login procedure. We tell our vendors, “Hey,” (all vendor-library conversations start with ‘Hey’) “here’s a list of IP addresses that you should allow and associate with us.”
  2. By authenticating with a central mechanism and then sending everything through a rewriting proxy server, thus allowing us to tell the vendor, “Hey. Anything coming through our proxy server is OK. Honest.”

The venerable EZProxy (now owned by OCLC) has been the solution of choice for libraries for a long time. It does what it does very well.

But I want more. Much more. More more more.

The current model assumes there’s exactly one question: Is this person authorized as a UM-Ann Arbor user?

But that’s a pretty crude question. Suppose the Business or Law school wants to buy access to stuff for only their students (news flash: they already do)? Or we want to subscribe to a journal but, because it’s so esoteric, restrict access to a couple departments to save money. Or recognize when an Ann Arbor faculty member is sitting at a public computer on a different campus but still allow her to get full rights as an Ann Arbor faculty member instead of appearing to be Joe-Random-Dearborn student, a group which has significantly less access to online journals.

Why can’t people with roles on multiple campuses get the best of all worlds, getting the least restrictive access possible to a given titleĀ  based on all their student/staff/faculty affiliations?

Why can’t we negotiate access to given titles (or even articles???) in lieu of course packets (or online reserves), restricting access to only those enrolled in the class?

Here at UMich, we’re just starting to get an Enterprise Directory online where we’ll actually be able to ask some of these questions. But until we get a proxy server that’s smart enough to do something with all the information, it’ll just sit there and taunt me.

This isn’t an idle question. We already have databases that the Business School subscribes to alone that can only be accessed when you’re physically in the B-School at one of the approved-IP-address computers. That’s freakin’ ridiculous.

Of course, this all presumes that all-or-nothing contracts aren’t the best way to go, but shouldn’t we at least have the option?

3 Responses to “Wanted: a better proxy server”

  1. Doug Eriksen says:

    I’m not the final word in EZproxy experts, but I don’t think anything on your list is outside the capabilities of EZproxy. Allowing different groups of users different levels of access is definitely possible, and routing users to their databases through your own EZproxy even when they are on another campus is simply a matter of instructing them to use your proxied links instead of going straight to the database or following a link from the other institution. I don’t deny there are still improvements that could be made to EZproxy, but I think it might be both the best proxy solution out there, and the best-at-its-own-job of any piece of software that I have to work with at my library.

  2. Bill says:

    Doug — good information, but my point is that I want to choose what group to identify with a user in realtime. When I’m trying to get to specific database XXX it should treat me as Ann Arbor faculty, but when trying to get to YYY I should be identified as a Flint student because (a) it knows I have both roles, and (b) it knows which role will give me greater access to each individual database.

  3. [...] and I pick up the occasional blog post which opens a window onto this parallel universe (like this one about identifying users by role when authenticating via a proxy server [originally spotted on Planet Code4Lib], which echos debates ‘over here’ about [...]

[From a series I’m calling, “Things About The Library I Think Are Stoooopid”, part one of about a zillion.]

I’m going to wallow in a little bit of hyperbole here, but only a little.

The problem

Suppose, just for a moment, that you’re a computer programmer working anytime in the last twenty years, and someone wants you to set up a data structure to deal with a timeless issue — how to keep track of who’s on which committees in a library.

If you’re a computer person

Easy enough. First off, what’s a committee?

Committee

  • Committee name (string)
  • Committee inception date (date)
  • Chair (person)
  • Members (set of people)

How about a person?

Person

  • Last name (string)
  • First name (string)
  • Email address (email)

Okeedokee. That looks ok so far, but we’ve got problems.

First off, everyone knows that committee names change. And, everyone also knows that last names can change, preferred first names can change. email addresses change, etc. We need some sort of unique identifier to represent the abstract ideal of a particular committee or a specific individual. Let’s be lazy and just throw in an integer ID that we’ll be careful not to reuse, ever, for any reason.

So, we’ll throw that in, and make sure our references are to these unique IDs, not names or whatnot.

That gives us this.

Committee

  • cID (unique integer)
  • Committee name (string)
  • Committee inception date (date)
  • Chair (pID)
  • Members (set of pIDs)

How about a person?

Person

  • pID (unique integer)
  • Last name (string)
  • First name (string)
  • Email address (email)

And the mapping, of course.

Committee-Person Mapping

  • pID (unique integer pointing into the Person table)
  • cID (unique integer pointing into the Committee table)
  • dateTermStarted (date)
  • dateTermEnds (date)

If this seems simple, well, it is. Like I said, the theory is almost forty years old, and common implementations of databases at least twenty. We have well-defined unique keys, special types for dates and email addresses so we can do some sanity checking and order things and so forth, and a very, very simple mapping of people to committees where we keep track of start and end dates just to be complete.

Most importantly, you know what’s not here? There’s nothing about how to print it out, or what format I’m going to store it in. Those are afterthoughts. They don’t matter. Any well-specified data model can be machine-translated into pretty much anything you need.

If you’re writing a library spec

As near as I can tell, the “library” way to write this would be as follows:

Committee

[Let “hus” stand for “hopefully unique string created by ridiculously complex algorithm”]

  • Committee name (hus)
  • Committee inception (string masquerading as a date in any of several formats)
  • Chair (hus)
  • Members
    • person1 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)
    • person2 (hus) $$b email address (string) $$c start date (date-like string) $$d end date (date-like string)

Ummmmm…strings. Nothing but strings. Short strings, long strings, fat strings, tall strings. Strings with dollar signs. Strings that look like dates. Strings that contain other strings. And, just for luck, a little bit of hierarchy, where “hierarchy” means “two levels.”

If someone’s name changes, well, good luck trying to find all the occurrences and fixing them all (and making sure you don’t get the wrong John Smith). Good luck parsing out all the dates, which rely not on machine syntax checking but on a whole set of data-enterers trying to follow some sort of rule without making any mistakes. And good, good luck getting a list of which committees a specific person belongs to.

Why I bring it up

One of the most eye-opening talks I heard at code4lib 2008 was a keynote by Karen Coyle on RDA and its ongoing specification. You can view the slides or watch the presentation if you’d like.

In it, she makes the point that, when push comes to shove, AACR2 and RDA both ended up being tremendously focused on producing text strings.

Whaaaaa??

Was there no one on the RDA committee that had experience with anything even approaching modern data theory?

Of course there was. But the giant weight of history is crushing library data modeling like a skinless grape under a dump truck.

Look, I understand that this is not a simple data modeling problem. I understand that there’s a whole set of issues, including a (what I think to be a specious) demand that the cataloged data accurately reflect the actual text in a real, physical object that’s sitting in front of you. I’m not so naive as to think this is an easy task.

But anyone who, in the 21st century, approaches the large-scale creation of data without first and foremost worrying about machine-parsability, consistent data types with machine-checkable syntax (and even some semantics) and one-to-one mappings between unique objects (an author, an editor, a publishing house, a work) and something that uniquely identifies that object in any reification is….well, I don’t know what they’re smoking.

We’re not printing cards anymore, people.

  • If something is only understandable if a human is reading it, it’s not understandable by any modern definition.
  • Punctuation doesn’t belong in the description of an object. Ever. Punctuation is a rendering issue. If you’re using punctuation, or well-formed strings, instead of descriptive attributes, you’re doing it wrong.
  • Just because you know your data doesn’t mean you know how to model it. Get outside help from the smartest people you can find.

Whew! That felt good!

OK. Rant off.

One Response to “Psst. We’re not printing cards anymore”

  1. [...] Robot Librarian Disclaimer: I’m not actually a robot. Skip to content « Psst. We’re not printing cards anymore [...]