Sending MARC(ish) data to Refworks

Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I’d put down how I’m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested.

The basic procedure is:

  • Send your user to a specific refworks URL along with a callback URL that can enumerate the record(s) you want to import in a supported form
  • Your user logs in (if need be) gets to her RefWorks page
  • RefWorks calls up your system and requests the record(s)
  • The import happens, and your user does whatever she want to do with them

Of course, there are lots of issues with doing this well (quick! Is this MARC record for a book? An edited book? Is it a journal, or a serial of some other sort? Who’s the actual author/editor?), but doing it at all isn’t so bad.

The URL to send them to

This is the “Export this record” URL on my system:

http://www.refworks.com.proxy.lib.umich.edu/express/expressimport.asp?
vendor=[your system]&
filter=MARC+Format&
database=All+MARC+Formats&
encoding=65001
&url=[your callback URL]

Note that the vendor variable should be a unique string (made up by your) for your system, not a larger entity (like the whole library or the institution).

The “MARC Format” filter we’re using is not a filter for real MARC. It’s a MARC-like delimited format (see an example from my catalog).

Basically, you have three types of lines (but really, look at the example, ’cause it’ll make everything a lot clearer):

LEADER

  LEADER [one space] [leader text]

Control Field

  [three-digit control tag] [four spaces] [data text]

Data Field

  [three-digit data tag] [one space] [ind1] [ind2] [one space] [value of subfield a] [other subfield constructs]

…where [other subfield constructs] look like

  [pipe characeter][subfield code][subfield value]

Notice that (a) there’s no leading ‘|a’ before the subfield a value, and (b) there are no spaces between the pipe, the subfield code, and the subfield value for the non-code-a subfields.

Some easy PHP code to produce such a format is as follows. Note that I’m sending it as text (because it’s not MARC) and UTF-8. If you’re got MARC-8, you’ll have to convert it before sending.

  1.       $m = $this->marcRecord;
  2.       header('Content-type: text/plain; charset=UTF-8');
  3.  
  4.       echo 'LEADER ', $m->getLeader(), "\n";
  5.      
  6.       foreach ($m->getFields() as $tag => $val) {
  7.         echo $tag;
  8.         if ($val instanceof File_MARC_Control_FIELD) {
  9.           echo '    ', $val->getData(), "\n";
  10.         } else {
  11.           echo ' ', $val->getIndicator(1),  $val->getIndicator(2), ' ';
  12.           $subs = array();
  13.           foreach ($val->getSubFields() as $code=>$subdata) {
  14.             $line = '';
  15.             if ($code != 'a') {
  16.               $line = '|' . $code;
  17.             }
  18.             $subs[] = $line . $subdata->getData();
  19.           }
  20.           echo implode(' ', $subs), "\n";
  21.         }        
  22.       }

MARC-HASH: The saga continues (now with even less structure)

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "fields" : [
  7.      ["001", "001 value"]
  8.      ["002", "002 value"]
  9.      ["010", " ", " ",
  10.       [
  11.         ["a", "68009499"]
  12.       ]
  13.     ],
  14.     ["035", " ", " ",
  15.       [
  16.         ["a", "(RLIN)MIUG0000733-B"]
  17.       ],
  18.     ],
  19.     ["035", " ", " ",
  20.       [
  21.         ["a", "(CaOTULAS)159818014"]
  22.       ],
  23.     ],
  24.     ["245", "1", "0",
  25.       [
  26.         ["a", "Capitalism, primitive and modern;"],
  27.         ["b", "some aspects of Tolai economic growth" ],
  28.         ["c", "[by] T. Scarlett Epstein."]
  29.       ]
  30.     ]
  31.   ]
  32. }

MARC-HASH control field, now with less structure

Why do I ever, ever think that MARC might not rely on order? I don’t know.

In any case, control fields will now be just an array of duples:

  1. control: [
  2.   ['001', 'value of the 001'],
  3.   ['006', 'value of the 006']
  4.   ['006', 'another 006']
  5. }

MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.

So…I’m suggesting we try something a little simpler. Something so brain-dead, in fact, that I’m loathe to put it down because it’s pretty much the obvious way to do it. To wit:

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "control" : [
  7.      ["001", ["all", "001", "values"]],
  8.      ["002", ["all", "002", "values"]],
  9.   ],
  10.   "data" : [
  11.     ["010", " ", " ",
  12.       [
  13.         ["a", "68009499"]
  14.       ]
  15.     ],
  16.     ["035", " ", " ",
  17.       [
  18.         ["a", "(RLIN)MIUG0000733-B"]
  19.       ],
  20.     ]
  21.     ["035", " ", " ",
  22.       [
  23.         ["a", "(CaOTULAS)159818014"]
  24.       ],
  25.     ]
  26.     ["245", "1", "0",
  27.       [
  28.         ["a", "Capitalism, primitive and modern;"],
  29.         ["b", "some aspects of Tolai economic growth" ],
  30.         ["c", "[by] T. Scarlett Epstein."]
  31.       ]
  32.     ]
  33.   ]
  34. }

Stupid MARC allows all the stupid fields to stupid repeat and be out of stupid order and such, so it’s just a lot of arrays. Easily round-tripable.

Why bother? Excellent question, and one that’s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it’s still a lot easier than working with raw MARC (or, I would claim, MARC-XML), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.

A few things worth noting about the assumptions in my mind:

  • By definition, it’s always UTF-8. The leader should be changed to note this on the sending end, but it’s not required.
  • We include both a type “marc-hash”, and a version with major/minor numbers.
  • Everything is a string.
  • Alpha characters in indicators/tags are all lowercased.
  • A control field is a duple: tag and array of values.
  • A data field has four values:
    1. The tag
    2. Indicator one
    3. Indicator two
    4. An array of duples: subfield and its value

A simple transformation to make it a little more queryable

Let’s say you don’t give a damn about tags that appear out of order, because that’s just a crime against nature, anyway. And you really don’t care what order the subtags appear in most of the time, ’cause really, who does?

A simple run-through (psuedocode ahead):

  my marchash = getTheMarcHash();
  my kindamarc;
  kindamarc{leader} = marchash{leader};
  
  # Map the control fields by tag => array-of-values
  foreach cfield (marchash{control}) {
    kindamarc{control}{cfield[0] ||= []};
    kindamarc{control}{cfield[0]}.push(cfield[1]);
  }
  
  foreach d (marchash{data}) {
    (tag, ind1, ind1) = (d[0], d[1], d[2]);
    
    # build up a hash based on subfields for this tag
    newd = {};
    foreach subfield (d[3]) {
      (stag, sval) = subfield;
      newd{stag} = sval;
    }
    
    # Store the subfield hash in a few places so it's easy to find.
    foreach i1 ('*', ind1) {
      foreach i2 ('*', ind2) {
        kindamarc{data}{tag}{i1}{i2} ||= [];
        kindamarc{data}{tag}{i1}{i2}.push(newd);
      }
    }
  }

Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator ‘*’.

Basically, this will allow things like this:

  1.   $leader = $kindamarc{leader};
  2.   $first001 = $kindamarc{control}{"001"}[0];
  3.  
  4.   # Find 856s where indicator 2 is '1'
  5.  
  6.   @mystuff = $kindamarc{data}{856}{'*'}{1};

It’s easy to see how we could store the index from the original array to make it easy to find the original order, too.

For many, I’m sure, the prospect of dealing with something like this is more daunting than just learning to use MARC-XML or using existing libraries to deal with straight MARC. But there seems to be a set of folks out there for whom this might be useful, so I’m throwing it out there.

A plea: use Solr to normalize your data

[Only, of course, if you're using Solr. Otherwise, that'd be dumb.]

We’ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don’t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.

Arguments about how much business logic to move into the database layer, in the form of foreign-key requirements, cascading inserts and deletes, stored procedures, etc. are as old as the features themselves. Solid arguments for and against are made on all sides, and like all things, there’s a happy middle ground for most people. 1

But Solr provides an incredibly compelling use case because it allows for data transformation at both index and query time via the use of custom analyzers (or a standard analyzer with text filters applied). We’re starting to migrate our schema to use more and more of these things, and I even went so far as to create a custom text filter for LCCNs after being inspired by Jonathan Rochkind.

The incentive is easy to see: client diversity. Let a thousand interfaces bloom, if you can give them all access to the same underlying Solr instance. And, seriously, how many times are you going to write that regexp to semi-normalize ISBNs and ISSNs, huh? Enough already.

If you’re using a Solr nightly (and, really, you should be — faceting is so much faster than the official 1.3 release) you have access to regexp-based filters as well, which makes stuff like this really, really easy:

  1.    <!– Simple type to normalize isbn/issn/other standard numbers –>
  2.     <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  3.       <analyzer>
  4.         <tokenizer class="solr.KeywordTokenizerFactory"/>
  5.         <filter class="solr.LowerCaseFilterFactory"/>
  6.         <filter class="solr.TrimFilterFactory"/>
  7.         <filter class="solr.PatternReplaceFilterFactory"
  8.              pattern="^0*([\d\-\.]+[xX]?).*$" replacement="$1"
  9.         />
  10.         <filter class="solr.PatternReplaceFilterFactory"
  11.              pattern="[\-\.]" replacement=""  replace="all"
  12.         />
  13.       </analyzer>
  14.     </fieldType>

Here, we use the KeywordTokenizerFactory which, not so intuitively, produces a single token from the input. Then lowercase it and pull of any leading and trailing spaces (Trim).

For those of you that don’t read regexp, we then match anything that looks like:

  1. Any number of leading zeros
  2. …followed by any number of digits, dashes, or periods and an optional ‘X’
  3. …followed by…well, we don’t care. Anything else.

…and throw away all but the stuff in #2. Then take that and throw away all the dashes and dots, and you’re left with a string of numbers.

The beauty is that it happens both while the index is being made and during query time, so if your user types in ” 123-45-6-X ” it will be normalized to 123456x, and then checked against your index.

This is simple stuff, and probably doesn’t deserve the virtual ink I’m providing for it, but Vufind out of the box doesn’t do any of this sort of thing (likely because “the box” existed before it was super-easy to do this), and we all should be doing it.

  1. “Most,” in this case, excluding the old-time MySQL fanboys who took it as gospel that all data validation and manipulation belongs in the application layer, because their “database” didn’t do any of it. Februrary 30th in a date field, anyone?

Enough with the freakin’ LC Call Number normalization!

OK. I’m done with it, and this time I mean it.

I’ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I’ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.

Ask, and you shall receive, and it shall be AWESOME!

The good folks at ticTocs heard the call for open data, and they responded…exactly as I asked them to. Which makes me think I should have asked for a pony, too, but I’m still very, very happy!

Anyone can now download a simple tab-delimited text file describing all the journal table of contents RSS files they’ve assembled, for use however anyone wants.

The data include issns and eissns (where available), the title of the journal, and of course the URL of the RSS/Atom/Whatever feed.

The feeds themselves are all over the map — it’s whatever the publisher decides to provide, which might include  abstract/volume/number/doi, or might just be the title of the article. But regardless, they represent data that are useful to our patron and are now available in a format that’s easy to exploit.

So…go to it?

TicTocs: Give us a file! Pretty pretty pretty please!

For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome.

In their blog at News from TicTocs, a post titled I want to be completely honest with you about ticTOCs notes that:

As for the API - yes, we’ve been asked this several times, and the answer is that it is currently being written and should be available very soon.

That’s great, but writing in a comment on that post (after logging in with a very, very old OpenID — I used to have a blog named Opachyderm, a name which I thought was insufferably clever), I noted that we don’t need an API right away.

What we need is a text file.

Simple. Tab-delimited. TicTocID,Title,URL,issn,eissn. Update it every night.

That’s all we need.

We can do the rest. Put it in the OPAC. Stick it on our SFX pages. Not screw around with Javascript/AJAX calls when the data we need are (relatively) static and (absolutely) simple.

Someone needed to put a web interface on those data, and the one provided at ticTocs is really nice. I’m glad it’s there.

And I can’t tell you how much I applaud the JISC for starting this project and getting vendors on board. That’s always the hard part — participation and standardization. They’re doing it, and I couldn’t be happier.

But these data are incredibly valuable,  and their value is currently limited because they’re boxed up.

Spreading these data far and wide is good for scholarship, and I can’t imagine the case that could be made showing it’s better for JISC to keep them at a single endpoint.

The knee-jerk reaction is always, I know, to keep things behind a wall, even if it’s a short wall. “Things will get out of sync if people have their own copies.” Or, “We’ll provide whatever access you need, as fast as you need it, honest.” Or, “We’re going to be providing value-added services on top of the data.”

It’s all true. Things will get out-of-sync — but that’s going to happen whether you encourage people to not cache results or not. And I don’t doubt for a moment that the API provided will be great. And of course you’ll be in a position where you can provide value-added services.

But so can the rest of us.

I’ve run into this myself. I fear…well, let’s be honest. I fear providing a service, having the data stripmined, and then having no one appreciate the front-end I put on it. I do this job for the fame, not the fortune. Obviously.

But I’ll never provide services as fast as me plus three hundred other geeks, all responding to different situations and servicing different patrons.

So…provide an API. Start simple: a single call named getCurrentTextFile. Or maybe add getCurrentTextFileGzipped. It’s only ten-thousand lines of text, probably less than 75k gzipped up. I promise to call it every night about 3am local time so I’m up-to-date.

So….pretty please? With sugar on top? My catalog is waiting. So is my SFX install. And our list of ejournals. And our subject guides. And lots of pages on our website. And our pre-packaged OPML files to offer students and professors. And a thousand yet-to-be-devised services as well.

Pretty pretty pretty please???

Five rules to make your open source more open

[I've noticed that a sure way to get people to look at stuff (as measured by, say, digg) is to include a number. So I did. Five. ]

Over at Bibliographic Wilderness, Jonathan Rothkind has a great followup to an ongoing discussion on the Blacklight list called How to build shared open source in which he tackles some of the differences between open-sourcing your code (a legal and distribution issue) and actually making it so someone else can usefully contribute to your code.

The project I’m spending most of my time on right now, VUFind, is a great piece of functional code but, in many ways, a nightmare in terms of trying to contribute code and abstract out local functionality. This isn’t meant as a slam on the main contributor(s) to VUFind — Andrew, especially, seems to be an almost frighteningly-productive coder — but my experiences trying to customize the code to our local situation has given me a lot of time to think about how I wish things had been architected.

So, here I give some general rules and some specifics as to What I Wish I Had To Work With.

1. Abstraction

General rule: Abstract things out as much as makes sense

Specific rule: Abstract the living crap out of your authentication scheme.

Look, pretty much everyone with anything worth protecting already has an auth/authZ infrastructure in place. Sometimes an extensive, perhaps multi-institutional infrastructure. One that isn’t going to be bypassed without, say, getting fired.

So if you’re going to require people to log in, make sure you make that process as abstract as you possibly can, both in algorithm and in code. Have a singleton class that’s easily subclassed to represent your user, and call it exclusively. Make sure that your URIs are easily separated into those that require auth and those that don’t, for simple use of mod_rewrite or whatnot to redirect to authentication. Make sure it’s easy to hook into (or work around) AJAX links that might require authentication that has expired.

And for the love of god, don’t stuff username/password information into a cookie if you’re doing web work. Use a session and session key. Any auth scheme that I can spoof is no auth scheme at all, because I’m an idiot and not even trying hard.

2. Configuration files

General rule: use config files for anything local

Specific rule: Use a configuration file format that can represent complex data

That’s right, I’m looking at you, .ini and .properties files.

Use something like YAML, or XML, or even straight programming-language code (i.e., a file with a PHP hash or a perl hashref or whatnot) that can actually represent, in a logical way, the complexities of the stuff you need to configure. And then, again, have a singleton class that will read that data and expose it in a useful and safe way.

And include a semantics checker if you can manage to write one.  It’ll save everyone a load of trouble.

Huge bonus points if your configuration singleton class can read from multiple files, overriding previous (default) definitions with subsequent (local) ones.

3. Hide subapplications

General rule: Don’t force your user to intimately understand every piece of every library/application you include

Specific rule: Generate configuration information for sublibraries/applications

This might be a little specific to the project I’m working on now, which uses Solr as a backend, but I think it applies more generally.

If you’re using a non-brain-dead configuration file format, and if you can assume reasonable defaults, then generate configuration files for your user. A low-level extreme of this is the traditional unix autoconf, which essentially allows you to install software without knowing a damn thing about your own system. Which is useful to those of us that don’t.

In VUFind, there are three files — a .properties file that specifies how to map MARC data into  field names,  Solr’s schema.xml that describes the structure and behavior of those same fields, and an XSLT stylesheet that pretties the data as it comes out of Solr to make it easier to work with. As you might expect, the overlap in data is about 80% across the three of them, and it would be a bazillion times easier to have a single file that generated all three.

OK. Maybe not a bazillion, because if it was that easy, I’d have taken a couple hours to write the code to do it already. Let’s say just a zillion times easier.

The caveat to this is that you need to either make sure your config file specification is complete enough to encompass everything all the other files might need to know (bad), or that the other config files can import subsections that override your defaults (good).

4.Testing

General rule: practice test-first (TDD or BDD) development

Specific rule: write your code in such a way that it’s testable

Look, we all know we should spend the first three weeks writing eight thousand tests to describe every corner of the code. And we all have bosses that will ask, every morning about 10:30am, “So, what do you have that you can show me?”

Not everyone is going to be able to write tests first. That’s not right, it’s not smart, but it’s the way the world works. But at least put in the hooks so someone else can come along and write tests.

Writing tests is one of the easiest ways that a newbie can come along to a project and instantly contribute in a meaningful way. But if you’re constantly calling global variables, depending on live database connections and not providing a way to mock them up, or throwing fatal errors if every subsystem isn’t present no matter the context, then it’s going to be hard to write tests.  So hard, in fact, that not only will you not do it, but neither will anyone else.

5. Error handling

General rule: provide a sane, hierarchical set of error classes and hooks to catch them as necessary

Specific rule: THROW SOME GODDAMN ERRORS!!!!

Don’t be an idiot. Things will fail. In the absense of Design by Contract or somesuch, errors will happen. Throw them. Catch them. But at least throw them, instead of letting your code die six hundred lines later with a “Cannot cast null value to string” when you finally get around to trying to print something out.

And then I finally shut the hell up

I had a great — great! I tell you — 30 second conversation with Ken Varnum (of RSS4Lib fame) that went something like this (much paraphrasing, obviously):

B: You’re gonna have to fix that interface. The standard header won’t work.
K: Well, no, we’re going leave it as it is.
B: It’s not gonna work.
K: We’ve decided to make it all consistent.
B: OK, you can keep saying that, but I’m really, really smart and I say users are going to be confused.
K: We’ve done user testing. They weren’t confused. And here’s our plan to see if they are confused once we go live.

And then I finally shut the hell up. While I’m never crazy about being just plain wrong, it was so so SO refreshing to have someone say, “Well, actually, we’re making this decision based on data and not just pulling answers out of our pants like so many flying monkeys.”

Where, oh where in the library is the dedication to making actual data-based decisions? Besides Ken’s office, I mean?