Category: Uncategorized

For the last few months, I’ve been working on rolling out a ridiculous-modified version of Vufind, which we just launched as our primary OPAC, Mirlyn, with a slightly-different version powering catalog.hathitrust.org, a temporary metadata search on the HathiTrust data until the OCLC takes it over at some undetermined date.

(Yeah, the HathiTrust site is a lot better looking.)

[Our Aleph-based catalog lives on at mirlyn-classic) -- I'll be interested to see how the traffic on the two differs as time goes on.]

I’m going to spend a few posts talking about how and why we essentially forked vufind, what sorts of modifications I made, and what technologies I hope to extract from our implementation that may be useful to the wider library community. And, I’m sure, a lot about why I hate Solr, why I love love love Solr, why I hate PHP, and why I love…er…no, I still hate PHP.

Credit where it’s due

And… a little credit where it’s due. I did a lot, but I didn’t do it all. I probably didn’t even do most of it. Half the effort, including all the heavy Aleph lifting — from getting the MARC out with all the filters and expansions we needed, to pulling holdings in real time, to grabbing a patron’s current checked-out items and holds, to fighting the inevitably-scarring battle with ILL — was done by Tim Prettyman. Suzanne Chapman lent her expertise to make it a lot less ugly and more usable than it once was (you can see her talents more strongly expressed at the HathiTrust catalog). And a whole horde of librarians were tapped by my boss, Jon Rothman, to try to figure out how to deal with the MARC data and facets and everything else that required a much deeper understand of our data than I possess.

Non-stock user-facing features

In the next post, I’ll start with a look at how and why we changed the backend and what I’d do differently if I were starting from scratch. But right now, a quick list of the user-facing stuff that you might find interesting.

  • Email and export searches and search results, as opposed to just individual records.
  • Working endnote and refworks export.
  • Multi-select on the advanced search (e.g., pick two languages to get English OR German).
  • Publication date-range searching (with date-added-to-catalog searching coming soon).
  • A “sticky” institution selection, so each campus can choose to default to searching just their own stuff. We sniff IPs to set a default, too.
  • A “call number starts with” search based on semantics for LC searches (e.g., searching on CA11 won’t find CA1105), with call number range searching in testing now.
  • Contracted holdings for long lists of serials (see, e.g., Nature).
  • [Coming soon] Selecting records to a temporary set, which can be manipulated en masse (sent to Refworks, etc.). I’ll be hooking this up to mTagger, our home-grown bookmarking and tagging tool, later on.

Of course, I also broke some things. I haven’t added back in Search History, but will do so when I’ve got a couple hours. “Search Within” will make a comeback soon, too, but there are usability issues to contend with. And …for the love of god, don’t do a “View Source.” It’s the ugliest HTML underpinnings I’ve been associated with since 1993 or so.

All in all, though, it’s not bad work, and I’m glad to be able to offer it to our patrons.

2 Responses to “Rolling out UMich’s “VUFind”: Introduction and New Features”

  1. Dean says:

    Hi Bill, Great to see this post. Would you mind elaborating in a future post how you made the exact title matching work? ie. Nature, Science, Cell?

    Thanks!

  2. Bill says:

    Sure thing. If anyone else has stuff they’d like to hear about sooner rather than later, drop a comment here or email me.

Leave a Reply

Sending MARC(ish) data to Refworks

May 11, 2009 at 10:48 amCategory:Uncategorized

Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I’d put down how I’m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested.

The basic procedure is:

  • Send your user to a specific refworks URL along with a callback URL that can enumerate the record(s) you want to import in a supported form
  • Your user logs in (if need be) gets to her RefWorks page
  • RefWorks calls up your system and requests the record(s)
  • The import happens, and your user does whatever she want to do with them

Of course, there are lots of issues with doing this well (quick! Is this MARC record for a book? An edited book? Is it a journal, or a serial of some other sort? Who’s the actual author/editor?), but doing it at all isn’t so bad.

The URL to send them to

This is the “Export this record” URL on my system:

http://www.refworks.com.proxy.lib.umich.edu/express/expressimport.asp?
vendor=[your system]&
filter=MARC+Format&
database=All+MARC+Formats&
encoding=65001
&url=[your callback URL]
Note that the vendor variable should be a unique string (made up by your) for your system, not a larger entity (like the whole library or the institution).

The “MARC Format” filter we’re using is not a filter for real MARC. It’s a MARC-like delimited format (see an example from my catalog).

Basically, you have three types of lines (but really, look at the example, ’cause it’ll make everything a lot clearer):

LEADER

  LEADER [one space] [leader text]

Control Field

  [three-digit control tag] [four spaces] [data text]

Data Field

  [three-digit data tag] [one space] [ind1] [ind2] [one space] [value of subfield a] [other subfield constructs]

…where [other subfield constructs] look like

  [pipe characeter][subfield code][subfield value]

Notice that (a) there’s no leading ‘|a’ before the subfield a value, and (b) there are no spaces between the pipe, the subfield code, and the subfield value for the non-code-a subfields.

Some easy PHP code to produce such a format is as follows. Note that I’m sending it as text (because it’s not MARC) and UTF-8. If you’re got MARC-8, you’ll have to convert it before sending.

  1.       $m = $this->marcRecord;
  2.       header('Content-type: text/plain; charset=UTF-8');
  3.  
  4.       echo 'LEADER ', $m->getLeader(), "\n";
  5.      
  6.       foreach ($m->getFields() as $tag => $val) {
  7.         echo $tag;
  8.         if ($val instanceof File_MARC_Control_FIELD) {
  9.           echo '    ', $val->getData(), "\n";
  10.         } else {
  11.           echo ' ', $val->getIndicator(1),  $val->getIndicator(2), ' ';
  12.           $subs = array();
  13.           foreach ($val->getSubFields() as $code=>$subdata) {
  14.             $line = '';
  15.             if ($code != 'a') {
  16.               $line = '|' . $code;
  17.             }
  18.             $subs[] = $line . $subdata->getData();
  19.           }
  20.           echo implode(' ', $subs), "\n";
  21.         }        
  22.       }

3 Responses to “Sending MARC(ish) data to Refworks”

  1. Dan Scott says:

    Bill – thanks so much for this! A working example makes all the difference for other people following in your wake. (Once the waves finish washing over my head, I hope to implement RefWorks export in Evergreen…)

  2. [...] Sending citations to RefWorks can be done with a callback. Essentially, you add a link to RefWorks’ import function page and send it your credentials, as well as a callback URL that RefWorks uses to grab the record from your ILS…in a RefWorks-supported format. The problem is that RefWorks doesn’t accept MODS, MARC, or even MARCXML. They say they accept MARC, but it’s actually what I call “MARC text” (it is described very well by Bill Dueber). [...]

  3. Ali says:

    Hi Bill,

    Good stuff. I will see if I can do similar for YorkU Vufind Instance.

    Cheers,

    Ali

Leave a Reply

After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

SO….it’s like this now.

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "fields" : [
  7.      ["001", "001 value"]
  8.      ["002", "002 value"]
  9.      ["010", " ", " ",
  10.       [
  11.         ["a", "68009499"]
  12.       ]
  13.     ],
  14.     ["035", " ", " ",
  15.       [
  16.         ["a", "(RLIN)MIUG0000733-B"]
  17.       ],
  18.     ],
  19.     ["035", " ", " ",
  20.       [
  21.         ["a", "(CaOTULAS)159818014"]
  22.       ],
  23.     ],
  24.     ["245", "1", "0",
  25.       [
  26.         ["a", "Capitalism, primitive and modern;"],
  27.         ["b", "some aspects of Tolai economic growth" ],
  28.         ["c", "[by] T. Scarlett Epstein."]
  29.       ]
  30.     ]
  31.   ]
  32. }

One Response to “MARC-HASH: The saga continues (now with even less structure)”

  1. [...] I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, [...]

Leave a Reply

Why do I ever, ever think that MARC might not rely on order? I don’t know.

In any case, control fields will now be just an array of duples:

  1. control: [
  2.   ['001', 'value of the 001'],
  3.   ['006', 'value of the 006']
  4.   ['006', 'another 006']
  5. }

Leave a Reply

In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.

So…I’m suggesting we try something a little simpler. Something so brain-dead, in fact, that I’m loathe to put it down because it’s pretty much the obvious way to do it. To wit:

  1. {
  2.   "type" : "marc-hash",
  3.   "version" : [1, 0],
  4.  
  5.   "leader" : "leader string"
  6.   "control" : [
  7.      ["001", ["all", "001", "values"]],
  8.      ["002", ["all", "002", "values"]],
  9.   ],
  10.   "data" : [
  11.     ["010", " ", " ",
  12.       [
  13.         ["a", "68009499"]
  14.       ]
  15.     ],
  16.     ["035", " ", " ",
  17.       [
  18.         ["a", "(RLIN)MIUG0000733-B"]
  19.       ],
  20.     ]
  21.     ["035", " ", " ",
  22.       [
  23.         ["a", "(CaOTULAS)159818014"]
  24.       ],
  25.     ]
  26.     ["245", "1", "0",
  27.       [
  28.         ["a", "Capitalism, primitive and modern;"],
  29.         ["b", "some aspects of Tolai economic growth" ],
  30.         ["c", "[by] T. Scarlett Epstein."]
  31.       ]
  32.     ]
  33.   ]
  34. }

Stupid MARC allows all the stupid fields to stupid repeat and be out of stupid order and such, so it’s just a lot of arrays. Easily round-tripable.

Why bother? Excellent question, and one that’s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it’s still a lot easier than working with raw MARC (or, I would claim, MARC-XML), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.

A few things worth noting about the assumptions in my mind:

  • By definition, it’s always UTF-8. The leader should be changed to note this on the sending end, but it’s not required.
  • We include both a type “marc-hash”, and a version with major/minor numbers.
  • Everything is a string.
  • Alpha characters in indicators/tags are all lowercased.
  • A control field is a duple: tag and array of values.
  • A data field has four values:
    1. The tag
    2. Indicator one
    3. Indicator two
    4. An array of duples: subfield and its value

A simple transformation to make it a little more queryable

Let’s say you don’t give a damn about tags that appear out of order, because that’s just a crime against nature, anyway. And you really don’t care what order the subtags appear in most of the time, ’cause really, who does?

A simple run-through (psuedocode ahead):

  my marchash = getTheMarcHash();
  my kindamarc;
  kindamarc{leader} = marchash{leader};

# Map the control fields by tag => array-of-values foreach cfield (marchash{control}) { kindamarc{control}{cfield[0] ||= []}; kindamarc{control}{cfield[0]}.push(cfield[1]); }

foreach d (marchash{data}) { (tag, ind1, ind1) = (d[0], d[1], d[2]);

# build up a hash based on subfields for this tag
newd = {};
foreach subfield (d[3]) {
  (stag, sval) = subfield;
  newd{stag} = sval;
}

# Store the subfield hash in a few places so it's easy to find.
foreach i1 ('*', ind1) {
  foreach i2 ('*', ind2) {
    kindamarc{data}{tag}{i1}{i2} ||= [];
    kindamarc{data}{tag}{i1}{i2}.push(newd);
  }
}

}

Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator ‘*’.

Basically, this will allow things like this:

  1.   $leader = $kindamarc{leader};
  2.   $first001 = $kindamarc{control}{"001"}[0];
  3.  
  4.   # Find 856s where indicator 2 is '1'
  5.  
  6.   @mystuff = $kindamarc{data}{856}{'*'}{1};

It’s easy to see how we could store the index from the original array to make it easy to find the original order, too.

For many, I’m sure, the prospect of dealing with something like this is more daunting than just learning to use MARC-XML or using existing libraries to deal with straight MARC. But there seems to be a set of folks out there for whom this might be useful, so I’m throwing it out there.

2 Responses to “MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records”

  1. This is basically just MARC-XML translated to JSON, yes?

    Do you really find it easier to work with than MARC-XML? I guess if you’re doing your work in javascript, maybe.

  2. Bill says:

    Meh. I agree — and hence the disclaimers — that it’s nothing special. But “nothing special that everyone agrees on” is better than no agreement at all, and for some people XML processing is a barrier they don’t want to have to deal with. For all its seeming ubiquity, lots of folks never have to deal with XML in their programming.

    I guess my thing is, “IF we’re going to serialize MARC as JSON/YAML/Whatnot, let’s all do it the same way.”

Leave a Reply

A plea: use Solr to normalize your data

March 30, 2009 at 4:22 pmCategory:Uncategorized

[Only, of course, if you're using Solr. Otherwise, that'd be dumb.]

We’ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don’t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.

Arguments about how much business logic to move into the database layer, in the form of foreign-key requirements, cascading inserts and deletes, stored procedures, etc. are as old as the features themselves. Solid arguments for and against are made on all sides, and like all things, there’s a happy middle ground for most people. 1

But Solr provides an incredibly compelling use case because it allows for data transformation at both index and query time via the use of custom analyzers (or a standard analyzer with text filters applied). We’re starting to migrate our schema to use more and more of these things, and I even went so far as to create a custom text filter for LCCNs after being inspired by Jonathan Rochkind.

The incentive is easy to see: client diversity. Let a thousand interfaces bloom, if you can give them all access to the same underlying Solr instance. And, seriously, how many times are you going to write that regexp to semi-normalize ISBNs and ISSNs, huh? Enough already.

If you’re using a Solr nightly (and, really, you should be — faceting is so much faster than the official 1.3 release) you have access to regexp-based filters as well, which makes stuff like this really, really easy:

  1.    <!– Simple type to normalize isbn/issn/other standard numbers –>
  2.     <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  3.       <analyzer>
  4.         <tokenizer class="solr.KeywordTokenizerFactory"/>
  5.         <filter class="solr.LowerCaseFilterFactory"/>
  6.         <filter class="solr.TrimFilterFactory"/>
  7.         <filter class="solr.PatternReplaceFilterFactory"
  8.              pattern="^0*([\d\-\.]+[xX]?).*$" replacement="$1"
  9.         />
  10.         <filter class="solr.PatternReplaceFilterFactory"
  11.              pattern="[\-\.]" replacement=""  replace="all"
  12.         />
  13.       </analyzer>
  14.     </fieldType>

Here, we use the KeywordTokenizerFactory which, not so intuitively, produces a single token from the input. Then lowercase it and pull of any leading and trailing spaces (Trim).

For those of you that don’t read regexp, we then match anything that looks like:

  1. Any number of leading zeros
  2. …followed by any number of digits, dashes, or periods and an optional ‘X’
  3. …followed by…well, we don’t care. Anything else.

…and throw away all but the stuff in #2. Then take that and throw away all the dashes and dots, and you’re left with a string of numbers.

The beauty is that it happens both while the index is being made and during query time, so if your user types in ” 123-45-6-X ” it will be normalized to 123456x, and then checked against your index.

This is simple stuff, and probably doesn’t deserve the virtual ink I’m providing for it, but Vufind out of the box doesn’t do any of this sort of thing (likely because “the box” existed before it was super-easy to do this), and we all should be doing it.

  1. “Most,” in this case, excluding the old-time MySQL fanboys who took it as gospel that all data validation and manipulation belongs in the application layer, because their “database” didn’t do any of it. Februrary 30th in a date field, anyone?

One Response to “A plea: use Solr to normalize your data”

  1. Nice! Can you provide your lccn normalization routines too?

Leave a Reply

OK. I’m done with it, and this time I mean it.

I’ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I’ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.

2 Responses to “Enough with the freakin’ LC Call Number normalization!”

  1. Thanks for sticking it up on the web. I suspect Blacklight will want that at some point.

  2. Naomi Dushay says:

    I wrote a bunch of LC parsing (and dewey, too!) to get to shelving keys. It’s in the CallNumUtils of the solrmarc project.

Leave a Reply

The good folks at ticTocs heard the call for open data, and they responded…exactly as I asked them to. Which makes me think I should have asked for a pony, too, but I’m still very, very happy!

Anyone can now download a simple tab-delimited text file describing all the journal table of contents RSS files they’ve assembled, for use however anyone wants.

The data include issns and eissns (where available), the title of the journal, and of course the URL of the RSS/Atom/Whatever feed.

The feeds themselves are all over the map — it’s whatever the publisher decides to provide, which might include  abstract/volume/number/doi, or might just be the title of the article. But regardless, they represent data that are useful to our patron and are now available in a format that’s easy to exploit.

So…go to it?

4 Responses to “Ask, and you shall receive, and it shall be AWESOME!”

  1. Paul S. says:

    Excellent work. I shall (go to it) on our own system.

  2. Bill,

    The shetland pony is on it’s way. Would it also be useful to include the ticTOCs URLs for TOC feeds in the text file? These URLs provide normalised (i.e. corrected, etc) content, which will display within ticTOCs. e.g. http://www.tictocs.ac.uk/?action=displayJournal&journalId=15711 would be the URL to the TOC of International Journal of Business and Emerging Markets. etc.

  3. Naomi Dushay says:

    Bill, This is freakin’ awesome. What a valuable thing for a library discovery service. Thanks for bringing it to my attention.

  4. Santy Chumbe says:

    Is the API of journaltocs (www.journaltocs.hw.ac.uk) the Shetland pony you were waiting for?

Leave a Reply

For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome.

In their blog at News from TicTocs, a post titled I want to be completely honest with you about ticTOCs notes that:

As for the API – yes, we’ve been asked this several times, and the answer is that it is currently being written and should be available very soon.

That’s great, but writing in a comment on that post (after logging in with a very, very old OpenID — I used to have a blog named Opachyderm, a name which I thought was insufferably clever), I noted that we don’t need an API right away.

What we need is a text file.

Simple. Tab-delimited. TicTocID,Title,URL,issn,eissn. Update it every night.

That’s all we need.

We can do the rest. Put it in the OPAC. Stick it on our SFX pages. Not screw around with Javascript/AJAX calls when the data we need are (relatively) static and (absolutely) simple.

Someone needed to put a web interface on those data, and the one provided at ticTocs is really nice. I’m glad it’s there.

And I can’t tell you how much I applaud the JISC for starting this project and getting vendors on board. That’s always the hard part — participation and standardization. They’re doing it, and I couldn’t be happier.

But these data are incredibly valuable,  and their value is currently limited because they’re boxed up.

Spreading these data far and wide is good for scholarship, and I can’t imagine the case that could be made showing it’s better for JISC to keep them at a single endpoint.

The knee-jerk reaction is always, I know, to keep things behind a wall, even if it’s a short wall. “Things will get out of sync if people have their own copies.” Or, “We’ll provide whatever access you need, as fast as you need it, honest.” Or, “We’re going to be providing value-added services on top of the data.”

It’s all true. Things will get out-of-sync — but that’s going to happen whether you encourage people to not cache results or not. And I don’t doubt for a moment that the API provided will be great. And of course you’ll be in a position where you can provide value-added services.

But so can the rest of us.

I’ve run into this myself. I fear…well, let’s be honest. I fear providing a service, having the data stripmined, and then having no one appreciate the front-end I put on it. I do this job for the fame, not the fortune. Obviously.

But I’ll never provide services as fast as me plus three hundred other geeks, all responding to different situations and servicing different patrons.

So…provide an API. Start simple: a single call named getCurrentTextFile. Or maybe add getCurrentTextFileGzipped. It’s only ten-thousand lines of text, probably less than 75k gzipped up. I promise to call it every night about 3am local time so I’m up-to-date.

So….pretty please? With sugar on top? My catalog is waiting. So is my SFX install. And our list of ejournals. And our subject guides. And lots of pages on our website. And our pre-packaged OPML files to offer students and professors. And a thousand yet-to-be-devised services as well.

Pretty pretty pretty please???

7 Responses to “TicTocs: Give us a file! Pretty pretty pretty please!”

  1. Hi Bill,

    If you are going to insert links from, say, an A-Z list of your journal subscriptions, which would be better – a) link to the journal RSS feed b) link which would display the latest TOC for each journal in at the ticTOCs website?

  2. Bill says:

    Both, of course :-)

    In many cases, for many of our users, a link to the ticTOCs site is less useful than a link directly to the home page of the journal, esp. if the RSS provided by the publisher is rather sparse. But I’d also like to inline ToC content in some cases. Or ferret out whatever I can about the user and proxy links (or don’t) through an appropriate proxy server. Or mash up filtered “superfeeds” for specific audiences (perhaps even an audience of one).

    But frankly — and I mean no disrespect, I really don’t — I’m not sure why the onus should be on me (and others) to make the case for opening the data. Can you articulate the case for keeping it closed?

  3. We’re not keeping it closed. Our techie has been ill, but is now working on it.

    Filtered superfeeds is an interesting idea. As is something I’m currently considering – Personalised journal current awareness visualisation with animated timeline of emerging research interests.

  4. Take a look at http://www.tictocs.ac.uk/text.php . Does this give you what you are looking for, for now? (When I download its output into a text file from IE7 the fields are delimited by a space rather than a tab which is pretty useless, but I’m assured that it really is tab-delimited and that whetever you are using is likely to see the tabs – tell us if he’s wrong!)

  5. [...] by Terry Bucknell on February 11, 2009 We’ve answered the call of developers like the Robot Librarian by providing a simple tab-delimited text file that contains all of the titles, ISSNs and feed URIs. [...]

  6. The file certainly is tab-delimited, juct checked it via a command-line download.

    Great blog post, BTW. A very good reminder to get stuff out there with the lowest friction possible, and not to lie awake at night worrying about lost opportunities (or fame).

  7. [...] useful directory of RSS-format journal TOCs, available at: http://www.tictocs.ac.uk) -  after they were asked, and agreed to ‘open the box‘ on their data – by making it available in a format [...]

Leave a Reply

Five rules to make your open source more open

January 25, 2009 at 5:40 pmCategory:Uncategorized

[I've noticed that a sure way to get people to look at stuff (as measured by, say, digg) is to include a number. So I did. Five. ]

Over at Bibliographic Wilderness, Jonathan Rothkind has a great followup to an ongoing discussion on the Blacklight list called How to build shared open source in which he tackles some of the differences between open-sourcing your code (a legal and distribution issue) and actually making it so someone else can usefully contribute to your code.

The project I’m spending most of my time on right now, VUFind, is a great piece of functional code but, in many ways, a nightmare in terms of trying to contribute code and abstract out local functionality. This isn’t meant as a slam on the main contributor(s) to VUFind — Andrew, especially, seems to be an almost frighteningly-productive coder — but my experiences trying to customize the code to our local situation has given me a lot of time to think about how I wish things had been architected.

So, here I give some general rules and some specifics as to What I Wish I Had To Work With.

1. Abstraction

General rule: Abstract things out as much as makes sense

Specific rule: Abstract the living crap out of your authentication scheme.

Look, pretty much everyone with anything worth protecting already has an auth/authZ infrastructure in place. Sometimes an extensive, perhaps multi-institutional infrastructure. One that isn’t going to be bypassed without, say, getting fired.

So if you’re going to require people to log in, make sure you make that process as abstract as you possibly can, both in algorithm and in code. Have a singleton class that’s easily subclassed to represent your user, and call it exclusively. Make sure that your URIs are easily separated into those that require auth and those that don’t, for simple use of mod_rewrite or whatnot to redirect to authentication. Make sure it’s easy to hook into (or work around) AJAX links that might require authentication that has expired.

And for the love of god, don’t stuff username/password information into a cookie if you’re doing web work. Use a session and session key. Any auth scheme that I can spoof is no auth scheme at all, because I’m an idiot and not even trying hard.

2. Configuration files

General rule: use config files for anything local

Specific rule: Use a configuration file format that can represent complex data

That’s right, I’m looking at you, .ini and .properties files.

Use something like YAML, or XML, or even straight programming-language code (i.e., a file with a PHP hash or a perl hashref or whatnot) that can actually represent, in a logical way, the complexities of the stuff you need to configure. And then, again, have a singleton class that will read that data and expose it in a useful and safe way.

And include a semantics checker if you can manage to write one.  It’ll save everyone a load of trouble.

Huge bonus points if your configuration singleton class can read from multiple files, overriding previous (default) definitions with subsequent (local) ones.

3. Hide subapplications

General rule: Don’t force your user to intimately understand every piece of every library/application you include

Specific rule: Generate configuration information for sublibraries/applications

This might be a little specific to the project I’m working on now, which uses Solr as a backend, but I think it applies more generally.

If you’re using a non-brain-dead configuration file format, and if you can assume reasonable defaults, then generate configuration files for your user. A low-level extreme of this is the traditional unix autoconf, which essentially allows you to install software without knowing a damn thing about your own system. Which is useful to those of us that don’t.

In VUFind, there are three files — a .properties file that specifies how to map MARC data into  field names,  Solr’s schema.xml that describes the structure and behavior of those same fields, and an XSLT stylesheet that pretties the data as it comes out of Solr to make it easier to work with. As you might expect, the overlap in data is about 80% across the three of them, and it would be a bazillion times easier to have a single file that generated all three.

OK. Maybe not a bazillion, because if it was that easy, I’d have taken a couple hours to write the code to do it already. Let’s say just a zillion times easier.

The caveat to this is that you need to either make sure your config file specification is complete enough to encompass everything all the other files might need to know (bad), or that the other config files can import subsections that override your defaults (good).

4.Testing

General rule: practice test-first (TDD or BDD) development

Specific rule: write your code in such a way that it’s testable

Look, we all know we should spend the first three weeks writing eight thousand tests to describe every corner of the code. And we all have bosses that will ask, every morning about 10:30am, “So, what do you have that you can show me?”

Not everyone is going to be able to write tests first. That’s not right, it’s not smart, but it’s the way the world works. But at least put in the hooks so someone else can come along and write tests.

Writing tests is one of the easiest ways that a newbie can come along to a project and instantly contribute in a meaningful way. But if you’re constantly calling global variables, depending on live database connections and not providing a way to mock them up, or throwing fatal errors if every subsystem isn’t present no matter the context, then it’s going to be hard to write tests.  So hard, in fact, that not only will you not do it, but neither will anyone else.

5. Error handling

General rule: provide a sane, hierarchical set of error classes and hooks to catch them as necessary

Specific rule: THROW SOME GODDAMN ERRORS!!!!

Don’t be an idiot. Things will fail. In the absense of Design by Contract or somesuch, errors will happen. Throw them. Catch them. But at least throw them, instead of letting your code die six hundred lines later with a “Cannot cast null value to string” when you finally get around to trying to print something out.

One Response to “Five rules to make your open source more open”

  1. Good stuff Bill. I’m thinking we should make a wiki page on code4lib wiki “design patterns for successful collaboratively shared library software”, start it with your post. Maybe if there’s some stuff in mine too not covered by yours, but I think you’ve hit what I had with more actionable specificity.

Leave a Reply