• MARC-HASH: The saga continues (now with even less structure)

    After a medium-sized discussion on #code4lib, we’ve collectively decided that…well, ok, no one really cares all that much, but a few people weighed in.

    The new format is: A list of arrays. If it’s got two elements, it’s a control field; if it’s got four, it’s a data field.

    SO….it’s like this now.

     { "type" : "marc-hash", "version" : [<more>
  • MARC-HASH control field, now with less structure

    Why do I ever, ever think that MARC might not rely on order? I don’t know.

    In any case, control fields will now be just an array of duples:

     control: [ ['001', 'value of the 001'], ['006', 'value of the 006'] ['006', 'another 006'] } 
    ... <more>
  • MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records

    In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. “Who really cares what order the subfields come in?” I asked myself.

    Well, of course, it turns out some people do. Some even care about the order of the tags. “Only in the 500s…usually” I... <more>

  • A plea: use Solr to normalize your data

    [Only, of course, if you’re using Solr. Otherwise, that’d be dumb.]

    We’ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don’t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.

    Arguments about how much business logic... <more>

  • Enough with the freakin' LC Call Number normalization!

    OK. I’m done with it, and this time I mean it.

    I’ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I’ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.

  • Ask, and you shall receive, and it shall be AWESOME!

    The good folks at ticTocs heard the call for open data, and they responded…exactly as I asked them to. Which makes me think I should have asked for a pony, too, but I’m still very, very happy!

    Anyone can now download a simple tab-delimited text file describing all the journal table of contents RSS files they’ve assembled, for use however anyone wants.

    The data include issns and eissns (where available), the title of... <more>

  • TicTocs: Give us a file! Pretty pretty pretty please!

    For those who haven’t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome.

    In their blog at News from TicTocs, a post titled I want to be completely honest with you about ticTOCs notes that:

    As for the API - yes, we’ve been asked this several times, and the... <more>
  • Five rules to make your open source more open

    [I’ve noticed that a sure way to get people to look at stuff (as measured by, say, digg) is to include a number. So I did. Five. ]

    Over at Bibliographic Wilderness, Jonathan Rothkind has a great followup to an ongoing discussion on the Blacklight list called How to build shared open source in which he tackles some of the differences between open-sourcing your code (a legal and distribution issue) and actually... <more>

  • And then I finally shut the hell up

    I had a great — great! I tell you – 30 second conversation with Ken Varnum (of RSS4Lib fame) that went something like this (much paraphrasing, obviously):

    B: You're gonna have to fix that interface. The standard header won't work.
    K: Well, no, we're going leave it as it is.
    B: It's not gonna work.
    K: We've decided to make it all consistent.
    B: OK, you... <more>
  • Normalizing LoC Call Numbers for sorting

    Updated: I missed a ‘?’ in the original code that pushed a single cutter into the second-cutter position. Fixed below.

    Crap. Update 2: Initial letters can be three characters long. Regexp and output changed.

    LoC Call numbers tend to be a mess, and I’ve been working this morning trying to normalize them for easy string comparison.

    The perl function below takes a call number (with some level of sloppiness) and returns a string suitable for... <more>