Category: Uncategorized

DANGER! I was trying to re-verify my numbers and found a glaring and hugely important mistake. I’ll make a new post with the details, but basically I was counting about 180k sessions (out of only 735k) that I should have been ignoring. Please ignore my basic stats until further notice. See the new numbers and corrected slides for more accurate data.

===============

I did a little Lightning Talk at Code4Lib 2011 and cleaned up (and heavily annotated) my slides for anyone interested in them.

The focus was on some basic stats about usage of our OPAC, Mirlyn, in calendar 2010.

I’ll be doing some posts and/or more rigorous writing on this stuff soon, but wanted to get these up in a timely fashion.

3 Responses to “[RETRACTED] Code4Lib 2011 Lightning Talk Slides”

  1. [...] This post was mentioned on Twitter by Dan Chudnov and Jennyann, Bill Dueber. Bill Dueber said: Slides from my lightning talk about OPAC stats are up. http://bit.ly/dMvgPb #c4l11 [...]

  2. Aaron Tay says:

    Really enjoyed your slides. Love the long presentation notes. most slides I can’t tell what the speaker is trying to say but for yours no such problems. Just added Google analytics to the catalogue so this is very timely, you have far better statistics but still good figures to compare.

    Just finished reading a year’s worth of your blog post. Really really nice blog, I can’t fully follow 100% of the details (I’m technically a reference librarian), but really enjoy most of the entries and the statistics of the catalogue you report.

  3. I personally (speaking only for myself) think this would make a good short Code4Lib Journal article, and encourage you to submit one.

Four things I hate about Ruby

January 13, 2011 at 4:17 pmCategory:Uncategorized

Don’t get me wrong. I use ruby as my default language when possible. I love JRuby in a way that’s illegal in most states.

But there are…issues. There are with any language and the associated environment. These are the ones that bug the crap out of me.

  • Ruby is slow. Let’s get this one out of the way right away. Ruby (at least the MRI 1.8.x implementation) is, for many things, slow. Sometimes not much slower. Sometimes (e.g., numerics) a hell of a lot slower.

Now, there’s nothing necessarily wrong with that. For what I do, MRI Ruby is usually fast enough, and JRuby is pretty much always fast enough. But the community response (“Buy more hardware! Programming time is more expensive than CPUs anyway! These are not the droids you’re looking for!”), esp. surrounding Rails, is simply annoying. If The Power That Be want to make a decision to not worry about improving the performance of the language, well, that’s fine then. But to pretend — or even insist — that it’s not at all in issue, well, that’s just disingenuous.

  • Version nonsense. Yes, yes, I understand the historical process that produced a version 1.9.x that’s not backwards compatible with 1.8.x. But it’s dumb. Gems don’t often seem much better (including, hypocritically, my own). Versioning — meaning assigning version numbers in such a way that the underlying semantics are transparent — doesn’t seem to be something The Ruby World ™ is all that interested in.

  • No relationship between gem names and the modules they contain. This drives me freakin’ crazy. The Perl community does a great job with this. One module per file, one file per module, filenames follow the module names. I know exactly what to put after use in Perl. In Ruby, what comes after require is anybody’s guess.

  • Lack of thread-safety. Look, I get it that the MRI doesn’t have real threads. And so maybe there’s not a huge incentive on the part of the core folks to make things thread-safe in general. But at least one language construct — autoload — is just plain broken under real threads, with seemingly little interest in getting it fixed.

Comments are closed.

There’s a common problem among developers of websites that paginate, including OPACs: how do you provide a single item view that can have links that go back to the search (or to the prev/next item) without making your URLs look ugly?

The fundamental problem is that as soon as your user opens up a couple searches in separate tabs, your session data can’t keep track of which search she wants to “go back to” unless you put some random crap in the URL, which none of us want to do.

But let’s take three giant steps backwards before we throw a ton of resources at this problem, and ask, “Does anyone use those links”?

Data from Mirlyn, the University of Michigan OPAC

Here’s the data since February of 2010 for Mirlyn, our library OPAC.

Action Count Pct. of Basic Search count
Basic search (baseline) 1,446,881 100%
Previous record 1,347 0.09%
Next record 8,394 0.58%
Back to search 9,568 0.66%

For what it’s worth, I looked at these number by percentage of sessions as well, and the numbers come up a little higher — about 0.8% of all sessions included at least one click of the “Back to Search” button.

Given these numbers, I’m pretty sure I wouldn’t put a whole lot of effort into it. In general, next/prev record navigation only makes sense when you have a really, really small number of hits, anyway.

So…why not just disappear the links? I know people will complain, but hopefully our days of doing an enormous amount of work for …well, some tiny but vocal minority…are past.

One Response to “Does anyone use those prev/next/back-to-search links?”

  1. Peter Murray says:

    Hmmm, interesting point. I wonder, though, if there is a usability issue on the UI. There are occasionally vocal complains about not being able to use the traditional OPAC for browsing, and removing these sorts of links would be a detriment to that. Before giving up entirely I might try some user experience testing with different ways to include the back-and-forward links (e.g. with cover images, with the title of the previous/next work as part of the link text, etc.).

    Admittedly, though, the UX of a few e-commerce sites I just tried do not include back/forward links from work description pages.

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data.

I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21.

Why bother?

Binary MARC-21 is “broken” in that a lot of us have records that are so long (more than 9999 bytes) it’s impossible to create a valid marc binary record. The standard alternative, MARC-XML, has huge filesizes (roughly 3 times as large) and runs a lot more slowly in every benchmark I’ve ever run. For ruby-marc, the penalty for using XML is further exaggerated because the serializer is based on REXML and is super-slow.

There have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton bigger in terms of file size, and (b) much easier to query from a NoSQL database using something like JSONPath or JSONQuery.

What I’m testing

For this test, I used:

  • marc21 binary This is the stock serialization / deserialization provided by ruby-marc.
  • YAJL for JSON YAJL is a very fast C-based JSON library. Here we’re using the Ruby bindings and calling Yajl::Encoder.encode(r.to_hash) to serialize and MARC::Record.new_from_hash(Yajl::Parser.parse(JSON)) to deserialize.
  • Msgpack The Msgpack project is explicitly designed to be “binary JSON” — smaller, faster, etc — at the expense of human readability/editabilty . Again, this used the ruby bindings.

The benchmark and its results

I’m interested in how long it takes to serialize and deserialize a single record. My primary use-case is sticking a single record into Solr, and then pulling the string representation of that record out and turning it back into MARC.

It’s entirely possible that trying to deal with a whole set of MARC records — as a JSON array of marc-in-json objects, or as a set of newline-delimited JSON or Msgpack objects — would yield different results. The former is especially interesting, since to parse a large JSON array one needs to use a streaming parser, which will almost certainly have a different profile in both processing and memory use.

The ambitious can see the full source code of the benchmark.

Note that the following represent only the performance of ruby-marc and the particular serializers used. Other platforms or other libraries will certainly give different results!

Total of 18880 records run 20 times (377,600 serialize/deserialize cycles per method) on my Mac OSX desktop; comparisons are to MARC21-Binary.

 SERIALIZING
   MARC Binary       357.02 s (100%)
   YAJL              312.65 s ( 88%)
   Msgpack           266.26 s ( 75%)

 DESERIALIZING
   MARC Binary       648.91 s (100%)
   YAJL              507.64 s ( 78%)
   Msgpack           459.73 s ( 71%)

 SERIALIZE + DESERIALIZE
   MARC Binary      1005.93 s (100%)
   YAJL              820.29 s ( 82%)
   Msgpack           725.99 s ( 72%)

 SIZE
   MARC Binary   31.15 MBytes (100%)
   Msgpack       42.00 MBytes (135%)
   JSON          55.99 MBytes (180%)
   XML           93.42 MBytes (300%)

Analysis, such as it is

Obviously, there are size/speed tradeoffs. Nothing is as small as binary MARC21, but both YAJL and Msgpack are faster — significantly so for deserialization, which happens to be where I want the speed for my uses.

At 80% larger, the JSON serialization is quite a big bigger, but it’s a hell of a lot smaller than MARC-XML and suffers none of the limitations of binary MARC.

For a closed system (i.e., you’re not worried about anyone else being able to read your data) such as a Blacklight installation, I’d be tempted to move to using JSON sooner rather than later.

One Response to “Size/speed of various MARC serializations using ruby-marc”

  1. Andy says:

    Minor correction: in the “why bother” section, valid binary MARC-21 can be up to 99999 (not 9999) bytes.

    Interesting write-up, I especially appreciate the metrics-based posts you do – always good to have some facts! Thanks.

VuFind Midwest gathering

September 16, 2010 at 11:55 amCategory:Uncategorized

A couple weeks ago, representatives from UMich (that’d be me), Purdue, Notre Dame, UChicago, and our hosts at Western Michigan got together in lovely Kalamazoo to talk about our VuFind implementations.

Eric Lease Morgan already wrote up his notes about the meeting, and I encourage you to go there for more info, but I’ll add my two cents here.

So, in light of that meeting, here’s what I’m thinking about VuFind of late:

  • None of us are running VuFuind 1.0 as released with full catalog data. Eric has a special purpose portal running the current code over an aggregated special collection and hasn’t done much to the underlying PHP. The rest of us were running heavily modified versions of RC1. An issue we had in common was that the changes from RC1 to RC2 to 1.0 release were so significant, including some complete architectual change (some based on the stuff I’ve done with mirlyn) that the effort required to get up with 1.0 would be no less significant than the effort to switch wholesale to something else (e.g., Blacklight).

  • A point that I made that was echoed by others is that we need to remember that these new discovery systems are all just thin wrappers over Solr. They basically have two jobs: to get a query and format it in a way that Solr can handle, and then to take the Solr results and display them. There’s some sugar on top of that (exporting, tagging, etc) but that’s really it. The heavy lifting is all done by your indexer (Solrmarc for most, although watch this space for my announcement of my JRuby-based stuff today) and Solr itself. It’s not a hard problem, although it is occasionally a messy one.

  • VuFind has, in my mind, fundamental architectural issues mostly based on the inability to easily separate local code from core code. A re-architecture to base everything on subclasses of the core code would help, but at some point you start to run up against fundamental limitations of PHP and Smarty to do things cleanly. Without the ability to update core code and know it won’t affect your local code, there’s no good way to keep on track with the trunk of the code and do upgrades; for the same reason, it’s almost impossible to send changes back to trunk.

  • Coupled tightly to the architectural issues is the lack of tests. The code is potentially very brittle; there’s no good way to know if you’re breaking anything until you notice it’s broken. It’s not at all clear how to write good tests for the code, because there’s a lot of inter-dependencies.

  • The second big problem is one of community; to wit, there isn’t much of one. There are some active players, and there’s what seems like a great conference going on right now, so this may change. But — especially because of the technical difficulties in contributing local changes back –VuFind could use a benevolent dictator, someone who has organizing and administrating VuFind be a part of his/her job. The last bit is important.

All of these are surmountable issues. The reason they’re at the top of my head, of course, is that the Blacklight community has, in many ways, already taken care of most of them.

If I were starting from scratch tomorrow, we’d already decided to do something locally, and I could convince my systems people to run a ruby implementation (I like JRuby myself), I’d go with Blacklight. If we were already looking at something like Summon, I’d take a hard, hard look at the build-vs-buy numbers. Summon and Primo both give you APIs to program an interface against, and boy, it might be worth the effort to do so and leave everything else alone.

10 Responses to “VuFind Midwest gathering”

  1. The hard aspect of the problem is making something as flexible as libraries (reasonably IMO) require, that can be shared by multiple ‘customers’ with their own flexible configuration and customization, while still keeping a common codebase and being able to share updates.

    I think this actually IS hard. That Blacklight is kind of sort of able to accomplish it is a result of lots of hard work, and not entirely succesful (yet).

    So one could say, forget that, let’s just have our own local homegrown wrapper on top of Solr. There are trade-offs to that, but it does certainly reduce the ‘hardness’ of the problem. Although I still think you’re left with a somewhat harder problem than you imply, with all the features we want in a modern library discovery system.

    Alternately, one could say, let’s pretty much drop the flexibility and customization. We’ll have some configuration for what Solr to point to, and we’ll have the simplest possible hooks for ILS-specific info and functionality (item status, request buttons, etc), but except for that, all ‘customers’ will pretty much be running the same thing, but for what they can manage to do in CSS alone. That would also simplify the problem a lot. (And is more or less the approach that most proprietary vendor software takes — when proprietary vendors have tried to make software as cleanly flexible as we’re trying to make BL, they haven’t generally succeeded very well. Which is again, in part, because it is not in fact an entirely easy problem. ).

  2. PS: I think just about all non-trivial open source projects need a benevolent dictator, until the community grows enough to have a benevolent junta instead. (But pretending you have a benevolent junta when most people on the junta don’t take their responsibilities seriously and figure some other junta member will take care of it, does not work).

    This is my impression of what’s almost an distributed-collaboration volunteer-done open source rule. There are some exceptions, but they will generally have some unusual characteristics that make them exceptions.

    If there isn’t an official benevolent dictator, and the open source project is seeming successful, there’s probably to some degree an unofficial one. Who is filling the role(s) can change over time, sometimes.

    One reason for this ‘law’ is that architecture matters. ESPECIALLY when you’re trying to create shareable “framework” style code, instead of just a custom fit application. And design/architecture by committee, especially a committee of people of varying levels of commitment, time, feeling of responsibility, skill, and perspective, doesn’t work out that well. You’ve got to have someone with a vision, whose vision makes sense (if it’s going to be succesful), who feels an obligation/responsibility to apply that vision.

    One of the downsides of extreme focus on test-driven/agile/extreme programming, is that you can think that just cause that feature you added in an ad hoc way passes all the tests, that’s the only metric of evaluating the quality of your software. It isn’t, architecture matters. Which is starting to get talked about more, as a corrective to focusing too much on “get it done quick with a test”, bringing things back to balance. Here’s an example of the genre that might not be the best, but I happened to read this morning on reddit: http://www.infoq.com/news/2010/09/big-ball-of-mud

  3. And, I can’t stop talking, sorry. (I should turn this into my own blog post, but I know Bill doesn’t mind).

    Blacklight may be, at the moment, more succesful at those things than VuFind, but it’s cause of a lot of sweat, and from the inside the battle isn’t over yet.

    And as far as Summon or Primo: If you ignoretheir interface entirely and just go with their APIs, then this isn’t going to be any easier than programming on top of Solr apis. It’s going to be the same, in fact. And have the same issues we’re talking about here, go it alone homegrown, try to share something on top of those APIs, etc.

    The legit reason to me to go with summon or primo+primoCentral is because their indexes include publisher and aggregator supplied article-level data. That is something that is going to be VERY hard to do without vendor support. But just the APIs to query their indexes, with your own interface on top? What’s the “everything else” you think this will do for you, that you don’t get with your own interface on top of Solr already?

  4. David says:

    Interesting discussions. We’re about to start implementing Summon, and one of the things I like about it is its API. I’d started to think about having some other frontend (especially hearing that there is a connector already built for VUFind) but do wonder what the point of it would be.

    Summon has a reasonably nice (if sparse) interface and theoretically everything we want people to search is in there, so setting up VUFind or Blacklight as a frontend would seem to only give us more configuration choices in the page design, at the expense of some non-trivial setup and maintenance costs. Are there any other great benefits that I haven’t thought of?

  5. David says:

    Also, the whole benevolent dictator / junta thing I think is part of Koha’s problems at the moment.

    They have that set up for each release (with a release manager and other roles) but there’s no “steering committee” type organization to oversee the project. They have a really good community, but when one of the support companies goes rogue (spreading FUD, locking the the community out of the website etc) there’s no coherent and official body to respond.

  6. This rings very true for me A couple of thoughts Separation of local and core code is a big problem we are currently working around. Additionally it appears that VuFind has separated template code from other code but not presentation code from logic code. I’ve yet to be convince of Smarty, (why write a template interpretor/compiler in a language that is already very good for templating?) and would like to be able to just use php as my templating language.

  7. I tend to agree with you, David, about the unclear benefit of putting a custom interface on Summon.

    What people seem to be doing in VuFind is having a locally indexed content search, and in a seperate tab a seperate Summon search. As far as I know, VuFind won’t merge these results into one search, you get your local search, or your Summon search.

    Why would you want a local search, when all that local content is probably already in your Summon anyway? Well, having a local search gives you a LOT more control over relevancy ranking and faceting, and adding additional features like my date range timeline/histogram. Example: http://tinyurl.com/2b5wujr . You couldn’t do that in Summon. But you can’t do it in a Summon search just by wrapping it in VuFind either — all you can do is have a local search with features like that, and a separate Summon search. If you ARE going to have both, I suppose there are benefits to wrapping them in a consistent VuFind interface, with a single ‘saved records’ area etc (I’m not positive if VuFind does give you a single saved records area).

    But if you’re NOT planning on adding a local index search to Summon, I think you probably get minimal benefit wrapping Summon in VuFind.

    And if you ARE planning on adding a local index search to Summon, you’re giving the users two different searches, which is theoretically what you’re trying to get away from with Summon.

    But there are pros and cons to the Summon approach. You get all that indexed content, but you lose control of your index you can get with Solr/Blacklight/VuFind.

    Where I am, we’re currently choosing to focus on improving our local index search with Blacklight. But that does leave out the indexed article-level data in Summon (or PrimoCentral). Later we may try to combine them the way apparently some people are in VuFind, although it’s an unsatisfactory combination, I think.

  8. And also, when I’m talking about benevolent dictator/junta, I’m talking about developers. Who are actively working on writing and architecting code. I’m not talking about some non-developer administrative policy committee, to the extent succesful distributed collaborative community open source projects have those (Apache does), they keep their hands off the code; such a policy committee is really a different thing meant to solve different problems, not quality-of-software problems. Maybe you need one maybe you don’t for those other problems, but I think when such a committee starts trying to direct software development, you get a beurocratic nightmare, not the good software that I’m suggesting a benevolent (developer, architect) dictator may be required for.

  9. You guys should check out Villanova University’s implementation of VuFind and Summon https://library.villanova.edu/Find/Search/Home

    We currently use VuFind and are within a couple of weeks of implementing Summon.

  10. Sorry for being unclear in the last post. “We” being Stephen F. Austin State University.

Simple Ruby gem for dealing with ISBN/ISSN/LCCN

September 13, 2010 at 4:43 pmCategory:Uncategorized

I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums.

It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome.

Functionality is all as module functions, as follows:

ISBN

  • char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn)
  • boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn)
  • thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn)
  • tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn)

ISSN

  • char = StdNum::ISSN.checkdigit(issn)
  • boolean = StdNum::ISSN.valid?(issn)

LCCN

  • normalizedLCCN = StdNum::LCCN.normalize(lccn)

Again, there’s nothing special here — just letting folks know it’s out there.

Comments are closed.

Here at UMich, we’re apparently in the minority in that we have Mirlyn, our catalog discovery interface (a very hacked version of VuFind), set up to find records that match only a subset of the query terms.

Put more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on ‘OR’.

Now, I’m actually quite proud of how our searching behaves. Reference desk anecdotes and our statistics all point to the idea that people tend to find what they’re looking for. I invite you to try our current configuration out — and, of course, let me know if something feels off to you. We have control of our OPAC now, and can actually fix things.

The “problem”: DisMax is weird

The DisMax algorithm is complex. Even if you ignore the fact that we weight some fields (title, author) much higher than others, a fundamental feature of DisMax is that it basically gives ranking based on the question, “What percentage of the words in the document match one of our query terms”?

Most of the time, that’s exactly what you want. In general, items that have all the keywords, and more of them, appear at the top of the search results.

But sometimes you can have just, say, two of your three search terms appearing like a rash all across a relatively short record, and it’ll pop to the top, appearing ahead of records that actually contain all three search terms. Or maybe three of four search terms appear in both title and author (highly-weighted fields) and the same thing happens.

And, yeah, it really happens.

An actual, real-life example

Searching for the three terms information AND architecture AND usability, explicitly requiring all three, gives 12 results.

The equivalent DisMax search (where only two of three need to be found) nets about 4300 results. Which is great — we’re casting a much wider net, with some pretty common words. That doesn’t matter so long as the most relevant results float to the top.

The kicker? The first time an item in the first set appears in the second is at record number 62. Our user is more than three pages in before she even see a record that contains all three terms.

Again, most of the time, our current algorithm does really, really well in my opinion. But noticing this led to talk about artificially pushing all the “all terms are present” items to the top.

Pushing records that contain all the terms to the top

So, I wanted to:

  • Push records with all search terms to the top, but
  • …don’t otherwise change their scores. i.e., don’t otherwise re-order them in any way, ’cause I’m already happy with my ordering.

It turns out to be harder than I initially thought. I fought with my code for a whole day, then asked for help, and help was provided.

So, with special thanks to Jan Høydahl for his solution, we get this, in Ruby psuedocode:

andedTerms = allMyTerms.join(' AND ')
bf = map(query($qq),0,0,0,100000.0)  # Add this value to the ranking score
qq = "allFields:(#{andedTerms})"     # Use this as the query
# add bf and qq to your solr query

The qq is easy enough — it basically says that to get any relevancy score at all, the record must have all the terms in the allFields Solr field.

For the map, we want to say

If the record matches all the terms, give it an extra 100K points. If not, don’t.

The map takes 5 arguments:

  • An initial value. In this case, we’re getting the relevancy ranking score based on the qq query. Basically, items that don’t have all the terms will have a score of zero; items that do have all three terms will have something bigger than zero.
  • The beginning of range to compare to. In this case, 0.
  • The end of the range. Another zero, so basically, we’ll be seeing if our initial value is between 0 and 0, e.g., if it’s exactly 0.
  • The value to return if the initial value fits in the range — zero. So, if the records doesn’t have all the terms, return a 0.
  • The value to return if the initial value falls outside the given range. 100K — a random very-large number I picked.

And…?

I just pushed this to our beta site, and folks are still looking at it, but so far, it looks awesome. I’ll do a little update post if/when it goes into production. And if it doesn’t, I’ll say why.

4 Responses to “Solr: Forcing items with all query terms to the top of a Solr search”

  1. Naomi Dushay says:

    So what sort of relevancy testing did you use to confirm the new way is better than the old way? Nice repeatable, automate-able tests, right?

  2. Naomi Dushay says:

    More questions:

    couldn’t you have set mm higher, or did you want to make sure you still got the more comprehensive result set?

    I’m also wondering if there are other ways to reduce the importance of the size-of-the-document.

    Have you also tweaked for unstemmed matches >> stemmed matches (cooking vs. cooked)

    and for proximity using the pf and ps boosts?

  3. Here’s my refinement, which avoids the need for the client to send the qq, everything is just computed based on existing ‘q’.

    map(query($all_terms),0,0,0,100000.0) {!dismax qf=text pf=” bf=” bq=” mm=’100%’ v=$q}

    Seems to work to do the same thing yours does. Of course, now this is indeed by default applied to every single query, including queries it doesn’t make any sense for (or may even error for?) like queries that weren’t originally dismax (like ‘advanced’ search).

    Not quite sure the best way to deal with this in blacklight, just playing around with different approaches to this functionality, seeing which will require the least code. :)

  4. Bah, that got rid of my brackets. let’s try again without brackets, but this is defaults in solrconfig.xml

    str name=”bf” map(query($all_terms),0,0,0,100000.0) /str

    str name=”all_terms” {!dismax qf=text pf=” bf=” bq=” mm=’100%’ v=$q} /str

Why RDA is doomed to failure

April 23, 2010 at 10:20 amCategory:Uncategorized

[Note: edited for clarity thanks to rsinger's comment, below]

Doomed, I say! DOOOOOOOOOOMMMMMMMED!

My reasoning is simple: RDA will fail because it’s not “better enough.”

Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.”

First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of.

And, of course, you’re right on all counts. I don’t know what I’m talking about in any of those realms.

And yet I’m still willing to make a strong statement?

Yes. I am. Here’s why.

[Oh, and if you're convinced I'm wrong -- please say so. I'd love to be wrong about this.]

First, an assertion

The purpose of any bibliographic metadata is to facilitate three things:

  • Description/Identification. If you know what you want, does the metadata give you enough information to determine if the described item is what you want? Alternately, if you’re holding an item (or an alternate metadata representation of it), can you find the record that describes it?
  • Machine finding. Can a machine, given a good-enough query, find a work via a search of the metadata?
  • Machine grouping. Given the metadata, can a machine help a person find items “like this one”?

Take issue with one or more of those statements. I don’t care. The point I’m really trying to make is that any standard that doesn’t put unmediated machine reasoning at the forefront of what the metadata needs to support is living in a deep, deep hole.

Computer cycles are pretty cheap, and programmers are pretty smart. We can figure out how to do useful things with virtually any data, but only if we can reliably get at those data.

Getting 75% of the way there

Three-fourths of the problem can be addressed with one simple concept.

A solid equality relationship.

By this I mean that “=” had better damn well mean “equal,” as opposed to “probably the same, but there might be other representations, too.” If I want to say “A = B” (where A and B are authors, or works, or subjects, or anything that can be nailed down) there’s better be no false positives and no false negatives. Ever. MARC’s use of “hopefully-unique strings” is ridiculously insufficient in the modern era.

RDA does pretty well with this, with URIs for appropriate concepts, so that’s good.

What’s wrong with it?

Well, it’s gonna cost money to access the spec, for starters. That’s just dumb.

But it’s also not flexible/extensible enough. It’s true that I’m not a cataloger. I do have an MS in computer science, though, and there is stuff in the various versions of the RDA spec which lead me to believe that the committee desperately, desperately needed some hardcore geeks on it. Computer science has basically done nothing but develop methods for abstraction and composition for decades, and that isn’t reflected enough here.

Language such as, “If it is determined that a mechanism for providing a direct link between a note and the instance of the element to which it relates is required,…” worries me. if? IF????? That’s not a spec. That’s a guideline. Nail it down, for god’s sake. When is it appropriate or inappropriate? How do you add links to multiple (but not all) instances of the element?

The spec also seems to describe at least half a dozen kinds of titles. One of these is “Abbreviated title.” Do we really want an abbreviated title? No. We want a title with an “abbreviated” modifier, so we can use that same modifier for, say, a corporate name or publisher or anything else. [Note: see rsinger's comment below, indicating this was a piss-poor example on my part.]

Well, sure, but it’s still better than the AACR2!

[This section updated to disabiguate my use of 'MARC' when I really meant 'AACR2 as commonly talked about in term of MARC tags']

Of course it is. It’s just not better enough!

We’re not just talking about writing a spec. We’re talking about replacing every single tool in the library toolchain, from the ILS to editing software to OPACs to scripts that keep it all put together. We’ll be asking programmers to learn new skills and new ways of thinking, vendors to produce functional software for untested data formats, and catalogers to essentially take their whole brain out of their heads and get a new one.

But that, frankly, is the easy part. The entire culture of the library is built around AACR2 concepts and MARC data structures. The thought processes, nomenclature — everything sometimes feels as if it’s built around three-digit tags. The majority of the (crucial!) specialized vocabulary librarians, and experts and specialists, use to communicate with each other is directly or indirectly tied to MARC

So, yeah, RDA is a hellofa lot better than AACR2/MARC. But in my view, it’s not better enough to justify all the pain. Switching is incredibly, astoundingly expensive both in terms of cost and in terms of the devaluation of institutional knowledge. We can’t do it every few years. We need to be damn sure we’re getting it right.

7 Responses to “Why RDA is doomed to failure”

  1. Ross Singer says:

    Hmm, there’s a lot here and while I think some of this would be easier to talk about synchronously, you have to go with the forum you have, not the forum you want.

    First off, let me put it on the record that I don’t disagree with your thesis. I can’t say whether or not RDA will fail (or what that “failure” or “success” means, really) but its timidity in actually modeling the data leaves a lot to be desired.

    Now, on to your arguments… Equality (with regards to information) is always going to be subjective. Witness the agita that owl:sameAs is currently wreaking on the Linked Data universe (esp. the hardcore semantic web set) to see. Machine based linking is always going to have error. Homonyms, mistaken assumptions, and human error are just going to have to be accounted for. Without a doubt RDA need to drop the string matching qualities of the status quo in MARC/AACR2 in favor of real identifiers. Still, this isn’t going solve the equality issue 100% because, honestly, a cataloger may not be 100% sure of what s/he is describing.

    Also, abbreviated titles are actual things. Like “JAMA”. I’m not sure the actual provenance of these titles, but they are distinct from the actual title (and generally considered important and used).

    My last point would be how you compare “RDA” and “MARC” in your last part. Really, you’re comparing RDA with AACR2 (esp. since the powers that be are trying to figure how RDA will be transmitted via MARC). The major issue is that RDA doesn’t distance itself nearly enough from AACR2 to be entirely worthwhile. Everything is still a literal and there is still a very “record-centric” mindset (even in the RDF schemas). This is most obvious when you see things like “titleOfTheWork” and “projectionOfCartographicContentExpression” instead of, I don’t know, just modeling the damned FRBR entities like they should.

    So, instead, we have a somewhat-major change in cataloging rules that will require a lot of time and energy and still provide no “real” relationships between resources and entities.

  2. Laura says:

    One minor quibble. RDA is intended to be a replacement for AACR2 — a descriptive standard, rather than MARC — a transmission standard. Granted MARC has evolved over the years to do both description and transmission in practice since there have been rules akin to application profiles in terms of how to enter data into a MARC record.

  3. Wally Grotophorst says:

    If you wonder whether this disconnect between computer science and library science (specifically cataloging) is real, stroll down your QA76 range of shelves sometime and marvel at the distribution of shelving locations for something like Oracle how-to books.

  4. In “Directions in Metadata” Karen Coyle notes that the current vendors have been reporting near ZERO feedback / customer demand for anything related to RDA. True, it’s still early – the spec hasn’t been formally released – but in a slow moving community, any change seems to need a lot of “ramp up” time, for both the library community and its vendors.

    Very too bad, since there’s a sense of urgency that’s missing in all of this discussion. I think the OSS community is going to shape up to be best positioned to respond to changes, but moving forward with some reasonable consensus from libraries is going to be the challenge. There still remains a gulf between the well-informed IT & catalogers vs. the laggards from the catalog card generation who don’t understand how our MARC/AACR2 standards present huge data issues that prevent us from moving forward.

  5. Karen Coyle says:

    If you look at the diagram called “Singapore Framework” on the Dublin Core site [1], it illustrates all of the necessary elements of a functioning, modern metadata scheme. The framework is based on RDF, but it could really be based on any other foundation technology. What we don’t seem to have learned in the library world is that the cataloging rules do not a metadata schema make. The rules are about how you make decisions, but you need to have defined data elements, vocabularies, and, above all, you need to have some sense of what functionality you wish your metadata to support. I feel like we go about it entirely backwards, first creating rules, then trying to fit it all into a data format.

    [1]http://dublincore.org/documents/singapore-framework/

  6. Irvin Flack says:

    Following up on Karen’s and Ross’s comments, I’m reminded about that joke about the guy looking for his lost keys under the street light — not because that’s where he dropped them but because that’s where he could see. Or, to throw in another metaphor: you visit a surgeon, you get an operation. Cataloguers are experts on the rules — so that’s what RDA at heart still is, a set of rules.

  7. Bruce says:

    If you wonder whether this disconnect between computer science and library science (specifically cataloging) is real, stroll down your QA76 range of shelves sometime and marvel at the distribution of shelving locations for something like Oracle how-to books.

Data structures and Serializations

April 20, 2010 at 4:56 pmCategory:Uncategorized

Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in.

What this post is not

There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them.

What I am comfortable saying is this:

Anyone advocating or dismissing a data model based on the data structure or serialization most-often associated with that model is missing the goddamn point.

Data serializations

…are boring. They’re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even MARC has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).

So, let me repeat again, serializations are boring and not worth talking about until you’ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.

Serializations are measured from “less pain” to “more pain”, but all have the exact same expressiveness. Data structures, on the other hand, do not.

A hierarchy of data structures

Think about the following data structures:

  • An ordered list
  • key-value pairs
  • A hierarchy (e.g., an XML document)
  • An undirected graph
  • A directed graph
  • A labeled, directed multigraph (e.g., a set of RDF Triples)

You don’t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item’s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you’re careful to make links both ways. And a directed graph can easily be represented as a set of RDF triples (where you may, for example, only have one label for your relationships: “links to”).

[Note that I didn't say any of these would be efficient implementations!]

The reverse is not true — or, at least, not without an incredible amount of “out of band” information in another layer somewhere.

The structures at the end of the list have more expressiveness. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I’m not going to try to model my set of key=value pairs in an array. I could do it, but it would take so much of my attention that the data modeling would suffer.

Don’t handicap yourself

Don’t start with the data structure.

DON’T START WITH THE DATA STRUCTURE!

GET THAT MOTHER-FREAKIN’ DATA STRUCTURE OFF MY MOTHER-FREAKIN’ PLANE!

Seriously. Don’t be stupid. If all you’ve got is a hammer, everything starts to look like a thumb.

If you start off with a restrictive data structure before you even fully define the domain you’re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you’re imagining in your head.

Domain modeling is ridiculously hard for any domain worth modeling. If you start with a handicap (a restrictive data structure) it’s going to be even harder.

No one would think of trying to model bibliographic data using only arrays. That’s premature optimization on an epic scale.

The appeal of RDF Triples

Even if you ignore all the semantics and rules that make RDF Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that any semantic model based on RDF Triples has enormous expressive power at its disposal.

Does it turn out that after you’ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You’ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a “type” of sorts, and add some out-of-band knowledge). And lord knows RDBMS’s have great performance characteristics.

But don’t start there. Start with the domain. Model it. Figure out what you need to describe and derive. Then pick the most appropriate data structure.

The nightmare that is MARC

MARC-the-data-structure (not to be confused with a serialization of that data structure, on the one hand, or with the AACR2 on the other) can incompletely (but usefully, I think) be described as:

  • A set of key-value pairs
  • …that have a defined order
  • …where keys can be repeated
  • …and values are strings
  • …and keys are a concatenation of tag/ind1/ind2/code

Control fields are especially restricted (ind1, ind2, and code are all ‘null’). There’s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.

It doesn’t give us much to work with. It’s restricted. And, sadly, so is our thinking.

Putting the cart before the horse

As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the MARC data structure. It’s a handicap. It’s an anchor around our necks.

MARC-the-data-model (nee AACR2) is not inherently bad because it’s built on an impoverished data structure. It’s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that’d be awesome because it would indicate that things are simple.

Things, of course, aren’t simple. They’re hard.

So, if you want to complain about MARC or RDA or FRBR, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don’t conflate data models, data structures, and serializations.

Oh, and don’t say “PIN Number” or “ATM Machine.” That drives me crazy, too.

5 Responses to “Data structures and Serializations”

  1. A brief exchange me and Bill had in IRC, which I think is further illuminating:

    (5:10:13 PM) jrochkind: BillDueber: I’d say the problem is that MARC is BOTH a “data model” AND a “data structure.” Even though was never designed as a data model, it has become one.

    (5:11:02 PM) BillDueber: jrochkind: Right. We long ago passed the point where the model drives the data structure. It’s [now] the other way around. [which is a bad thing]

  2. MJ Suhonos says:

    Bill, just to clarify my perspective on the issue, I fully agree with everything you’re saying above. In fact, your explanation is probably the clearest I’ve seen to date. And the thread is definitely ridiculous. :-)

  3. Hi,

    I’m not at all familiar with Domain Modelling – so far what I know about it comes from this blog post, plus a less useful Wikipedia article, plus a random white paper I googled up. (http://www.aptprocess.com/whitepapers/DomainModelling.pdf)

    My question is this: would modelling the domain for a library system consist of coming up with something like the set of behaviors that FRBR describes, and then building a data structure based on that?

    Thanks for an interesting post, anyway. Joe M.

  4. [...] librarian would put their metadata in a data format (or “content format” or “data structure“).  Some examples are binary or XML.  It is the carrier for the content, just like how a CD [...]

  5. Jakob says:

    Yes, FRBR is one example of a Domain Model – librarians can do this. But with FRBR they failed to define a serialization. Domain Models help you to talk about things with human beings. But to exchange data you need a serialization of the model. I agree that the model must come first, but when you stop there, you end up doing no data exchange but philosophy (which is nice too).

Library of Congress Subject Headings (LCSH) in particular.

I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird.

And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying.

So, just for kicks, I ran some numbers.

The process

I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6′s, and threw away any trailing punctuation in any of the subfields. I called the concatenation of what was left a unique LCSH.

Then I printed them out and put them all onto index cards, using tick-marks to indicate…

No, of course not. I used sort, uniq -c, and wc -l. Here’s what I found.

Counts of LCSH

…in round numbers.

In our catalog, there are:

  • 8.50M subject headings (using the definition above)
  • 1.87M unique subject headings
  • …66% of which (1.23M) appear exactly once

We only have to go out to 30K subjects to account for half of all subject entries. The top 1000 most-used subjects account for 14.5% of all 8.5M subject entries.

The top ten subjects by count are:

  • 6029 $$aSermons, American
  • 6131 $$aPhilosophy
  • 7224 $$aFeature films
  • 7591 $$aPiano music
  • 7968 $$aSocialism
  • 8796 $$aEconomics
  • 9185 $$aCommunism
  • 12440 $$aSermons, English$$y17th century
  • 13539 $$aBills, Private$$zUnited States
  • 58823 $$aEconomics$$xHistory$$vSources

From a record’s point of view

Our catalog has:

  • 7M records
  • 4.4M records with at least one subject (as defined above)
  • 2.4M records with more than one subject
  • 2.0M records with exactly one subject
  • 2.6M records with zero subjects

The records with the most subject headings tend to be collections of stuff (theses, photos, etc). Our local standout is the Dept. of Medicine and Surgery (University of Michigan) theses, 1851-1878 with 208 subject entries. 14 records have at least 30 subject entries.

What it means

Gee, lady, I don’t know.

One way to look at it: suppose you’re considering defining subjects in this way, and making them “hot” in the catalog interface. For our data, 2/3 of records would have either no subjects or a subject that found only the record you’re at. So…think again.

In real life, we index lots of possible subject fields, and we additionally index the $$a as well as the whole string, so ours are a little bit more useful. A little.

4 Responses to “Stupid catalog tricks: Subject Headings and the Long Tail”

  1. Cool!. Now consider graphing the results of your counting.

    Count other things such as length of book, dates, authors, etc. When you get this far compare the subject headings with the additional counts to see whether or not their are a relationships. Are books of one type of subject generally longer than others? Was this subject heading assigned more often during specific years? Are there common authors within subject headings? Are the books in question available via full text? Can you get those full text books and determine whether or not the books were cataloged “correctly” by doing text mining against the full text. Are there sets of “better” words that could be used to describe the books?

    Fun with counting.

  2. Naomi Dushay says:

    Eric,

    I’m in the “what can we do with the data we have” camp more than the “how should our data be improved” camp, by and large. If we’re using subjects to promote discovery; having 7 million records with suboptimal inconsistent data is not surprising. Even if there are patterns such as the ones you explored off the top of your head, retrospective cataloging seems unlikely. (whispering) In fact, perhaps human cataloging isn’t really scalable. That said, I will take LCSH over call numbers, but since we have both …

  3. I think this says you’ve GOT to get into the hieararchy in LCSH in order to make em useful. Yes, the hieararchy is weird. Yes, there are actually TWO OR THREE axes of hiearchy in LCSH. But it’s the data we’ve got, like Naomi says, and I think you’ve got to get into it to make em useful.

    So a “subject” according to your analysis is simply the pre-coordinated, for example,

    “Great Britain — Social conditions — 19th century.”

    If you just make that a link and show all things with exactly the same heading, will you get others? Well, in my catalog, you’ll actually get a few hundred, yeah. (And I TRIED to find a good example that wouldn’t do that! But maybe I didn’t try hard enough). But let’s pretend not.

    Okay, but how many will you get if you look for “Great Britain — Social conditions”, or “”Great Britain — Social conditions – ANY“? A lot lot more.

    How about “Great Britain — Social conditions — 19th century — something else“? A lot more there too.

    These subject headings were designed for a card catalog world where they’d all be laid out alphabetically, so the “wildcard” strings I suggest would neccesarily be right next to the original subject.

    Our challenge is to figure out how to present these things in a rational way in the online environment instead. But it’s definitely not only linking to things with exactly the same pre-coordinated subject heading — if that often gets you very few hits other than your origin record, it’s because that’s not what LCSH was designed for.

  4. “Our challenge is to figure out how to present these things in a rational way in the online environment instead.”

    Well, here the online environment can work with you rather than agsinst you, since you can make displays that show you the hierarchies radiating out in all kinds of directions.

    Ideally, for “Great Britain — Social conditions — 19th century” you’d see not just that subject and its books, but related subjects and their books as well, displayed in a way that makes it easy for you both to find books of interest and to shift your focus based on what you find.

    See this link for an example of how it works for “Great Britain — Social conditions — 19th century” in a collection of about 40,000 titles. The display shows 7 titles with that subject, and also shows a few others with more specialized subjects (social conditions in England during that time, or social conditions for women). And then it goes on to shift the focus outward a bit, looking at books on social conditions in England without the explicit time qualifier, and so on.