Search results for “MARC

Please: don’t return your books

February 12, 2013 at 4:16 pmCategory:Uncategorized

So, I’m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part.

Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question:

> How much more shelving would we need if everyone returned their books?

Assuming we could get them all checked in and such, well, where would we put them?

I’m looking at this in the simplest, most conservative way possible:

  • Assume they’re all paperbacks, so we don’t worry about how thick a cover is (cover width = 0)
  • Assume items for which we don’t have page count information are “average”

Starting data

What’s my current situation at Michigan?

  • Total bibs: about 10M (but that includes a bunch of HathiTrust items and other electronic-only items that could never be checked out)
  • Total items checked out right now: 162,080

The first problem I run into is that I don’t know how many pages are in a given book. Well, in theory I can look in MARC field 300$a, and it will tell me.

Finding the number of pages in a book

I went through a recent dump of all our records and pulled out page counts from the 300 (those that matched the regular expression $$a\d+\s+[pP].).

Problem solved, right? Well, kind of

  • 3,085,433 total bibs with page count data (about 30%)
  • 40,872 checked out items with page count data (about 25%)

OK, so I don’t have data for everything. Plus, some of those are multi-volume works that list the total page count, even though only a single volume may be checked out.

We’ll have to drop down into statistics:

  • Average number of pages in a checked-out item: 270
  • Median number of pages in a checked-out item: 244

The median is lower, so we’ll go with that. Being conservative, remember?

Bringing it all together

Obviously we need to make a lot of assumptions.

  • All paperbacks (== no space allowance for covers)
  • 244 pages per item (the median of checked out items for which we have data)
  • Pages = 244 * 162,080 = 39,547,520 pages

So…what’s the damage?

But how to do the calculation?

It turns out that simply googling book spine width calculator a few come up.

I picked one and input 39,547,520 pages and assumed 50lb paper (the lightest paper in the tool).

Total width: 77,241.25 inches, or 6437 feet, or 1.22 miles

1.22 miles???

Well, we had a lot of assumptions,but most of them were pretty conservative. And I have no idea if the book spine calculator is at all accurate.

But…it’s gonna be a big number no matter what. Add in that many of them are hardcover, and this seems like a pretty good guess at a lower end.

What is this good for again?

Oh, nothing at all. Just a little fun while I’m at code4lib.

Next steps

Well, the best next step would be to walk away. This is a huge waste of time.

But…we could look in the 020s for a hint of whether it’s hardcover or paperback (which is really hard. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and take that into account.

But really: this is enough for me. Whether Michael wants to pursue it further on his own, well, that’s up to him.

One Response to “Please: don’t return your books”

  1. I had a class in Library School in the 1990′s on operations research in the library. There was some interesting stuff on adjusting loan periods so that some number of books was always checked out in order to make sure there was enough shelf space.

    A bunch of researchers applied Operations Research (queueing theory/monte carlo simulations, and more math stuff) to the relationship between library loan periods/policies and shelving needs (as well as user satisfaction etc)

    Here is an excerpt from some of Buckland’s research in the 1960′s. It doesn’t actually talk about deliberately making the loan period longer in order to reduce the need for shelf space though. I’m pretty sure either his book or subsequent research dealt with this issue however. On the other hand, this research on user’s behavior from 1968 sounds relevant for today.

    http://people.ischool.berkeley.edu/~buckland/lancasterlru.pdf

    During 1967 and 1968 a series of measurements were undertaken which showed

    that library users could find the books they were looking for about 6 times out of 10;

    that the major cause of nonavailability was that the book was out on loan to someone else;

    that borrowed books tended to remain out for the full length of the loan period;

    that in practice a loan period was determined not by written policies but by when overdue fines began;

    that disappointed would-be borrowers did not often avail themselves of the procedures for recalling books back from loan;

    and that in-library book use tended to have a stable relationship to circulation in any given library (Hindle and Buckland, 1978).

    A Monte Carlo simulation was used to avoid the limitations of queuing theory. A flow chart of borrowing activities was programmed so that a computer could simulate the sequence of users seeking a single book, its repeatedly being borrowed and returned, and how often a copy was not available when sought. The simulation was flexible enough to show the effects of changes in the pattern and level of demand, in the length of the loan period, and/or of changing the number of copies of that book.

    For more details see Buckland’s book on this “Book availability and the library user”: http://mirlyn.lib.umich.edu/Record/000014253

    Tom

Leave a Reply

Completed parts of the series:

  1. A Solr Field Type for numeric(ish) IDs
  2. Using localparams in Solr (or, how to boost records that contain all terms)
  3. Requiring/Preferring searches that don’t span multiple values
  4. Boosting on Exactish (anchored) phrase matching

Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.

Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.

Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.

So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.

Premise 1: put all the logic you possible can into Solr

Much of what I’ll be doing is looking at new field type definitions that are appropriate (in my mind, anyway) for library data. Some of this stuff (e.g., normalizing ISBNs) would be a lot easier to do in your indexing code.

But then you’d have to do it again in your application to munge whatever is entered in the search box. And maybe it won’t be the same every time. Or maybe you don’t want to write a freakin’ parser to try to find anything that might look like an ISBN and mess with it.

I take it as gospel that you should put all your logic into the solr field analysis chain, so the exact same thing is happening on index and on query. That way, even if it’s wrong, at least it’ll be wrong in the exact same way and your users will find the stuff they’re looking for.

Premise 2: Doing it crappily is better than not doing it at all.

Look, the right way to do much of this stuff is by hacking on Solr itself, building custom field analyzers or filters or tokenizers that mess with the token chain and…

Wait. I already lost myself, and probably you, too. At some point, I’m going to do an actual sample custom filter for the new Solr codebase (the stuff I did once before is out-of-date); the example will be LCCN normalization and you’ll be able to follow along with me on this blog.

But in the meantime, we can do a lot of fairly ambitious stuff just by using and abusing the out-of-the-box stuff: pattern replacement filters, the existing tokenizers, etc. It might be ugly, and not very fast, but if I start getting the 200 hits a second that mean this is a bottleneck for me, I’ll be happy to deal with it then.

Premise 3: It’s always better to put something out there so smart people can tell you how to do it right

One of the disappointments in my life right now is that there isn’t more formal and informal discussion about what people are doing/trying. I’m sure it’s out there, but some of it is buried in a sea of application-level crap, and much of it is ignored by the people that really understand the data.

With luck, I’ll get comments from folks who really know their stuff and can tell me, in excruciating detail, exactly how I don’t. Please: correct me. I might not be the brightest guy in the room, but I know enough to try to outsource my thinking.

Follow along at home!

Option 1: Build your own current-trunk Solr

If you want to follow along at home, you’ll need a copy of the current source (not the 3.5 stable, since I use things like the ICUTokenizer coming in 3.6 / 4.0), which you can find and build from the Solr site.

Option 2: Just use what I’m using

Alternately, if you’re lazy (and who isn’t??), I’ve provided a github repo of the standard solr “example” directory you can nab and run on your own java-equipped machine.

Warning: the git repo is currently 60MB or so.

git clone https://billdueber@github.com/billdueber/solr_stupid_tricks.git
  1.   cd solr_stupid_tricks
  2.   java -jar start.jar

…and then head to your local Solr Admin page page on port 8983 to check things out. We’ll be spending most of our time in the analysis tab.

I’ll get the first post in the series up later today, and then every few days as I think of more things to talk about. I hope you’ll join me!

6 Responses to “Stupid Solr tricks: Introduction (SST #0)”

  1. Jon Gorman says:

    Very cool. Kudos for doing this.

  2. Joe Montibello says:

    This is good stuff. I’m just getting my feet wet with Solr and Blacklight, and I’m already learning from your first two posts!

    Thanks.

  3. [...] Know; consider it required reading for the next SST. If you're just joining us, check out the introduction to the Stupid Solr Tricks series] Contents1 What the heck is a localparams query?2 Solution: Build a query of queries3 An example: [...]

  4. [...] phrase slop, solr, Stupid Solr Tricks March 9, 2012 No Comments Check out introduction to the Stupid Solr Tricks series if you’re just joining us.] Contents1 Solr and multiValued fields2 Following along at home?3 [...]

  5. [...] out introduction to the Stupid Solr Tricks series if you’re just joining [...]

Leave a Reply

ISBN parenthetical notes: Bad MARC data #1

Tags:

April 12, 2011 at 12:22 pmCategory:Uncategorized

Yesterday, I gave a brief overview of why free text is hard to deal with.

Today, I’m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it.

The point is not to mock anything. Mocking will, however, be included for free.

What’s supposed to be in the 020?

Well, for starters, an ISBN (10 or 13 digit, we’re not picky).

Let’s not worry, for the moment, about the actual ISBN and whether it’s valid or not.

Wait, no, let’s go ahead and worry about it. It’s an easy enough script to write, although it takes a while to run.

8,630,794  Total records
3,220,666  Total 020a's
    6,498  020a's that don't obviously contain an ISBN
    8,407  that look like an ISBN but fail checksum test:
... so 0.26% of the ISBNs have invalid checksums

So, not bad at all, especially considering some of those are known to be bad, but are transcribed dutifully from the actual (mis-)printed book.

A lot of the malformed data (anything from which I can’t seem to extract something that looks like an ISBN) is pricing data, and most of it appears in system numbers that are close enough to each other that I presume it was just a bad batch.

What’s goes after the ISBN in the 020?

I’m no cataloger, of course, but it looks to me like the answer is “Something about how the book is bound together, or the publisher, unless you want to put something else there, and then, really, go ahead, because it’s not like anyone is ever going to want to parse this out, all we need to do is print cards with it for god’s sake.”

No, I kid, I kid! The actual rules are in Library of Congress Rule Interpretation 1.8, which reads, in part:

For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.

I think it’s important to read that a second time, because it succinctly conveys the culture in which these rules were devised.

  • Don’t worry about consistency, because your only reader is human.
  • Defer to the cataloger.
  • Being complete is more important than being consistent.
  • Base your notes on your subjective view of the actual, physical item you’re presumed to be holding in your hands.

Interestingly (to me, anyway), it looks like the OCLC once had a (now deprecated) $$b subfield for binding information. Apparently it didn’t catch on.

What did I find?

So, let’s pretend I’d like to be able to differentiate between paperback and hardbound books. Probably useful, yes?

I went ahead and took all parenthetical notes from any field in the 020, split them on colon (’cause that seems to be the way they roll) and did some basic normalization:

  • Eliminate numbers (so ‘vol. 1′ and ‘vol. 2′ count as only one pattern)
  • Lowercase everything
  • Turn runs of spaces into a single space
  • Trim leading/trailing spaces
  • Remove any trailing punctuation

I found 1,506,729 parenthetical remarks in the 020 subfields of our catalog.

The top twenty most common entries using those normalizations are:

  1. 402537 pbk
  2. 387406 alk. paper
  3. 99260 v # (e.g., “v. 1″, “v. 22″, etc.)
  4. 82918 cloth
  5. 51125 hbk
  6. 42036 electronic bk
  7. 41360 acid-free paper
  8. 38792 hardcover
  9. 28913 set
  10. 20358 hardback
  11. 19160 ebook
  12. 16264 paper
  13. 15269 u.s
  14. 12770 hd.bd
  15. 11793 print
  16. 10625 lib. bdg
  17. 10520 hc
  18. 8772 est
  19. 7767 pb
  20. 7639 hard

The kicker? These are the top twenty of 13,374 unique parenthetical strings found in the 020 field. Many of them are publishers, or cities, or whatnot, but an awful lot of them are variations on “hardcover” and “paperback.”

For example, a quick search for anything that might be “hard” (regexp: /h[ar]{0,2}d/) got me started on a list. Here’s just the 90 examples from that list that start with ‘h’:

hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover | hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr

And that’s after eliminating things like places of publication, strings like “with…”, “plus…”, “alk. paper”, etc.

“Yeah, but you have to understand that historically…”

Stop hiding behind that.

I understand that at one point in time it probably made sense (to someone at least) to do it this way. I can deal with that.

What I can’t accept is that as I type this there’s a cataloger doing this in this way. Today. April 2011. Some, what? maybe thirty years since computer-based OPACs became prevalent?

These sorts of problems were recognized ages ago and should have been dealt with. Add a subfield. Invent a controlled vocabulary. Don’t worry about the legacy data; it’s always going to suck.

But why are we still producing sucky data???

To sum up

The point is that there’s a better way to do this stuff. Lots and lots of better ways, in fact. Time I spend dealing with crappy data is time I don’t spend making relevancy raking better, or building a better command language search option for my librarians, or working on ways to get a decent “more like this”.

The need is both dire and urgent; the latter because sooner or later we’re going to have to go to a “two state solution” with traditional MARC21 for many of our records and whatever comes next (RDA?) for the newer stuff. And every day we wait, that first category grows, and the growth rate keeps increasing.

And then there’s serials. Don’t talk to me about serials.

6 Responses to “ISBN parenthetical notes: Bad MARC data #1”

  1. Chris says:

    Okay, so, what do you suggest we actually DO about this?

  2. Jakob says:

    What we should do is first making clear what parts of cataloging produce useless junk (like Bill did) and second clean up and normalize the data. A typical reason for avoidable quality problems is a lack of feedback. If you do not instantly get feedback about illformed data when you start to create a record, you will unlikely change your cataloging practice. So third we should create better cataloging clients. As long as cataloging rules are only written as rules instead of implemented as code that given error messages, I doubt that we get better data.

  3. Karen Coyle says:

    Bill, ISBN is one of the examples I use in my talks when I get to the point of “text versus data”. I have grabbed a couple of examples, but you’ve done a full-blown study, and I’m here to thank you for it! I will point people to this post for more info.

    Chris, as to what we do… given that any library system in existence today has algorithms to separate the ISBN from the rest of the subfield, we need to add a new subfield to MARC to hold the text and make that separate permanent. Actually, we need to have done that 20 years ago, and I almost feel like now it’s too late to make the change worthwhile since it looks like we’re on the verge of moving beyond MARC anyway.

  4. Matthew Phillips says:

    Yet another example of where the UKMARC format was superior. In UKMARC the ISBN was in subfield “a”, qualifying remarks in subfield “c” and price in subfield “d”. There was even a subfield “b” with a code to indicate the type of ISBN (e.g. whether it was the ISBN for a set of volumes or an individual volume).

    Sadly UKMARC was abandoned about ten years ago because it was just so unsatisfactory trying to convert from USMARC to UKMARC. It’s always easier to convert metadata into a format which is less expressive, so we all moved to USMARC instead.

  5. [...] could look in the 020s for a hint of whether it’s hardcover or paperback (which is really hard. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and [...]

  6. [...] Bill Dueber differentiating between bindings [...]

Leave a Reply

One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job.

A lot of people seem to not understand why.

This post, then, is for all the catalogers out there who constantly answer my questions with, “Well, it depends” and don’t understand why that’s a problem.

Description vs Findability

I’m surprised — and a little dismayed — by how often I talk to people in the library world who don’t understand the difference between description and findability. AACR2 is clearly designed for description; once you’ve found a record, it does a pretty good job telling a human being what she’s looking at. With respect to a person who’s already got a copy of the record in her (virtual) hand, strings of text and reasonable abbreviations are…well, often good enough, let’s say.

But much of AACR2 is a giant mountain of fail when it comes to supporting findability — the ability for a machine to slice and dice the data in ways that can be mapped onto searches and transformations. What those of us on the business end of the computer need are well-defined values stuck into well-defined places that represent well-defined relationships.

Free text stuck on the end of a field fails all three of those criteria.

Machine Reasoning vs. Machine Parsing

When many people look at something like RDF, their first reaction is, “Great Googally Moogally! Just tell me the language! I don’t want to follow a chain of reasoning that’s seventeen steps long just to figure out the damn thing is in English!!!”

Of course you don’t. And you don’t have to. Someone — hopefully someone smarter than me — needs to write a program to do it. And we can.

Following all that logic — deriving relationships, figuring out eventual values, determining how to convert between various forms — is what I’ll call (for simplicity’s sake) machine reasoning. And machine reasoning — for the purposes of this discussion, anyway — is a solved problem. I’m not saying it’s not hard, and I’m not saying it might not take gobs of hardware resources. But we, the collective of humanity, know how to do it.

On the other hand, machine parsing — looking at all that free text that is sprinkled throughout our records and trying to turn it into something that is susceptible to machine reasoning — is vehemently not a solved problem. Even if you ignore all the misspellings, we’re still stuck with one-off abbreviations, lack of ordering, gobs of “local practice,” and iffy punctuation.

And, come to think of it, you can’t ignore the misspellings, either.

The point is this: good data trumps everything else. If there’s good, solid, well-defined data in computable places, we can (given some time) do damn near anything with it. If there’s human-entered, free-text, parenthetical-remark-type data, we’re pretty much stuck.

Examples?

Jonathan Rochkind just did a great post looking at LC call numbers, and how, well, they might be in a few different places, and may or may not be valid LC call numbers, and so on and on and on and on.

And my next post (hopefully tomorrow) will be an analysis of the first freetext in MARC I ever tried to deal with — the parenthetical remarks in the 020 (ISBN) field. If that doesn’t keep you up all night, well, I don’t know what will.

6 Responses to “Why programmers hate free text in MARC records”

  1. Chris says:

    I don’t think they don’t understand — I mean, the catalogers I deal with around here are pretty clever. I think that YOUR concerns are not THEIR concerns and they are working within the constraints of a system that has encouraged, and in some ways even forced, then to get cute with parens and abbreviations because no one was going to change anything to support their need to differentiate.

    The typos are of course only human, so I’m whatever on those. The punctuation, though? I cannot complain about enough. ;-)

  2. Chris says:

    You’ll note I demonstrated how okay I am with typos by inserting one. I did that on purpose, you know. No, really….

  3. Programmers hate MARC because it’s the cool thing to do.

    MARC records contain a lot more than the description. You don’t mention the subject metadata that most contain. This greatly enables retrieval (or findability). You do mention class numbers, which are also not part of the description and also help with retrieva.

    Also, there’s no reason to play findability and description off against each other. Each fills a different need.

  4. Cathy says:

    Just guesing here, since I’ve always been Colection Developmenet/Acquisitions, but haven’t those MARC records been pulled in from a lot of different sources, over many years, from the hands of many different catalogers? I remember something about the older a librar’s records are (the cataloging records), I mean the more hands that have worked on them, throught the years, a steady increase of mistakes or differences. Example: your search results featureing hard,hardback, etc, are possibly not even from the same library, originally. The fun of copy cataloging? Just a thought.

  5. Bill says:

    Jeffery, that’s a….let’s say perhaps an “overly broad” statement, and I’m not sure what you hope to add to the discussion with it. Programmers don’t, of course, hate MARC because “it’s the cool thing to do.” MARC-as-data-format is outdated and has a lot of flaws that are well-known. MARC-as-AACR2 as I’m treating it here is easy to hate because it’s full of description that could easily be made susceptible to machine parsing, but isn’t, and hence is at least partially wasted effort in that people are doing work that could be useful along both dimensions but isn’t.

Leave a Reply

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data.

I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21.

Why bother?

Binary MARC-21 is “broken” in that a lot of us have records that are so long (more than 9999 bytes) it’s impossible to create a valid marc binary record. The standard alternative, MARC-XML, has huge filesizes (roughly 3 times as large) and runs a lot more slowly in every benchmark I’ve ever run. For ruby-marc, the penalty for using XML is further exaggerated because the serializer is based on REXML and is super-slow.

There have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton bigger in terms of file size, and (b) much easier to query from a NoSQL database using something like JSONPath or JSONQuery.

What I’m testing

For this test, I used:

  • marc21 binary This is the stock serialization / deserialization provided by ruby-marc.
  • YAJL for JSON YAJL is a very fast C-based JSON library. Here we’re using the Ruby bindings and calling Yajl::Encoder.encode(r.to_hash) to serialize and MARC::Record.new_from_hash(Yajl::Parser.parse(JSON)) to deserialize.
  • Msgpack The Msgpack project is explicitly designed to be “binary JSON” — smaller, faster, etc — at the expense of human readability/editabilty . Again, this used the ruby bindings.

The benchmark and its results

I’m interested in how long it takes to serialize and deserialize a single record. My primary use-case is sticking a single record into Solr, and then pulling the string representation of that record out and turning it back into MARC.

It’s entirely possible that trying to deal with a whole set of MARC records — as a JSON array of marc-in-json objects, or as a set of newline-delimited JSON or Msgpack objects — would yield different results. The former is especially interesting, since to parse a large JSON array one needs to use a streaming parser, which will almost certainly have a different profile in both processing and memory use.

The ambitious can see the full source code of the benchmark.

Note that the following represent only the performance of ruby-marc and the particular serializers used. Other platforms or other libraries will certainly give different results!

Total of 18880 records run 20 times (377,600 serialize/deserialize cycles per method) on my Mac OSX desktop; comparisons are to MARC21-Binary.

 SERIALIZING
   MARC Binary       357.02 s (100%)
   YAJL              312.65 s ( 88%)
   Msgpack           266.26 s ( 75%)

 DESERIALIZING
   MARC Binary       648.91 s (100%)
   YAJL              507.64 s ( 78%)
   Msgpack           459.73 s ( 71%)

 SERIALIZE + DESERIALIZE
   MARC Binary      1005.93 s (100%)
   YAJL              820.29 s ( 82%)
   Msgpack           725.99 s ( 72%)

 SIZE
   MARC Binary   31.15 MBytes (100%)
   Msgpack       42.00 MBytes (135%)
   JSON          55.99 MBytes (180%)
   XML           93.42 MBytes (300%)

Analysis, such as it is

Obviously, there are size/speed tradeoffs. Nothing is as small as binary MARC21, but both YAJL and Msgpack are faster — significantly so for deserialization, which happens to be where I want the speed for my uses.

At 80% larger, the JSON serialization is quite a big bigger, but it’s a hell of a lot smaller than MARC-XML and suffers none of the limitations of binary MARC.

For a closed system (i.e., you’re not worried about anyone else being able to read your data) such as a Blacklight installation, I’d be tempted to move to using JSON sooner rather than later.

One Response to “Size/speed of various MARC serializations using ruby-marc”

  1. Andy says:

    Minor correction: in the “why bother” section, valid binary MARC-21 can be up to 99999 (not 9999) bytes.

    Interesting write-up, I especially appreciate the metrics-based posts you do – always good to have some facts! Thanks.

Leave a Reply

VuFind Midwest gathering

September 16, 2010 at 11:55 amCategory:Uncategorized

A couple weeks ago, representatives from UMich (that’d be me), Purdue, Notre Dame, UChicago, and our hosts at Western Michigan got together in lovely Kalamazoo to talk about our VuFind implementations.

Eric Lease Morgan already wrote up his notes about the meeting, and I encourage you to go there for more info, but I’ll add my two cents here.

So, in light of that meeting, here’s what I’m thinking about VuFind of late:

  • None of us are running VuFuind 1.0 as released with full catalog data. Eric has a special purpose portal running the current code over an aggregated special collection and hasn’t done much to the underlying PHP. The rest of us were running heavily modified versions of RC1. An issue we had in common was that the changes from RC1 to RC2 to 1.0 release were so significant, including some complete architectual change (some based on the stuff I’ve done with mirlyn) that the effort required to get up with 1.0 would be no less significant than the effort to switch wholesale to something else (e.g., Blacklight).

  • A point that I made that was echoed by others is that we need to remember that these new discovery systems are all just thin wrappers over Solr. They basically have two jobs: to get a query and format it in a way that Solr can handle, and then to take the Solr results and display them. There’s some sugar on top of that (exporting, tagging, etc) but that’s really it. The heavy lifting is all done by your indexer (Solrmarc for most, although watch this space for my announcement of my JRuby-based stuff today) and Solr itself. It’s not a hard problem, although it is occasionally a messy one.

  • VuFind has, in my mind, fundamental architectural issues mostly based on the inability to easily separate local code from core code. A re-architecture to base everything on subclasses of the core code would help, but at some point you start to run up against fundamental limitations of PHP and Smarty to do things cleanly. Without the ability to update core code and know it won’t affect your local code, there’s no good way to keep on track with the trunk of the code and do upgrades; for the same reason, it’s almost impossible to send changes back to trunk.

  • Coupled tightly to the architectural issues is the lack of tests. The code is potentially very brittle; there’s no good way to know if you’re breaking anything until you notice it’s broken. It’s not at all clear how to write good tests for the code, because there’s a lot of inter-dependencies.

  • The second big problem is one of community; to wit, there isn’t much of one. There are some active players, and there’s what seems like a great conference going on right now, so this may change. But — especially because of the technical difficulties in contributing local changes back –VuFind could use a benevolent dictator, someone who has organizing and administrating VuFind be a part of his/her job. The last bit is important.

All of these are surmountable issues. The reason they’re at the top of my head, of course, is that the Blacklight community has, in many ways, already taken care of most of them.

If I were starting from scratch tomorrow, we’d already decided to do something locally, and I could convince my systems people to run a ruby implementation (I like JRuby myself), I’d go with Blacklight. If we were already looking at something like Summon, I’d take a hard, hard look at the build-vs-buy numbers. Summon and Primo both give you APIs to program an interface against, and boy, it might be worth the effort to do so and leave everything else alone.

10 Responses to “VuFind Midwest gathering”

  1. The hard aspect of the problem is making something as flexible as libraries (reasonably IMO) require, that can be shared by multiple ‘customers’ with their own flexible configuration and customization, while still keeping a common codebase and being able to share updates.

    I think this actually IS hard. That Blacklight is kind of sort of able to accomplish it is a result of lots of hard work, and not entirely succesful (yet).

    So one could say, forget that, let’s just have our own local homegrown wrapper on top of Solr. There are trade-offs to that, but it does certainly reduce the ‘hardness’ of the problem. Although I still think you’re left with a somewhat harder problem than you imply, with all the features we want in a modern library discovery system.

    Alternately, one could say, let’s pretty much drop the flexibility and customization. We’ll have some configuration for what Solr to point to, and we’ll have the simplest possible hooks for ILS-specific info and functionality (item status, request buttons, etc), but except for that, all ‘customers’ will pretty much be running the same thing, but for what they can manage to do in CSS alone. That would also simplify the problem a lot. (And is more or less the approach that most proprietary vendor software takes — when proprietary vendors have tried to make software as cleanly flexible as we’re trying to make BL, they haven’t generally succeeded very well. Which is again, in part, because it is not in fact an entirely easy problem. ).

  2. PS: I think just about all non-trivial open source projects need a benevolent dictator, until the community grows enough to have a benevolent junta instead. (But pretending you have a benevolent junta when most people on the junta don’t take their responsibilities seriously and figure some other junta member will take care of it, does not work).

    This is my impression of what’s almost an distributed-collaboration volunteer-done open source rule. There are some exceptions, but they will generally have some unusual characteristics that make them exceptions.

    If there isn’t an official benevolent dictator, and the open source project is seeming successful, there’s probably to some degree an unofficial one. Who is filling the role(s) can change over time, sometimes.

    One reason for this ‘law’ is that architecture matters. ESPECIALLY when you’re trying to create shareable “framework” style code, instead of just a custom fit application. And design/architecture by committee, especially a committee of people of varying levels of commitment, time, feeling of responsibility, skill, and perspective, doesn’t work out that well. You’ve got to have someone with a vision, whose vision makes sense (if it’s going to be succesful), who feels an obligation/responsibility to apply that vision.

    One of the downsides of extreme focus on test-driven/agile/extreme programming, is that you can think that just cause that feature you added in an ad hoc way passes all the tests, that’s the only metric of evaluating the quality of your software. It isn’t, architecture matters. Which is starting to get talked about more, as a corrective to focusing too much on “get it done quick with a test”, bringing things back to balance. Here’s an example of the genre that might not be the best, but I happened to read this morning on reddit: http://www.infoq.com/news/2010/09/big-ball-of-mud

  3. And, I can’t stop talking, sorry. (I should turn this into my own blog post, but I know Bill doesn’t mind).

    Blacklight may be, at the moment, more succesful at those things than VuFind, but it’s cause of a lot of sweat, and from the inside the battle isn’t over yet.

    And as far as Summon or Primo: If you ignoretheir interface entirely and just go with their APIs, then this isn’t going to be any easier than programming on top of Solr apis. It’s going to be the same, in fact. And have the same issues we’re talking about here, go it alone homegrown, try to share something on top of those APIs, etc.

    The legit reason to me to go with summon or primo+primoCentral is because their indexes include publisher and aggregator supplied article-level data. That is something that is going to be VERY hard to do without vendor support. But just the APIs to query their indexes, with your own interface on top? What’s the “everything else” you think this will do for you, that you don’t get with your own interface on top of Solr already?

  4. David says:

    Interesting discussions. We’re about to start implementing Summon, and one of the things I like about it is its API. I’d started to think about having some other frontend (especially hearing that there is a connector already built for VUFind) but do wonder what the point of it would be.

    Summon has a reasonably nice (if sparse) interface and theoretically everything we want people to search is in there, so setting up VUFind or Blacklight as a frontend would seem to only give us more configuration choices in the page design, at the expense of some non-trivial setup and maintenance costs. Are there any other great benefits that I haven’t thought of?

  5. David says:

    Also, the whole benevolent dictator / junta thing I think is part of Koha’s problems at the moment.

    They have that set up for each release (with a release manager and other roles) but there’s no “steering committee” type organization to oversee the project. They have a really good community, but when one of the support companies goes rogue (spreading FUD, locking the the community out of the website etc) there’s no coherent and official body to respond.

  6. This rings very true for me A couple of thoughts Separation of local and core code is a big problem we are currently working around. Additionally it appears that VuFind has separated template code from other code but not presentation code from logic code. I’ve yet to be convince of Smarty, (why write a template interpretor/compiler in a language that is already very good for templating?) and would like to be able to just use php as my templating language.

  7. I tend to agree with you, David, about the unclear benefit of putting a custom interface on Summon.

    What people seem to be doing in VuFind is having a locally indexed content search, and in a seperate tab a seperate Summon search. As far as I know, VuFind won’t merge these results into one search, you get your local search, or your Summon search.

    Why would you want a local search, when all that local content is probably already in your Summon anyway? Well, having a local search gives you a LOT more control over relevancy ranking and faceting, and adding additional features like my date range timeline/histogram. Example: http://tinyurl.com/2b5wujr . You couldn’t do that in Summon. But you can’t do it in a Summon search just by wrapping it in VuFind either — all you can do is have a local search with features like that, and a separate Summon search. If you ARE going to have both, I suppose there are benefits to wrapping them in a consistent VuFind interface, with a single ‘saved records’ area etc (I’m not positive if VuFind does give you a single saved records area).

    But if you’re NOT planning on adding a local index search to Summon, I think you probably get minimal benefit wrapping Summon in VuFind.

    And if you ARE planning on adding a local index search to Summon, you’re giving the users two different searches, which is theoretically what you’re trying to get away from with Summon.

    But there are pros and cons to the Summon approach. You get all that indexed content, but you lose control of your index you can get with Solr/Blacklight/VuFind.

    Where I am, we’re currently choosing to focus on improving our local index search with Blacklight. But that does leave out the indexed article-level data in Summon (or PrimoCentral). Later we may try to combine them the way apparently some people are in VuFind, although it’s an unsatisfactory combination, I think.

  8. And also, when I’m talking about benevolent dictator/junta, I’m talking about developers. Who are actively working on writing and architecting code. I’m not talking about some non-developer administrative policy committee, to the extent succesful distributed collaborative community open source projects have those (Apache does), they keep their hands off the code; such a policy committee is really a different thing meant to solve different problems, not quality-of-software problems. Maybe you need one maybe you don’t for those other problems, but I think when such a committee starts trying to direct software development, you get a beurocratic nightmare, not the good software that I’m suggesting a benevolent (developer, architect) dictator may be required for.

  9. You guys should check out Villanova University’s implementation of VuFind and Summon https://library.villanova.edu/Find/Search/Home

    We currently use VuFind and are within a couple of weeks of implementing Summon.

  10. Sorry for being unclear in the last post. “We” being Stephen F. Austin State University.

Leave a Reply

Why RDA is doomed to failure

April 23, 2010 at 10:20 amCategory:Uncategorized

[Note: edited for clarity thanks to rsinger's comment, below]

Doomed, I say! DOOOOOOOOOOMMMMMMMED!

My reasoning is simple: RDA will fail because it’s not “better enough.”

Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.”

First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of.

And, of course, you’re right on all counts. I don’t know what I’m talking about in any of those realms.

And yet I’m still willing to make a strong statement?

Yes. I am. Here’s why.

[Oh, and if you're convinced I'm wrong -- please say so. I'd love to be wrong about this.]

First, an assertion

The purpose of any bibliographic metadata is to facilitate three things:

  • Description/Identification. If you know what you want, does the metadata give you enough information to determine if the described item is what you want? Alternately, if you’re holding an item (or an alternate metadata representation of it), can you find the record that describes it?
  • Machine finding. Can a machine, given a good-enough query, find a work via a search of the metadata?
  • Machine grouping. Given the metadata, can a machine help a person find items “like this one”?

Take issue with one or more of those statements. I don’t care. The point I’m really trying to make is that any standard that doesn’t put unmediated machine reasoning at the forefront of what the metadata needs to support is living in a deep, deep hole.

Computer cycles are pretty cheap, and programmers are pretty smart. We can figure out how to do useful things with virtually any data, but only if we can reliably get at those data.

Getting 75% of the way there

Three-fourths of the problem can be addressed with one simple concept.

A solid equality relationship.

By this I mean that “=” had better damn well mean “equal,” as opposed to “probably the same, but there might be other representations, too.” If I want to say “A = B” (where A and B are authors, or works, or subjects, or anything that can be nailed down) there’s better be no false positives and no false negatives. Ever. MARC’s use of “hopefully-unique strings” is ridiculously insufficient in the modern era.

RDA does pretty well with this, with URIs for appropriate concepts, so that’s good.

What’s wrong with it?

Well, it’s gonna cost money to access the spec, for starters. That’s just dumb.

But it’s also not flexible/extensible enough. It’s true that I’m not a cataloger. I do have an MS in computer science, though, and there is stuff in the various versions of the RDA spec which lead me to believe that the committee desperately, desperately needed some hardcore geeks on it. Computer science has basically done nothing but develop methods for abstraction and composition for decades, and that isn’t reflected enough here.

Language such as, “If it is determined that a mechanism for providing a direct link between a note and the instance of the element to which it relates is required,…” worries me. if? IF????? That’s not a spec. That’s a guideline. Nail it down, for god’s sake. When is it appropriate or inappropriate? How do you add links to multiple (but not all) instances of the element?

The spec also seems to describe at least half a dozen kinds of titles. One of these is “Abbreviated title.” Do we really want an abbreviated title? No. We want a title with an “abbreviated” modifier, so we can use that same modifier for, say, a corporate name or publisher or anything else. [Note: see rsinger's comment below, indicating this was a piss-poor example on my part.]

Well, sure, but it’s still better than the AACR2!

[This section updated to disabiguate my use of 'MARC' when I really meant 'AACR2 as commonly talked about in term of MARC tags']

Of course it is. It’s just not better enough!

We’re not just talking about writing a spec. We’re talking about replacing every single tool in the library toolchain, from the ILS to editing software to OPACs to scripts that keep it all put together. We’ll be asking programmers to learn new skills and new ways of thinking, vendors to produce functional software for untested data formats, and catalogers to essentially take their whole brain out of their heads and get a new one.

But that, frankly, is the easy part. The entire culture of the library is built around AACR2 concepts and MARC data structures. The thought processes, nomenclature — everything sometimes feels as if it’s built around three-digit tags. The majority of the (crucial!) specialized vocabulary librarians, and experts and specialists, use to communicate with each other is directly or indirectly tied to MARC

So, yeah, RDA is a hellofa lot better than AACR2/MARC. But in my view, it’s not better enough to justify all the pain. Switching is incredibly, astoundingly expensive both in terms of cost and in terms of the devaluation of institutional knowledge. We can’t do it every few years. We need to be damn sure we’re getting it right.

7 Responses to “Why RDA is doomed to failure”

  1. Ross Singer says:

    Hmm, there’s a lot here and while I think some of this would be easier to talk about synchronously, you have to go with the forum you have, not the forum you want.

    First off, let me put it on the record that I don’t disagree with your thesis. I can’t say whether or not RDA will fail (or what that “failure” or “success” means, really) but its timidity in actually modeling the data leaves a lot to be desired.

    Now, on to your arguments… Equality (with regards to information) is always going to be subjective. Witness the agita that owl:sameAs is currently wreaking on the Linked Data universe (esp. the hardcore semantic web set) to see. Machine based linking is always going to have error. Homonyms, mistaken assumptions, and human error are just going to have to be accounted for. Without a doubt RDA need to drop the string matching qualities of the status quo in MARC/AACR2 in favor of real identifiers. Still, this isn’t going solve the equality issue 100% because, honestly, a cataloger may not be 100% sure of what s/he is describing.

    Also, abbreviated titles are actual things. Like “JAMA”. I’m not sure the actual provenance of these titles, but they are distinct from the actual title (and generally considered important and used).

    My last point would be how you compare “RDA” and “MARC” in your last part. Really, you’re comparing RDA with AACR2 (esp. since the powers that be are trying to figure how RDA will be transmitted via MARC). The major issue is that RDA doesn’t distance itself nearly enough from AACR2 to be entirely worthwhile. Everything is still a literal and there is still a very “record-centric” mindset (even in the RDF schemas). This is most obvious when you see things like “titleOfTheWork” and “projectionOfCartographicContentExpression” instead of, I don’t know, just modeling the damned FRBR entities like they should.

    So, instead, we have a somewhat-major change in cataloging rules that will require a lot of time and energy and still provide no “real” relationships between resources and entities.

  2. Laura says:

    One minor quibble. RDA is intended to be a replacement for AACR2 — a descriptive standard, rather than MARC — a transmission standard. Granted MARC has evolved over the years to do both description and transmission in practice since there have been rules akin to application profiles in terms of how to enter data into a MARC record.

  3. Wally Grotophorst says:

    If you wonder whether this disconnect between computer science and library science (specifically cataloging) is real, stroll down your QA76 range of shelves sometime and marvel at the distribution of shelving locations for something like Oracle how-to books.

  4. In “Directions in Metadata” Karen Coyle notes that the current vendors have been reporting near ZERO feedback / customer demand for anything related to RDA. True, it’s still early – the spec hasn’t been formally released – but in a slow moving community, any change seems to need a lot of “ramp up” time, for both the library community and its vendors.

    Very too bad, since there’s a sense of urgency that’s missing in all of this discussion. I think the OSS community is going to shape up to be best positioned to respond to changes, but moving forward with some reasonable consensus from libraries is going to be the challenge. There still remains a gulf between the well-informed IT & catalogers vs. the laggards from the catalog card generation who don’t understand how our MARC/AACR2 standards present huge data issues that prevent us from moving forward.

  5. Karen Coyle says:

    If you look at the diagram called “Singapore Framework” on the Dublin Core site [1], it illustrates all of the necessary elements of a functioning, modern metadata scheme. The framework is based on RDF, but it could really be based on any other foundation technology. What we don’t seem to have learned in the library world is that the cataloging rules do not a metadata schema make. The rules are about how you make decisions, but you need to have defined data elements, vocabularies, and, above all, you need to have some sense of what functionality you wish your metadata to support. I feel like we go about it entirely backwards, first creating rules, then trying to fit it all into a data format.

    [1]http://dublincore.org/documents/singapore-framework/

  6. Irvin Flack says:

    Following up on Karen’s and Ross’s comments, I’m reminded about that joke about the guy looking for his lost keys under the street light — not because that’s where he dropped them but because that’s where he could see. Or, to throw in another metaphor: you visit a surgeon, you get an operation. Cataloguers are experts on the rules — so that’s what RDA at heart still is, a set of rules.

  7. Bruce says:

    If you wonder whether this disconnect between computer science and library science (specifically cataloging) is real, stroll down your QA76 range of shelves sometime and marvel at the distribution of shelving locations for something like Oracle how-to books.

Leave a Reply

Data structures and Serializations

April 20, 2010 at 4:56 pmCategory:Uncategorized

Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in.

What this post is not

There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them.

What I am comfortable saying is this:

Anyone advocating or dismissing a data model based on the data structure or serialization most-often associated with that model is missing the goddamn point.

Data serializations

…are boring. They’re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even MARC has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).

So, let me repeat again, serializations are boring and not worth talking about until you’ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.

Serializations are measured from “less pain” to “more pain”, but all have the exact same expressiveness. Data structures, on the other hand, do not.

A hierarchy of data structures

Think about the following data structures:

  • An ordered list
  • key-value pairs
  • A hierarchy (e.g., an XML document)
  • An undirected graph
  • A directed graph
  • A labeled, directed multigraph (e.g., a set of RDF Triples)

You don’t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item’s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you’re careful to make links both ways. And a directed graph can easily be represented as a set of RDF triples (where you may, for example, only have one label for your relationships: “links to”).

[Note that I didn't say any of these would be efficient implementations!]

The reverse is not true — or, at least, not without an incredible amount of “out of band” information in another layer somewhere.

The structures at the end of the list have more expressiveness. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I’m not going to try to model my set of key=value pairs in an array. I could do it, but it would take so much of my attention that the data modeling would suffer.

Don’t handicap yourself

Don’t start with the data structure.

DON’T START WITH THE DATA STRUCTURE!

GET THAT MOTHER-FREAKIN’ DATA STRUCTURE OFF MY MOTHER-FREAKIN’ PLANE!

Seriously. Don’t be stupid. If all you’ve got is a hammer, everything starts to look like a thumb.

If you start off with a restrictive data structure before you even fully define the domain you’re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you’re imagining in your head.

Domain modeling is ridiculously hard for any domain worth modeling. If you start with a handicap (a restrictive data structure) it’s going to be even harder.

No one would think of trying to model bibliographic data using only arrays. That’s premature optimization on an epic scale.

The appeal of RDF Triples

Even if you ignore all the semantics and rules that make RDF Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that any semantic model based on RDF Triples has enormous expressive power at its disposal.

Does it turn out that after you’ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You’ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a “type” of sorts, and add some out-of-band knowledge). And lord knows RDBMS’s have great performance characteristics.

But don’t start there. Start with the domain. Model it. Figure out what you need to describe and derive. Then pick the most appropriate data structure.

The nightmare that is MARC

MARC-the-data-structure (not to be confused with a serialization of that data structure, on the one hand, or with the AACR2 on the other) can incompletely (but usefully, I think) be described as:

  • A set of key-value pairs
  • …that have a defined order
  • …where keys can be repeated
  • …and values are strings
  • …and keys are a concatenation of tag/ind1/ind2/code

Control fields are especially restricted (ind1, ind2, and code are all ‘null’). There’s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.

It doesn’t give us much to work with. It’s restricted. And, sadly, so is our thinking.

Putting the cart before the horse

As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the MARC data structure. It’s a handicap. It’s an anchor around our necks.

MARC-the-data-model (nee AACR2) is not inherently bad because it’s built on an impoverished data structure. It’s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that’d be awesome because it would indicate that things are simple.

Things, of course, aren’t simple. They’re hard.

So, if you want to complain about MARC or RDA or FRBR, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don’t conflate data models, data structures, and serializations.

Oh, and don’t say “PIN Number” or “ATM Machine.” That drives me crazy, too.

5 Responses to “Data structures and Serializations”

  1. A brief exchange me and Bill had in IRC, which I think is further illuminating:

    (5:10:13 PM) jrochkind: BillDueber: I’d say the problem is that MARC is BOTH a “data model” AND a “data structure.” Even though was never designed as a data model, it has become one.

    (5:11:02 PM) BillDueber: jrochkind: Right. We long ago passed the point where the model drives the data structure. It’s [now] the other way around. [which is a bad thing]

  2. MJ Suhonos says:

    Bill, just to clarify my perspective on the issue, I fully agree with everything you’re saying above. In fact, your explanation is probably the clearest I’ve seen to date. And the thread is definitely ridiculous. :-)

  3. Hi,

    I’m not at all familiar with Domain Modelling – so far what I know about it comes from this blog post, plus a less useful Wikipedia article, plus a random white paper I googled up. (http://www.aptprocess.com/whitepapers/DomainModelling.pdf)

    My question is this: would modelling the domain for a library system consist of coming up with something like the set of behaviors that FRBR describes, and then building a data structure based on that?

    Thanks for an interesting post, anyway. Joe M.

  4. [...] librarian would put their metadata in a data format (or “content format” or “data structure“).  Some examples are binary or XML.  It is the carrier for the content, just like how a CD [...]

  5. Jakob says:

    Yes, FRBR is one example of a Domain Model – librarians can do this. But with FRBR they failed to define a serialization. Domain Models help you to talk about things with human beings. But to exchange data you need a serialization of the model. I agree that the model must come first, but when you stop there, you end up doing no data exchange but philosophy (which is nice too).

Leave a Reply

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

  • Read the records into marc4j objects and do nothing. This is a baseline of sorts.
  • The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
  • Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
  • The big “allfields” field is text from tags 100 through 900.
  • The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

  • 18,881 records in marc-binary format
  • Times are in seconds, run on my desktop
  • Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.
Total Seconds Description
19 Just read the records with marc4j and do nothing.
85 Read and do 35 “normal” fields (no custom)
104 Read, 35 normal, 15 custom fields
110 Read, normal, custom, allfields
129 Read, normal, custom, allfields, to_xml
136 Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142 Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124 Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds Description
19 read the records and do nothing
66 process the 35 normal fields
19 process the 15 custom fields
6 generate the “allfields” field
19 generate the XML (yowza!)
7 send to solr with two threads
13 send to solr with one thread

Or like this:

Seconds Description
129 do all the reading and processing
13 send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

  • SUSS is just faster than whatever solrmarc does.
  • My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
  • The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time SUSS Threads Processing Threads
70 1 1 (was 142 seconds on the desktop)
47 1 2
39 1 3
35 1 4
68 2 1
48 2 2
38 2 3
34 2 4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

4 Responses to “Pushing MARC to Solr; processing times and threading and such”

  1. What’s HLB?

    Both ruby-marc and marc4j will generate marc-xml, but do you mean optimizing speed of it? (Don’t forget marc-json possibilities! heh).

    Not sure if you’re still happy with marc4j or might prefer ruby-marc, I realized one thing missing from the ruby stack (if you didn’t want to use marc4j) (as far as I know) is the marc8-utf8 conversion stuff, and heuristic guess detection of marc records that aren’t really the encoding they claim to be.

  2. Oh, I see, performance with toXML.

    What i wonder/worry about, is if the added time for toXML isn’t actually the serialization to xml, but simply that if you’re pushing a larger stored field to solr, that’s going to slow things down.

    We still need to store our marc either way, of course. The UWisconsin approach of storing marc in an rdbms instead of a solr stored field may or may not speed up indexing, since it’s still gonna take time to store it.

  3. Hey, I should read more carefully before I post, but instead I’ll just multi-post.

    I see the serialization to XML itself is non-trivial too.

    json!

  4. Bruce says:

    What’s HLB?

    Both ruby-marc and marc4j will generate marc-xml, but do you mean optimizing speed of it? (Don’t forget marc-json possibilities! heh).

    Not sure if you’re still happy with marc4j or might prefer ruby-marc, I realized one thing missing from the ruby stack (if you didn’t want to use marc4j) (as far as I know) is the marc8-utf8 conversion stuff, and heuristic guess detection of marc records that aren’t really the encoding they claim to be.

Leave a Reply

ruby-marc with pluggable readers

March 2, 2010 at 1:55 pmCategory:Uncategorized

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

  1.   require 'marc'
  2.   require 'my_marc_stuff'
  3.  
  4.   mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader
  5.   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto
  6.  
  7.   MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)
  8.   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now
  9.  
  10.   xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)
  11.  
  12.   # …and maybe further on down the road
  13.  
  14.   asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)
  15.   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

3 Responses to “ruby-marc with pluggable readers”

  1. adam says:

    Bill, How is the performance as compared to other languages?

    • adam
  2. Bill says:

    Adam — not sure what you’re asking. Ruby vs. Perl? MARC-HASH-JSON vs. MARC-HASH-YAML?

  3. adam says:

    I was thinking ruby vs. perl vs. java

Leave a Reply