Archives: September 2010

Ross Singer recently updated ruby-marc to include a #to_hash method that creates a data structure that is (a) round-trippable without any data loss, and (b) amenable to serializing to JSON. He’s calling it marc-in-json (even though the serialization is up to the programmer, it’s expected most of us will use JSON), and I think it’s the way to go in terms of JSON-able MARC data.

I wanted to take a quick look at the space/speed tradeoffs of using various means to serialize MARC records in the marc-in-json format compared to using binary MARC-21.

Why bother?

Binary MARC-21 is “broken” in that a lot of us have records that are so long (more than 9999 bytes) it’s impossible to create a valid marc binary record. The standard alternative, MARC-XML, has huge filesizes (roughly 3 times as large) and runs a lot more slowly in every benchmark I’ve ever run. For ruby-marc, the penalty for using XML is further exaggerated because the serializer is based on REXML and is super-slow.

There have been a few proposals for a MARC data structure that can easily be serialized to JSON (I had my own, in fact), but the stuff Ross has done with marc-in-json is preferable in being (a) not a ton bigger in terms of file size, and (b) much easier to query from a NoSQL database using something like JSONPath or JSONQuery.

What I’m testing

For this test, I used:

  • marc21 binary This is the stock serialization / deserialization provided by ruby-marc.
  • YAJL for JSON YAJL is a very fast C-based JSON library. Here we’re using the Ruby bindings and calling Yajl::Encoder.encode(r.to_hash) to serialize and MARC::Record.new_from_hash(Yajl::Parser.parse(JSON)) to deserialize.
  • Msgpack The Msgpack project is explicitly designed to be “binary JSON” — smaller, faster, etc — at the expense of human readability/editabilty . Again, this used the ruby bindings.

The benchmark and its results

I’m interested in how long it takes to serialize and deserialize a single record. My primary use-case is sticking a single record into Solr, and then pulling the string representation of that record out and turning it back into MARC.

It’s entirely possible that trying to deal with a whole set of MARC records — as a JSON array of marc-in-json objects, or as a set of newline-delimited JSON or Msgpack objects — would yield different results. The former is especially interesting, since to parse a large JSON array one needs to use a streaming parser, which will almost certainly have a different profile in both processing and memory use.

The ambitious can see the full source code of the benchmark.

Note that the following represent only the performance of ruby-marc and the particular serializers used. Other platforms or other libraries will certainly give different results!

Total of 18880 records run 20 times (377,600 serialize/deserialize cycles per method) on my Mac OSX desktop; comparisons are to MARC21-Binary.

 SERIALIZING
   MARC Binary       357.02 s (100%)
   YAJL              312.65 s ( 88%)
   Msgpack           266.26 s ( 75%)

 DESERIALIZING
   MARC Binary       648.91 s (100%)
   YAJL              507.64 s ( 78%)
   Msgpack           459.73 s ( 71%)

 SERIALIZE + DESERIALIZE
   MARC Binary      1005.93 s (100%)
   YAJL              820.29 s ( 82%)
   Msgpack           725.99 s ( 72%)

 SIZE
   MARC Binary   31.15 MBytes (100%)
   Msgpack       42.00 MBytes (135%)
   JSON          55.99 MBytes (180%)
   XML           93.42 MBytes (300%)

Analysis, such as it is

Obviously, there are size/speed tradeoffs. Nothing is as small as binary MARC21, but both YAJL and Msgpack are faster — significantly so for deserialization, which happens to be where I want the speed for my uses.

At 80% larger, the JSON serialization is quite a big bigger, but it’s a hell of a lot smaller than MARC-XML and suffers none of the limitations of binary MARC.

For a closed system (i.e., you’re not worried about anyone else being able to read your data) such as a Blacklight installation, I’d be tempted to move to using JSON sooner rather than later.

One Response to “Size/speed of various MARC serializations using ruby-marc”

  1. Andy says:

    Minor correction: in the “why bother” section, valid binary MARC-21 can be up to 99999 (not 9999) bytes.

    Interesting write-up, I especially appreciate the metrics-based posts you do – always good to have some facts! Thanks.

Leave a Reply

VuFind Midwest gathering

September 16, 2010 at 11:55 amCategory:Uncategorized

A couple weeks ago, representatives from UMich (that’d be me), Purdue, Notre Dame, UChicago, and our hosts at Western Michigan got together in lovely Kalamazoo to talk about our VuFind implementations.

Eric Lease Morgan already wrote up his notes about the meeting, and I encourage you to go there for more info, but I’ll add my two cents here.

So, in light of that meeting, here’s what I’m thinking about VuFind of late:

  • None of us are running VuFuind 1.0 as released with full catalog data. Eric has a special purpose portal running the current code over an aggregated special collection and hasn’t done much to the underlying PHP. The rest of us were running heavily modified versions of RC1. An issue we had in common was that the changes from RC1 to RC2 to 1.0 release were so significant, including some complete architectual change (some based on the stuff I’ve done with mirlyn) that the effort required to get up with 1.0 would be no less significant than the effort to switch wholesale to something else (e.g., Blacklight).

  • A point that I made that was echoed by others is that we need to remember that these new discovery systems are all just thin wrappers over Solr. They basically have two jobs: to get a query and format it in a way that Solr can handle, and then to take the Solr results and display them. There’s some sugar on top of that (exporting, tagging, etc) but that’s really it. The heavy lifting is all done by your indexer (Solrmarc for most, although watch this space for my announcement of my JRuby-based stuff today) and Solr itself. It’s not a hard problem, although it is occasionally a messy one.

  • VuFind has, in my mind, fundamental architectural issues mostly based on the inability to easily separate local code from core code. A re-architecture to base everything on subclasses of the core code would help, but at some point you start to run up against fundamental limitations of PHP and Smarty to do things cleanly. Without the ability to update core code and know it won’t affect your local code, there’s no good way to keep on track with the trunk of the code and do upgrades; for the same reason, it’s almost impossible to send changes back to trunk.

  • Coupled tightly to the architectural issues is the lack of tests. The code is potentially very brittle; there’s no good way to know if you’re breaking anything until you notice it’s broken. It’s not at all clear how to write good tests for the code, because there’s a lot of inter-dependencies.

  • The second big problem is one of community; to wit, there isn’t much of one. There are some active players, and there’s what seems like a great conference going on right now, so this may change. But — especially because of the technical difficulties in contributing local changes back –VuFind could use a benevolent dictator, someone who has organizing and administrating VuFind be a part of his/her job. The last bit is important.

All of these are surmountable issues. The reason they’re at the top of my head, of course, is that the Blacklight community has, in many ways, already taken care of most of them.

If I were starting from scratch tomorrow, we’d already decided to do something locally, and I could convince my systems people to run a ruby implementation (I like JRuby myself), I’d go with Blacklight. If we were already looking at something like Summon, I’d take a hard, hard look at the build-vs-buy numbers. Summon and Primo both give you APIs to program an interface against, and boy, it might be worth the effort to do so and leave everything else alone.

10 Responses to “VuFind Midwest gathering”

  1. The hard aspect of the problem is making something as flexible as libraries (reasonably IMO) require, that can be shared by multiple ‘customers’ with their own flexible configuration and customization, while still keeping a common codebase and being able to share updates.

    I think this actually IS hard. That Blacklight is kind of sort of able to accomplish it is a result of lots of hard work, and not entirely succesful (yet).

    So one could say, forget that, let’s just have our own local homegrown wrapper on top of Solr. There are trade-offs to that, but it does certainly reduce the ‘hardness’ of the problem. Although I still think you’re left with a somewhat harder problem than you imply, with all the features we want in a modern library discovery system.

    Alternately, one could say, let’s pretty much drop the flexibility and customization. We’ll have some configuration for what Solr to point to, and we’ll have the simplest possible hooks for ILS-specific info and functionality (item status, request buttons, etc), but except for that, all ‘customers’ will pretty much be running the same thing, but for what they can manage to do in CSS alone. That would also simplify the problem a lot. (And is more or less the approach that most proprietary vendor software takes — when proprietary vendors have tried to make software as cleanly flexible as we’re trying to make BL, they haven’t generally succeeded very well. Which is again, in part, because it is not in fact an entirely easy problem. ).

  2. PS: I think just about all non-trivial open source projects need a benevolent dictator, until the community grows enough to have a benevolent junta instead. (But pretending you have a benevolent junta when most people on the junta don’t take their responsibilities seriously and figure some other junta member will take care of it, does not work).

    This is my impression of what’s almost an distributed-collaboration volunteer-done open source rule. There are some exceptions, but they will generally have some unusual characteristics that make them exceptions.

    If there isn’t an official benevolent dictator, and the open source project is seeming successful, there’s probably to some degree an unofficial one. Who is filling the role(s) can change over time, sometimes.

    One reason for this ‘law’ is that architecture matters. ESPECIALLY when you’re trying to create shareable “framework” style code, instead of just a custom fit application. And design/architecture by committee, especially a committee of people of varying levels of commitment, time, feeling of responsibility, skill, and perspective, doesn’t work out that well. You’ve got to have someone with a vision, whose vision makes sense (if it’s going to be succesful), who feels an obligation/responsibility to apply that vision.

    One of the downsides of extreme focus on test-driven/agile/extreme programming, is that you can think that just cause that feature you added in an ad hoc way passes all the tests, that’s the only metric of evaluating the quality of your software. It isn’t, architecture matters. Which is starting to get talked about more, as a corrective to focusing too much on “get it done quick with a test”, bringing things back to balance. Here’s an example of the genre that might not be the best, but I happened to read this morning on reddit: http://www.infoq.com/news/2010/09/big-ball-of-mud

  3. And, I can’t stop talking, sorry. (I should turn this into my own blog post, but I know Bill doesn’t mind).

    Blacklight may be, at the moment, more succesful at those things than VuFind, but it’s cause of a lot of sweat, and from the inside the battle isn’t over yet.

    And as far as Summon or Primo: If you ignoretheir interface entirely and just go with their APIs, then this isn’t going to be any easier than programming on top of Solr apis. It’s going to be the same, in fact. And have the same issues we’re talking about here, go it alone homegrown, try to share something on top of those APIs, etc.

    The legit reason to me to go with summon or primo+primoCentral is because their indexes include publisher and aggregator supplied article-level data. That is something that is going to be VERY hard to do without vendor support. But just the APIs to query their indexes, with your own interface on top? What’s the “everything else” you think this will do for you, that you don’t get with your own interface on top of Solr already?

  4. David says:

    Interesting discussions. We’re about to start implementing Summon, and one of the things I like about it is its API. I’d started to think about having some other frontend (especially hearing that there is a connector already built for VUFind) but do wonder what the point of it would be.

    Summon has a reasonably nice (if sparse) interface and theoretically everything we want people to search is in there, so setting up VUFind or Blacklight as a frontend would seem to only give us more configuration choices in the page design, at the expense of some non-trivial setup and maintenance costs. Are there any other great benefits that I haven’t thought of?

  5. David says:

    Also, the whole benevolent dictator / junta thing I think is part of Koha’s problems at the moment.

    They have that set up for each release (with a release manager and other roles) but there’s no “steering committee” type organization to oversee the project. They have a really good community, but when one of the support companies goes rogue (spreading FUD, locking the the community out of the website etc) there’s no coherent and official body to respond.

  6. This rings very true for me A couple of thoughts Separation of local and core code is a big problem we are currently working around. Additionally it appears that VuFind has separated template code from other code but not presentation code from logic code. I’ve yet to be convince of Smarty, (why write a template interpretor/compiler in a language that is already very good for templating?) and would like to be able to just use php as my templating language.

  7. I tend to agree with you, David, about the unclear benefit of putting a custom interface on Summon.

    What people seem to be doing in VuFind is having a locally indexed content search, and in a seperate tab a seperate Summon search. As far as I know, VuFind won’t merge these results into one search, you get your local search, or your Summon search.

    Why would you want a local search, when all that local content is probably already in your Summon anyway? Well, having a local search gives you a LOT more control over relevancy ranking and faceting, and adding additional features like my date range timeline/histogram. Example: http://tinyurl.com/2b5wujr . You couldn’t do that in Summon. But you can’t do it in a Summon search just by wrapping it in VuFind either — all you can do is have a local search with features like that, and a separate Summon search. If you ARE going to have both, I suppose there are benefits to wrapping them in a consistent VuFind interface, with a single ‘saved records’ area etc (I’m not positive if VuFind does give you a single saved records area).

    But if you’re NOT planning on adding a local index search to Summon, I think you probably get minimal benefit wrapping Summon in VuFind.

    And if you ARE planning on adding a local index search to Summon, you’re giving the users two different searches, which is theoretically what you’re trying to get away from with Summon.

    But there are pros and cons to the Summon approach. You get all that indexed content, but you lose control of your index you can get with Solr/Blacklight/VuFind.

    Where I am, we’re currently choosing to focus on improving our local index search with Blacklight. But that does leave out the indexed article-level data in Summon (or PrimoCentral). Later we may try to combine them the way apparently some people are in VuFind, although it’s an unsatisfactory combination, I think.

  8. And also, when I’m talking about benevolent dictator/junta, I’m talking about developers. Who are actively working on writing and architecting code. I’m not talking about some non-developer administrative policy committee, to the extent succesful distributed collaborative community open source projects have those (Apache does), they keep their hands off the code; such a policy committee is really a different thing meant to solve different problems, not quality-of-software problems. Maybe you need one maybe you don’t for those other problems, but I think when such a committee starts trying to direct software development, you get a beurocratic nightmare, not the good software that I’m suggesting a benevolent (developer, architect) dictator may be required for.

  9. You guys should check out Villanova University’s implementation of VuFind and Summon https://library.villanova.edu/Find/Search/Home

    We currently use VuFind and are within a couple of weeks of implementing Summon.

  10. Sorry for being unclear in the last post. “We” being Stephen F. Austin State University.

Leave a Reply

Simple Ruby gem for dealing with ISBN/ISSN/LCCN

September 13, 2010 at 4:43 pmCategory:Uncategorized

I needed some code to deal with ISBN10->ISBN13 conversion, so I put in a few other functions and wrapped it all up in a gem called library_stdnums.

It’s only 100 lines of code or so and some specs, but I put it out there in case others want to use it or add to it. Pull requests at the github repo are welcome.

Functionality is all as module functions, as follows:

ISBN

  • char = StdNum::ISBN.checkdigit(ten-or-thirteen-digit-isbn)
  • boolean = StdNum::ISBN.valid?(ten-or-thirteen-digit-isbn)
  • thirteenDigitISBN = StdNum::ISBN.convert_to_13(ten-or-thirteen-digit-isbn)
  • tenDigitISBN = StdNum::ISBN.convert_to_10(ten-or-thirteen-digit-isbn)

ISSN

  • char = StdNum::ISSN.checkdigit(issn)
  • boolean = StdNum::ISSN.valid?(issn)

LCCN

  • normalizedLCCN = StdNum::LCCN.normalize(lccn)

Again, there’s nothing special here — just letting folks know it’s out there.

Leave a Reply