Why RDA is doomed to failure

[Note: edited for clarity thanks to rsinger's comment, below]

Doomed, I say! DOOOOOOOOOOMMMMMMMED!

My reasoning is simple: RDA will fail because it’s not “better enough.”

Now, those of you who know me might be saying to yourselves, “Waitjustaminute. Bill doesn’t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about…err….hmmm…well, in any case, he’s definitely talking out of his ass on this one.”

First off, thanks for having such a long-winded internal monologue about me; it’s good to be thought of.

And, of course, you’re right on all counts. I don’t know what I’m talking about in any of those realms.

And yet I’m still willing to make a strong statement?

Yes. I am. Here’s why.

[Oh, and if you're convinced I'm wrong -- please say so. I'd love to be wrong about this.]

First, an assertion

The purpose of any bibliographic metadata is to facilitate three things:

  • Description/Identification. If you know what you want, does the metadata give you enough information to determine if the described item is what you want? Alternately, if you’re holding an item (or an alternate metadata representation of it), can you find the record that describes it?
  • Machine finding. Can a machine, given a good-enough query, find a work via a search of the metadata?
  • Machine grouping. Given the metadata, can a machine help a person find items “like this one”?

Take issue with one or more of those statements. I don’t care. The point I’m really trying to make is that any standard that doesn’t put unmediated machine reasoning at the forefront of what the metadata needs to support is living in a deep, deep hole.

Computer cycles are pretty cheap, and programmers are pretty smart. We can figure out how to do useful things with virtually any data, but only if we can reliably get at those data.

Getting 75% of the way there

Three-fourths of the problem can be addressed with one simple concept.

A solid equality relationship.

By this I mean that “=” had better damn well mean “equal,” as opposed to “probably the same, but there might be other representations, too.” If I want to say “A = B” (where A and B are authors, or works, or subjects, or anything that can be nailed down) there’s better be no false positives and no false negatives. Ever. MARC’s use of “hopefully-unique strings” is ridiculously insufficient in the modern era.

RDA does pretty well with this, with URIs for appropriate concepts, so that’s good.

What’s wrong with it?

Well, it’s gonna cost money to access the spec, for starters. That’s just dumb.

But it’s also not flexible/extensible enough. It’s true that I’m not a cataloger. I do have an MS in computer science, though, and there is stuff in the various versions of the RDA spec which lead me to believe that the committee desperately, desperately needed some hardcore geeks on it. Computer science has basically done nothing but develop methods for abstraction and composition for decades, and that isn’t reflected enough here.

Language such as, “If it is determined that a mechanism for providing a direct link between a note and the instance of the element to which it relates is required,…” worries me. if? IF????? That’s not a spec. That’s a guideline. Nail it down, for god’s sake. When is it appropriate or inappropriate? How do you add links to multiple (but not all) instances of the element?

The spec also seems to describe at least half a dozen kinds of titles. One of these is “Abbreviated title.” Do we really want an abbreviated title? No. We want a title with an “abbreviated” modifier, so we can use that same modifier for, say, a corporate name or publisher or anything else. [Note: see rsinger's comment below, indicating this was a piss-poor example on my part.]

Well, sure, but it’s still better than the AACR2!

[This section updated to disabiguate my use of 'MARC' when I really meant 'AACR2 as commonly talked about in term of MARC tags']

Of course it is. It’s just not better enough!

We’re not just talking about writing a spec. We’re talking about replacing every single tool in the library toolchain, from the ILS to editing software to OPACs to scripts that keep it all put together. We’ll be asking programmers to learn new skills and new ways of thinking, vendors to produce functional software for untested data formats, and catalogers to essentially take their whole brain out of their heads and get a new one.

But that, frankly, is the easy part. The entire culture of the library is built around AACR2 concepts and MARC data structures. The thought processes, nomenclature — everything sometimes feels as if it’s built around three-digit tags. The majority of the (crucial!) specialized vocabulary librarians, and experts and specialists, use to communicate with each other is directly or indirectly tied to MARC

So, yeah, RDA is a hellofa lot better than AACR2/MARC. But in my view, it’s not better enough to justify all the pain. Switching is incredibly, astoundingly expensive both in terms of cost and in terms of the devaluation of institutional knowledge. We can’t do it every few years. We need to be damn sure we’re getting it right.

Data structures and Serializations

Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization ( see here, here, and here ) and I thought I’d jump in.

What this post is not

There’s a lot to be said about a good domain model for bibliographic data. I’m so not the guy to say it. I know there are arguments for and against various aspects of the AACR2 and RDA and FRBR, and I’m unable to go into them.

What I am comfortable saying is this:

Anyone advocating or dismissing a data model based on the data structure or serialization most-often associated with that model is missing the goddamn point.

Data serializations

…are boring. They’re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even MARC has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).

So, let me repeat again, serializations are boring and not worth talking about until you’ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.

Serializations are measured from “less pain” to “more pain”, but all have the exact same expressiveness. Data structures, on the other hand, do not.

A hierarchy of data structures

Think about the following data structures:

  • An ordered list
  • key-value pairs
  • A hierarchy (e.g., an XML document)
  • An undirected graph
  • A directed graph
  • A labeled, directed multigraph (e.g., a set of RDF Triples)

You don’t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item’s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you’re careful to make links both ways. And a directed graph can easily be represented as a set of RDF triples (where you may, for example, only have one label for your relationships: “links to”).

[Note that I didn't say any of these would be efficient implementations!]

The reverse is not true — or, at least, not without an incredible amount of “out of band” information in another layer somewhere.

The structures at the end of the list have more expressiveness. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I’m not going to try to model my set of key=value pairs in an array. I could do it, but it would take so much of my attention that the data modeling would suffer.

Don’t handicap yourself

Don’t start with the data structure.

DON’T START WITH THE DATA STRUCTURE!

GET THAT MOTHER-FREAKIN’ DATA STRUCTURE OFF MY MOTHER-FREAKIN’ PLANE!

Seriously. Don’t be stupid. If all you’ve got is a hammer, everything starts to look like a thumb.

If you start off with a restrictive data structure before you even fully define the domain you’re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you’re imagining in your head.

Domain modeling is ridiculously hard for any domain worth modeling. If you start with a handicap (a restrictive data structure) it’s going to be even harder.

No one would think of trying to model bibliographic data using only arrays. That’s premature optimization on an epic scale.

The appeal of RDF Triples

Even if you ignore all the semantics and rules that make RDF Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that any semantic model based on RDF Triples has enormous expressive power at its disposal.

Does it turn out that after you’ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You’ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a “type” of sorts, and add some out-of-band knowledge). And lord knows RDBMS’s have great performance characteristics.

But don’t start there. Start with the domain. Model it. Figure out what you need to describe and derive. Then pick the most appropriate data structure.

The nightmare that is MARC

MARC-the-data-structure (not to be confused with a serialization of that data structure, on the one hand, or with the AACR2 on the other) can incompletely (but usefully, I think) be described as:

  • A set of key-value pairs
  • …that have a defined order
  • …where keys can be repeated
  • …and values are strings
  • …and keys are a concatenation of tag/ind1/ind2/code

Control fields are especially restricted (ind1, ind2, and code are all ‘null’). There’s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.

It doesn’t give us much to work with. It’s restricted. And, sadly, so is our thinking.

Putting the cart before the horse

As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the MARC data structure. It’s a handicap. It’s an anchor around our necks.

MARC-the-data-model (nee AACR2) is not inherently bad because it’s built on an impoverished data structure. It’s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that’d be awesome because it would indicate that things are simple.

Things, of course, aren’t simple. They’re hard.

So, if you want to complain about MARC or RDA or FRBR, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don’t conflate data models, data structures, and serializations.

Oh, and don’t say “PIN Number” or “ATM Machine.” That drives me crazy, too.

Stupid catalog tricks: Subject Headings and the Long Tail

Library of Congress Subject Headings (LCSH) in particular.

I’ve always been down on LCSH because I don’t understand them. They kinda look like a hierarchy, but they’re not really. Things get modifiers. Geography is inline and …weird.

And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying.

So, just for kicks, I ran some numbers.

The process

I extracted all the field 650, indicator2=”0″ from our catalog, threw away the subfield 6’s, and threw away any trailing punctuation in any of the subfields. I called the concatenation of what was left a unique LCSH.

Then I printed them out and put them all onto index cards, using tick-marks to indicate…

No, of course not. I used sort, uniq -c, and wc -l. Here’s what I found.

Counts of LCSH

…in round numbers.

In our catalog, there are:

  • 8.50M subject headings (using the definition above)
  • 1.87M unique subject headings
  • …66% of which (1.23M) appear exactly once

We only have to go out to 30K subjects to account for half of all subject entries. The top 1000 most-used subjects account for 14.5% of all 8.5M subject entries.

The top ten subjects by count are:

  • 6029 $$aSermons, American
  • 6131 $$aPhilosophy
  • 7224 $$aFeature films
  • 7591 $$aPiano music
  • 7968 $$aSocialism
  • 8796 $$aEconomics
  • 9185 $$aCommunism
  • 12440 $$aSermons, English$$y17th century
  • 13539 $$aBills, Private$$zUnited States
  • 58823 $$aEconomics$$xHistory$$vSources

From a record’s point of view

Our catalog has:

  • 7M records
  • 4.4M records with at least one subject (as defined above)
  • 2.4M records with more than one subject
  • 2.0M records with exactly one subject
  • 2.6M records with zero subjects

The records with the most subject headings tend to be collections of stuff (theses, photos, etc). Our local standout is the Dept. of Medicine and Surgery (University of Michigan) theses, 1851-1878 with 208 subject entries. 14 records have at least 30 subject entries.

What it means

Gee, lady, I don’t know.

One way to look at it: suppose you’re considering defining subjects in this way, and making them “hot” in the catalog interface. For our data, 2/3 of records would have either no subjects or a subject that found only the record you’re at. So…think again.

In real life, we index lots of possible subject fields, and we additionally index the $$a as well as the whole string, so ours are a little bit more useful. A little.

Why bother with threading in jruby? Because it’s easy.

Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off simple program was a waste.

Well, it turns out I’ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution — a generic “threaded each” I’m calling threach.

  1.   enumerable_object.threach(number_of_threads, :which_iterator) do |i|    
  2.     do_something_threadsafe(i)
  3.   end

Some examples

  1.   # You like #each? You'll love…err..probably like #threach
  2.   load 'threach.rb'
  3.  
  4.   # Process with 2 threads. It assumes you want 'each'
  5.   # as your iterator.
  6.   (1..10).threach(2) {|i| puts i.to_s}  
  7.  
  8.   # You can also specify the iterator
  9.   File.open('mybigfile') do |f|
  10.     f.threach(2, :each_line) do |line|
  11.       processLine(line)
  12.     end
  13.   end
  14.  
  15.   # threach does not care what the arity of your block is
  16.   # as long as it matches the iterator you ask for
  17.  
  18.   ('A'..'Z').threach(3, :each_with_index) do |letter, index|
  19.     puts "#{index}: #{letter}"
  20.   end
  21.  
  22.   # Or with a hash
  23.   h = {'a' => 1, 'b'=>2, 'c'=>3}
  24.   h.threach(2) do |letter, i|
  25.     puts "#{i}: #{letter}"
  26.   end

threach.rb adds to the Enumerable module to provide a threaded version of whatever enumerator you throw at it (each by default).

How does it work?

How about I just put the source here. It’s short.

  1.   require 'thread'
  2.   module Enumerable
  3.  
  4.     def threach(threads=0, iterator=:each, &blk)
  5.       if threads == 0
  6.         # Just call the iterator itself
  7.         self.send(iterator, &blk)
  8.       else
  9.         bq = SizedQueue.new(threads * 4)
  10.         consumers = []
  11.         threads.times do |i|
  12.           consumers << Thread.new do
  13.             until (a = bq.pop) === :end_of_data
  14.               blk.call(*a)
  15.             end
  16.           end          
  17.         end
  18.  
  19.         # The producer
  20.         count = 0
  21.         self.send(iterator) do |*x|
  22.           bq.push x
  23.           count += 1
  24.         end
  25.         # Now end it
  26.         threads.times do
  27.           bq << :end_of_data
  28.         end
  29.         # Do the join
  30.         consumers.each {|t| t.join}
  31.       end
  32.     end
  33.   end

That’s it. If threads=0, just use the iterator itself. If not:

  • Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.
  • Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the end_of_data token. These consumer threads all immediately block because there’s nothing in the SizedQueue yet.
  • Populate the SizedQueue. When you run out of stuff to add, push on an end_of_data token for each consumer thread.
  • Call join on the threads to keep the main program around when one of them exits.

Why use it?

Well, if you’re using stock ruby — you probably shouldn’t. It’ll just slow things down. But if you’re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.

You can always do something like:

  1.   if defined? JRUBY_VERSION
  2.     numthreads = 3
  3.   else
  4.     numthreads = 0
  5.   end
  6.  
  7.   my_enumerable.threach(numthreads) {|i|}

Note the “relatively” up there. The block you pass still has to be thread-safe, and there are many data structures you’ll encounter that are not thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that’ll get you pretty far.

Pushing MARC to Solr; processing times and threading and such

[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What’s the question?

The question came up, “How much time do we spend processing the MARC vs trying to push it into Solr?”. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best, taking at least as long as the processing stage.

I’m interested because I’ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher’s suggestion). So I thought I’d check how things break down for me.

Here are my numbers running under JRuby (using MARC4J as the marc implementation) with the Solr StreamingUpdateSolrServer. Obviously, there are a lot of differences between this and solrmarc, but I’m hoping that while it’s not comparing apples to apples, it’s at least comparing apples to some sort of processed cheese-like product.

What work is being done on what?

The data set is a file of 18,881 MARC records in marc-binary format. It’s probably not big enough to get a great idea of how things will run over the long (many millions of records) haul, but it’ll do for this rough-cut stuff.

I break my processing down into five categories:

  • Read the records into marc4j objects and do nothing. This is a baseline of sorts.
  • The “normal” fields are anything that you could do with SolrMarc without a custom routine; the actual processing is done in JRuby.
  • Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.
  • The big “allfields” field is text from tags 100 through 900.
  • The “to_xml” routine is just calling the underlying marc4j XML output and stuffing it into a string.

The schema used is our normal UMICH schema except for High Level Browse (which appear in the our catalog as “Academic Discipline”). The code for that is written in Java, and I just call it from JRuby when I’m using it. I excluded it because it’s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing — there’s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It’s expensive. Trust me.

The Solr server itself is on a different, incredibly-beefy machine, and is emptied out before each invocation that involves actually pushing data to it (with a delete-by-query :).

How fast were things on my desktop?

  • 18,881 records in marc-binary format
  • Times are in seconds, run on my desktop
  • Remember, you can’t compare these numbers to Bob’s because we’re doing different things to different data.
Total Seconds Description
19 Just read the records with marc4j and do nothing.
85 Read and do 35 “normal” fields (no custom)
104 Read, 35 normal, 15 custom fields
110 Read, normal, custom, allfields
129 Read, normal, custom, allfields, to_xml
136 Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs
142 Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs
124 Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, 2 threads doing processing

We can also break the same numbers down as:

Seconds Description
19 read the records and do nothing
66 process the 35 normal fields
19 process the 15 custom fields
6 generate the “allfields” field
19 generate the XML (yowza!)
7 send to solr with two threads
13 send to solr with one thread

Or like this:

Seconds Description
129 do all the reading and processing
13 send to solr with one thread

Why does solr processing seem so much faster for me?

There are a lot of reasons why my submit-to-solr might seem like less of a burden. The ones I can think of off the top of my head are:

  • SUSS is just faster than whatever solrmarc does.
  • My processing stage is so much slower than solrmac’s (due to algorithms or jruby-vs-java, I don’t know) that the “push to solr” portion of it gets swallowed up by the slowness of the of overall code.
  • The Solr server is so much faster than my desktop that my poor little desktop can’t send it data fast enough to work it.

For my setup, obviously adding a processing thread is a lot more beneficial than adding a SUSS thread. My desktop doesn’t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.

Trying the same thing on a beefy machine

This is the exact same code and data, but on a beefy machine (16 cores, gobs of memory).

time SUSS Threads Processing Threads
70 1 1 (was 142 seconds on the desktop)
47 1 2
39 1 3
35 1 4
68 2 1
48 2 2
38 2 3
34 2 4

So, on my hardware anyway, there’s a sweet spot with one suss thread and three processing threads. YMMV, of course.

What have we learned?

I’m not sure, to be honest. It’s logistically difficult for me to do the same process in solrmarc because I’d have to rebuild everything without the HLB stuff. I guess for me, what I’ve learned that if I’m going to continue working on my code, the places to focus my attention are threading (obviously) and MARC-XML generation.

ruby-marc with pluggable readers

I’ve been messing with easier ways of adding parsers to ruby-marc’s MARC::Reader object. The idea is that you can do this:

  1.   require 'marc'
  2.   require 'my_marc_stuff'
  3.  
  4.   mbreader = MARC::Reader.new('test.mrc') # => Stock marc binary reader
  5.   mbreader = MARC::Reader.new('test.mrc' :readertype=>:marcstrict) # => ditto
  6.  
  7.   MARC::Reader.register_parser(My::MARC::Parser, :marcstrict)
  8.   mbreader = MARC::Reader.new('test.mrc') # => Uses My::MARC::Parser now
  9.  
  10.   xmlreader = MARC::Reader.new('test.xml', :readertype=>:marcxml)
  11.  
  12.   # …and maybe further on down the road
  13.  
  14.   asreader = MARC::Reader.new('test.seq', :readertype=>:alephsequential)
  15.   mjreader = MARC::Reader.new('test.json', :readertype=>:marchashjson)

A parser need only implement #each and a module-level method #decode_from_string.

Read all about it on the github page.

New interest in MARC-HASH / JSON

For reasons I’m still not entirely clear on (I wasn’t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn’t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.

Now, though, a lot of us are looking for a MARC format that (a) doesn’t suffer from the length limitations of binary MARC, but (b) is less painful (both in code and processing time) than MARC-XML, and it’s worth re-visiting.

For at least a few folks, un-marshaling time is a factor, since no matter what you’re doing, processing XML is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren’t a big win with a brain-dead format like MARC, so it’s worth looking at alternatives.

What is MARC-HASH?

At some point, we’ll want a real spec, but right now it’s just this:

  1.  
  2.   # A record is a four-pair hash, as follows. UTF-8 is mandatory.
  3.   {
  4.     "type" : "marc-hash"
  5.     "version" : [1, 0]
  6.     "leader" : "…leader string … "
  7.     "fields" : [array, of, fields]
  8.   }
  9.  
  10.   # A field is an array of either 2 or 4 elements
  11.   [tag, value] # a control field
  12.   [tag, ind1, ind2, [array, of subfields]]
  13.  
  14.   # A subfield is an array of two elements
  15.  
  16.   [code, value]

So, a short example:

  1.   {
  2.     "type" : "marc-hash",
  3.     "version" : [1, 0],
  4.  
  5.     "leader" : "leader string"
  6.     "fields" : [
  7.        ["001", "001 value"]
  8.        ["002", "002 value"]
  9.        ["010", " ", " ",
  10.         [
  11.           ["a", "68009499"]
  12.         ]
  13.       ],
  14.       ["035", " ", " ",
  15.         [
  16.           ["a", "(RLIN)MIUG0000733-B"]
  17.         ],
  18.       ],
  19.       ["035", " ", " ",
  20.         [
  21.           ["a", "(CaOTULAS)159818014"]
  22.         ],
  23.       ],
  24.       ["245", "1", "0",
  25.         [
  26.           ["a", "Capitalism, primitive and modern;"],
  27.           ["b", "some aspects of Tolai economic growth" ],
  28.           ["c", "[by] T. Scarlett Epstein."]
  29.         ]
  30.       ]
  31.     ]
  32.   }

How's the speed?

I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.

Having said that, in real life people are mostly concerned about JSON. So, let's look at JSON performance.

The MARC-Binary and MARC-XML files are normal files, as you'd expect. The JSON file is "Newline-Delimited JSON" -- a single JSON record on each line.

The benchmark code looks like this:

  # Unmarshal
  x.report("MARC Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end

# Marshal x.report("MARC Binary") do reader = MARC::Reader.new('test.mrc') writer = MARC::Writer.new('benchout.mrc') reader.each do |r| writer.write(r) end writer.close end

Under MRI, I used the nokogiri XML parser and the yajl JSON gem. Under JRUby, it was the jstax XML parser and the json-jruby JSON gem.

The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.

Marshalling Speed (read from binary marc, dump to given format)

Times are in seconds on my Macbook laptop, using ruby-marc.

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 393 443 188 356
MARC Binary 36 23 23 25
JSON/ NDJ 31 19 25 ERROR

Unmarshalling speed (from pre-created file)

Again, times are in seconds

Format Ruby 1.87 Ruby 1.9 JRuby 1.4 Jruby 1.4 --1.9
XML 113 89 75 89
MARC Binary 29 16 16 19
JSON/ NDJ 17 9 13 16

And so...

I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.

If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?

OCLC still not (NO! They are!) normalizing their LCCNs

NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So…good news all around, and huge kudos to Xiaoming Liu for his quick response!
**NOTE** It strikes me that I haven’t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you’ll get back either good data or nothing (and the “nothing” might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.

A long time ago, Jonathan Rochkind noted that the OCLC doesn’t correctly normalize their LCCNs.

Well, it’s not fixed.

I could really, really use the xlccn service right about now — a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you’re interested in.

Except they “normalize” their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.

The xLCCN service will silently provide no data or incorrect data for many LCCN requests!

An example:

  • (F) Full LCCN is “sn 83011407″
  • (D) First set of digits is “83011407″. This is what I think the OCLC is indexing.
  • (N) Correct normalization is “sn83011407″

The problem, of course, is that (D) “83011407″ is itself a valid LCCN.

  • (F) is associated with OCLC# 47212967
  • (D) is associated with OCLC# 12505148. That’s not the same record.

So, how do the OCLC services respond?

  • (F) Worldcat search finds correct (probably just doing a string match); xid finds nothing
  • (D) Worldcat finds both correct and incorrect records. The xLCCN service finds only the incorrect record, OCLC# 12505148.
  • (N) Neither worldcat nor xid finds anything for the correctly normalized version.

So, what am I supposed to do? Only use the service on LCCNs where the original and normalized versions are the same and include only digits? Frustrating.

Indexing data into Solr via JRuby (with threads!)

[Note: in this post I'm just going to focus on the "get stuff into Solr" part. My normal focus -- MARC data -- will make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except that the best way to interact with it is via Java. I don’t so much love me the java.

So…taking Erik Hatcher’s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with Solr.

Getting the code

I’ve added the gems to gemcutter, if you want to play along at home:

  • jruby_producer_consumer (github, rdoc.info) Ruby syntax for threaded operations under jruby
  • jruby_streaming_update_solr_server (github, rdoc.info) Ruby syntax on top of the Java class of the same name
  • marc4j4r (github, rdoc.info) Ruby syntax on top of the marc4j java library.

WARNING: None of these gems have a 1.0 version tag on them, and that means that the API may change a titch in the future. Also, the fact that they’re released as gems means that it’s easy to release gems, not that I’m not an idiot.

The basics: Using SolrInputDocument and StreamingUpdateSolrServer

OK, with the disclaimer out of the way, let’s look at some code.

  1.   require 'rubygems'
  2.   require 'jruby_streaming_update_solr_server'
  3.  
  4.   solrurl = 'http://your.solr.server:port/solr'
  5.   sussqueuesize = 24 # how many items to buffer on their way to solr
  6.   sussthreads = 1   # how many threads to use to send stuff to solr
  7.  
  8.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  9.  
  10.   # Let's add a simple document via a hash: A title, three authors, and a year
  11.  
  12.   h = {
  13.     :title => "Never been deader",
  14.     :author => ['Bill', 'Mike', 'Molly'],
  15.     :year => 2003
  16.   }
  17.   suss << h
  18.   suss.commit
  19.  
  20.   # YEA! You just added a document to solr and committed it.
  21.   # Have a cookie!
  22.  
  23.   # We can also use a document object to do the same thing
  24.  
  25.   doc = SolrInputDocument.new
  26.   # Add the title
  27.   doc << ['title', 'Never been deader']
  28.  
  29.   # Add the first author
  30.   doc << [:author, 'Bill']
  31.  
  32.   # Add more. Re-used keys mean you're adding additional values
  33.   # Note values can be scalars or arrays
  34.  
  35.   doc << [:author, ['Mike', 'Molly']]
  36.  
  37.   # Add the wrong year using [] syntax
  38.   doc[:year] = 2001
  39.  
  40.   # Oops! fix it. []= overwrites existing value(s)
  41.  
  42.   doc[:year] = 2003
  43.  
  44.   # Finally, we can merge a hash (or anything else that responds to
  45.   # 'each_pair' with key-value pairs) into an existing doc
  46.  
  47.   doc.merge! {'author' => 'Ringo Starrre', 'publisher'=>'Vainity Books'}
  48.  
  49.   # Add it
  50.  
  51.   suss << doc
  52.  
  53.   # Commit and optimize if you'd like
  54.  
  55.   suss.commit
  56.   suss.optimize # if you want

Nothing really fancy in there — just a few things worth noting:

  • An suss object will take a hash (again, anything that responds to #each_pair) or a SolrInputDoc
  • You can use either strings or symbols to represent Solr field names
  • Values can be either a single value, or an array of multiple values

And there are three ways to get data into a doc:

  • Via << [field, value(s)] (additive)
  • Via doc.merge! hash (additive)
  • Via doc[field] = value (replaces)

Adding Threads

I also went down the garden path of threading things. There are an awful lot of operations that are not threadsafe (e.g., reading a line from a file) but once you’ve got a bunch of records to worth with, turning them into Solr documents is usually thread-safe.

My model is that there’s a producer (usually the method #each) from an underlying data object. A thread takes whatever that method yields and sticks the values into a java BlockingQueue awaiting consumption. You then use ProdcuerConsumer#threaded_each (or ProducerConsumer#threaded_each_with_index) to pull items out of the queue and do something useful with them.

I extracted stuff into a library (jruby_producer_consumer) for your viewing pleasure.

CONFUSION ALERT: It’s perhaps unfortunate that the object you send to ProducerConsumer.new(obj) must implement #each and that the ProducerConsumer method #threaded_each calls that underlying #each…well there’s a lot of #each’s floating around. Keep them straight.

So…let’s look at some code to work with consumer threads.

  1.   # Start off the same as before
  2.   require 'rubygems'
  3.   require 'jruby_streaming_update_solr_server'
  4.   require 'jruby_producer_consumer'
  5.   require 'marc4j4r'
  6.  
  7.   solrurl = 'http://your.solr.server:port/solr'
  8.   sussqueuesize = 24 # how many items to buffer on their way to solr
  9.   sussthreads = 2   # how many threads to use to send stuff to solr
  10.  
  11.   suss = StreamingUpdateSolrServer.new(solrurl,sussqueuesize,sussthreads)
  12.  
  13.   # I'll go ahead and use a MARC file as my example, but won't talk about the
  14.   # MARC parts of it. All you need to know is that the reader object
  15.   # implements #each
  16.  
  17.   reader = MARC4J4R.reader('test.xml', :marcxml)
  18.  
  19.   # Get a producer/consumer object with the reader at its base, using
  20.   # the default method #each to get stuff out of it, and with the assumption
  21.   # that we only need to keep the default 5 items in memory at a time to
  22.   # keep up with consumption
  23.  
  24.   pc = ProducerConsumer.new(reader)
  25.  
  26.   # Get three threads to actually consume the things, turn them into solr
  27.   # documents, and send them to solr (potentially out of order)
  28.  
  29.   numconsumerthreads = 3
  30.   pc.threaded_each(numconsumerthreads).each do |r|
  31.     suss << turn_marc_record_into_a_hash_or_solrdoc(r)
  32.   end
  33.   suss.commit

Again, not a lot happening here.

  • The “producer” is always one thread, because so little is thread-safe at the ‘each’ level. In this case, there’s a single thread pulling data out of the file and turning it into MARC records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don’t starve. I presume that producing items is cheaper than consuming them, or else this library won’t help you much.
  • ProducerConsumer#threaded_each calls the #each method of the underlying object. You can substitute anything that yields, though, as in this example where I call #each_line instead of the default #each
  1.   queuesize = 5
  2.   pc = ProducerConsumer.new(File.new('myfile.txt'), queuesize, :each_line)
  • Keep track of your threads. In this last example, there is one thread getting MARC records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the suss object, and another two pulling stuff out of the suss object and sending things to Sorl. And, of course, there’s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.
  • See the docs for how to mess with what goes into a ProducerConsumer object. It’s entirely possible to use, say, #each_slice. There’s also a convenience method #threaded_each_with_index, but it does not call the underlying #each_with_index, it produces its own index as things are read.

Feedback not only welcome but necessary!

I’ve done a lot of messing around with Ruby in the last 10 days or so, but I’m still basically converting from Perl in my head. Any comments, bugs reports, or whatnot are definitely welcome!

jruby_producer_consumer dead-simple producer/consumer for JRuby

Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. And I didn't really understand it. So I dug in today and wrote this.]

I’ve just pushed to Gemcutter my first gem — a JRuby-only producer/consumer class that works with anything that provides #each called jruby_producer_consumer.

It’s JRuby-only because it uses (a) A blocking queue implemenation that’s native Java, and (b) threading, which isn’t a huge win under regular Ruby.

There’s no testing there because I’m not sure how to test threaded stuff :-(

It is, I hope, easy to use:

  1.    require 'rubygems'
  2.    require 'jruby_producer_consumer'
  3.  
  4.    # Create a ProducerConsumer. Arguments are anything that implements #each
  5.    # and the size for the underlying queue. For the former, I'll just use a Range object.
  6.  
  7.    eachable = 1..10
  8.    queuesize = 3
  9.  
  10.    pc = ProducerConsumer.new(eachable, queuesize)
  11.  
  12.    # Just a method to show what happens
  13.    def sample (consumerid, x)
  14.      puts "Consumer #{consumerid}: consuming #{x}"
  15.      sleep 1 # otherwise this'll finsish before I can create multiple consumers
  16.    end
  17.  
  18.    # Create three consumers. You can pass any number of args to
  19.    # #consumer, and must pass a block whose arguments are the
  20.    # object returned by eachable#each and those args back.
  21.  
  22.    ['A', 'B', 'C'].each do |consumerid|
  23.      pc.consumer(consumerid) do |x, consumerid|
  24.        sample(consumerid, x)
  25.      end
  26.    end
  27.  
  28.    # OUTPUT
  29.    # Consumer A: consuming 1
  30.    # Consumer B: consuming 2
  31.    # Consumer C: consuming 3
  32.    # Consumer A: consuming 4
  33.    # Consumer B: consuming 5
  34.    # Consumer C: consuming 6
  35.    # Consumer B: consuming 7
  36.    # Consumer A: consuming 8
  37.    # Consumer C: consuming 9
  38.    # Consumer B: consuming 10