Archives: September 2009

Many of the folks who read this blog (hi, both of you! Mom, say hello to Dad!) are aware, at least tangentially, of the HathiTrust. Currently hosted by us at the University of Michigan, the most public interface to its data is a VuFind installation you can access at catalog.hathitrust.org (or, for you smart-phone types, at m.catalog.hathitrust.org). Once you do a metadata search, you get links into the actual page images or a chance to search the fulltext of the selected item (depending on its copyright status).

It’s awesome. Seriously. Even in the absence of fulltext, being able to search within an item can be incredibly useful. Give it a shot if you haven’t.

You don’t always need an OPAC

But there are plenty of folks who don’t want or need a full-flown interface into all the metadata. They’ve already got one of those. What they’re interested in, mostly, is figuring out how to easily put links in their own OPAC (or whatnot or whoseits) to the HathiTrust if page images or searching are available. See, for example, a typical record from Tod Olson’s stuff at U-Chicago — he sniffs for HathiTrust and Google Books availability via embedded javascript.

To this end, the HathiTrust folks provide a set of simple, tab-delimited files — a full extract on the first of every month, and nightly updates every …er…night.

You can see from the description of the file that it’s very simple. Tab-delimited fields of the HathiTrust ID, right information, and all the golden-oldie standard identifiers — some of which (ISSNs, ISBNs, etc.) are further comma-delimited in cases where multiple values are available and a field repeats. And a title and enumcron (description of an individual volume, e.g., “Sept 2007, vol. 33, issue 4″), so you have something useful to display if you need to, and that’s 98% of what most folks want.

The smart way to do it: RDBMS

If you want to query this data quickly and easily, the obvious thing to do is to dump it into a database. One main table for the non-repeated values, and either a few key=>value tables (or, if you’re lazy, a single key => type/value) for the repeated ISBNs/ISSNs/whatnot. A quick mod-perl script to set up some data normalization going in and out and persist the prepared SQL queries and you’re set.

It’s hard to make an argument against using a database for these data. I mean, c’mon. We’ve got a well-defined structure. An obvious foreign-key. No full-text searching needed. This is practically designed for a good old-fashioned RDBMS. Plus, I’ve done this approximately a zillion times before, so I’m good and fast at it. Case closed.

How I’m gonna do it

Screw that. What I really wanted to do was start messing around with the DataImportHandler(DIH) in Solr.

I can make a weak argument for including the data in a Solr instance. To wit, it’ll certainly be fast enough for anything I’m gonna throw at it, and (more important to me) it’s easy to set up datastore-level indexing and querying filters with built-in facilities and/or custom code. This allows me to build clients that call it without having to worry about manipulating the input much, if at all.

The list of simple DIH examples is…well, I never really found any good ones, although I’m sure they’re out there. The documentation isn’t bad, but it’s not full of complete examples, and almost all of them have to do with the potential complexities of sucking data out of a database, which is what most people want to do. Not me, I’ve got flat files to work with.

Luckily, you can fire up an “interactive” DIH session where, at the very least, you can try to import a few rows of data and see if things are puking. I didn’t find the error reports particularly helpful all the time, but it’s about a zillion times better than nothing, I can tell you that much.

The game plan

We’ll start with the assumption that I’ve already managed to load a full dump from some date (run with me here; I’ll explain how to do it later). Then what we want to do is the following:

  1. Every night, download the nightly additions/changes file and gunzip it.
  2. Hit the DIH handle to import all files that (a) have a filename of the right format, and (b) have a created date after the last time the DIH handle was run.

And that’s it. Get the new stuff, have DIH figure out what’s new, and import it.

The first part is easy enough to do with perl/python/ruby/whatever. I’ll leave it as an exercise for all you diligent students.

Setting up solrconfig.xml

This is the easy part. Set up the handler, give it a semi-meaningful name, and call out to a config file.

  1.   <requestHandler name="/hathiimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
  2.       <lst name="defaults">
  3.         <str name="config">hathi-data-config.xml</str>
  4.       </lst>
  5.   </requestHandler>

Define some useful data types in schema.xml

I left pretty much all of the boilerplate in schema.xml and just added a few types to deal with identifiers.

  • lowercase: return a single token that’s been lowercased. Don’t muck with it otherwise.
  • genericID: trim it, lowercase it, ditch everything that’s not a number or a letter, and return as a single token.
  • numeric: Ditch everything but the first string of digits, and then ditch any leading zeros. Useful when you know it’s gotta be an integer.
  • stdnum Find the first set of digits (optionally followed by an ‘X’ and potentially interspersed with dashes or dots), strip off the leading zeros, and return it. Good to extract an ISBN from a string like “(alt) 123-45-678X electronic only”.
  • lccnnormalizer: Custom code to normalize an LCCN as per this page at the LoC.
  1. <types>
  2.   <!– lowercases the entire field value, keeping it as a single token.  –>
  3. <fieldType name="lowercase" class="solr.TextField" positionIncrementGap="100">
  4.   <analyzer>
  5.     <tokenizer class="solr.KeywordTokenizerFactory"/>
  6.     <filter class="solr.LowerCaseFilterFactory" />
  7.   </analyzer>
  8. </fieldType>
  9.  
  10. <!– Full string, stripped of \W and lowercased –>
  11.  <fieldType name="genericID" class="solr.TextField" sortMissingLast="true"  omitNorms="true">
  12.    <analyzer>
  13.      <tokenizer class="solr.KeywordTokenizerFactory"/>
  14.      <filter class="solr.LowerCaseFilterFactory"/>
  15.      <filter class="solr.TrimFilterFactory"/>
  16.      <filter class="solr.PatternReplaceFilterFactory"
  17.           pattern="[^\p{L}\p{N}]" replacement=""  replace="all"
  18.      />
  19.    </analyzer>
  20. </fieldType>
  21.  
  22.   <!– standard number normalizer – extract sequence of digits, strip leading zeroes –>
  23. <fieldType name="numeric" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  24.  <analyzer>
  25.    <tokenizer class="solr.KeywordTokenizerFactory"/>
  26.    <filter class="solr.LowerCaseFilterFactory"/>
  27.    <filter class="solr.TrimFilterFactory"/>
  28.    <filter class="solr.PatternReplaceFilterFactory"
  29.         pattern="[^0-9]*([0-9]+)[^0-9]*" replacement="$1"
  30.    />
  31.    <filter class="solr.PatternReplaceFilterFactory"
  32.         pattern="^0*(.*)" replacement="$1"
  33.    />
  34.  </analyzer>
  35. </fieldType>
  36.  
  37.  
  38.   <!– Simple type to normalize isbn/issn. Just get first string of digits followed by an optional 'x' –>
  39. <fieldType name="stdnum" class="solr.TextField" sortMissingLast="true" omitNorms="true" >
  40.  <analyzer>
  41.    <tokenizer class="solr.KeywordTokenizerFactory"/>
  42.    <filter class="solr.LowerCaseFilterFactory"/>
  43.    <filter class="solr.TrimFilterFactory"/>
  44.     <filter class="solr.PatternReplaceFilterFactory"
  45.         pattern="^[\s0\-\.]*([\d\.\-]+x?).*$" replacement="$1"
  46.    />
  47.    <filter class="solr.PatternReplaceFilterFactory"
  48.         pattern="[\-\.]" replacement=""  replace="all"
  49.    />
  50.  </analyzer>
  51. </fieldType>
  52.  
  53. <!– LCCN normalization on both index and query –>
  54. <fieldType name="lccnnormalizer" class="solr.TextField"  omitNorms="true">
  55.   <analyzer>
  56.     <tokenizer class="solr.KeywordTokenizerFactory"/>
  57.     <filter class="solr.LowerCaseFilterFactory"/>
  58.     <filter class="solr.TrimFilterFactory"/>
  59.     <filter class="edu.umich.lib.solr.analysis.LCCNNormalizerFilterFactory"/>
  60.   </analyzer>
  61. </fieldType>
  62.  
  63. <!– since fields of this type are by default not stored or indexed,
  64.      any data added to them will be ignored outright.  –>
  65. <fieldtype name="ignored" stored="false" indexed="false" multiValued="true" class="solr.StrField" />
  66.    
  67. </types>

Add field definitions to schema.xml

This is pretty straight-forward: just set it up.

  1. <field name="htid"        type="genericID"          indexed="true"  stored="true"  multiValued="true"/>
  2. <field name="bibnum"       type="genericID"         indexed="true"  stored="true"/>
  3.  
  4. <field name="access"       type="lowercase"         indexed="true"  stored="true"/>
  5.  
  6. <field name="rights"       type="lowercase"         indexed="true"  stored="true"/>
  7.  
  8. <field name="source"       type="lowercase"         indexed="true"  stored="true"/>
  9. <field name="sourceid"     type="genericID"         indexed="true"  stored="true"/>
  10.  
  11. <field name="lccn"         type="lccnnormalizer" indexed="true"  stored="true"  multiValued="true"/>
  12. <field name="oclc"         type="numeric"        indexed="true"  stored="true"  multiValued="true"/>
  13. <field name="isbn"         type="stdnum"         indexed="true"  stored="true"  multiValued="true"/>
  14. <field name="issn"         type="stdnum"         indexed="true"  stored="true"  multiValued="true"/>
  15.  
  16. <field name="title"        type="text"         indexed="true" stored="true"/>
  17. <field name="imprint"      type="text"         indexed="true" stored="true"/>
  18. <field name="enumcron"     type="text"         indexed="true" stored="true"/>
  19.  
  20.   <!– Ignore the multivalued, comma-delimieted source strings –>
  21.  
  22.   <field name="rawLine"  type="ignored" indexed="false" stored="false"/>
  23.   <field name="issns"  type="ignored" indexed="false" stored="false"/>
  24.   <field name="isbns"  type="ignored" indexed="false" stored="false"/>
  25.   <field name="oclcs"  type="ignored" indexed="false" stored="false"/>
  26.   <field name="lccns"  type="ignored" indexed="false" stored="false"/>

hathi-data-config.xml — define how DIH is going to work.

This, of course, is the meat of the heart of the center of the matter.

I’m going to make use of four DIH technologies:

  • FileDataSource: In DIH, you declare a data source from which you’ll be sucking the raw data for manipulation and massaging. I’m just using a file, so this is for me. You can, as you might expect, pull in from a URL or (as mentioned) a database via JDBC.
  • FileListEntityProcessor: Given a directory and a set of criteria for a file, this will return a list of filenames that match those criteria. The criteria we’ll be using are (a) a regexp the filename must match, and (b) a creation date after the last time we ran the process.
  • LineEntityProcessor: Once you’ve got a data source, you need to stream it in somehow. There are Processors for XML and other formats, but this one just pulls in lines one at a time. The documentation all talks about LineEntityProcessor basically only being useful for pulling in, say, a list of filenames, but since my data is all line-by-line, this is what I’m using as my primary record-fetcher. It populates a single field called rawLine for later processing.
  • RegexTransformer: Allows you to take a field pulled from the datasource (or already derived from previous processing) and do regexp substitutions, group extraction, or splitting.

SO…I’m going to:

  1. Set up a FileDataSource to read from files
  2. Use FileListEntityProcessor to get a list of files that match my criteria
  3. Run each through LineEntityProcessor to generate a bunch of rawLines.
  4. Use the RegexTransformer multiple times to extract the data from the line.

[If you never went to look at it, this might be a good time to check out the description of the tab-delimited metadata files.]

  1.   <dataConfig>
  2.     <dataSource name="fds" encoding="UTF-8"  type="FileDataSource" />
  3.     <document>
  4.       <!– Get a list of files from the last time the handler ran –>
  5.       <entity name="hathifile"
  6.               processor="FileListEntityProcessor"
  7.               newerThan="${dataimporter.last_index_time}"
  8.               fileName="^hathi_upd_.*\.txt$"
  9.               rootEntity="false"
  10.               baseDir="/Users/dueberb/Documents/devel/hathi"
  11.       >
  12.  
  13.         <entity name="hathiline"
  14.                 processor="LineEntityProcessor"
  15.                 url="${hathifile.fileAbsolutePath}"
  16.                 rootEntity="true"
  17.                 dataSource="fds"
  18.                 transformer="RegexTransformer"
  19.         >
  20.  
  21. <!– Big ugly regexp to get all the tab-delimited fields –>
  22.           <field column="rawLine"
  23.                  regex="^(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)\t(.*)$"
  24.                  groupNames="htid,access,rights,bibnum,enumcron,source,sourceid,oclcs,isbns,issns,lccns,title,imprint"
  25.           />
  26.  
  27. <!– Split the multi-values on comma –>
  28.  
  29.           <field column="oclc" splitBy="," sourceColName="oclcs" />
  30.           <field column="issn" splitBy="," sourceColName="issns" />
  31.           <field column="isbn" splitBy="," sourceColName="isbns" />
  32.           <field column="lccn" splitBy="," sourceColName="lccns" />
  33.         </entity> <!– end of hathiline –>
  34.  
  35.       </entity> <!– end of hathifile –>
  36.     </document>
  37.   </dataConfig>

And…it doesn’t work.

It almost works. The problem is that my attempt to use the variable ${dataimporter.last_index_time} is busted. There’s a ticket to fix it and a patch already provided, so it’s only a matter of time before it’s not an issue.

For the moment, though, we’ll change that line to:

  1.   <entity name="hathifile"
  2.           processor="FileListEntityProcessor"
  3.           newerThan="'NOW/DAY'"
  4.           fileName="^hathi_upd_.*\.txt$"
  5.           rootEntity="false"
  6.           baseDir="/Users/dueberb/Documents/devel/hathi"
  7.   >

That says to basically take everything created since midnight and use it. If you have cron scripts set up to run this every day, you’ll have no problems.

Dealing with a full extract

You’ll only have to do this once, of course, but it has to be done. Basically, reproduce the DIH handler with a different name, pulling in the data from a full extract (you could, e.g., just change the filename parameter to accept /^hathi_full_.*\.txt$/). Maybe call it hathifullimport instead of hathiimport.

Fire her up!

Once you’re ready to go, just hit the right URL:

http:://solrmachine:port/solr/hathifullimport?command=full-import&clean=true

http:://solrmachine:port/solr/hathiimport?command=full-import&clean=false

The first one will get the initial big, full file; the second will pull in all the nightlies you’ve downloaded, gunzipped, and put in the right place (provided, of course, they’re dated after the last midnight, or they’ve fixed DIH to allow the last_index_time syntax).

Next steps?

Beer or wine. Take your pick.

After that, though, it’d be a matter of actually writing the download scripts and setting up cron jobs. And, of course, putting a front-end on it if you want, or massaging the data as they come out to return a nice JSON format for your consumers. That sort of thing.

So, wait…is this really worth doing?

Maybe. Probably not. It was worth it to me to start thinking about DIH and how I can use it. And it might be worth it to you, if you want to play around with these data in the ways that solr makes easy.

But, like to many things, it’s less worth doing that it was worth writing up. I learned a lot.

3 Responses to “An exercise in Solr and DataImportHandler: HathiTrust data”

  1. Ironic. I am indexing HathiTrust content with Solr as we speak. I got sets of MBook MARC data that I dumped into MyLibrary and then send off to Solr. Like Perl, there is more than one way to do it. dueber++

  2. Nice article! Thanks for the detailed writeup on DIH. I’m sure a lot of people will find it very useful. As for SOLR-1473, you’ll find it fixed in the nightly build for 10-01-2009.

  3. [...] läste Robot Librarians övningar i att använda Solr för Hathi Trust data och blev åter intresserad av att göra något själv med [...]

One of the advantages of having complete control over the OPAC is that I change things pretty easily. The downside of that is that we need to know what to change.

Many of you that work in libraries may have noticed that data are not necessarily the primary tool in decision-making. Or, say, even a part of the process. Or even thought about hard. Or even considered.

For many decisions I see going on in the library world, the primary motivator is the anecdote. In fact, to be honest, the primary driver is the faculty anecdote. Those cliched three curmudgeonly old faculty members invariably have huge influence over systems and interfaces that will be used by 40K undergraduates. The tiny percentage of weirdos that actually talk to reference librarians end up wielding enormous power compared to the untold masses that don’t.

Enter the dragon…er…log.

So…I’m logging everything. EVERYTHING. Everything I can think of, anyway, and that doesn’t slow things down too far.

I’ve got a simple database table set up with the following columns:

  • incrementing integer (solely for innodb’s efficiency needs)
  • sessionid
  • action
  • data1, data2, and data3 (all of these are action-dependent)
  • logweekday, logdate, logtime (instead of a single timestamp for easy and efficient queries)

And that’s it. I’ve had it running (initially with only a few actions) for two weeks and have on the order of 300,000 rows in it at this point. Obviously, at some point I’ll have a better idea of which data I actually care about and things will get slimmed down a little bit. But for now, it’s fun having that all around.

Common log events include an action, and usually at least one other piece of data. Stuff like:

  • start-a-new-session with IP Address
  • simple-search with search-index, searchstring
  • choose-a-facet with facet-index, facet-value, position-on-list
  • view-a-full-record with recordID, search-result-number
  • click on an electronic resource link to proquest/google/hathitrust/whatever

…etc. I track adding and removing things from the selected items set and the user’s favorites, exporting to email or refworks or whatnot, logging in and out, clicking on the author’s name or a LCSH subject in the full record view, picking a “similar item” from the eponymous list, clicking on the spelling suggestion and the prev/next buttons, etc. Currently I’m logging 78 events.

[Note: by "search result number" I mean the enumeration of that record in that specific search set. So, the top result is #1. The first result on page two is #21]

What do I think I’m gonna learn?

I’m not exactly sure. Of course, I can get all the basics — how much traffic and what people are searching for — but there’s the possibility of other stuff. Things like:

  • Do people actually use the [prev/next, facets, facets-below-the-fold, items past the third page, etc]
  • Seriously, is anyone using the boolean searches and wildcards on the basic search page? Are any of them using IP addresses from outside the staff subnet? If not, can I please please please start using DisMax???
  • What facets are most popular? Do people hit the little “more” button to expand the list of facet values from 6 to 30?
  • What’s the average search result number of a record chosen for a full-record view for each search index (perhaps an indicator of how well the relevancy-ranking is working?)
  • Looking at all the full-record displays, what are the patterns for those records (e.g., break down by callnumber prefix, or by our “Academic Discipline” subject)

I’ve got a lot to learn about stats, and user tracking, and clickpath analysis, but dammit I’ll have data and I’m not afraid to use it!

[Er...them. Not "it". Data are a "them." Always feels weird to me to refer to data in the plural, but I'm forcing myself to do so these days.]

What’s the server implementation?

I already mentioned the database. I’ve got a little module called ActivityLog that does three and a half things:

  1. Get the session id from the session
  2. Get logging information from the GET/POST or passed in as parameters
  3. Modify the parameters if need be (e.g., pull domain name out of an external URL). This is the half a thing.
  4. Stuff it into the database with appropriate timestamps.

And that’s it.

What’s the client need to do?

I start off with the following rules:

  1. I want to be able to log damn near everything
  2. I can’t degrade the user experience in a meaningful way just for logging
  3. I want to log outgoing links, too.
  4. I must must must have pretty, bookmarkable URLs.

Truth be told, some of the “client” stuff can be (and is) done on the server. When someone is, say, sending a record or set of records to RefWorks, the server knows everything it needs to know and I can just take care of logging as part of the regular request fulfillment.

But some stuff — like the search result number, say — are best taken care of from the browser. Easy enough, for the most part, esp. with form submissions and such.

The potentially-non-obvious part comes in with rule #3 — I want pretty URLs. That means that the full display of record 123456789 is always going to be at /Record/123456789 no matter what the user clicked on. Ditto with adding/removing facets and such — the URL contains the resulting search, not the resulting search plus which facet was removed or added.

But — see #1 — I want to log damn near everything.

My solution — and I know lots of people are doing this; this isn’t rocket science — is to fire off an AJAX post for the click events that I’m interested in, sending log data off to my server and then not waiting for a return. Just send the data and follow the link as if nothing had intervened. It degrades gracefully (although the rest of my VuFind doesn’t, so that doesn’t matter much) and it dead-easy to implement.

The actual javascript implementation

I long ago switched our VuFind stuff over to use jQueryuery, just because I like it and know it.

First thing is to use the templates to modify the links to have a particular class (logit) and a well-structured ref (pipe-delimited values).

So, a link from the title of a work on the search-results page to the individual record will look like this:

<a ref="srrecview|{$record.id}||{$recordCounter}" href="/Record/{$record.id}" class="title logit">{$title}</a>

The ref attribute tells us that we’re going to log the type of event (record view from the search results), the ID of the record, a null in the data2 column, and the search result number.

Then there’s javascript to make all the magic happen:

  1.  
  2.  
  3.   function logit(a, args) {
  4.     a = jQuery(a);
  5.    
  6.     // Allow the caller to pass in args, or get them from the ref attribute
  7.     if (!args) {
  8.       args = a.attr('ref').split('|');
  9.     }
  10.    
  11.     jQuery.post(
  12.       url_to_the_logging_method,
  13.       {
  14.         'lc' : args[0],
  15.         'lv1': args[1] || '',
  16.         'lv2': args[2] || '',
  17.         'lv3': args[3] || ''
  18.       }
  19.     );
  20.   }
  21.  
  22.   jQuery(document).ready(function() {
  23.     jQuery('a.logit').live('click', function(e) {
  24.       logit(this);
  25.     });
  26.   });

The logit function just does a brain-dead post of the data in the ref attribute. We then bind that function to all anchors with the appropriate class, and we’re done.

[Note the use of the jQuery live event -- this makes sure the event will be bound to stuff that comes in via AJAX after page load. Our links to Google Books, for example, come in like this.]

Since I’m not returning false from the logit function, the default action (actually follow the link) will fire — without even waiting for the AJAX call to come back. Delay to the user is, hopefully, unnoticeable.

Final words

This isn’t all that smart. I should be doing more data-integrity stuff than I am, and of course someone could spoof my numbers if they wanted. But someone could spoof my stats just by hitting my normal catalog pages programatically, too, so there’s no more risk involved, and I do log IPs.

And, of course, I get my pretty URLs, and most users (i.e., those not running firebug) will never notice anything.

I don’t know that this would work for everyone, but so far it’s working pretty well for us. I’ll let you know if that continues in a post in a few weeks.

2 Responses to “Dead-easy (but extreme) AJAX logging in our VuFind install”

  1. Hear, hear for reality-based decision-making! And the logs to support it.

  2. Have you considered user privacy at all? Definitely make sure those logs are super secure — remember how the yahoo (or was it MSN?) search logs with just sessionIDs ended up revealing which people were which session by the context? And maybe talk to your library to make sure they have a policy for what they’re going to do with this stuff if law enforcement asks for it.

    A pain (an underestimate, a HUGE pain), but probably important.

The sad truths about journal bundle prices

September 23, 2009 at 10:42 amCategory:Uncategorized

[Notes taken during a talk today, Ted Bergstrom: "Some Economics of Saying Nix To Big Deals and the Terrible Fix". My own thoughts are interspersed throughout; please don't automatically ascribe everything to Dr. Bergstrom.

Check out his stuff at Ted Bergstrom's home page.]

Journals are a weird market — libraries buy as agents of professors, using someone else’s money, in deals of enormous complexity and uncertain value from companies that basically have a monopoly.

Similar to a few other situations: doctors prescribe drugs for patients using insurance money. Professors assign textbooks to students whose parents (in general) buy them. In all these cases, the supplier is (or is nearly) a monopoly operation.

Median price per article of for-profit journals is 3-4 times the median prices for non-profit journals. When you look at price per citation it gets even worse (because the “best” — or at least most cited — journals tend to be non-profit).

Marginal cost of supplying print journals is about a penny a page. The marginal cost of supplying electronic access is nearly zero, of course and shelf space and multiple-copies become a thing of the past.

SO…enter the Big Deal.

Academic press and then Elsevier figured out how to price-discriminate: calculate each library’s current expenditure on paper journals, multiply by 1+x (for x about 0.15) and provide access to all their journals electronically, plus whatever paper you used to buy. Elsevier had a 5-year contract, during which they promise not to increase the price more than 7% per year.

This is a great deal for Elsevier, because they know what a library is already paying and the marginal cost of providing electronic access is essentially zero to them. Huge success — lots of libraries jumped on board, and then so did the other publishers.

Bundling deters entry into the market

Libraries who bought the first Big Deal had their payments increase 7%/year (note that about half of UC’s serials budget goes to Elsevier), but their own budget increases about 3.5%/year. So, they’re in constant cancellation mode, but you can’t cancel only a portion of the electronic access, so it’s exempt. Small-time journal publishers are the only thing left to cut.

We have to learn to say “no”

Plus…it’s an incredibly popular product. Faculty love online access, as do students. So negotiating a new contract is tough to do, because libraries (as always) are unwilling to walk away.

The theory of bargaining suggests the the library needs to know what will happen if the Big Deal bargain breaks down — what happens if we walk?

Problem: we (the libraries) don’t know what the deal is worth to us OR what it’s worth to the publisher. Valuation of the big bundles is a ridiculously complex problem.

What happens if we cut if off?

  • Library owns access rights to back issues of journals previously subscribd to.
  • Pay-per-view access required only for recent volumes

So…we need to calculate number of pay-per-view access, which will obviously increase as time goes on (and more stuff falls under this model) and would go down if we were to change consumers (faculty and students) some percentage of the cost.

Big problem: we can’t make that calculation. We have no good way of knowing what percentage of article downloads are for current journals, and the publishers don’t release that data.

Even if we had the data, though, the likely outcome after a certain amount of time, as more stuff falls into the pay-per-view window, is a new Big Deal.

What about the Big Deals themselves?

Hard to know — because there are NDA sprinkled around like snow in Minnesota. But it turns out that these deals are FOIA-able! Yea! Elsevier actually tried to sue Ted’s group in the state of Washington, but Paul Courant and others helped to win the day, and they haven’t been sued since.

Publishes want to give the impression that the renewals of the Big Deals are basically formulaic, but in real life there are significant differences from institution to institution.

What’s the “Economic Solution”

Llet users pay for what they use. If users paid their own money (about $35/article at the for-profit institutions), users will modify their own behavior AND authors will stop submitting to the expensive-to-access journals because they want their stuff to be read.

How much money are they making?

Reported profits of Elsevier and Springer are about 30% of sales. That’s a huge margin in the regular world, but their costs are tiny? Where does it all go?

Basically, it goes to lobbyists, lawers, and executive salaries.

The Optical Society of America (physics, not eyesight) is a non-profit organization that publishes journals at about 1/3 the cost per page of Elsevier, but makes 40% profit on sales. They, of course, plow it back into physics journals, being a non-profit and all.

What’s the economic model of a journal?

  • Publishers have fixed costs (editing, harassing referees, typesetting, technology, etc.). I (Bill) think of this as the “First reader” cost — the cost to get one reader to be able to read it.
  • The marginal cost of adding more users is essentially zero.
  • The “efficient” option is to either allow user access at zero cost, with various institutions subsidizing the fixed costs, or just don’t publish the journal.

What can a single library do?

Not much. Faculty will scream, and one library acting alone will have essentially no effect on anything.

An interim strategy

Drop Big Deals to overpriced journals. Maintain subscription and free access to journals priced near the average cost, and subsidize (at less than 100%) pay-per-view access to the overpriced journals

How big are the differences in what people pay?

Just as an aside, almost, he tells us that while UMich and Illinois pay Elsevier about $2.25M for the “Freedom Collection”, Wisconsin pays about $1.2M for the exact same collection. Whoops!

He’s getting contract via FOIA and analyzing the differences. I imagine there’ll be publications forthcoming.

2 Responses to “The sad truths about journal bundle prices”

  1. Nettie Lagace says:

    Thanks for this blog post, Bill. This is a very interesting and important topic. Was this a talk just for U of M library staff? Ted Bergstrom’s son Carl also does great research.

  2. Bill says:

    I think the talk was open to all, but not particularly well-advertised. It’s interesting — he’s really advocating a user-pays system. Like all these things, though, it won’t work unless everyone does it at once.

More Ruby MARC Benchmarks: Adding in MARC-XML

September 18, 2009 at 11:08 amCategory:Uncategorized

It turns out that UVA’s reluctance to use the raw MARC data on the search results screen is driven more by processing time than parsing time. Even if they were to start with a fully-parsed MARC object, they’re doing enough screwing around with that data that the bottleneck on their end appears to be all the regex and string processing, not the parsing. Their specs for what gets displayed are complex enough that they want to do the work up-front.

But I remain interested, at least partially because of the reason UVA is using MARC-XML: they have MARC records too big for binary MARC format to handle. We do, too, and we’ve just been talking about what to do with them. So I’m thinking that

First, I spent some time dusting off my first attempt at ruby programming: modifying ruby-marc to use libxml if it’s available. It’s not super-well tested, but I’m pretty sure it works. And the speed increases are … well, see below.

Anyone who wants to mess with my attempt at libxml-enabled ruby-marc is welcome to do so. This is a very forgiving parser — it trusts that whatever ended up in the XML should, in fact, have been there. If you say ‘XXE’ is a control field, well, I’ll treat it as a control field.

But back to the data. A few points are obvious:

  • XML with REXML is dead-slow on both platforms (at least an order of magnitude slower )
  • XML with LibXML is competitive with binary MARC (within 20% or so)
  • Even with REXML, though, time to create MARC records out of the 50 input strings is less than a second, which might be ok depending on your application.

Full results

As with last time, the total numbers below show how long it took to process all 40 sets of 50 records. The unadorned numbers are the average time it took to process a set of 50 records.

Call up solr with a null search, get 2000 records back in batches of 50 with wt=ruby, eval it, and stick it into arrays

jruby-Get/Eval data              0.143550
mri-Get/Eval data                0.106550

jruby-Get/Eval data (total)      5.742000
mri-Get/Eval data (total)        4.262017

Turn raw strings into MARC::Record objects from MARC-Binary strings, joining all the returned MARC together first

jruby-marc4j-multistring         0.026575
jruby-marc-multistring           0.037175
mri-marc-multistring             0.073396

jruby-marc4j-multistring (total) 1.063000
jruby-marc-multistring (total)   1.487000
mri-marc-multistring (total)     2.935842

Turn raw strings into MARC::Record objects from MARC-XML

mri-marc-LibXML                  0.091332
jruby-marc-REXML                 0.799500
mri-marc-REXML                   0.948549

mri-marc-LibXML (total)          3.653276
jruby-marc-REXML (total)        31.980000
mri-marc-REXML (total)          37.941975

Conclusions

I’m not sure exactly where this leaves me, other than knowing that marc-xml is probably a viable alternative if you can use libxml. Getting a version of that code which uses native Java XML libraries when run under jruby might be a useful exercise.

One Response to “More Ruby MARC Benchmarks: Adding in MARC-XML”

  1. This is awesomely helpful, thanks!

    It would be awesome if you actually committed to ruby-marc, to be able to use libxml as an option in the standard distro. It could default to using libxml if libxml was available, otherwise default to rexml, with a user setting that could force either one.

Benchmarking MARC record parsing in Ruby

September 17, 2009 at 3:19 pmCategory:Uncategorized

[Note: since I started writing this, I found out Bess & Co. store MARC-XML. That makes a difference, since XML in Ruby can be really, really slow]

[UPADTE It turns out they don't use MARC-XML. They use MARC-Binary just like the rest of us. Oops. ]

[UP-UPDATE Well, no, they do use MARC-XML. I'm not afraid to constantly change my story. This is why I'm the best investigative reporter in the business]

The other day on the blacklight mailing list, Bess Sadler wrote

Yes, we do still include the full marc record, but the rule of thumb we’re currently using is that anything that needs to display in the index view (the search results) needs to be broken out into a separate display field, because retrieving and parsing marc records for every item in a list of search results is too much of a performance hit.

This surprised me a fair bit, because in our implementation of VuFind (which uses PHP, versus Ruby for Blacklight) I do just that — grab the MARC out of Solr, parse it, and pull stuff like full titles and such out of it.

As it turns out, I’d been screwing around with calling marc4j from jruby, anyway, so I threw that into the mix, and here’s what I found.

What the benchmark tries to measure

The focus is on measuring time to parse MARC records as returned in a field from Solr in MARC-binary.

I got 40 sets of 50 records each (2000 records) from our Solr instance in ruby format and extracted the binary MARC strings. This resulted in an array of 40 sets of 50 strings, each of which is a valid MARC record.

Fifty records seems largish to me — we only display 20 at a time — but thought I’d swing for the fences.

I’m testing along three(ish) dimensions:

  • jruby vs mri
  • marc4j vs ruby-marc (only on jruby, obviously)
  • parsing each string individually, or globbing them all together and treating it as if it’s a multi-record file

[Note that MRI is using Net::HTTP to get the data; I presume Curl would be faster still. It's already faster than jruby]

The following data show the average time to parse out each set of 50 records and extract the first 245 (title) field from each one, along with the totals for doing all 2000 records.

Method                           User       Total      Real      

jruby Get/Eval data              0.134750   0.134750 (  0.134850)
jruby Get/Eval data (2000)       5.390000   5.390000 (  5.394000)

MRI Get/Eval data                0.008500   0.012750 (  0.115942)
MRI Get/Eval data (2000)         0.340000   0.510000 (  4.637677)    

jruby-marc4j-oneAtATime          0.056075   0.056075  (0.056125)
jruby-marc4j-multistring         0.027925   0.027925  (0.028000)

jruby-marc-oneAtATime            0.066625   0.066625  (0.066650)
jruby-marc-multistring           0.034300   0.034300  (0.034325)

mri-marc-oneAtATime              0.084500   0.085250  (0.086597)
mri-marc-multistring             0.085000   0.085750  (0.086026)

jruby-marc4j-oneAtATime (2000)   2.243000   2.243000  (2.244999)
jruby-marc-oneAtATime (2000)     2.665001   2.665001  (2.666000)
mri-marc-oneAtATime (2000)       3.380000   3.410000  (3.463888)


jruby-marc4j-multistring (2000)  1.117001   1.117001  (1.120001)
jruby-marc-multistring (2000)    1.371999   1.371999  (1.372999)
mri-marc-multistring (2000)      3.400000   3.430000  (3.441052)

So…the worst-case scenario is taking an average 0.085 second to get the first title field out of each one of 50 binary MARC records once we’ve got them.

Now, I’m sure all my records came out of the cache, so my query time wasn’t very long. But we still end up with a maximum of roughly 0.2 seconds plus the time to actually do the query to end up with a set of 50 marc records.

We can see from looking at the totals that it looks like MRI’s bottleneck is the actual parsing, whereas constructing the input streams is expensive under jruby (at least the way I’m doing it), resulting in a benefit of concatenating them all together into one longish string before parsing.

Marc4j is faster (20%ish), but not enough faster to be worth the effort in my mind. Keep in mind that I have no idea how fast Marc4j is when running under pure java, without all the jruby overhead.

Bottom line, though: that seems fast enough to me.

I’ll try to benchmark with XML later on today or tomorrow.

2 Responses to “Benchmarking MARC record parsing in Ruby”

  1. Thanks Bill, incredibly helpful!

    Can you explain what the “Get/Eval Data” benchmarks are? I don’t really understand, but they seem quite a bit longer, trying to figure out what implications they have. (Cause part of what some of us are trying to figure out is not just what tool is fastest at parsing MARC, but whether, in a certain context, we should avoid parsing MARC altogether).

    Processing one at a time vs putting them all together in a multi-record string doesn’t seem to make a difference in MRI but does in jruby, interesting.

  2. Simon Spero says:

    My C++ MPI code could parse about the scriblio set (~7M records) in about 27 seconds on two dual-core macs (gzipped binary marc, 1.85GHz imac & 2.0GHz macbook pro, both first models). This test was generating bigram tag/subcode counts.

    Performance scaled fairly linearly; with different IO->CPU ratios, it gets faster not to read compressed.

    Marc4j has a huge performance leak in the character set conversion routines. This would probably have been fixed by now.

    BTW, has anyone ever written a marc8 to unicode converter without screwing up that one character that maps to nothing? I think it’s there just to make sure they didn’t end up with some part of the spec being accidentally consistent :)

    The MARC leader is essentially worthless- for the LC data there’s less than a bytes worth of entropy in it.

    MARC-XML is a performance killer. Use it to exchange data, or for archiving, but not for real time work.