<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robot Librarian</title>
	<atom:link href="http://robotlibrarian.billdueber.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://robotlibrarian.billdueber.com</link>
	<description>Disclaimer: I'm not actually a robot.</description>
	<lastBuildDate>Thu, 04 Mar 2010 17:05:49 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Pushing MARC to Solr; processing times and threading and such</title>
		<link>http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/</link>
		<comments>http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/#comments</comments>
		<pubDate>Thu, 04 Mar 2010 16:38:03 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=214</guid>
		<description><![CDATA[[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What&#8217;s the question?

The question came up, &#8220;How much time do we spend processing the MARC vs trying to push it into Solr?&#8221;. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to [...]]]></description>
			<content:encoded><![CDATA[<p>[This is in response to a <a href="http://groups.google.com/group/blacklight-development/browse_thread/thread/672b7269ada16a61?hl=en">thread on the blacklight mailing</a> list about getting <acronym title="MAchine Readable Cataloging">MARC</acronym> data into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>.]</p>

<h2>What&#8217;s the question?</h2>

<p>The question came up, &#8220;How much time do we spend processing the <acronym title="MAchine Readable Cataloging">MARC</acronym> vs trying to push it into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>?&#8221;. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best,
taking at least as long as the processing stage.</p>

<p>I&#8217;m interested because I&#8217;ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher&#8217;s suggestion). So I thought I&#8217;d check how things break down for me.</p>

<p>Here are my numbers running under JRuby (using MARC4J as the marc
implementation) with the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> StreamingUpdateSolrServer. Obviously, there are
a lot of differences between this and solrmarc, but I&#8217;m hoping that while it&#8217;s
not comparing apples to apples, it&#8217;s at least comparing apples to some sort of
processed cheese-like product.</p>

<h2>What work is being done on what?</h2>

<p>The data set is a file of 18,881 <acronym title="MAchine Readable Cataloging">MARC</acronym> records in marc-binary format. It&#8217;s
probably not big enough to get a great idea of how things will run over the
long (many millions of records) haul, but it&#8217;ll do for this rough-cut stuff.</p>

<p>I break my processing down into five categories:</p>

<ul>
<li>Read the records into marc4j objects and do nothing. This is a baseline of sorts.</li>
<li>The &#8220;normal&#8221; fields are anything that you could do with SolrMarc without a
custom routine; the actual processing is done in JRuby. </li>
<li>Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.</li>
<li>The big &#8220;allfields&#8221; field is text from tags 100 through 900.</li>
<li>The &#8220;to_xml&#8221; routine is just calling the underlying marc4j <acronym title="Extensible Markup Language">XML</acronym> output and stuffing it into a string.</li>
</ul>

<p>The schema used is our normal UMICH schema <em>except for</em> High Level Browse
(which appear in the <a href="http://mirlyn.lib.umich.edu/">our catalog</a> as &#8220;Academic
Discipline&#8221;). The code for that is written in Java, and I just call it from
JRuby when I&#8217;m using it. I excluded it because it&#8217;s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing &#8212; there&#8217;s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It&#8217;s expensive. Trust me.</p>

<p>The <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> server itself is on a different, incredibly-beefy machine, and is
emptied out before each invocation that involves actually pushing data to it (with a delete-by-query <em>:</em>).</p>

<h2>How fast were things on my desktop?</h2>

<ul>
<li>18,881 records in marc-binary format</li>
<li>Times are in seconds, run on my desktop</li>
<li>Remember, you can&#8217;t compare these numbers to Bob&#8217;s because we&#8217;re doing
different things to different data. </li>
</ul>

<table>
<thead>
<tr>
  <th align="right">Total Seconds</th>
  <th>Description</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">19</td>
  <td>Just read the records with marc4j and do nothing.</td>
</tr>
<tr>
  <td align="right">85</td>
  <td>Read and do 35 &#8220;normal&#8221; fields (no custom)</td>
</tr>
<tr>
  <td align="right">104</td>
  <td>Read, 35 normal, 15 custom fields</td>
</tr>
<tr>
  <td align="right">110</td>
  <td>Read, normal, custom, allfields</td>
</tr>
<tr>
  <td align="right">129</td>
  <td>Read, normal, custom, allfields, to_xml</td>
</tr>
<tr>
  <td align="right">136</td>
  <td>Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs</td>
</tr>
<tr>
  <td align="right">142</td>
  <td>Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs</td>
</tr>
<tr>
  <td align="right">124</td>
  <td>Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, <strong>2 threads doing processing</strong></td>
</tr>
</tbody>
</table>

<p>We can also break the same numbers down as:</p>

<table>
<thead>
<tr>
  <th align="right">Seconds</th>
  <th>Description</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">19</td>
  <td>read the records and do nothing</td>
</tr>
<tr>
  <td align="right">66</td>
  <td>process the 35 normal fields</td>
</tr>
<tr>
  <td align="right">19</td>
  <td>process the 15 custom fields</td>
</tr>
<tr>
  <td align="right">6</td>
  <td>generate the &#8220;allfields&#8221; field</td>
</tr>
<tr>
  <td align="right">19</td>
  <td>generate the <acronym title="Extensible Markup Language">XML</acronym> (yowza!)</td>
</tr>
<tr>
  <td align="right">7</td>
  <td>send to solr with two threads</td>
</tr>
<tr>
  <td align="right">13</td>
  <td>send to solr with one thread</td>
</tr>
</tbody>
</table>

<p>Or like this:</p>

<table>
<thead>
<tr>
  <th align="right">Seconds</th>
  <th>Description</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">129</td>
  <td>do all the reading and processing</td>
</tr>
<tr>
  <td align="right">13</td>
  <td>send to solr with one thread</td>
</tr>
</tbody>
</table>

<h2>Why does solr processing seem so much faster for me?</h2>

<p>There are a lot of reasons why my submit-to-solr might seem like less of a
burden. The ones I can think of off the top of my head are:</p>

<ul>
<li>SUSS is just faster than whatever solrmarc does. </li>
<li>My processing stage is so much slower than solrmac&#8217;s (due to algorithms or jruby-vs-java, I don&#8217;t know) that the &#8220;push to solr&#8221; portion of it gets swallowed up by the slowness of the of overall code.</li>
<li>The <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> server is so much faster than my desktop that my poor little 
desktop can&#8217;t send it data fast enough to work it.</li>
</ul>

<p><strong>For my setup, obviously adding a processing thread is a lot more beneficial
than adding a SUSS thread.</strong> My desktop doesn&#8217;t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.</p>

<h2>Trying the same thing on a beefy machine</h2>

<p>This is the exact same code and data, but on a beefy machine (16 cores, gobs
of memory).</p>

<table>
<thead>
<tr>
  <th align="right">time</th>
  <th align="center">SUSS Threads</th>
  <th align="center">Processing Threads</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">70</td>
  <td align="center">1</td>
  <td align="center">1     (was 142 seconds on the desktop)</td>
</tr>
<tr>
  <td align="right">47</td>
  <td align="center">1</td>
  <td align="center">2</td>
</tr>
<tr>
  <td align="right">39</td>
  <td align="center">1</td>
  <td align="center">3</td>
</tr>
<tr>
  <td align="right">35</td>
  <td align="center">1</td>
  <td align="center">4</td>
</tr>
<tr>
  <td align="right">68</td>
  <td align="center">2</td>
  <td align="center">1</td>
</tr>
<tr>
  <td align="right">48</td>
  <td align="center">2</td>
  <td align="center">2</td>
</tr>
<tr>
  <td align="right">38</td>
  <td align="center">2</td>
  <td align="center">3</td>
</tr>
<tr>
  <td align="right">34</td>
  <td align="center">2</td>
  <td align="center">4</td>
</tr>
</tbody>
</table>

<p>So, on my hardware anyway, there&#8217;s a sweet spot with one suss thread and
three processing threads. <acronym title="Your mileage may vary">YMMV</acronym>, of course.</p>

<h2>What have we learned?</h2>

<p>I&#8217;m not sure, to be honest. It&#8217;s logistically difficult for me to do the same
process in solrmarc because I&#8217;d have to rebuild everything without the HLB stuff. I guess for me, what I&#8217;ve learned that if I&#8217;m going to continue working 
on my code, the places to focus my attention are threading (obviously) and <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym> generation.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>ruby-marc with pluggable readers</title>
		<link>http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/</link>
		<comments>http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 17:55:43 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/</guid>
		<description><![CDATA[I&#8217;ve been messing with easier ways of adding parsers to ruby-marc&#8217;s MARC::Reader object. The idea is that you can do this:

&#160; require &#39;marc&#39;
&#160; require &#39;my_marc_stuff&#39;
&#160; 
&#160; mbreader = MARC::Reader.new&#40;&#39;test.mrc&#39;&#41; # =&#62; Stock marc binary reader
&#160; mbreader = MARC::Reader.new&#40;&#39;test.mrc&#39; :readertype=&#62;:marcstrict&#41; # =&#62; ditto
&#160; 
&#160; MARC::Reader.register_parser&#40;My::MARC::Parser, :marcstrict&#41;
&#160; mbreader = MARC::Reader.new&#40;&#39;test.mrc&#39;&#41; # =&#62; Uses My::MARC::Parser now
&#160; 
&#160; xmlreader [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been messing with easier ways of adding parsers to ruby-marc&#8217;s MARC::Reader object. The idea is that you can do this:</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;marc&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;my_marc_stuff&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; mbreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.mrc&#39;</span><span class="br0">&#41;</span> <span class="co1"># =&gt; Stock marc binary reader</span></div></li>
<li class="li1"><div class="de1">&nbsp; mbreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.mrc&#39;</span> <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:marcstrict<span class="br0">&#41;</span> <span class="co1"># =&gt; ditto</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re2">MARC::Reader</span>.<span class="me1">register_parser</span><span class="br0">&#40;</span><span class="re2">My::MARC::Parser</span>, <span class="re3">:marcstrict</span><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; mbreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.mrc&#39;</span><span class="br0">&#41;</span> <span class="co1"># =&gt; Uses My::MARC::Parser now</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; xmlreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.xml&#39;</span>, <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:marcxml<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># &#8230;and maybe further on down the road</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; asreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.seq&#39;</span>, <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:alephsequential<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; mjreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.json&#39;</span>, <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:marchashjson<span class="br0">&#41;</span></div></li></ol></div>

<p>A parser need only implement <code>#each</code> and a module-level method <code>#decode_from_string</code>.</p>

<p>Read all about it <a href="http://github.com/billdueber/ruby-marc-plugable-readers">on the github page</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>New interest in MARC-HASH / JSON</title>
		<link>http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/</link>
		<comments>http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/#comments</comments>
		<pubDate>Fri, 26 Feb 2010 04:29:46 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=204</guid>
		<description><![CDATA[For reasons I&#8217;m still not entirely clear on (I wasn&#8217;t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn&#8217;t such a pain in the butt to work with, something that [...]]]></description>
			<content:encoded><![CDATA[<p>For reasons I&#8217;m still not entirely clear on (I wasn&#8217;t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for <acronym title="MAchine Readable Cataloging">MARC</acronym> data.</p>

<p>When I initially looked at <acronym title="MAchine Readable Cataloging">MARC</acronym>-HASH <a href="http://robotlibrarian.billdueber.com/marc-hash-the-saga-continues-now-with-even-less-structure/">almost a year ago</a>, I was mostly looking for something that wasn&#8217;t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.</p>

<p>Now, though, a lot of us are looking for a <acronym title="MAchine Readable Cataloging">MARC</acronym> format that (a) doesn&#8217;t suffer from the length limitations of binary <acronym title="MAchine Readable Cataloging">MARC</acronym>, but (b) is less painful (both in code and processing time) than <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym>, and it&#8217;s worth re-visiting. </p>

<p>For at least a few folks, un-marshaling time is a factor, since no matter what you&#8217;re doing, processing <acronym title="Extensible Markup Language">XML</acronym> is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren&#8217;t a big win with a brain-dead format like <acronym title="MAchine Readable Cataloging">MARC</acronym>, so it&#8217;s worth looking at alternatives.</p>

<h2>What is <acronym title="MAchine Readable Cataloging">MARC</acronym>-HASH?</h2>

<p>At some point, we&#8217;ll want a real spec, but right now it&#8217;s just this:
</p>

<div class="geshi no json"><ol><li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; # A record is a four-pair hash, as follows. UTF-8 is mandatory.</div></li>
<li class="li1"><div class="de1">&nbsp; {</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;type&quot; : &quot;marc-hash&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;version&quot; : [1, 0]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;leader&quot; : &quot;&#8230;leader string &#8230; &quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;fields&quot; : [array, of, fields]</div></li>
<li class="li1"><div class="de1">&nbsp; }</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; # A field is an array of either 2 or 4 elements</div></li>
<li class="li1"><div class="de1">&nbsp; [tag, value] # a control field</div></li>
<li class="li1"><div class="de1">&nbsp; [tag, ind1, ind2, [array, of subfields]]</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; # A subfield is an array of two elements</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; [code, value]</div></li></ol></div>

<p>So, a short example:</p>

<div class="geshi no json"><ol><li class="li1"><div class="de1">&nbsp; {</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;type&quot; : &quot;marc-hash&quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;version&quot; : [1, 0],</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;leader&quot; : &quot;leader string&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;fields&quot; : [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;[&quot;001&quot;, &quot;001 value&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;[&quot;002&quot;, &quot;002 value&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;[&quot;010&quot;, &quot; &quot;, &quot; &quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;68009499&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; [&quot;035&quot;, &quot; &quot;, &quot; &quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;(RLIN)MIUG0000733-B&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; [&quot;035&quot;, &quot; &quot;, &quot; &quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;(CaOTULAS)159818014&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; [&quot;245&quot;, &quot;1&quot;, &quot;0&quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;Capitalism, primitive and modern;&quot;],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;b&quot;, &quot;some aspects of Tolai economic growth&quot; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;c&quot;, &quot;[by] T. Scarlett Epstein.&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; }</div></li></ol></div>

<h2>How's the speed?</h2>

<p>I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.</p>

<p>Having said that, in real life people are mostly concerned about JSON. So,
let's look at JSON performance.</p>

<p>The <acronym title="MAchine Readable Cataloging">MARC</acronym>-Binary and <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym> files are normal files, as you'd expect. The JSON file is 
<a href="http://trephine.org/t/index.php?title=Newline_delimited_JSON">"Newline-Delimited JSON"</a> -- a single JSON record on each line.</p>

<p>The benchmark code looks like this:</p>

<p><pre>
  # Unmarshal
  x.report("<acronym title="MAchine Readable Cataloging">MARC</acronym> Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end</p>

<p># Marshal
  x.report("<acronym title="MAchine Readable Cataloging">MARC</acronym> Binary") do 
    reader = MARC::Reader.new('test.mrc')
    writer = MARC::Writer.new('benchout.mrc')
    reader.each do |r|
      writer.write(r)
    end
    writer.close
  end
</pre></p>

<p>Under MRI, I used the nokogiri <acronym title="Extensible Markup Language">XML</acronym> parser and the yajl JSON gem. Under JRUby, it was the jstax <acronym title="Extensible Markup Language">XML</acronym> parser and the json-jruby JSON gem.</p>

<p>The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.</p>

<h3>Marshalling Speed (read from binary marc, dump to given format)</h3>

<p>Times are in seconds on my Macbook laptop, using ruby-marc.</p>

<table class="grid">
    <tr>
      <th>Format</th>
      <th>Ruby 1.87</th>
      <th>Ruby 1.9</th>
      <th>JRuby 1.4</th>
      <th>Jruby 1.4 --1.9</th>
    </tr>
    <tr>
      <td><acronym title="Extensible Markup Language">XML</acronym></td>
      <td>393</td>
      <td>443</td>
      <td>188</td>
      <td>356</td>
    </tr>
    <tr>
      <td><acronym title="MAchine Readable Cataloging">MARC</acronym> Binary</td>
      <td>36</td>
      <td>23</td>
      <td>23</td>
      <td>25</td>
    </tr>
    <tr>
      <td>JSON/ NDJ</td>
      <td>31</td>
      <td>19</td>
      <td>25</td>
      <td>ERROR</td>
    </tr>
  </table>

<h3>Unmarshalling speed (from pre-created file)</h3>

<p>Again, times are in seconds</p>

<table class="grid">
    <tr>
      <th>Format</th>
      <th>Ruby 1.87</th>
      <th>Ruby 1.9</th>
      <th>JRuby 1.4</th>
      <th>Jruby 1.4 --1.9</th>
    </tr>
    <tr>
      <td><acronym title="Extensible Markup Language">XML</acronym></td>
      <td>113</td>
      <td>89</td>
      <td>75</td>
      <td>89</td>
    </tr>
    <tr>
      <td><acronym title="MAchine Readable Cataloging">MARC</acronym> Binary</td>
      <td>29</td>
      <td>16</td>
      <td>16</td>
      <td>19</td>
    </tr>
    <tr>
      <td>JSON/ NDJ</td>
      <td>17</td>
      <td>9</td>
      <td>13</td>
      <td>16</td>
    </tr>
  </table>

<h2> And so...</h2>

<p>I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.</p>

<p>If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>OCLC still not (NO! They are!) normalizing their LCCNs</title>
		<link>http://robotlibrarian.billdueber.com/oclc-still-not-normalizing-their-lccns/</link>
		<comments>http://robotlibrarian.billdueber.com/oclc-still-not-normalizing-their-lccns/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 14:58:33 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=198</guid>
		<description><![CDATA[NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So&#8230;good news all around, and huge [...]]]></description>
			<content:encoded><![CDATA[<blockquote>NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So&#8230;good news all around, and huge kudos to 
Xiaoming Liu for his quick response!
</blockquote>

<blockquote>**NOTE** It strikes me that I haven&#8217;t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you&#8217;ll get back either good data or nothing (and the &#8220;nothing&#8221; might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.</blockquote>

<hr />

<p>A <a href="http://bibwild.wordpress.com/2009/03/11/normalize-your-lccns/">long time ago</a>, Jonathan Rochkind noted that the <acronym title="Online Computer Library Center">OCLC</acronym> doesn&#8217;t correctly <a href="http://www.loc.gov/marc/lccn-namespace.html">normalize their LCCNs</a>.</p>

<p>Well, it&#8217;s not fixed.</p>

<p>I could really, <em>really</em> use the xlccn service right about now &#8212; a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you&#8217;re interested in.</p>

<p>Except they &#8220;normalize&#8221; their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.</p>

<p><em>The xLCCN service will silently provide no data or <strong>incorrect data</strong> for many LCCN requests!</em></p>

<p>An example:</p>

<ul>
<li>(F) Full LCCN is &#8220;sn 83011407&#8243;</li>
<li>(D) First set of digits is &#8220;83011407&#8243;. This is what I think the <acronym title="Online Computer Library Center">OCLC</acronym> is indexing.</li>
<li>(N) Correct normalization is &#8220;sn83011407&#8243;</li>
</ul>

<p>The problem, of course, is that (D) &#8220;83011407&#8243; <em>is itself a valid LCCN</em>.</p>

<ul>
<li>(F) is associated with <acronym title="Online Computer Library Center">OCLC</acronym># 47212967</li>
<li>(D) is associated with <acronym title="Online Computer Library Center">OCLC</acronym># 12505148. That&#8217;s <em>not the same record</em>.</li>
</ul>

<p>So, how do the <acronym title="Online Computer Library Center">OCLC</acronym> services respond?</p>

<ul>
<li>(F) Worldcat search finds correct (probably just doing a string match); xid finds nothing</li>
<li>(D) Worldcat finds both correct and incorrect records. The xLCCN service finds <em>only</em> the incorrect record, <acronym title="Online Computer Library Center">OCLC</acronym># 12505148.</li>
<li>(N) Neither worldcat nor xid finds anything for the correctly normalized version.</li>
</ul>

<p>So, what am I supposed to do? Only use the service on LCCNs where the original
and normalized versions are the same and include only digits? Frustrating.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/oclc-still-not-normalizing-their-lccns/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Indexing data into Solr via JRuby (with threads!)</title>
		<link>http://robotlibrarian.billdueber.com/indexing-data-into-solr-via-jruby-with-threads/</link>
		<comments>http://robotlibrarian.billdueber.com/indexing-data-into-solr-via-jruby-with-threads/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 19:43:52 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=190</guid>
		<description><![CDATA[[Note: in this post I'm just going to focus on the "get stuff into Solr" part. My normal focus -- MARC data -- will
make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except [...]]]></description>
			<content:encoded><![CDATA[<p>[Note: in this post I'm just going to focus on the "get stuff into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>" part. My normal focus -- <acronym title="MAchine Readable Cataloging">MARC</acronym> data -- will
make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]</p>

<h2>Working with <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></h2>

<p>I love me the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>. I love everything about it except that the best way to interact with it is via Java. I don&#8217;t so much love me the java.</p>

<p>So&#8230;taking Erik Hatcher&#8217;s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>.</p>

<h2>Getting the code</h2>

<p>I&#8217;ve added the gems to gemcutter, if you want to play along at home:</p>

<ul>
<li><strong>jruby&#95;producer&#95;consumer</strong> (<a href="http://github.com/billdueber/jruby_producer_consumer">github</a>, <a href="http://rdoc.info/projects/billdueber/jruby_producer_consumer">rdoc.info</a>) Ruby syntax for threaded operations under jruby</li>
<li><strong>jruby&#95;streaming&#95;update&#95;solr&#95;server</strong> (<a href="http://github.com/billdueber/jruby_streaming_update_solr_server">github</a>, <a href="http://rdoc.info/projects/billdueber/jruby_streaming_update_solr_server">rdoc.info</a>) Ruby syntax on top of the Java class of the same name</li>
<li><strong>marc4j4r</strong> (<a href="http://github.com/billdueber/marc4j4r">github</a>, <a href="http://rdoc.info/projects/billdueber/marc4j4r">rdoc.info</a>) Ruby syntax on top of the marc4j java library.</li>
</ul>

<p><em>WARNING</em>: None of these gems have a 1.0 version tag on them, and that means that the <acronym title="Application Programming Interface">API</acronym> may change a titch in 
  the future. Also, the fact that they&#8217;re released as gems means that it&#8217;s easy to release gems, not that I&#8217;m not
  an idiot.</p>

<h2>The basics: Using SolrInputDocument and StreamingUpdateSolrServer</h2>

<p>OK, with the disclaimer out of the way, let&#8217;s look at some code.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;jruby_streaming_update_solr_server&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; solrurl = <span class="st0">&#39;http://your.solr.server:port/solr&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussqueuesize = <span class="nu0">24</span> <span class="co1"># how many items to buffer on their way to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussthreads = <span class="nu0">1</span> &nbsp; <span class="co1"># how many threads to use to send stuff to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; suss = StreamingUpdateSolrServer.<span class="me1">new</span><span class="br0">&#40;</span>solrurl,sussqueuesize,sussthreads<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Let&#39;s add a simple document via a hash: A title, three authors, and a year</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; h = <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="re3">:title</span> <span class="sy0">=&gt;</span> <span class="st0">&quot;Never been deader&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="re3">:author</span> <span class="sy0">=&gt;</span> <span class="br0">&#91;</span><span class="st0">&#39;Bill&#39;</span>, <span class="st0">&#39;Mike&#39;</span>, <span class="st0">&#39;Molly&#39;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="re3">:year</span> <span class="sy0">=&gt;</span> <span class="nu0">2003</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; suss <span class="sy0">&lt;&lt;</span> h</div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">commit</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># YEA! You just added a document to solr and committed it. </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Have a cookie!</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># We can also use a document object to do the same thing</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc = SolrInputDocument.<span class="me1">new</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add the title</span></div></li>
<li class="li1"><div class="de1">&nbsp; doc <span class="sy0">&lt;&lt;</span> <span class="br0">&#91;</span><span class="st0">&#39;title&#39;</span>, <span class="st0">&#39;Never been deader&#39;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add the first author</span></div></li>
<li class="li1"><div class="de1">&nbsp; doc <span class="sy0">&lt;&lt;</span> <span class="br0">&#91;</span><span class="re3">:author</span>, <span class="st0">&#39;Bill&#39;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add more. Re-used keys mean you&#39;re adding additional values</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Note values can be scalars or arrays</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc <span class="sy0">&lt;&lt;</span> <span class="br0">&#91;</span><span class="re3">:author</span>, <span class="br0">&#91;</span><span class="st0">&#39;Mike&#39;</span>, <span class="st0">&#39;Molly&#39;</span><span class="br0">&#93;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add the wrong year using [] syntax</span></div></li>
<li class="li1"><div class="de1">&nbsp; doc<span class="br0">&#91;</span><span class="re3">:year</span><span class="br0">&#93;</span> = <span class="nu0">2001</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Oops! fix it. []= overwrites existing value(s)</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc<span class="br0">&#91;</span><span class="re3">:year</span><span class="br0">&#93;</span> = <span class="nu0">2003</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Finally, we can merge a hash (or anything else that responds to </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># &#39;each_pair&#39; with key-value pairs) into an existing doc</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc.<span class="me1">merge</span>! <span class="br0">&#123;</span><span class="st0">&#39;author&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;Ringo Starrre&#39;</span>, <span class="st0">&#39;publisher&#39;</span><span class="sy0">=&gt;</span><span class="st0">&#39;Vainity Books&#39;</span><span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add it</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; suss <span class="sy0">&lt;&lt;</span> doc</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Commit and optimize if you&#39;d like</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">commit</span></div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">optimize</span> <span class="co1"># if you want</span></div></li></ol></div>

<p>Nothing really fancy in there &#8212; just a few things worth noting:</p>

<ul>
<li>An suss object will take a hash (again, anything that responds to <code>#each_pair</code>) or a SolrInputDoc</li>
<li>You can use either strings or symbols to represent <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> field names</li>
<li>Values can be either a single value, or an array of multiple values</li>
</ul>

<p>And there are three ways to get data into a doc:</p>

<ul>
<li>Via <code>&lt;&lt; [field, value(s)]</code> (additive)</li>
<li>Via <code>doc.merge! hash</code> (additive)</li>
<li>Via <code>doc[field] = value</code> (replaces)</li>
</ul>

<h2>Adding Threads</h2>

<p>I also went down the garden path of threading things. There are an awful lot
of operations that are not threadsafe (e.g., reading a line from a file) but
once you&#8217;ve got a bunch of records to worth with, turning them into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>
documents is usually thread-safe.</p>

<p>My model is that there&#8217;s a producer (usually the method <code>#each</code>) from an 
underlying data object. A thread takes whatever that method yields
and sticks the values into a java 
BlockingQueue awaiting consumption. You then use <code>ProdcuerConsumer#threaded_each</code>
(or <code>ProducerConsumer#threaded_each_with_index</code>) to pull items out of the queue and do something useful with them.</p>

<p>I extracted stuff into a library (jruby&#95;producer&#95;consumer) for your viewing pleasure.</p>

<p><strong>CONFUSION ALERT</strong>: It&#8217;s perhaps unfortunate that the object you send to <code>ProducerConsumer.new(obj)</code> must implement <code>#each</code> and that the ProducerConsumer method <code>#threaded_each</code> calls that underlying <code>#each</code>&#8230;well
there&#8217;s a lot of <code>#each</code>&#8217;s floating around. Keep them straight.</p>

<p>So&#8230;let&#8217;s look at some code to work with consumer threads.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="co1"># Start off the same as before</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;jruby_streaming_update_solr_server&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;jruby_producer_consumer&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;marc4j4r&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; solrurl = <span class="st0">&#39;http://your.solr.server:port/solr&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussqueuesize = <span class="nu0">24</span> <span class="co1"># how many items to buffer on their way to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussthreads = <span class="nu0">2</span> &nbsp; <span class="co1"># how many threads to use to send stuff to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; suss = StreamingUpdateSolrServer.<span class="me1">new</span><span class="br0">&#40;</span>solrurl,sussqueuesize,sussthreads<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># I&#39;ll go ahead and use a <acronym title="MAchine Readable Cataloging">MARC</acronym> file as my example, but won&#39;t talk about the</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># <acronym title="MAchine Readable Cataloging">MARC</acronym> parts of it. All you need to know is that the reader object</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># implements #each</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; reader = MARC4J4R.<span class="me1">reader</span><span class="br0">&#40;</span><span class="st0">&#39;test.xml&#39;</span>, <span class="re3">:marcxml</span><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Get a producer/consumer object with the reader at its base, using</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># the default method #each to get stuff out of it, and with the assumption</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># that we only need to keep the default 5 items in memory at a time to </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># keep up with consumption</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; pc = ProducerConsumer.<span class="me1">new</span><span class="br0">&#40;</span>reader<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Get three threads to actually consume the things, turn them into solr </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># documents, and send them to solr (potentially out of order)</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; numconsumerthreads = <span class="nu0">3</span></div></li>
<li class="li1"><div class="de1">&nbsp; pc.<span class="me1">threaded_each</span><span class="br0">&#40;</span>numconsumerthreads<span class="br0">&#41;</span>.<span class="me1">each</span> <span class="kw1">do</span> <span class="sy0">|</span>r<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; suss <span class="sy0">&lt;&lt;</span> turn_marc_record_into_a_hash_or_solrdoc<span class="br0">&#40;</span>r<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">commit</span></div></li></ol></div>

<p>Again, not a lot happening here.</p>

<ul>
<li>The &#8220;producer&#8221; is always one thread, because so little is thread-safe at the &#8216;each&#8217; level. In this case, there&#8217;s a single thread pulling data out of the file and turning it into <acronym title="MAchine Readable Cataloging">MARC</acronym> records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don&#8217;t starve. I presume that producing items is cheaper than consuming them, or else this library won&#8217;t help you much. </li>
<li><code>ProducerConsumer#threaded_each</code> calls the <code>#each</code> method of the underlying object. You can substitute anything that yields, though, as in this example where I call <code>#each_line</code> instead of the default <code>#each</code></li>
</ul>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; queuesize = <span class="nu0">5</span></div></li>
<li class="li1"><div class="de1">&nbsp; pc = ProducerConsumer.<span class="me1">new</span><span class="br0">&#40;</span><span class="kw4">File</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;myfile.txt&#39;</span><span class="br0">&#41;</span>, queuesize, <span class="re3">:each_line</span><span class="br0">&#41;</span></div></li></ol></div>

<ul>
<li>Keep track of your threads. In this last example, there is one thread getting <acronym title="MAchine Readable Cataloging">MARC</acronym> records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the <code>suss</code> object, and another two pulling stuff <em>out</em> of the <code>suss</code> object and sending things to Sorl. And, of course, there&#8217;s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.</li>
<li>See the docs for how to mess with what goes into a ProducerConsumer object. It&#8217;s entirely possible to use, say, <code>#each_slice</code>. There&#8217;s also a convenience method <code>#threaded_each_with_index</code>, but it does <em>not</em> call the underlying <code>#each_with_index</code>, it produces its own index as things are read. </li>
</ul>

<h2>Feedback not only welcome but necessary!</h2>

<p>I&#8217;ve done a lot of messing around with Ruby in the last 10 days or so, but I&#8217;m still basically converting from <acronym title="Practical Extraction and Report Language">Perl</acronym> in my head. Any comments, bugs reports, or whatnot are definitely welcome!</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/indexing-data-into-solr-via-jruby-with-threads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>jruby_producer_consumer dead-simple producer/consumer for JRuby</title>
		<link>http://robotlibrarian.billdueber.com/jruby_producer_consumer-dead-simple-producerconsumer-for-jruby/</link>
		<comments>http://robotlibrarian.billdueber.com/jruby_producer_consumer-dead-simple-producerconsumer-for-jruby/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 19:46:46 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=188</guid>
		<description><![CDATA[Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. [...]]]></description>
			<content:encoded><![CDATA[<p>Yea! My first gem ever released!</p>

<p style="color: red">[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new <a href="http://rdoc.info/projects/billdueber/jruby_producer_consumer">jruby_producer_consumer</a> gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]</p>

<p>[In working on a threaded JRuby-based <acronym title="MAchine Readable Cataloging">MARC</acronym>-to-<acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> project, I realized that my threading stuff was...ugly. And
I didn't really understand it. So I dug in today and wrote this.]</p>

<p>I&#8217;ve just pushed to <a href="http://gemcutter.org/">Gemcutter</a> my first gem &#8212; a <a href="http://jruby.org/">JRuby</a>-only
producer/consumer class that works with anything that provides <em>#each</em> called <a href="http://gemcutter.org/gems/jruby_producer_consumer">jruby_producer_consumer</a>.</p>

<p>It&#8217;s JRuby-only because it uses (a) A blocking queue implemenation that&#8217;s native Java, and (b) threading, which isn&#8217;t 
a huge win under regular Ruby.</p>

<p>There&#8217;s no testing there because I&#8217;m not sure how to test threaded stuff <img src='http://robotlibrarian.billdueber.com/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' /> </p>

<p>It is, I hope, easy to use:</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw3">require</span> <span class="st0">&#39;jruby_producer_consumer&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Create a ProducerConsumer. Arguments are anything that implements #each</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># and the size for the underlying queue. For the former, I&#39;ll just use a Range object.</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;eachable = <span class="nu0">1</span>..<span class="nu0">10</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;queuesize = <span class="nu0">3</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;pc = ProducerConsumer.<span class="me1">new</span><span class="br0">&#40;</span>eachable, queuesize<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Just a method to show what happens</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw1">def</span> sample <span class="br0">&#40;</span>consumerid, x<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="kw3">puts</span> <span class="st0">&quot;Consumer #{consumerid}: consuming #{x}&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="kw3">sleep</span> <span class="nu0">1</span> <span class="co1"># otherwise this&#39;ll finsish before I can create multiple consumers</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Create three consumers. You can pass any number of args to</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># #consumer, and must pass a block whose arguments are the</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># object returned by eachable#each and those args back.</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&#39;A&#39;</span>, <span class="st0">&#39;B&#39;</span>, <span class="st0">&#39;C&#39;</span><span class="br0">&#93;</span>.<span class="me1">each</span> <span class="kw1">do</span> <span class="sy0">|</span>consumerid<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;pc.<span class="me1">consumer</span><span class="br0">&#40;</span>consumerid<span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>x, consumerid<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;sample<span class="br0">&#40;</span>consumerid, x<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># OUTPUT</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer A: consuming 1</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 2</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer C: consuming 3</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer A: consuming 4</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 5</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer C: consuming 6</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 7</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer A: consuming 8</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer C: consuming 9</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 10</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/jruby_producer_consumer-dead-simple-producerconsumer-for-jruby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Still another look at MARC parsing in ruby and jruby</title>
		<link>http://robotlibrarian.billdueber.com/still-another-look-at-marc-parsing-in-ruby-and-jruby/</link>
		<comments>http://robotlibrarian.billdueber.com/still-another-look-at-marc-parsing-in-ruby-and-jruby/#comments</comments>
		<pubDate>Fri, 29 Jan 2010 15:51:38 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/still-another-look-at-marc-parsing-in-ruby-and-jruby/</guid>
		<description><![CDATA[I&#8217;ve been looking at making a jruby-based solr indexer for MARC documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup. 


Assertion: The upper bound on how fast I can process records and send them to Solr can be approximated by looking [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been looking at making a jruby-based solr indexer for <acronym title="MAchine Readable Cataloging">MARC</acronym> documents, and started off wanting to make sure I could determine if anything I did would be faster than our existing (solrmarc-based) setup. </p>

<div style="padding-left: 3em">
<p><strong>Assertion</strong>: The upper bound on how fast I can process records and send them to <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> can be approximated by looking at how fast I can parse (and do nothing else to) marc records from a file.</p>
<p><strong>Assertion</strong>: If I can&#8217;t write a system that&#8217;s faster than what we have now, it&#8217;s probably not worth my time even though being able to fall back to ruby instead of java would be nice.</p>
<p><strong>The Big Question</strong>: Is the <acronym title="MAchine Readable Cataloging">MARC</acronym> parsing process fast enough that it seems I might be able to write a system that runs faster than the solrmarc setup I have now?</p>
<p><strong>The Answer (see below)</strong>: Yes, if I use marc4j.</p>
</div>

<p>On our ridiculously-awesome hardware, right now we&#8217;re doing about 300 records/second for short files and 250 records/second for a full (6.5 million record) index, giving us a 7-8 hour reindex.</p>

<p>I&#8217;ll just post the results without a lot of commentary. I warmed stuff up in all cases, and ran on my desktop (so I could compare to MRI ruby, which isn&#8217;t installed on the server) and on the server where we usually run these things.</p>

<ul>
  <li><em>The machines</em> are my desktop OSX machine and the beefy linux server where we usually do this stuff</li>
  <li><em>The platforms</em> are jruby 1.4 &#8211;server and MRI ruby 1.87</li>
  <li><em>The libraries</em> are marc4j and ruby-marc 0.3.3</li>
  <li><em>The parsers</em> are 
    <ul>
      <li>The standard binary parsers all around</li>
      <li>A home-grown AlephSequential format reader for the &#8217;seq&#8217; type. AlephSequential is a <acronym title="MAchine Readable Cataloging">MARC</acronym> representation that uses one line for each field. We use it because it doesn&#8217;t have length limitations and, not surprisingly, Aleph can spit it out pretty quickly compared to <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym>.</li>
      <li>Whatever marc4j uses internally for <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym></li>
      <li>ruby-marc&#8217;s &#8216;jstax&#8217; xml parser under jruby (which I wrote and apparently needs some love, see below)</li>
      <li>ruby-marc&#8217;s &#8216;libxml&#8217; xml parser under MRI ruby</li>
    </ul>
  <li><em>Seconds</em> is the average of two rounds, with measurements taken after a warmup run in each case.</li>
</ul>

<p>The test files were 18,881 records in marc-xml, marc-binary, and AlephSequential formats.</p>

<p><pre></p>

<p>MACHINE  PLATFORM LIBRARY     PARSER    SECONDS    REC/SECOND<br />
desktop  jruby    marc4j      binary      4.06       4650
desktop  jruby    marc4j      xml         5.55       3401
desktop  jruby    ruby-marc   binary     17.35       1088
desktop  jruby    ruby-marc   jstax      80.11        236</p>

<p>desktop  ruby     ruby-marc   binary     33.54        562
desktop  ruby     ruby-marc   libxml     46.87        402</p>

<p>server   jruby    marc4j      binary      2.29       8245
server   jruby    marc4j      xml         3.36       5619
server   jruby    marc4j      AlephSeq    3.68       5130
server   jruby    ruby-marc   binary      9.93       1901
server   jruby    ruby-marc   jstax      44.56        424</p>

<p></pre></p>

<p>The quick takeaways, with all the obvious caveats:</p>

<ul>
    <li>jruby with ruby-marc is twice as fast at binary and twice as slow at xml compared with MRI</li>
    <li>marc4j is four times as fast for binary and about an order of magnitutde faster for xml compared with ruby-marc.</li>
    <li>The server is fast. </li>
  </ul>

<p>We know from previous experience that libxml is the fastest of the current MRI-based marc-xml readers and that jstax is the best of the current jruby-based marc-xml readers. And, finally, we know that many of us can&#8217;t use marc-binary format because our records are too big. </p>

<p>If I&#8217;m gonna use jruby (which I think I am due to wanting to use the StreamingUpdateSolrServer) I&#8217;m gonna need to use marc4j and just wrap it up in some nicer syntax.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/still-another-look-at-marc-parsing-in-ruby-and-jruby/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Beta version of the HathiTrust Volumes API available</title>
		<link>http://robotlibrarian.billdueber.com/beta-version-of-the-hathitrust-volumes-api-available/</link>
		<comments>http://robotlibrarian.billdueber.com/beta-version-of-the-hathitrust-volumes-api-available/#comments</comments>
		<pubDate>Tue, 15 Dec 2009 18:48:08 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=177</guid>
		<description><![CDATA[MAJOR CHANGE

So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, URL-like slash (/) character.

Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.

So, we&#8217;re back to using pipe (&#124;) characters to separate multiple calls &#8212; the examples below have been updated to reflect this.

Introduction

I&#8217;ve put up [...]]]></description>
			<content:encoded><![CDATA[<h2><span style="color: red">MAJOR CHANGE</span></h2>

<p>So, initially, this post listed that the way to separate multiple simultaneous requests was with a nice, <acronym title="Uniform Resource Locator">URL</acronym>-like slash (/) character.</p>

<p>Then, I remembered that LCCNs can have embedded slashes, e.g., 65063380//r85.</p>

<p>So, we&#8217;re back to using pipe (|) characters to separate multiple calls &#8212; the examples below have been updated to reflect this.</p>

<h2>Introduction</h2>

<p>I&#8217;ve put up a beta version of the HathiTrust Volumes <acronym title="Application Programming Interface">API</acronym> <a href="http://robotlibrarian.billdueber.com/thinking-through-a-simple-api-for-hathitrust-item-metadata/">previously discussed</a> on this blog and via email.</p>

<p>Currently, I&#8217;ve only got json output, although there is space in there for other output formats as necessary.</p>

<h2>What exactly is this?</h2>

<p>Given: an identifier or set of identifiers, this <acronym title="Application Programming Interface">API</acronym> will
Return: a set of matched records and a sorted list of the items available in the HathiTrust.</p>

<p>Useful, for example, if you want to display HathiTrust holdings alongside your own in your <acronym title="Online Public Access Catalog">OPAC</acronym>.</p>

<h2>Simple, single-value call</h2>

<p>Given the URL:</p>

<p><a href="http://catalog.hathitrust.org/api/volumes/oclc/15420548.json">http://catalog.hathitrust.org/api/volumes/oclc/15420548.json</a></p>

<p>You&#8217;ll get the following back:</p>

<div class="geshi no javascript"><ol><li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="st0">&quot;records&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;000791709&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;recordURL&quot;</span>:<span class="st0">&quot;http://catalog.hathitrust.org/Record/000791709&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;titles&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;<span class="es0">\&quot;</span>Zhong gong dang shi<span class="es0">\&quot;</span> fu dao /&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;<span class="es0">\u</span>300a<span class="es0">\u</span>4e2d<span class="es0">\u</span>5171<span class="es0">\u</span>515a<span class="es0">\u</span>53f2<span class="es0">\u</span>300b<span class="es0">\u</span>8f85<span class="es0">\u</span>5bfc /&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;isbns&quot;</span>: <span class="br0">&#91;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;issns&quot;</span>: <span class="br0">&#91;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;oclcs&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;15420548&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;lccns&quot;</span>: <span class="br0">&#91;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="st0">&quot;items&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;orig&quot;</span>:<span class="st0">&quot;University of Michigan&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;fromRecord&quot;</span>:<span class="st0">&quot;000791709&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;htid&quot;</span>:<span class="st0">&quot;mdp.39015058510069&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;itemURL&quot;</span>:<span class="st0">&quot;http://hdl.handle.net/2027/mdp.39015058510069&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;rightsCode&quot;</span>:<span class="st0">&quot;ic&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;lastUpdate&quot;</span>:<span class="st0">&quot;00000000&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;enumcron&quot;</span>:<span class="kw2">false</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li></ol></div>

<p>Note that the &#8216;records&#8217; are keyed on the local umid, also available in the &#8216;fromRecord&#8217; field of each item.</p>

<p>The generic short form is:</p>

<pre><code>http://catalog.hathitrust.org/api/volumes/(idtype)/id.(outputtype)
</code></pre>

<p>Right now the valid idtypes are:</p>

<ul>
<li>issn (will be normalized to just digits, no leading zeros)</li>
<li>isbn (will be normalized to an <acronym title="International Standard Book Number">ISBN</acronym>-13)</li>
<li>oclc (will be normalized to all digits, no leading zeros)</li>
<li>lccn (will be normalized <a href="http://www.loc.gov/marc/lccn-namespace.html#syntax">as recommended</a>)</li>
<li>htid (HathiTrust item id, seen above as &#8220;mdp.39015058510069&#8243;)</li>
<li>umid (the University of Michigan record ID, seen above in the &#8220;fromRecord&#8221; field of an item)</li>
</ul>

<p>Currently the only valid outputtype is &#8216;json&#8217;.</p>

<h2>More complex, multi-valued call</h2>

<p>The full <acronym title="Application Programming Interface">API</acronym> <acronym title="Uniform Resource Locator">URL</acronym> looks like this:</p>

<p><a href="http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613">http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613</a></p>

<p>This is a request for data on two separate items, identified on the calling end as simply &#8216;1&#8242; and &#8216;2&#8242; (id:1 and id:2). The first item
is searched for using both an oclc number and an lccn; the second supplies only an isbn.</p>

<p>Note that</p>

<ul>
<li>The output format (json) has moved to appear right after the &#8216;/volumes/&#8217;</li>
<li>There&#8217;s an arbitrary &#8216;id&#8217; field. This will be used to index the return values, so use something meaningful on your end.</li>
<li>keys and values are separated by colons. Key-Value pairs are separated by semi-colons.</li>
<li>Separate requests are separated by &#8216;/&#8217; in the <acronym title="Uniform Resource Locator">URL</acronym>, allowing you to request data for an arbitrary number of items with a single call.</li>
<li>Return values are </li>
<li>Matches follow the &#8220;#3&#8243; option on <a href="http://robotlibrarian.billdueber.com/thinking-through-a-simple-api-for-hathitrust-item-metadata/">the old post</a>, the
&#8220;Must match if present&#8221; option &#8212; basically, if you supply an identifier and a record has one of those identifiers, they must match. </li>
</ul>

<p>So, in the example, the first request has both an oclc number and an lccn. Matches are as follows:</p>

<ul>
<li>If a record has an oclc number but no lccn, its oclc number must match the passed oclc number.</li>
<li>If a record has an lccn but no oclc number, its lccn must match the passed lccn value.</li>
<li>If a record has both an lccn and an oclc number, both its identifiers must match the passed values.</li>
</ul>

<p>The returned structure is keyed on the arbitrary id passed in the search string (if not present, the whole search string will be used instead):</p>

<div class="geshi no javascript"><ol><li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="st0">&quot;1&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;records&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;001474331&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;recordURL&quot;</span>:<span class="st0">&quot;http://catalog.hathitrust.org/Record/001474331&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;titles&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;Some aspects of seventeenth-century medicine &amp;amp; science; papers read at a Clark Library seminar, October 12, 1968&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;isbns&quot;</span>: <span class="br0">&#91;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;issns&quot;</span>: <span class="br0">&#91;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;oclcs&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;00045678&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;lccns&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;70628581 //r86&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;items&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;orig&quot;</span>:<span class="st0">&quot;University of Michigan&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;fromRecord&quot;</span>:<span class="st0">&quot;001474331&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;htid&quot;</span>:<span class="st0">&quot;mdp.39015004074095&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;itemURL&quot;</span>:<span class="st0">&quot;http://hdl.handle.net/2027/mdp.39015004074095&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;rightsCode&quot;</span>:<span class="st0">&quot;ic&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;lastUpdate&quot;</span>:<span class="st0">&quot;20090713&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;enumcron&quot;</span>:<span class="kw2">false</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="st0">&quot;2&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;records&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;004370624&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;recordURL&quot;</span>:<span class="st0">&quot;http://catalog.hathitrust.org/Record/004370624&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;titles&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;ARBA in-depth. Philosophy and religion /&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;isbns&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;1591581613&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;issns&quot;</span>: <span class="br0">&#91;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;oclcs&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;53462174&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;lccns&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;2003065945&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;items&quot;</span>:</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;orig&quot;</span>:<span class="st0">&quot;University of Michigan&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;fromRecord&quot;</span>:<span class="st0">&quot;004370624&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;htid&quot;</span>:<span class="st0">&quot;mdp.39015058261911&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;itemURL&quot;</span>:<span class="st0">&quot;http://hdl.handle.net/2027/mdp.39015058261911&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;rightsCode&quot;</span>:<span class="st0">&quot;ic&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;lastUpdate&quot;</span>:<span class="st0">&quot;20090907&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;enumcron&quot;</span>:<span class="kw2">false</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;<span class="br0">&#125;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li></ol></div>

<h2>Enumeration / Chronology</h2>

<p>An effort is made to return items in &#8220;enumcron order&#8221; &#8212; hopefully, with earlier volumes showing up before later volumes.
The full enumcron is listed in the items if you need to try something different.</p>

<h2>JSONP Support</h2>

<p><a href="http://niryariv.wordpress.com/2009/05/05/jsonp-quickly/">JSONP</a> output is supported &#8212; just throw a &#8216;&amp;callback=blahblahblah&#8217; on the end of the <acronym title="Uniform Resource Locator">URL</acronym> you call and you&#8217;ll get a function definition back.</p>

<p>Some examples:</p>

<p><a href="http://catalog.hathitrust.org/api/volumes/oclc/15420548.json&amp;callback=myfunc">http://catalog.hathitrust.org/api/volumes/oclc/15420548.json&amp;callback=myfunc</a></p>

<p><a href="http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581|id:2;isbn:1591581613&amp;callback=myfunc">http://catalog.hathitrust.org/api/volumes/json/id:1;oclc:45678;lccn:70628581/id:2;isbn:1591581613&amp;callback=myfunc</a></p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/beta-version-of-the-hathitrust-volumes-api-available/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Running Blacklight under JRuby</title>
		<link>http://robotlibrarian.billdueber.com/running-blacklight-under-jruby/</link>
		<comments>http://robotlibrarian.billdueber.com/running-blacklight-under-jruby/#comments</comments>
		<pubDate>Wed, 18 Nov 2009 03:35:30 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=170</guid>
		<description><![CDATA[I decided to see if I could get Blacklight working under JRuby, starting with running the test suite and working my way up from there.

There was much pain. Much, much pain. Exacerbated by my almost complete
lack of knowledge about what I was doing.

This is the procedure I eventually arrived at &#8212; if there are places [...]]]></description>
			<content:encoded><![CDATA[<p>I decided to see if I could get <a href="http://projectblacklight.org/">Blacklight</a> working under <a href="http://jruby.org/">JRuby</a>, starting with running the test suite and working my way up from there.</p>

<p>There was much pain. Much, much pain. Exacerbated by my almost complete
lack of knowledge about what I was doing.</p>

<p>This is the procedure I eventually arrived at &#8212; if there are places where I made trouble for myself, please let me know!</p>

<p>[And does anyone know how to get jruby's nokogiri to link to a different
libxml and stop with the crappy libxml2-version error message every time I 
run it under OSX???]</p>

<h2>Download jruby</h2>

<p>Go to <a href="http://jruby.org/">jruby.org</a> and download a binary distribution. Extract the tar.gz (or zip or whatever)</p>

<p>I&#8217;ll put mine in ~/jruby. Or, at least that&#8217;s what I&#8217;ll tell you.</p>

<pre><code>tar xzf jruby-1.4.tar.gz
</code></pre>

<p>To avoid confusion, let&#8217;s make <em>jrake</em> an alias for <em>rake</em> and add the jruby bin directory to the path</p>

<pre><code>cd ~/jruby/bin
ln -s rake jrake
export PATH=`pwd`:$PATH
</code></pre>

<h2>Download Blacklight</h2>

<pre><code>git clone git://github.com/projectblacklight/blacklight.git
</code></pre>

<p>Again, well say that I put this in <em>~/blacklight/</em></p>

<h2>Muck with Blacklight dependencies</h2>

<p>Edit the file <em>init.rb</em> to comment out references to <em>libxml</em> and <em>ruby-xslt</em>, 
as well as <em>nokogiri</em>. My understanding is that the first two are used, at this point, only for the EAD stuff. Both rely on <em>libxml2</em> which is a C-extension and hence unavailable to JRuby.</p>

<p>Nokogiri gets pulled in during other installs and for some reason jrake will complain later on that it&#8217;s got a wrong version or something. So, we&#8217;ll just work without that particular net for now.</p>

<pre><code>#### File ~/blacklight/init.rb
# config.gem 'libxml-ruby', :lib=&gt;'libxml', :version=&gt;'1.1.3'
# config.gem 'ruby-xslt', :lib=&gt;'xml/xslt', :version=&gt;'0.9.6'
# config.gem 'nokogiri', :version=&gt;'1.3.3'
</code></pre>

<h2>Do some initial installs</h2>

<pre><code>jgem install -v=2.3.4 rails 
jgem install activerecord-jdbc-adapter jdbc-sqlite3 
             activerecord-jdbcsqlite3-adapter ActiveRecord-JDBC 
jgem install rcov -s http://gemcutter.org --no-rdoc --no-ri
jrake
jrake gems:install
</code></pre>

<h2>Edit the <em>config/database.yml</em> file</h2>

<p>&#8230;to change the adapter to <em>jdbcsqlite3</em> for development and testing.</p>

<h2>Edit the <em>databases.rake</em> file</h2>

<p>This one was harder to track down. The default rake task has hard-coded database names in the .rake file &#8212; jdbcsqlite3 isn&#8217;t included. I keep seeing things saying, &#8220;Oh, yeah, that&#8217;s been fixed&#8230;&#8221; but, well, it wasn&#8217;t for me. I had to do it by hand.</p>

<pre><code>edit ~/jruby/lib/ruby/gems/1.8/gems/rails-2.3.4/lib/tasks/databases.rake
</code></pre>

<p>You need to find everywhere there&#8217;s a</p>

<pre><code>when "sqlite", "sqlite3" # or when /^sqlite/ in one case
</code></pre>

<p>&#8230;and change it to</p>

<pre><code>when "sqlite", "sqlite3", "jdbcsqlite3"
</code></pre>

<p>Repeat for other databases you want to use (e.g., mysql). For the moment, since I&#8217;m only worried about running <em>jrake spec</em>, that&#8217;s all I&#8217;m gonna do.</p>

<h2>Try again</h2>

<pre><code>jrake
  Missing these required gems:
   mislav-hanna  = 0.1.11
</code></pre>

<p>OK. Not sure why that didn&#8217;t come in before. Go head and add it.</p>

<pre><code>jgem install  mislav-hanna
</code></pre>

<h2>Migrate the databases</h2>

<pre><code>jrake
</code></pre>

<p>The databases should migrate, and then it&#8217;ll poop out because <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> didn&#8217;t start.</p>

<h2>Fire up solr</h2>

<p>Since we&#8217;re running jruby, accessing the shell doesn&#8217;t work. You&#8217;ll have to fire up your test solr instance by hand.</p>

<pre><code>cd ~/blacklight/jetty
java -Djetty.port=8888 -jar start.jar 2&gt;log.jetty
</code></pre>

<h2>Try it again!</h2>

<pre><code>cd ~/blacklight
jrake spec

   ................................................................
   ................................................................
   ....F............................................................
   1)
   'ApplicationHelper Export EndNote should render the correct 
   EndNote text file' FAILED
   expected: "%0 Format\n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%T Music for horn \n%I Harmonia Mundi USA, \n%C [United States] : \n%D p2001. \n",
  got: "%0 Format\n%C [United States] : \n%D p2001. \n%E Greer, Lowell. \n%E Lubin, Steven. \n%E Chase, Stephanie, \n%E Brahms, Johannes, \n%E Beethoven, Ludwig van, \n%E Krufft, Nikolaus von, \n%I Harmonia Mundi USA, \n%T Music for horn \n" (using ##)
./spec/helpers/application_helper_spec.rb:128:

Finished in 15.519 seconds
193 examples, 1 failure
</code></pre>

<p>I can live with that for the moment. Anyone know why that spec fails?</p>

<h2>Great! How about the features?</h2>

<pre><code>jrake features
  (much output)

  59 scenarios (59 passed)
  434 steps (434 passed)
  0m51.186s
</code></pre>

<h2>And so&#8230;</h2>

<p>&#8230;it appears that, at least on the surface, jruby is a viable platform for Blacklight so long as I don&#8217;t actually need any of the <em>libxml</em> stuff. In the next couple days I&#8217;ll try and actually get it all up and running and see if I can break it.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/running-blacklight-under-jruby/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Setting up your OPAC for Zotero support using unAPI</title>
		<link>http://robotlibrarian.billdueber.com/setting-up-your-opac-for-zotero-support-using-unapi/</link>
		<comments>http://robotlibrarian.billdueber.com/setting-up-your-opac-for-zotero-support-using-unapi/#comments</comments>
		<pubDate>Fri, 06 Nov 2009 20:43:05 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=167</guid>
		<description><![CDATA[unAPI is a very simple protocol to let a machine know what other formats a document is available in. Zotero is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.

Let&#8217;s get them to play nice with each other!

How&#8217;s it all work?


Zotero looks for a well-constructed &#60;link&#62; [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://unapi.info">unAPI</a> is a very simple protocol to let a machine know what other formats a document is available in. <a href="http://zotero.org">Zotero</a> is a bibliographic management tool (like Endnote or Refworks) that operates as a Firefox plugin. And it speaks unAPI.</p>

<p>Let&#8217;s get them to play nice with each other!</p>

<h2>How&#8217;s it all work?</h2>

<ol>
<li>Zotero looks for a well-constructed &lt;link&gt; tag in the head of the page</li>
<li>It checks the document on the other side of that link to see what formats are offered, and picks one to use. No, you can&#8217;t decide which one it uses. <em>It</em> picks.</li>
<li>Zotero then looks for IDs in the body of the page</li>
<li>If both are found and everything seems kosher, Zotero will offer the option to import some or all of the records. </li>
</ol>

<h2>What you&#8217;ll need</h2>

<ol>
<li>An <acronym title="Online Public Access Catalog">OPAC</acronym> whose output you can futz with</li>
<li>Access to an individual record&#8217;s ID in that output</li>
<li>A <acronym title="Uniform Resource Locator">URL</acronym> based on the ID that gives an RIS representation of the records</li>
<li>A screwdriver. Made with decent &#8212; but not too expensive &#8212; vodka and fresh orange juice.</li>
</ol>

<h2>Yes. I&#8217;m cheating.</h2>

<p>I have all those things already. Hence, this is easy for me. If you had to, say, write some sort of weird redirection script because IDs are not first-class citizens in your <acronym title="Online Public Access Catalog">OPAC</acronym>&#8217;s <acronym title="Uniform Resource Locator">URL</acronym> scheme, or write an RIS export tool by hand, well, this will take you a bit longer.</p>

<h2>The process</h2>

<h3>1. Build an upAPI target script</h3>

<p>You need a script that&#8217;ll do three things:</p>

<ol>
<li>With no arguments, return a list of available formats <em>in general</em></li>
<li>With one argument, id=&lt;ID&gt;, return a list of formats available <em>for that item</em>. This will likely be exactly the same as #1.</li>
<li>With two arguments, id=&lt;ID&gt; &amp; format=&lt;FORMAT&gt;, return the record identified by &lt;ID&gt; in format &lt;FORMAT&gt;</li>
</ol>

<p>Mine looks like this:</p>

<div class="geshi no php"><ol><li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// id is of the form urn:bibnum:000000000</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re1">$id</span> <span class="sy0">=</span> <span class="kw3">isset</span><span class="br0">&#40;</span><span class="re1">$_REQUEST</span><span class="br0">&#91;</span><span class="st0">&#39;id&#39;</span><span class="br0">&#93;</span><span class="br0">&#41;</span>? <span class="re1">$_REQUEST</span><span class="br0">&#91;</span><span class="st0">&#39;id&#39;</span><span class="br0">&#93;</span> <span class="sy0">:</span> <span class="kw2">false</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// Format, at this point, had better be &#39;ris&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re1">$format</span> <span class="sy0">=</span> <span class="kw3">isset</span><span class="br0">&#40;</span><span class="re1">$_REQUEST</span><span class="br0">&#91;</span><span class="st0">&#39;format&#39;</span><span class="br0">&#93;</span><span class="br0">&#41;</span>? <span class="re1">$_REQUEST</span><span class="br0">&#91;</span><span class="st0">&#39;format&#39;</span><span class="br0">&#93;</span> <span class="sy0">:</span> <span class="kw2">false</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// Got neither? Return the general list</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span><span class="sy0">!</span><span class="br0">&#40;</span><span class="re1">$id</span> <span class="sy0">||</span> <span class="re1">$format</span><span class="br0">&#41;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">header</span><span class="br0">&#40;</span><span class="st0">&#39;Content-type: application/xml&#39;</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">echo</span> <span class="st0">&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp;&lt;formats&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp; &nbsp;&lt;format name=&quot;ris&quot; </span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;type=&quot;application/x-Research-Info-Systems&quot; </span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;docs=&quot;http://www.refman.com/support/risformat_intro.asp&quot;/&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp;&lt;/formats&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp;&#39;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">exit</span><span class="sy0">;</span> &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// Got just the id? Return formats for that ID</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span><span class="re1">$id</span> <span class="sy0">&amp;&amp;</span> <span class="sy0">!</span><span class="re1">$format</span><span class="br0">&#41;</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">header</span><span class="br0">&#40;</span><span class="st0">&#39;Content-type: application/xml&#39;</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">echo</span> <span class="st0">&#39;&lt;?xml version=&quot;1.0&quot; encoding=&quot;UTF-8&quot;?&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp;&lt;formats id=&quot;&#39;</span> <span class="sy0">.</span> <span class="re1">$id</span> <span class="sy0">.</span> <span class="st0">&#39;&quot;&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp; &nbsp;&lt;format name=&quot;ris&quot; </span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;type=&quot;application/x-Research-Info-Systems&quot; </span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;docs=&quot;http://www.refman.com/support/risformat_intro.asp&quot;/&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp;&lt;/formats&gt;</span></div></li>
<li class="li1"><div class="de1"><span class="st0"> &nbsp; &nbsp;&#39;</span><span class="sy0">;</span> &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">exit</span><span class="sy0">;</span> &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// Otherwise&#8230;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// Parse out the actual numeric part of the id from the urn:&lt;typeOfNumber&gt; prefix</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">preg_match</span><span class="br0">&#40;</span><span class="st0">&#39;/^urn:bibnum:(.*)$/&#39;</span><span class="sy0">,</span> <span class="re1">$id</span><span class="sy0">,</span> <span class="re1">$match</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re1">$actualID</span> <span class="sy0">=</span> <span class="re1">$match</span><span class="br0">&#91;</span><span class="nu0">1</span><span class="br0">&#93;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1">// Again: format had better be &#39;ris&#39; because that&#39;s all I&#39;m supporting at this point.</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">header</span><span class="br0">&#40;</span><span class="st0">&quot;Location: /Search/SearchExport?id=$actualID&amp;method=$format&quot;</span><span class="sy0">,</span> <span class="kw2">true</span><span class="sy0">,</span> <span class="nu0">302</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li></ol></div>

<p>You can see that a &lt;format&gt; is a just a name, a mime-type, and an optional reference to documentation on the type.</p>

<p>I take advantage of my existing RIS export process in the redirect, at the bottom. I also built in the possibility that other types of numbers could come in &#8212; I&#8217;m hard-coding &#8216;bibnum&#8217; for the moment, but could allow, say, &#8220;oclc&#8221; or &#8220;isbn&#8221; or whatnot, too.</p>

<h3>2. Tell your <acronym title="Online Public Access Catalog">OPAC</acronym> where the script lives</h3>

<p>You&#8217;ll need a line in the &lt;head&gt; section of all your pages that might have an ID on them:</p>

<pre><code>&lt;link rel="unapi-server" type="application/xml" title="unAPI" href="/unapi"&gt;
</code></pre>

<p>Everything should be left alone except for the actual <em>href</em>.</p>

<h3>3. Add your IDs to the <acronym title="HyperText Markup Language">HTML</acronym></h3>

<p>In the <acronym title="HyperText Markup Language">HTML</acronym> of your page, you can add one or more tags of the form:</p>

<pre><code>&lt;abbr class="unapi-id" title="urn:bibnum:000000002"&gt;&lt;/abbr&gt;
</code></pre>

<p>(where the <em>title</em> of the &lt;abbr&gt; conforms to what you&#8217;re expecting in your script).</p>

<p>You can put stuff inside the &lt;abbr&gt; but you need not. On a single-record page, you should have (I would think) only one of these things. On a search results page, you may decide to not have any, or you may decide to have one for each search result.</p>

<h3>4. Final step</h3>

<p>Drink your screwdriver.</p>

<h2>Where can I see it?</h2>

<p>Well&#8230;here&#8217;s the thing.</p>

<p>You can take a look at my test instance, <a href="http://dueberb.vufind.lib.umich.edu/">http://dueberb.vufind.lib.umich.edu/</a> and play there. You can <em>not</em> see it in production, because there&#8217;s a little problem.</p>

<p>Our old <acronym title="Online Public Access Catalog">OPAC</acronym> &#8212; now dubbed <a href="http://mirlyn-classic.lib.umich.edu">mirlyn-classic</a> &#8212; had a custom translator written for it. And it worked fine, and that was great.</p>

<p>But now we&#8217;ve got this new software running at mirlyn.lib.umich.edu, and Zotero keeps on using the old translator no matter what you do. The only way to override it is to actually fire up sqlite3 and remove the conflicting entry from the zotero translators table. And then never update that table again.</p>

<p>I&#8217;ve asked around about getting it fixed (changing the target <acronym title="Uniform Resource Locator">URL</acronym> for the old translator to point at mirlyn-classic) but it&#8217;s Friday, and no one is around. Hopefully soon.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/setting-up-your-opac-for-zotero-support-using-unapi/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.535 seconds -->
<!-- Cached page served by WP-Cache -->
