<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robot Librarian &#187; Uncategorized</title>
	<atom:link href="http://robotlibrarian.billdueber.com/category/uncategorized/feed/" rel="self" type="application/rss+xml" />
	<link>http://robotlibrarian.billdueber.com</link>
	<description>Disclaimer: I'm not actually a robot.</description>
	<lastBuildDate>Fri, 23 Apr 2010 15:19:55 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Why RDA is doomed to failure</title>
		<link>http://robotlibrarian.billdueber.com/why-rda-is-doomed-to-failure/</link>
		<comments>http://robotlibrarian.billdueber.com/why-rda-is-doomed-to-failure/#comments</comments>
		<pubDate>Fri, 23 Apr 2010 14:20:51 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/why-rda-is-doomed-to-failure/</guid>
		<description><![CDATA[[Note: edited for clarity thanks to rsinger's comment, below]

Doomed, I say! DOOOOOOOOOOMMMMMMMED!

My reasoning is simple: RDA will fail because it&#8217;s not &#8220;better enough.&#8221;

Now, those of you who know me might be saying to yourselves, &#8220;Waitjustaminute. Bill doesn&#8217;t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic [...]]]></description>
			<content:encoded><![CDATA[<p>[Note: edited for clarity thanks to rsinger's comment, below]</p>

<p>Doomed, I say! DOOOOOOOOOOMMMMMMMED!</p>

<p>My reasoning is simple: <acronym title="Resource Description and Access">RDA</acronym> will fail because it&#8217;s not &#8220;better enough.&#8221;</p>

<p>Now, those of you who know me might be saying to yourselves, &#8220;Waitjustaminute. Bill doesn&#8217;t know anything at all about cataloging, or semantic representations, or the relative merits of various encapsulations of bibliographic metadata. I mean, sure, he knows a lot about&#8230;err&#8230;.hmmm&#8230;well, in any case, he&#8217;s <em>definitely</em> talking out of his ass on this one.&#8221;</p>

<p>First off, thanks for having such a long-winded internal monologue about me; it&#8217;s good to be thought of.</p>

<p>And, of course, you&#8217;re right on all counts. I don&#8217;t know what I&#8217;m talking about in any of those realms.</p>

<p>And yet I&#8217;m still willing to make a strong statement?</p>

<p>Yes. I am. Here&#8217;s why.</p>

<p>[Oh, and if you're convinced I'm wrong -- please say so. I'd <em>love</em> to be wrong about this.]</p>

<h2>First, an assertion</h2>

<p>The purpose of any bibliographic metadata is to facilitate three things:</p>

<ul>
<li><strong>Description/Identification</strong>. If you know what you want, does the metadata give you enough information to determine if the described item is what you want? Alternately, if you&#8217;re holding an item (or an alternate metadata representation of it), can you find the record that describes it? </li>
<li><strong>Machine finding</strong>. Can a machine, given a good-enough query, find a work via a search of the metadata?</li>
<li><strong>Machine grouping</strong>. Given the metadata, can a machine help a person find items &#8220;like this one&#8221;? </li>
</ul>

<p>Take issue with one or more of those statements. I don&#8217;t care. The point I&#8217;m really trying to make is that any standard that doesn&#8217;t put <em>unmediated machine reasoning</em> at the forefront of what the metadata needs to support is living in a deep, deep hole.</p>

<p>Computer cycles are pretty cheap, and programmers are pretty smart. We can figure out how to do useful things with virtually any data, but only if we can reliably get at those data.</p>

<h2>Getting 75% of the way there</h2>

<p>Three-fourths of the problem can be addressed with one simple concept.</p>

<p><strong>A solid equality relationship</strong>.</p>

<p>By this I mean that &#8220;=&#8221; had better damn well mean &#8220;equal,&#8221; as opposed to &#8220;probably the same,  but there might be other representations, too.&#8221; If I want to say &#8220;A = B&#8221; (where A and B are authors, or works, or subjects, or anything that can be nailed down) there&#8217;s better be no false positives and no false negatives. Ever. <acronym title="MAchine Readable Cataloging">MARC</acronym>&#8217;s use of &#8220;hopefully-unique strings&#8221; is ridiculously insufficient in the modern era.</p>

<p><acronym title="Resource Description and Access">RDA</acronym> does pretty well with this, with URIs for appropriate concepts, so that&#8217;s good.</p>

<h2>What&#8217;s wrong with it?</h2>

<p>Well, it&#8217;s gonna cost money to access the spec, for starters. That&#8217;s just dumb.</p>

<p>But it&#8217;s also not flexible/extensible enough. It&#8217;s true that I&#8217;m not a
cataloger. I do have an <acronym title="Microsoft">MS</acronym> in computer science, though, and there is stuff in
the various versions of the <acronym title="Resource Description and Access">RDA</acronym> spec which lead me to believe that the
committee desperately, <strong>desperately</strong> needed some hardcore geeks on it.
Computer science has basically done nothing but develop methods for
abstraction and composition for <em>decades</em>, and that isn&#8217;t reflected enough
here.</p>

<p>Language such as, &#8220;If it is determined that a mechanism for providing a direct
link between a note and the instance of the element to which it relates is
required,&#8230;&#8221; worries me. <em>if</em>? <strong>IF</strong>????? That&#8217;s not a spec. That&#8217;s a guideline. Nail it down, for god&#8217;s sake. When is it appropriate or inappropriate? How do you add links to multiple (but not all) instances of the element?</p>

<p>The spec also seems to describe at least half a dozen kinds of titles. One of these is &#8220;Abbreviated title.&#8221; Do we really want an abbreviated title? No. We want a title with an &#8220;abbreviated&#8221; modifier, so we can use that same modifier for, say, a corporate name or publisher or anything else. [Note: see rsinger's comment below, indicating this was a piss-poor example on my part.]</p>

<h2>Well, sure, but it&#8217;s still better than the <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym>!</h2>

<p>[This section updated to disabiguate my use of '<acronym title="MAchine Readable Cataloging">MARC</acronym>' when I really meant '<acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> as commonly talked about in term of <acronym title="MAchine Readable Cataloging">MARC</acronym> tags']</p>

<p>Of <em>course</em> it is. It&#8217;s just not <em>better enough</em>!</p>

<p>We&#8217;re not just talking about writing a spec. We&#8217;re talking about replacing <em>every single tool in the library toolchain</em>, from the ILS to editing software to OPACs to scripts that keep it all put together. We&#8217;ll be asking programmers to learn new skills and new ways of thinking, vendors to produce functional software for untested data formats, and catalogers to essentially take their whole brain out of their heads and get a new one.</p>

<p>But that, frankly, is the <em>easy</em> part. The entire culture of the library is built around <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> concepts and <acronym title="MAchine Readable Cataloging">MARC</acronym> data structures. The thought processes, nomenclature &#8212; <em>everything</em> sometimes feels as if it&#8217;s built around three-digit tags. The majority of the (crucial!) specialized vocabulary librarians, and experts and specialists, use to communicate with each other is directly or indirectly tied to <acronym title="MAchine Readable Cataloging">MARC</acronym></p>

<p>So, yeah, <acronym title="Resource Description and Access">RDA</acronym> is a hellofa lot better than <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym>/<acronym title="MAchine Readable Cataloging">MARC</acronym>. But in my view, it&#8217;s not better <em>enough</em> to justify all the pain. Switching is incredibly, astoundingly expensive both in terms of cost and in terms of the devaluation of institutional knowledge. We can&#8217;t do it every few years. We need to be damn sure we&#8217;re getting it right.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/why-rda-is-doomed-to-failure/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Data structures and Serializations</title>
		<link>http://robotlibrarian.billdueber.com/data-structures-and-serializations/</link>
		<comments>http://robotlibrarian.billdueber.com/data-structures-and-serializations/#comments</comments>
		<pubDate>Tue, 20 Apr 2010 20:56:52 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=235</guid>
		<description><![CDATA[Jonathan Rochkind, in response to a long (and, IMHO, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization (
see here, here, and here
) and I thought I&#8217;d jump in.

What this post is not

There&#8217;s a lot to be said about a good domain model for bibliographic data. I&#8217;m [...]]]></description>
			<content:encoded><![CDATA[<p>Jonathan Rochkind, in response to a long (and, <acronym title="In my humble opinion">IMHO</acronym>, mostly ridiculous) thread on NGC4Lib, has been exploring the boundaries between a data model and its expression/serialization (
see <a href="http://bibwild.wordpress.com/2010/04/19/of-marc-serialization-formats-and-element-schemata/">here</a>, <a href="http://bibwild.wordpress.com/2010/04/19/and-more-on-software-data-formats/">here</a>, and <a href="http://bibwild.wordpress.com/2010/04/20/serialization-vs-metadata-schemavocabulary/">here</a>
) and I thought I&#8217;d jump in.</p>

<h2>What this post is <em>not</em></h2>

<p>There&#8217;s a lot to be said about a good domain model for bibliographic data. I&#8217;m <em>so</em> not the guy to say it. I know there are arguments for and against various aspects of the <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> and <acronym title="Resource Description and Access">RDA</acronym> and <acronym title="Functional Requirements for Bibliographic Records">FRBR</acronym>, and I&#8217;m unable to go into them.</p>

<p>What I <em>am</em> comfortable saying is this:</p>

<blockquote>
  <p>Anyone advocating or dismissing a data model based on the  data structure or serialization most-often associated with that model <strong>is missing the goddamn point</strong>.</p>
</blockquote>

<h2>Data serializations</h2>

<p>&#8230;are boring. They&#8217;re unimportant at the data modeling stage, and only barely important when thinking about data structures. For any given data structure there are lots of ways you can serialize it. A standard programming-language hash can be represented in a zillion ways, for example: yaml, json, various programming languages, .ini files, etc. Even <acronym title="MAchine Readable Cataloging">MARC</acronym> has two standard serializations (binary and xml) with several more actually in use (Aleph Sequential, for example).</p>

<p>So, let me repeat again, serializations are boring and not worth talking about until you&#8217;ve got everything else nailed down. Any format you can round-trip your data structure to/from is fine.</p>

<p>Serializations are measured from &#8220;less pain&#8221; to &#8220;more pain&#8221;, but all have the exact same <em>expressiveness</em>. Data <em>structures</em>, on the other hand, do not.</p>

<h2>A hierarchy of data structures</h2>

<p>Think about the following data structures:</p>

<ul>
<li>An ordered list</li>
<li>key-value pairs</li>
<li>A hierarchy (e.g., an <acronym title="Extensible Markup Language">XML</acronym> document)</li>
<li>An undirected graph</li>
<li>A directed graph</li>
<li>A labeled, directed multigraph (e.g., a set of <acronym title="Resource Description Framework">RDF</acronym> Triples)</li>
</ul>

<p>You don&#8217;t have to think very hard to see that any of these can be viewed as a restricted version of the data structures above it. An ordered list (array) is just a set of key-value pairs where the keys represent each item&#8217;s sequence. A set of key-value pairs is a very, very flat hierarchy. A hierarchy is an undirected graph without cycles. An undirected graph is a directed graph where you&#8217;re careful to make links both ways. And a directed graph can easily be represented as a set of <acronym title="Resource Description Framework">RDF</acronym> triples (where you may, for example, only have one label for your relationships: &#8220;links to&#8221;).</p>

<p>[<strong>Note that I didn't say any of these would be efficient implementations</strong>!]</p>

<p>The reverse is not true &#8212; or, at least, not without an incredible amount of &#8220;out of band&#8221; information in another layer somewhere.</p>

<p>The structures at the end of the list have more <em>expressiveness</em>. You can just plain model more things in them (give-or-take the out-of-band stuff, composition, etc) per unit of screwing around. I&#8217;m not going to try to model my set of key=value pairs in an array. I could <em>do</em> it, but it would take so much of my attention that the data modeling would suffer.</p>

<h2>Don&#8217;t handicap yourself</h2>

<p>Don&#8217;t start with the data structure.</p>

<p>DON&#8217;T START WITH THE DATA STRUCTURE!</p>

<p>GET THAT MOTHER-FREAKIN&#8217; DATA STRUCTURE OFF MY MOTHER-FREAKIN&#8217; PLANE!</p>

<p>Seriously. Don&#8217;t be stupid. If all you&#8217;ve got is a hammer, everything starts to look like a thumb.</p>

<p>If you start off with a restrictive data structure before you even fully define the domain you&#8217;re trying to model, you may hose yourself. You may end up making stupid decisions based on the toolchain you&#8217;re imagining in your head.</p>

<p>Domain modeling is <em>ridiculously hard</em> for any domain worth modeling. If you start with a handicap (a restrictive data structure) it&#8217;s going to be even harder.</p>

<p>No one would think of trying to model bibliographic data using only arrays. That&#8217;s premature optimization on an epic scale.</p>

<h2>The appeal of <acronym title="Resource Description Framework">RDF</acronym> Triples</h2>

<p>Even if you ignore all the semantics and rules that make <acronym title="Resource Description Framework">RDF</acronym> Triples a value-added instance of a labeled, directed multigraph, the appeal (to me, anyway) is that <em>any semantic model based on <acronym title="Resource Description Framework">RDF</acronym> Triples has enormous expressive power at its disposal</em>.</p>

<p>Does it turn out that after you&#8217;ve fully satisfied the necessary model for the domain, the semantics you need can actually be accomplished with something lower down in the list? Awesome. Go with it. You&#8217;ll get great implementations with good real-life computing characteristics. A database can often usefully be thought of an implementation of an undirected graph with typed nodes (and, perhaps, some typed links, if you use the column name in the calling table a &#8220;type&#8221; of sorts, and add some out-of-band knowledge). And lord knows <acronym title="Relational DataBase Management System">RDBMS</acronym>&#8217;s have great performance characteristics.</p>

<p>But don&#8217;t <em>start</em> there. Start with the domain. Model it. Figure out what you need to describe and derive. <em>Then</em> pick the most appropriate data structure.</p>

<h2>The nightmare that is <acronym title="MAchine Readable Cataloging">MARC</acronym></h2>

<p><acronym title="MAchine Readable Cataloging">MARC</acronym>-the-data-structure (not to be confused with a serialization of that data structure, on the one hand,  or with the <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> on the other) can incompletely (but usefully, I think) be described as:</p>

<ul>
<li>A set of key-value pairs</li>
<li>&#8230;that have a defined order</li>
<li>&#8230;where keys can be repeated</li>
<li>&#8230;and values are strings</li>
<li>&#8230;and keys are a concatenation of tag/ind1/ind2/code</li>
</ul>

<p>Control fields are especially restricted (ind1, ind2, and code are all &#8216;null&#8217;). There&#8217;s been some bullshit attempts at links (e.g., the 880 fields) but really, this is it.</p>

<p>It doesn&#8217;t give us much to work with. It&#8217;s restricted. And, sadly, so is our thinking.</p>

<h2>Putting the cart before the horse</h2>

<p>As Jonathan (and zillions of others) rightly point out, a huge problem in the library world is that there are generations (plural) of working librarians who, because of years of practice, find it incredibly hard to think about bibliographic data as modeled outside the constraints inherent in the <acronym title="MAchine Readable Cataloging">MARC</acronym> <em>data structure</em>. It&#8217;s a handicap. It&#8217;s an anchor around our necks.</p>

<p><acronym title="MAchine Readable Cataloging">MARC</acronym>-the-data-<em>model</em> (nee <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym>) is not inherently bad because it&#8217;s built on an impoverished data structure. It&#8217;s bad because it does a shitty job at modeling the bibliographic data space. If we could produce a good model in a crappy data structure like that, well, that&#8217;d be awesome because it would indicate that things are simple.</p>

<p>Things, of course, aren&#8217;t simple. They&#8217;re <em>hard</em>.</p>

<p>So, if you want to complain about <acronym title="MAchine Readable Cataloging">MARC</acronym> or <acronym title="Resource Description and Access">RDA</acronym> or <acronym title="Functional Requirements for Bibliographic Records">FRBR</acronym>, figure out what its trying to model and talk about the fidelity of the model with respect to the problem space. But don&#8217;t conflate data models, data structures, and serializations.</p>

<p>Oh, and don&#8217;t say &#8220;<acronym title="Personal Identification Number">PIN</acronym> Number&#8221; or &#8220;ATM Machine.&#8221; That drives me crazy, too.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/data-structures-and-serializations/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Stupid catalog tricks: Subject Headings and the Long Tail</title>
		<link>http://robotlibrarian.billdueber.com/stupid-catalog-tricks-subject-headings-and-the-long-tail/</link>
		<comments>http://robotlibrarian.billdueber.com/stupid-catalog-tricks-subject-headings-and-the-long-tail/#comments</comments>
		<pubDate>Tue, 13 Apr 2010 14:31:12 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/stupid-catalog-tricks-subject-headings-and-the-long-tail/</guid>
		<description><![CDATA[Library of Congress Subject Headings (LCSH) in particular.

I&#8217;ve always been down on LCSH because I don&#8217;t understand them. They kinda look like a hierarchy, but they&#8217;re not really. Things get modifiers. Geography is inline and &#8230;weird.

And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you [...]]]></description>
			<content:encoded><![CDATA[<p>Library of Congress Subject Headings (LCSH) in particular.</p>

<p>I&#8217;ve always been down on LCSH because I don&#8217;t understand them. They kinda look like a hierarchy, but they&#8217;re not really. Things get modifiers. Geography is inline and &#8230;weird.</p>

<p>And, of course, in our faceting catalog when you click on a linked LCSH to do an automatic search, you often get nothing but the record you started from. Which is super-annoying.</p>

<p>So, just for kicks, I ran some numbers.</p>

<h2>The process</h2>

<p>I extracted all the field 650, indicator2=&#8221;0&#8243; from our catalog, threw away the subfield 6&#8217;s, and threw away any trailing punctuation in any of the subfields.  I called the concatenation of what was left a unique LCSH.</p>

<p>Then I printed them out and put them all onto index cards, using tick-marks to indicate&#8230;</p>

<p>No, of course not. I used <code>sort</code>, <code>uniq -c</code>, and <code>wc -l</code>. Here&#8217;s what I found.</p>

<h2>Counts of LCSH</h2>

<p>&#8230;in round numbers.</p>

<p>In our catalog, there are:</p>

<ul>
<li>8.50M subject headings (using the definition above)</li>
<li>1.87M unique subject headings</li>
<li>&#8230;66% of which (1.23M) appear exactly once</li>
</ul>

<p>We only have to go out to 30K subjects to account for half of all subject entries. The top 1000 most-used subjects account for 14.5% of all 8.5M subject entries.</p>

<p>The top ten subjects by count are:</p>

<ul>
<li>6029 $$aSermons, American</li>
<li>6131 $$aPhilosophy</li>
<li>7224 $$aFeature films</li>
<li>7591 $$aPiano music</li>
<li>7968 $$aSocialism</li>
<li>8796 $$aEconomics</li>
<li>9185 $$aCommunism</li>
<li>12440 $$aSermons, English$$y17th century</li>
<li>13539 $$aBills, Private$$zUnited States</li>
<li>58823 $$aEconomics$$xHistory$$vSources</li>
</ul>

<h2>From a record&#8217;s point of view</h2>

<p>Our catalog has:</p>

<ul>
<li>7M records</li>
<li>4.4M records with at least one subject (as defined above)</li>
<li>2.4M records with more than one subject</li>
<li>2.0M records with exactly one subject</li>
<li>2.6M records with zero subjects</li>
</ul>

<p>The records with the most subject headings tend to be collections of stuff (theses, photos, etc). Our local standout is the <a href="http://mirlyn.lib.umich.edu/Record/004078801">Dept. of Medicine and Surgery (University of Michigan) theses, 1851-1878</a> with 208 subject entries. 14 records have at least 30 subject entries.</p>

<h2>What it means</h2>

<p>Gee, lady, I don&#8217;t know.</p>

<p>One way to look at it: suppose you&#8217;re considering defining subjects in this way, and making them &#8220;hot&#8221; in the catalog interface. For our data, 2/3 of records would have either no subjects or a subject that found only the record you&#8217;re at. So&#8230;think again.</p>

<p>In real life, we index lots of possible subject fields, and we additionally index the $$a as well as the whole string, so ours are a little bit more useful. A little.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/stupid-catalog-tricks-subject-headings-and-the-long-tail/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Why bother with threading in jruby? Because it&#8217;s easy.</title>
		<link>http://robotlibrarian.billdueber.com/why-bother-with-threading-in-jruby-because-its-easy/</link>
		<comments>http://robotlibrarian.billdueber.com/why-bother-with-threading-in-jruby-because-its-easy/#comments</comments>
		<pubDate>Fri, 12 Mar 2010 02:48:49 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/why-bother-with-threading-in-jruby-because-its-easy/</guid>
		<description><![CDATA[Lately on the #code4lib IRC channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe rsinger?) said, basically, that to bother with threading for a one-off [...]]]></description>
			<content:encoded><![CDATA[<p>Lately on the #code4lib <acronym title="International Relations Committee">IRC</acronym> channel, several of us have been knocking around different versions (in several programming languages) of programs to read in a ginormous file and do some processing on each line. I noted some speedups related to multi-threading, and someone (maybe <em>rsinger</em>?) said, basically, that to bother with threading for a one-off simple program was a waste.</p>

<p>Well, it turns out I&#8217;ve been trying to figure out how to deal with threading in jruby anyway. And I think I have a pretty elegant solution &#8212; a generic &#8220;threaded each&#8221; I&#8217;m calling <code>threach</code>.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; enumerable_object.<span class="me1">threach</span><span class="br0">&#40;</span>number_of_threads, <span class="re3">:which_iterator</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>i<span class="sy0">|</span> &nbsp; &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; do_something_threadsafe<span class="br0">&#40;</span>i<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li></ol></div>

<h2>Some examples</h2>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="co1"># You like #each? You&#39;ll love&#8230;err..probably like #threach</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">load</span> <span class="st0">&#39;threach.rb&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Process with 2 threads. It assumes you want &#39;each&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># as your iterator.</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#40;</span><span class="nu0">1</span>..<span class="nu0">10</span><span class="br0">&#41;</span>.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">2</span><span class="br0">&#41;</span> <span class="br0">&#123;</span><span class="sy0">|</span>i<span class="sy0">|</span> <span class="kw3">puts</span> i.<span class="me1">to_s</span><span class="br0">&#125;</span> &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># You can also specify the iterator</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw4">File</span>.<span class="kw3">open</span><span class="br0">&#40;</span><span class="st0">&#39;mybigfile&#39;</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>f<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; f.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">2</span>, <span class="re3">:each_line</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>line<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; processLine<span class="br0">&#40;</span>line<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># threach does not care what the arity of your block is</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># as long as it matches the iterator you ask for</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#40;</span><span class="st0">&#39;A&#39;</span>..<span class="st0">&#39;Z&#39;</span><span class="br0">&#41;</span>.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">3</span>, <span class="re3">:each_with_index</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>letter, index<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">puts</span> <span class="st0">&quot;#{index}: #{letter}&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Or with a hash</span></div></li>
<li class="li1"><div class="de1">&nbsp; h = <span class="br0">&#123;</span><span class="st0">&#39;a&#39;</span> <span class="sy0">=&gt;</span> <span class="nu0">1</span>, <span class="st0">&#39;b&#39;</span><span class="sy0">=&gt;</span><span class="nu0">2</span>, <span class="st0">&#39;c&#39;</span><span class="sy0">=&gt;</span><span class="nu0">3</span><span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; h.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">2</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>letter, i<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">puts</span> <span class="st0">&quot;#{i}: #{letter}&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li></ol></div>

<p><em>threach.rb</em> adds to the Enumerable module to provide a threaded
version of whatever enumerator you throw at it (<code>each</code> by default).</p>

<h2>How does it work?</h2>

<p>How about I just put the source here. It&#8217;s short.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;thread&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">module</span> <span class="kw4">Enumerable</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw1">def</span> threach<span class="br0">&#40;</span>threads=<span class="nu0">0</span>, iterator=:each, <span class="sy0">&amp;</span>blk<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="kw1">if</span> threads == <span class="nu0">0</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># Just call the iterator itself</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">send</span><span class="br0">&#40;</span>iterator, <span class="sy0">&amp;</span>blk<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="kw1">else</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; bq = <span class="kw4">SizedQueue</span>.<span class="me1">new</span><span class="br0">&#40;</span>threads <span class="sy0">*</span> <span class="nu0">4</span><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; consumers = <span class="br0">&#91;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; threads.<span class="me1">times</span> <span class="kw1">do</span> <span class="sy0">|</span>i<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; consumers <span class="sy0">&lt;&lt;</span> <span class="kw4">Thread</span>.<span class="me1">new</span> <span class="kw1">do</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">until</span> <span class="br0">&#40;</span>a = bq.<span class="me1">pop</span><span class="br0">&#41;</span> === <span class="re3">:end_of_data</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; blk.<span class="me1">call</span><span class="br0">&#40;</span><span class="sy0">*</span>a<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">end</span> &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># The producer</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; count = <span class="nu0">0</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw2">self</span>.<span class="me1">send</span><span class="br0">&#40;</span>iterator<span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|*</span>x<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bq.<span class="me1">push</span> x</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; count <span class="sy0">+</span>= <span class="nu0">1</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># Now end it</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; threads.<span class="me1">times</span> <span class="kw1">do</span> </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bq <span class="sy0">&lt;&lt;</span> <span class="re3">:end_of_data</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># Do the join</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; consumers.<span class="me1">each</span> <span class="br0">&#123;</span><span class="sy0">|</span>t<span class="sy0">|</span> t.<span class="me1">join</span><span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li></ol></div>

<p>That&#8217;s it. If threads=0, just use the iterator itself. If not:</p>

<ul>
<li>Create a SizedQueue. It is thread-safe by definition and acts as the glue between the consumers and the main-thread producer.</li>
<li>Start a set of consumer threads that basically just pull an item out of the queue and then run the given block on it. Bail when you see the <code>end_of_data</code> token. These consumer threads all immediately block because there&#8217;s nothing in the SizedQueue yet.</li>
<li>Populate the SizedQueue. When you run out of stuff to add, push on an <code>end_of_data</code> token for each consumer thread.</li>
<li>Call <code>join</code> on the threads to keep the main program around when one of them exits.</li>
</ul>

<h2>Why use it?</h2>

<p>Well, if you&#8217;re using stock ruby &#8212; you probably shouldn&#8217;t. It&#8217;ll just slow things down. But if you&#8217;re using a ruby implementation that has real threads, like JRuby, this will give you relatively painless multi-threading.</p>

<p>You can always do something like:</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw1">if</span> <span class="kw1">defined</span>? JRUBY_VERSION</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; numthreads = <span class="nu0">3</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">else</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; numthreads = <span class="nu0">0</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; my_enumerable.<span class="me1">threach</span><span class="br0">&#40;</span>numthreads<span class="br0">&#41;</span> <span class="br0">&#123;</span><span class="sy0">|</span>i<span class="sy0">|</span> &#8230;<span class="br0">&#125;</span></div></li></ol></div>

<p>Note the &#8220;relatively&#8221; up there. The block you pass still has to be thread-safe, and there are many data structures you&#8217;ll encounter that are <em>not</em> thread-safe. Scalars, arrays, and hashes are, though, under JRuby, and that&#8217;ll get you pretty far.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/why-bother-with-threading-in-jruby-because-its-easy/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Pushing MARC to Solr; processing times and threading and such</title>
		<link>http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/</link>
		<comments>http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/#comments</comments>
		<pubDate>Thu, 04 Mar 2010 16:38:03 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=214</guid>
		<description><![CDATA[[This is in response to a thread on the blacklight mailing list about getting MARC data into Solr.]

What&#8217;s the question?

The question came up, &#8220;How much time do we spend processing the MARC vs trying to push it into Solr?&#8221;. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to [...]]]></description>
			<content:encoded><![CDATA[<p>[This is in response to a <a href="http://groups.google.com/group/blacklight-development/browse_thread/thread/672b7269ada16a61?hl=en">thread on the blacklight mailing</a> list about getting <acronym title="MAchine Readable Cataloging">MARC</acronym> data into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>.]</p>

<h2>What&#8217;s the question?</h2>

<p>The question came up, &#8220;How much time do we spend processing the <acronym title="MAchine Readable Cataloging">MARC</acronym> vs trying to push it into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>?&#8221;. Bob Haschart found that even with a pretty damn complicated processing stage, pushing the data to solr was still, at best,
taking at least as long as the processing stage.</p>

<p>I&#8217;m interested because I&#8217;ve been struggling to write a solrmarc-like system that runs under JRuby. Architecturally, the big difference between my stuff and solrmac is that I use the StreamingUpdateSolrServer (on Erik Hatcher&#8217;s suggestion). So I thought I&#8217;d check how things break down for me.</p>

<p>Here are my numbers running under JRuby (using MARC4J as the marc
implementation) with the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> StreamingUpdateSolrServer. Obviously, there are
a lot of differences between this and solrmarc, but I&#8217;m hoping that while it&#8217;s
not comparing apples to apples, it&#8217;s at least comparing apples to some sort of
processed cheese-like product.</p>

<h2>What work is being done on what?</h2>

<p>The data set is a file of 18,881 <acronym title="MAchine Readable Cataloging">MARC</acronym> records in marc-binary format. It&#8217;s
probably not big enough to get a great idea of how things will run over the
long (many millions of records) haul, but it&#8217;ll do for this rough-cut stuff.</p>

<p>I break my processing down into five categories:</p>

<ul>
<li>Read the records into marc4j objects and do nothing. This is a baseline of sorts.</li>
<li>The &#8220;normal&#8221; fields are anything that you could do with SolrMarc without a
custom routine; the actual processing is done in JRuby. </li>
<li>Custom fields are generated with JRuby code, but these are things that in solmarc would require a custom routine.</li>
<li>The big &#8220;allfields&#8221; field is text from tags 100 through 900.</li>
<li>The &#8220;to_xml&#8221; routine is just calling the underlying marc4j <acronym title="Extensible Markup Language">XML</acronym> output and stuffing it into a string.</li>
</ul>

<p>The schema used is our normal UMICH schema <em>except for</em> High Level Browse
(which appear in the <a href="http://mirlyn.lib.umich.edu/">our catalog</a> as &#8220;Academic
Discipline&#8221;). The code for that is written in Java, and I just call it from
JRuby when I&#8217;m using it. I excluded it because it&#8217;s incredibly expensive, both at startup time (when it loads a giant database of call-number ranges and associated categories) and for processing &#8212; there&#8217;s a lot of call-number normalization, long-string comparisons, some modified binary searches, etc. etc. etc. It&#8217;s expensive. Trust me.</p>

<p>The <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> server itself is on a different, incredibly-beefy machine, and is
emptied out before each invocation that involves actually pushing data to it (with a delete-by-query <em>:</em>).</p>

<h2>How fast were things on my desktop?</h2>

<ul>
<li>18,881 records in marc-binary format</li>
<li>Times are in seconds, run on my desktop</li>
<li>Remember, you can&#8217;t compare these numbers to Bob&#8217;s because we&#8217;re doing
different things to different data. </li>
</ul>

<table>
<thead>
<tr>
  <th align="right">Total Seconds</th>
  <th>Description</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">19</td>
  <td>Just read the records with marc4j and do nothing.</td>
</tr>
<tr>
  <td align="right">85</td>
  <td>Read and do 35 &#8220;normal&#8221; fields (no custom)</td>
</tr>
<tr>
  <td align="right">104</td>
  <td>Read, 35 normal, 15 custom fields</td>
</tr>
<tr>
  <td align="right">110</td>
  <td>Read, normal, custom, allfields</td>
</tr>
<tr>
  <td align="right">129</td>
  <td>Read, normal, custom, allfields, to_xml</td>
</tr>
<tr>
  <td align="right">136</td>
  <td>Read, normal, custom, allfields, to_xml, 2-threaded SUSS, commit every 5K docs</td>
</tr>
<tr>
  <td align="right">142</td>
  <td>Read, normal, custom, allfields, to_xml, 1-threaded SUSS, commit every 5k docs</td>
</tr>
<tr>
  <td align="right">124</td>
  <td>Read, normal, custom, allfields, to_xmx, 1-threaded SUSS, commit every 5k docs, <strong>2 threads doing processing</strong></td>
</tr>
</tbody>
</table>

<p>We can also break the same numbers down as:</p>

<table>
<thead>
<tr>
  <th align="right">Seconds</th>
  <th>Description</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">19</td>
  <td>read the records and do nothing</td>
</tr>
<tr>
  <td align="right">66</td>
  <td>process the 35 normal fields</td>
</tr>
<tr>
  <td align="right">19</td>
  <td>process the 15 custom fields</td>
</tr>
<tr>
  <td align="right">6</td>
  <td>generate the &#8220;allfields&#8221; field</td>
</tr>
<tr>
  <td align="right">19</td>
  <td>generate the <acronym title="Extensible Markup Language">XML</acronym> (yowza!)</td>
</tr>
<tr>
  <td align="right">7</td>
  <td>send to solr with two threads</td>
</tr>
<tr>
  <td align="right">13</td>
  <td>send to solr with one thread</td>
</tr>
</tbody>
</table>

<p>Or like this:</p>

<table>
<thead>
<tr>
  <th align="right">Seconds</th>
  <th>Description</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">129</td>
  <td>do all the reading and processing</td>
</tr>
<tr>
  <td align="right">13</td>
  <td>send to solr with one thread</td>
</tr>
</tbody>
</table>

<h2>Why does solr processing seem so much faster for me?</h2>

<p>There are a lot of reasons why my submit-to-solr might seem like less of a
burden. The ones I can think of off the top of my head are:</p>

<ul>
<li>SUSS is just faster than whatever solrmarc does. </li>
<li>My processing stage is so much slower than solrmac&#8217;s (due to algorithms or jruby-vs-java, I don&#8217;t know) that the &#8220;push to solr&#8221; portion of it gets swallowed up by the slowness of the of overall code.</li>
<li>The <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> server is so much faster than my desktop that my poor little 
desktop can&#8217;t send it data fast enough to work it.</li>
</ul>

<p><strong>For my setup, obviously adding a processing thread is a lot more beneficial
than adding a SUSS thread.</strong> My desktop doesn&#8217;t have that many threads lying around (adding a third processing thread actually slowed things down), so I moved the code to a beefier machine to see what happened.</p>

<h2>Trying the same thing on a beefy machine</h2>

<p>This is the exact same code and data, but on a beefy machine (16 cores, gobs
of memory).</p>

<table>
<thead>
<tr>
  <th align="right">time</th>
  <th align="center">SUSS Threads</th>
  <th align="center">Processing Threads</th>
</tr>
</thead>
<tbody>
<tr>
  <td align="right">70</td>
  <td align="center">1</td>
  <td align="center">1     (was 142 seconds on the desktop)</td>
</tr>
<tr>
  <td align="right">47</td>
  <td align="center">1</td>
  <td align="center">2</td>
</tr>
<tr>
  <td align="right">39</td>
  <td align="center">1</td>
  <td align="center">3</td>
</tr>
<tr>
  <td align="right">35</td>
  <td align="center">1</td>
  <td align="center">4</td>
</tr>
<tr>
  <td align="right">68</td>
  <td align="center">2</td>
  <td align="center">1</td>
</tr>
<tr>
  <td align="right">48</td>
  <td align="center">2</td>
  <td align="center">2</td>
</tr>
<tr>
  <td align="right">38</td>
  <td align="center">2</td>
  <td align="center">3</td>
</tr>
<tr>
  <td align="right">34</td>
  <td align="center">2</td>
  <td align="center">4</td>
</tr>
</tbody>
</table>

<p>So, on my hardware anyway, there&#8217;s a sweet spot with one suss thread and
three processing threads. <acronym title="Your mileage may vary">YMMV</acronym>, of course.</p>

<h2>What have we learned?</h2>

<p>I&#8217;m not sure, to be honest. It&#8217;s logistically difficult for me to do the same
process in solrmarc because I&#8217;d have to rebuild everything without the HLB stuff. I guess for me, what I&#8217;ve learned that if I&#8217;m going to continue working 
on my code, the places to focus my attention are threading (obviously) and <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym> generation.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/pushing-marc-to-solr-processing-times-and-threading-and-such/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ruby-marc with pluggable readers</title>
		<link>http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/</link>
		<comments>http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/#comments</comments>
		<pubDate>Tue, 02 Mar 2010 17:55:43 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/</guid>
		<description><![CDATA[I&#8217;ve been messing with easier ways of adding parsers to ruby-marc&#8217;s MARC::Reader object. The idea is that you can do this:

&#160; require &#39;marc&#39;
&#160; require &#39;my_marc_stuff&#39;
&#160; 
&#160; mbreader = MARC::Reader.new&#40;&#39;test.mrc&#39;&#41; # =&#62; Stock marc binary reader
&#160; mbreader = MARC::Reader.new&#40;&#39;test.mrc&#39; :readertype=&#62;:marcstrict&#41; # =&#62; ditto
&#160; 
&#160; MARC::Reader.register_parser&#40;My::MARC::Parser, :marcstrict&#41;
&#160; mbreader = MARC::Reader.new&#40;&#39;test.mrc&#39;&#41; # =&#62; Uses My::MARC::Parser now
&#160; 
&#160; xmlreader [...]]]></description>
			<content:encoded><![CDATA[<p>I&#8217;ve been messing with easier ways of adding parsers to ruby-marc&#8217;s MARC::Reader object. The idea is that you can do this:</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;marc&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;my_marc_stuff&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; mbreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.mrc&#39;</span><span class="br0">&#41;</span> <span class="co1"># =&gt; Stock marc binary reader</span></div></li>
<li class="li1"><div class="de1">&nbsp; mbreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.mrc&#39;</span> <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:marcstrict<span class="br0">&#41;</span> <span class="co1"># =&gt; ditto</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re2">MARC::Reader</span>.<span class="me1">register_parser</span><span class="br0">&#40;</span><span class="re2">My::MARC::Parser</span>, <span class="re3">:marcstrict</span><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; mbreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.mrc&#39;</span><span class="br0">&#41;</span> <span class="co1"># =&gt; Uses My::MARC::Parser now</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; xmlreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.xml&#39;</span>, <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:marcxml<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># &#8230;and maybe further on down the road</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; asreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.seq&#39;</span>, <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:alephsequential<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; mjreader = <span class="re2">MARC::Reader</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;test.json&#39;</span>, <span class="re3">:readertype</span><span class="sy0">=&gt;</span>:marchashjson<span class="br0">&#41;</span></div></li></ol></div>

<p>A parser need only implement <code>#each</code> and a module-level method <code>#decode_from_string</code>.</p>

<p>Read all about it <a href="http://github.com/billdueber/ruby-marc-plugable-readers">on the github page</a>.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/ruby-marc-with-pluggable-readers/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>New interest in MARC-HASH / JSON</title>
		<link>http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/</link>
		<comments>http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/#comments</comments>
		<pubDate>Fri, 26 Feb 2010 04:29:46 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=204</guid>
		<description><![CDATA[For reasons I&#8217;m still not entirely clear on (I wasn&#8217;t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for MARC data.

When I initially looked at MARC-HASH almost a year ago, I was mostly looking for something that wasn&#8217;t such a pain in the butt to work with, something that [...]]]></description>
			<content:encoded><![CDATA[<p>For reasons I&#8217;m still not entirely clear on (I wasn&#8217;t there), the Code4Lib 2010 conference this week inspired renewed interest in a JSON-based format for <acronym title="MAchine Readable Cataloging">MARC</acronym> data.</p>

<p>When I initially looked at <acronym title="MAchine Readable Cataloging">MARC</acronym>-HASH <a href="http://robotlibrarian.billdueber.com/marc-hash-the-saga-continues-now-with-even-less-structure/">almost a year ago</a>, I was mostly looking for something that wasn&#8217;t such a pain in the butt to work with, something that would marshall into multiple formats easily and would simply and easily round-trip.</p>

<p>Now, though, a lot of us are looking for a <acronym title="MAchine Readable Cataloging">MARC</acronym> format that (a) doesn&#8217;t suffer from the length limitations of binary <acronym title="MAchine Readable Cataloging">MARC</acronym>, but (b) is less painful (both in code and processing time) than <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym>, and it&#8217;s worth re-visiting. </p>

<p>For at least a few folks, un-marshaling time is a factor, since no matter what you&#8217;re doing, processing <acronym title="Extensible Markup Language">XML</acronym> is slower than most other options. Much of that expense is due to functionality and data-safety features that just aren&#8217;t a big win with a brain-dead format like <acronym title="MAchine Readable Cataloging">MARC</acronym>, so it&#8217;s worth looking at alternatives.</p>

<h2>What is <acronym title="MAchine Readable Cataloging">MARC</acronym>-HASH?</h2>

<p>At some point, we&#8217;ll want a real spec, but right now it&#8217;s just this:
</p>

<div class="geshi no json"><ol><li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; # A record is a four-pair hash, as follows. UTF-8 is mandatory.</div></li>
<li class="li1"><div class="de1">&nbsp; {</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;type&quot; : &quot;marc-hash&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;version&quot; : [1, 0]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;leader&quot; : &quot;&#8230;leader string &#8230; &quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;fields&quot; : [array, of, fields]</div></li>
<li class="li1"><div class="de1">&nbsp; }</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; # A field is an array of either 2 or 4 elements</div></li>
<li class="li1"><div class="de1">&nbsp; [tag, value] # a control field</div></li>
<li class="li1"><div class="de1">&nbsp; [tag, ind1, ind2, [array, of subfields]]</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; # A subfield is an array of two elements</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; [code, value]</div></li></ol></div>

<p>So, a short example:</p>

<div class="geshi no json"><ol><li class="li1"><div class="de1">&nbsp; {</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;type&quot; : &quot;marc-hash&quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;version&quot; : [1, 0],</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;leader&quot; : &quot;leader string&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &quot;fields&quot; : [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;[&quot;001&quot;, &quot;001 value&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;[&quot;002&quot;, &quot;002 value&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;[&quot;010&quot;, &quot; &quot;, &quot; &quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;68009499&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; [&quot;035&quot;, &quot; &quot;, &quot; &quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;(RLIN)MIUG0000733-B&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; [&quot;035&quot;, &quot; &quot;, &quot; &quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;(CaOTULAS)159818014&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; [&quot;245&quot;, &quot;1&quot;, &quot;0&quot;,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; [</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;a&quot;, &quot;Capitalism, primitive and modern;&quot;],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;b&quot;, &quot;some aspects of Tolai economic growth&quot; ],</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; [&quot;c&quot;, &quot;[by] T. Scarlett Epstein.&quot;]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; ]</div></li>
<li class="li1"><div class="de1">&nbsp; }</div></li></ol></div>

<h2>How's the speed?</h2>

<p>I think it's important to separate the format marc-hash from the eventual marshaling format -- partly because someone might actually want to use something other than JSON as a final format, but mostly because it forces implementations to separate hash-creation from json-creation and makes it easy to swap in different json libraries when a faster one comes along.</p>

<p>Having said that, in real life people are mostly concerned about JSON. So,
let's look at JSON performance.</p>

<p>The <acronym title="MAchine Readable Cataloging">MARC</acronym>-Binary and <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym> files are normal files, as you'd expect. The JSON file is 
<a href="http://trephine.org/t/index.php?title=Newline_delimited_JSON">"Newline-Delimited JSON"</a> -- a single JSON record on each line.</p>

<p>The benchmark code looks like this:</p>

<p><pre>
  # Unmarshal
  x.report("<acronym title="MAchine Readable Cataloging">MARC</acronym> Binary") do
    reader = MARC::Reader.new('test.mrc')
    reader.each do |r|
      title = r['245']['a']
    end
  end</p>

<p># Marshal
  x.report("<acronym title="MAchine Readable Cataloging">MARC</acronym> Binary") do 
    reader = MARC::Reader.new('test.mrc')
    writer = MARC::Writer.new('benchout.mrc')
    reader.each do |r|
      writer.write(r)
    end
    writer.close
  end
</pre></p>

<p>Under MRI, I used the nokogiri <acronym title="Extensible Markup Language">XML</acronym> parser and the yajl JSON gem. Under JRUby, it was the jstax <acronym title="Extensible Markup Language">XML</acronym> parser and the json-jruby JSON gem.</p>

<p>The test file is a set of 18,831 records I've been using for all my benchmarking of late. It's nothing special; just a nice size.</p>

<h3>Marshalling Speed (read from binary marc, dump to given format)</h3>

<p>Times are in seconds on my Macbook laptop, using ruby-marc.</p>

<table class="grid">
    <tr>
      <th>Format</th>
      <th>Ruby 1.87</th>
      <th>Ruby 1.9</th>
      <th>JRuby 1.4</th>
      <th>Jruby 1.4 --1.9</th>
    </tr>
    <tr>
      <td><acronym title="Extensible Markup Language">XML</acronym></td>
      <td>393</td>
      <td>443</td>
      <td>188</td>
      <td>356</td>
    </tr>
    <tr>
      <td><acronym title="MAchine Readable Cataloging">MARC</acronym> Binary</td>
      <td>36</td>
      <td>23</td>
      <td>23</td>
      <td>25</td>
    </tr>
    <tr>
      <td>JSON/ NDJ</td>
      <td>31</td>
      <td>19</td>
      <td>25</td>
      <td>ERROR</td>
    </tr>
  </table>

<h3>Unmarshalling speed (from pre-created file)</h3>

<p>Again, times are in seconds</p>

<table class="grid">
    <tr>
      <th>Format</th>
      <th>Ruby 1.87</th>
      <th>Ruby 1.9</th>
      <th>JRuby 1.4</th>
      <th>Jruby 1.4 --1.9</th>
    </tr>
    <tr>
      <td><acronym title="Extensible Markup Language">XML</acronym></td>
      <td>113</td>
      <td>89</td>
      <td>75</td>
      <td>89</td>
    </tr>
    <tr>
      <td><acronym title="MAchine Readable Cataloging">MARC</acronym> Binary</td>
      <td>29</td>
      <td>16</td>
      <td>16</td>
      <td>19</td>
    </tr>
    <tr>
      <td>JSON/ NDJ</td>
      <td>17</td>
      <td>9</td>
      <td>13</td>
      <td>16</td>
    </tr>
  </table>

<h2> And so...</h2>

<p>I'm not sure what else to say. The format is totally brain-dead. It round-trips. It's fast enough. It has no length limitations. We define it to have to be UTF-8. The NDJ format is easy to understand and allows streaming of large document collections.</p>

<p>If folks are interested in implementing this across other libraries, that'd be great. Any thoughts?</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/new-interest-in-marc-hash-json/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
		<item>
		<title>OCLC still not (NO! They are!) normalizing their LCCNs</title>
		<link>http://robotlibrarian.billdueber.com/oclc-still-not-normalizing-their-lccns/</link>
		<comments>http://robotlibrarian.billdueber.com/oclc-still-not-normalizing-their-lccns/#comments</comments>
		<pubDate>Thu, 18 Feb 2010 14:58:33 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=198</guid>
		<description><![CDATA[NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So&#8230;good news all around, and huge [...]]]></description>
			<content:encoded><![CDATA[<blockquote>NOTE 2: It turns out that I did find a minor bug in the system, but that in general LCCN normalization is working correctly. I just happened to hit a weirdness with a bad LCCN and a little bug in the parser on their end. Which is getting fixed. So&#8230;good news all around, and huge kudos to 
Xiaoming Liu for his quick response!
</blockquote>

<blockquote>**NOTE** It strikes me that I haven&#8217;t seen a case where bad data results from sending a valid LCCN. The only verified problem is one of false negatives. Send a valid lccn, you&#8217;ll get back either good data or nothing (and the &#8220;nothing&#8221; might be in error). So, still a big problem, but not as THESKYISFALLING as I imply below.</blockquote>

<hr />

<p>A <a href="http://bibwild.wordpress.com/2009/03/11/normalize-your-lccns/">long time ago</a>, Jonathan Rochkind noted that the <acronym title="Online Computer Library Center">OCLC</acronym> doesn&#8217;t correctly <a href="http://www.loc.gov/marc/lccn-namespace.html">normalize their LCCNs</a>.</p>

<p>Well, it&#8217;s not fixed.</p>

<p>I could really, <em>really</em> use the xlccn service right about now &#8212; a great web service they provide that, much like xisbn and xissn and the other xXXXX (heh!) services, purports to allow you to put in an lccn and get data back on the item you&#8217;re interested in.</p>

<p>Except they &#8220;normalize&#8221; their LCCNs in a way that is not only incorrect, but causes namespace collisions. As near as I can tell, they throw out any leading non-digits and only keep up to the next non-digit.</p>

<p><em>The xLCCN service will silently provide no data or <strong>incorrect data</strong> for many LCCN requests!</em></p>

<p>An example:</p>

<ul>
<li>(F) Full LCCN is &#8220;sn 83011407&#8243;</li>
<li>(D) First set of digits is &#8220;83011407&#8243;. This is what I think the <acronym title="Online Computer Library Center">OCLC</acronym> is indexing.</li>
<li>(N) Correct normalization is &#8220;sn83011407&#8243;</li>
</ul>

<p>The problem, of course, is that (D) &#8220;83011407&#8243; <em>is itself a valid LCCN</em>.</p>

<ul>
<li>(F) is associated with <acronym title="Online Computer Library Center">OCLC</acronym># 47212967</li>
<li>(D) is associated with <acronym title="Online Computer Library Center">OCLC</acronym># 12505148. That&#8217;s <em>not the same record</em>.</li>
</ul>

<p>So, how do the <acronym title="Online Computer Library Center">OCLC</acronym> services respond?</p>

<ul>
<li>(F) Worldcat search finds correct (probably just doing a string match); xid finds nothing</li>
<li>(D) Worldcat finds both correct and incorrect records. The xLCCN service finds <em>only</em> the incorrect record, <acronym title="Online Computer Library Center">OCLC</acronym># 12505148.</li>
<li>(N) Neither worldcat nor xid finds anything for the correctly normalized version.</li>
</ul>

<p>So, what am I supposed to do? Only use the service on LCCNs where the original
and normalized versions are the same and include only digits? Frustrating.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/oclc-still-not-normalizing-their-lccns/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Indexing data into Solr via JRuby (with threads!)</title>
		<link>http://robotlibrarian.billdueber.com/indexing-data-into-solr-via-jruby-with-threads/</link>
		<comments>http://robotlibrarian.billdueber.com/indexing-data-into-solr-via-jruby-with-threads/#comments</comments>
		<pubDate>Tue, 16 Feb 2010 19:43:52 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=190</guid>
		<description><![CDATA[[Note: in this post I'm just going to focus on the "get stuff into Solr" part. My normal focus -- MARC data -- will
make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]

Working with Solr

I love me the Solr. I love everything about it except [...]]]></description>
			<content:encoded><![CDATA[<p>[Note: in this post I'm just going to focus on the "get stuff into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>" part. My normal focus -- <acronym title="MAchine Readable Cataloging">MARC</acronym> data -- will
make an appearance in the next post when I talk about using this in addition to / instead of solrmarc.]</p>

<h2>Working with <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></h2>

<p>I love me the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>. I love everything about it except that the best way to interact with it is via Java. I don&#8217;t so much love me the java.</p>

<p>So&#8230;taking Erik Hatcher&#8217;s lead and advice, as I will do whenever he offers either, I wrote some code to work within JRuby to deal with <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>.</p>

<h2>Getting the code</h2>

<p>I&#8217;ve added the gems to gemcutter, if you want to play along at home:</p>

<ul>
<li><strong>jruby&#95;producer&#95;consumer</strong> (<a href="http://github.com/billdueber/jruby_producer_consumer">github</a>, <a href="http://rdoc.info/projects/billdueber/jruby_producer_consumer">rdoc.info</a>) Ruby syntax for threaded operations under jruby</li>
<li><strong>jruby&#95;streaming&#95;update&#95;solr&#95;server</strong> (<a href="http://github.com/billdueber/jruby_streaming_update_solr_server">github</a>, <a href="http://rdoc.info/projects/billdueber/jruby_streaming_update_solr_server">rdoc.info</a>) Ruby syntax on top of the Java class of the same name</li>
<li><strong>marc4j4r</strong> (<a href="http://github.com/billdueber/marc4j4r">github</a>, <a href="http://rdoc.info/projects/billdueber/marc4j4r">rdoc.info</a>) Ruby syntax on top of the marc4j java library.</li>
</ul>

<p><em>WARNING</em>: None of these gems have a 1.0 version tag on them, and that means that the <acronym title="Application Programming Interface">API</acronym> may change a titch in 
  the future. Also, the fact that they&#8217;re released as gems means that it&#8217;s easy to release gems, not that I&#8217;m not
  an idiot.</p>

<h2>The basics: Using SolrInputDocument and StreamingUpdateSolrServer</h2>

<p>OK, with the disclaimer out of the way, let&#8217;s look at some code.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;jruby_streaming_update_solr_server&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; solrurl = <span class="st0">&#39;http://your.solr.server:port/solr&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussqueuesize = <span class="nu0">24</span> <span class="co1"># how many items to buffer on their way to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussthreads = <span class="nu0">1</span> &nbsp; <span class="co1"># how many threads to use to send stuff to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; suss = StreamingUpdateSolrServer.<span class="me1">new</span><span class="br0">&#40;</span>solrurl,sussqueuesize,sussthreads<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Let&#39;s add a simple document via a hash: A title, three authors, and a year</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; h = <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="re3">:title</span> <span class="sy0">=&gt;</span> <span class="st0">&quot;Never been deader&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="re3">:author</span> <span class="sy0">=&gt;</span> <span class="br0">&#91;</span><span class="st0">&#39;Bill&#39;</span>, <span class="st0">&#39;Mike&#39;</span>, <span class="st0">&#39;Molly&#39;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="re3">:year</span> <span class="sy0">=&gt;</span> <span class="nu0">2003</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; suss <span class="sy0">&lt;&lt;</span> h</div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">commit</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># YEA! You just added a document to solr and committed it. </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Have a cookie!</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># We can also use a document object to do the same thing</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc = SolrInputDocument.<span class="me1">new</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add the title</span></div></li>
<li class="li1"><div class="de1">&nbsp; doc <span class="sy0">&lt;&lt;</span> <span class="br0">&#91;</span><span class="st0">&#39;title&#39;</span>, <span class="st0">&#39;Never been deader&#39;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add the first author</span></div></li>
<li class="li1"><div class="de1">&nbsp; doc <span class="sy0">&lt;&lt;</span> <span class="br0">&#91;</span><span class="re3">:author</span>, <span class="st0">&#39;Bill&#39;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add more. Re-used keys mean you&#39;re adding additional values</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Note values can be scalars or arrays</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc <span class="sy0">&lt;&lt;</span> <span class="br0">&#91;</span><span class="re3">:author</span>, <span class="br0">&#91;</span><span class="st0">&#39;Mike&#39;</span>, <span class="st0">&#39;Molly&#39;</span><span class="br0">&#93;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add the wrong year using [] syntax</span></div></li>
<li class="li1"><div class="de1">&nbsp; doc<span class="br0">&#91;</span><span class="re3">:year</span><span class="br0">&#93;</span> = <span class="nu0">2001</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Oops! fix it. []= overwrites existing value(s)</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc<span class="br0">&#91;</span><span class="re3">:year</span><span class="br0">&#93;</span> = <span class="nu0">2003</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Finally, we can merge a hash (or anything else that responds to </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># &#39;each_pair&#39; with key-value pairs) into an existing doc</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; doc.<span class="me1">merge</span>! <span class="br0">&#123;</span><span class="st0">&#39;author&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;Ringo Starrre&#39;</span>, <span class="st0">&#39;publisher&#39;</span><span class="sy0">=&gt;</span><span class="st0">&#39;Vainity Books&#39;</span><span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Add it</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; suss <span class="sy0">&lt;&lt;</span> doc</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Commit and optimize if you&#39;d like</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">commit</span></div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">optimize</span> <span class="co1"># if you want</span></div></li></ol></div>

<p>Nothing really fancy in there &#8212; just a few things worth noting:</p>

<ul>
<li>An suss object will take a hash (again, anything that responds to <code>#each_pair</code>) or a SolrInputDoc</li>
<li>You can use either strings or symbols to represent <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> field names</li>
<li>Values can be either a single value, or an array of multiple values</li>
</ul>

<p>And there are three ways to get data into a doc:</p>

<ul>
<li>Via <code>&lt;&lt; [field, value(s)]</code> (additive)</li>
<li>Via <code>doc.merge! hash</code> (additive)</li>
<li>Via <code>doc[field] = value</code> (replaces)</li>
</ul>

<h2>Adding Threads</h2>

<p>I also went down the garden path of threading things. There are an awful lot
of operations that are not threadsafe (e.g., reading a line from a file) but
once you&#8217;ve got a bunch of records to worth with, turning them into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>
documents is usually thread-safe.</p>

<p>My model is that there&#8217;s a producer (usually the method <code>#each</code>) from an 
underlying data object. A thread takes whatever that method yields
and sticks the values into a java 
BlockingQueue awaiting consumption. You then use <code>ProdcuerConsumer#threaded_each</code>
(or <code>ProducerConsumer#threaded_each_with_index</code>) to pull items out of the queue and do something useful with them.</p>

<p>I extracted stuff into a library (jruby&#95;producer&#95;consumer) for your viewing pleasure.</p>

<p><strong>CONFUSION ALERT</strong>: It&#8217;s perhaps unfortunate that the object you send to <code>ProducerConsumer.new(obj)</code> must implement <code>#each</code> and that the ProducerConsumer method <code>#threaded_each</code> calls that underlying <code>#each</code>&#8230;well
there&#8217;s a lot of <code>#each</code>&#8217;s floating around. Keep them straight.</p>

<p>So&#8230;let&#8217;s look at some code to work with consumer threads.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="co1"># Start off the same as before</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;jruby_streaming_update_solr_server&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;jruby_producer_consumer&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw3">require</span> <span class="st0">&#39;marc4j4r&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; solrurl = <span class="st0">&#39;http://your.solr.server:port/solr&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussqueuesize = <span class="nu0">24</span> <span class="co1"># how many items to buffer on their way to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; sussthreads = <span class="nu0">2</span> &nbsp; <span class="co1"># how many threads to use to send stuff to solr</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; suss = StreamingUpdateSolrServer.<span class="me1">new</span><span class="br0">&#40;</span>solrurl,sussqueuesize,sussthreads<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># I&#39;ll go ahead and use a <acronym title="MAchine Readable Cataloging">MARC</acronym> file as my example, but won&#39;t talk about the</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># <acronym title="MAchine Readable Cataloging">MARC</acronym> parts of it. All you need to know is that the reader object</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># implements #each</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; reader = MARC4J4R.<span class="me1">reader</span><span class="br0">&#40;</span><span class="st0">&#39;test.xml&#39;</span>, <span class="re3">:marcxml</span><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Get a producer/consumer object with the reader at its base, using</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># the default method #each to get stuff out of it, and with the assumption</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># that we only need to keep the default 5 items in memory at a time to </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># keep up with consumption</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; pc = ProducerConsumer.<span class="me1">new</span><span class="br0">&#40;</span>reader<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Get three threads to actually consume the things, turn them into solr </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># documents, and send them to solr (potentially out of order)</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; numconsumerthreads = <span class="nu0">3</span></div></li>
<li class="li1"><div class="de1">&nbsp; pc.<span class="me1">threaded_each</span><span class="br0">&#40;</span>numconsumerthreads<span class="br0">&#41;</span>.<span class="me1">each</span> <span class="kw1">do</span> <span class="sy0">|</span>r<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; suss <span class="sy0">&lt;&lt;</span> turn_marc_record_into_a_hash_or_solrdoc<span class="br0">&#40;</span>r<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; suss.<span class="me1">commit</span></div></li></ol></div>

<p>Again, not a lot happening here.</p>

<ul>
<li>The &#8220;producer&#8221; is always one thread, because so little is thread-safe at the &#8216;each&#8217; level. In this case, there&#8217;s a single thread pulling data out of the file and turning it into <acronym title="MAchine Readable Cataloging">MARC</acronym> records, which are added to the internal BlockingQueue. I buffer 5 of these at a pop (the default) so the consumer threads don&#8217;t starve. I presume that producing items is cheaper than consuming them, or else this library won&#8217;t help you much. </li>
<li><code>ProducerConsumer#threaded_each</code> calls the <code>#each</code> method of the underlying object. You can substitute anything that yields, though, as in this example where I call <code>#each_line</code> instead of the default <code>#each</code></li>
</ul>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; queuesize = <span class="nu0">5</span></div></li>
<li class="li1"><div class="de1">&nbsp; pc = ProducerConsumer.<span class="me1">new</span><span class="br0">&#40;</span><span class="kw4">File</span>.<span class="me1">new</span><span class="br0">&#40;</span><span class="st0">&#39;myfile.txt&#39;</span><span class="br0">&#41;</span>, queuesize, <span class="re3">:each_line</span><span class="br0">&#41;</span></div></li></ol></div>

<ul>
<li>Keep track of your threads. In this last example, there is one thread getting <acronym title="MAchine Readable Cataloging">MARC</acronym> records and putting them into the PC buffer (no way to change that), three threads consuming those records and sticking them into the <code>suss</code> object, and another two pulling stuff <em>out</em> of the <code>suss</code> object and sending things to Sorl. And, of course, there&#8217;s other stuff running on the computer, too. Experiment and figure out what works best for your hardware.</li>
<li>See the docs for how to mess with what goes into a ProducerConsumer object. It&#8217;s entirely possible to use, say, <code>#each_slice</code>. There&#8217;s also a convenience method <code>#threaded_each_with_index</code>, but it does <em>not</em> call the underlying <code>#each_with_index</code>, it produces its own index as things are read. </li>
</ul>

<h2>Feedback not only welcome but necessary!</h2>

<p>I&#8217;ve done a lot of messing around with Ruby in the last 10 days or so, but I&#8217;m still basically converting from <acronym title="Practical Extraction and Report Language">Perl</acronym> in my head. Any comments, bugs reports, or whatnot are definitely welcome!</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/indexing-data-into-solr-via-jruby-with-threads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>jruby_producer_consumer dead-simple producer/consumer for JRuby</title>
		<link>http://robotlibrarian.billdueber.com/jruby_producer_consumer-dead-simple-producerconsumer-for-jruby/</link>
		<comments>http://robotlibrarian.billdueber.com/jruby_producer_consumer-dead-simple-producerconsumer-for-jruby/#comments</comments>
		<pubDate>Fri, 05 Feb 2010 19:46:46 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=188</guid>
		<description><![CDATA[Yea! My first gem ever released!

[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new jruby_producer_consumer gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]

[In working on a threaded JRuby-based MARC-to-Solr project, I realized that my threading stuff was...ugly. [...]]]></description>
			<content:encoded><![CDATA[<p>Yea! My first gem ever released!</p>

<p style="color: red">[YUCK! It was a disaster in a few ways! Don't look at this! It's hideous! There's a new <a href="http://rdoc.info/projects/billdueber/jruby_producer_consumer">jruby_producer_consumer</a> gem on gemcutter that is slightly different from this in that it works. Ignore the stuff below.]</p>

<p>[In working on a threaded JRuby-based <acronym title="MAchine Readable Cataloging">MARC</acronym>-to-<acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> project, I realized that my threading stuff was...ugly. And
I didn't really understand it. So I dug in today and wrote this.]</p>

<p>I&#8217;ve just pushed to <a href="http://gemcutter.org/">Gemcutter</a> my first gem &#8212; a <a href="http://jruby.org/">JRuby</a>-only
producer/consumer class that works with anything that provides <em>#each</em> called <a href="http://gemcutter.org/gems/jruby_producer_consumer">jruby_producer_consumer</a>.</p>

<p>It&#8217;s JRuby-only because it uses (a) A blocking queue implemenation that&#8217;s native Java, and (b) threading, which isn&#8217;t 
a huge win under regular Ruby.</p>

<p>There&#8217;s no testing there because I&#8217;m not sure how to test threaded stuff <img src='http://robotlibrarian.billdueber.com/wp-includes/images/smilies/icon_sad.gif' alt=':-(' class='wp-smiley' /> </p>

<p>It is, I hope, easy to use:</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw3">require</span> <span class="st0">&#39;jruby_producer_consumer&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Create a ProducerConsumer. Arguments are anything that implements #each</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># and the size for the underlying queue. For the former, I&#39;ll just use a Range object.</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;eachable = <span class="nu0">1</span>..<span class="nu0">10</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;queuesize = <span class="nu0">3</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;pc = ProducerConsumer.<span class="me1">new</span><span class="br0">&#40;</span>eachable, queuesize<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Just a method to show what happens</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw1">def</span> sample <span class="br0">&#40;</span>consumerid, x<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="kw3">puts</span> <span class="st0">&quot;Consumer #{consumerid}: consuming #{x}&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="kw3">sleep</span> <span class="nu0">1</span> <span class="co1"># otherwise this&#39;ll finsish before I can create multiple consumers</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Create three consumers. You can pass any number of args to</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># #consumer, and must pass a block whose arguments are the</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># object returned by eachable#each and those args back.</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&#39;A&#39;</span>, <span class="st0">&#39;B&#39;</span>, <span class="st0">&#39;C&#39;</span><span class="br0">&#93;</span>.<span class="me1">each</span> <span class="kw1">do</span> <span class="sy0">|</span>consumerid<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;pc.<span class="me1">consumer</span><span class="br0">&#40;</span>consumerid<span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>x, consumerid<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp;sample<span class="br0">&#40;</span>consumerid, x<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># OUTPUT</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer A: consuming 1</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 2</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer C: consuming 3</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer A: consuming 4</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 5</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer C: consuming 6</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 7</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer A: consuming 8</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer C: consuming 9</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp;<span class="co1"># Consumer B: consuming 10</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/jruby_producer_consumer-dead-simple-producerconsumer-for-jruby/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.341 seconds -->
