<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robot Librarian &#187; Bill</title>
	<atom:link href="http://robotlibrarian.billdueber.com/author/bill/feed/" rel="self" type="application/rss+xml" />
	<link>http://robotlibrarian.billdueber.com</link>
	<description>Disclaimer: I'm not actually a robot.</description>
	<lastBuildDate>Thu, 01 Dec 2011 16:37:53 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Solr and boolean operators</title>
		<link>http://robotlibrarian.billdueber.com/solr-and-boolean-operators/</link>
		<comments>http://robotlibrarian.billdueber.com/solr-and-boolean-operators/#comments</comments>
		<pubDate>Thu, 01 Dec 2011 16:13:09 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=402</guid>
		<description><![CDATA[[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!] What does Solr do, given the following query? a OR b AND c I&#8217;ll give you three guesses, but you&#8217;ll get the first two wrong and won&#8217;t have any idea how to generate a third, so don&#8217;t spend too much time on it. Boolean [...]]]></description>
			<content:encoded><![CDATA[<p>[Summary: <strong>ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!</strong>]</p>

<p>What does <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> do, given the following query?</p>

<p><pre>
  a OR b AND c
</pre></p>

<p>I&#8217;ll give you three guesses, but you&#8217;ll get the first two wrong
and won&#8217;t have any idea how to generate a third, so don&#8217;t spend too much time on it.</p>

<h2>Boolean algebra and operator precedence</h2>

<p>Anyone who&#8217;s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping:</p>

<p><pre>
  a OR (b AND c)
</pre></p>

<p>That&#8217;s guess one. It&#8217;s not how <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> does it.</p>

<h2>Left to right?</h2>

<p>Some naive students, and at least one programming language (Smalltalk), do a simple left-to-right evaluation. So you might go with:</p>

<p><pre>
  (a OR b) AND c
</pre></p>

<p>Nope. Wrong again.</p>

<h2>So what&#8217;s left???</h2>

<p>Excellent question. I don&#8217;t know the code well enough to know what&#8217;s going on underneath, but here&#8217;s what we get under the lucene query parser.</p>

<p><pre>
    (b AND c)
</pre></p>

<p>That&#8217;s right. The first term is thrown away.(More correctly, the first term is deemed &#8220;optional&#8221;).</p>

<h2>Do you let your users put AND/OR/NOT in their queries?</h2>

<p>Hopefully, they don&#8217;t know any boolean algebra. If they do, hopefully they use parentheses, or you parse it out for them. And if not, well, they&#8217;re gonna be pretty damn confused.</p>

<h2>It gets weirder</h2>

<p>I populated a fresh solr (3.5) index with all possible subsets of the strings &#8220;curly&#8221;, &#8220;larry&#8221;, &#8220;moe&#8221;, and &#8220;shemp&#8221; (not Joe. Don&#8217;t talk to me about Joe). There are 15 of them, from the one-item &#8216;curly&#8217; to all four at once.</p>

<p>I wrote a script to run a set of queries against the index under both lucene and edismax to see what I would get. In all cases the default lucene operator is &#8216;AND&#8217; and the edismax mm parameter is set to 100% (equivalent to &#8220;all required&#8221;).</p>

<p><pre>
        Lucene                    EDismax<br />
  &#45;&#45;------------------------------------------------------</p>

<ol>
<li><p>curly AND larry
    curly larry               curly larry<br />
    curly larry moe           curly larry moe<br />
    curly larry shemp         curly larry shemp<br />
    curly larry moe shemp     curly larry moe shemp</p></li>
<li><p>curly AND larry OR moe
    curly                     curly larry<br />
    curly larry               curly larry moe<br />
    curly moe                 curly larry shemp<br />
    curly shemp               curly larry moe shemp<br />
    curly larry moe<br />
    curly larry shemp<br />
    curly moe shemp<br />
    curly larry moe shemp</p></li>
<li><p>curly OR larry AND moe
    larry moe                 larry moe<br />
    curly larry moe           curly larry moe<br />
    larry moe shemp           larry moe shemp<br />
    curly larry moe shemp     curly larry moe shemp</p></li>
<li><p>curly AND larry OR moe AND shemp
    curly moe shemp           curly larry moe shemp<br />
    curly larry moe shemp</p></li>
<li><p>moe AND shemp OR curly AND larry
    curly larry moe           curly larry moe shemp<br />
    curly larry moe shemp</p></li>
</ol>

<p></pre></p>

<p>Query 1 is as expected. Query 2 apparently reduces to just &#8216;curly&#8217; under the lucene parser and &#8216;curly AND larry&#8217; under edismax (and query 3 similarly reduces to the two AND&#8217;d words). Queries 4 and 5 are&#8230;well, you can look at the debugQuery output to see what it gets, but not <strong>why</strong>. And then tell me how to explain it to a user.</p>

<h2>Where does this leave us?</h2>

<p>The good news is that both lucene and edismax behave predictably when you use parentheses for grouping. So do that.</p>

<p>I&#8217;m generally not one to complain about open-source software, at least partially because I don&#8217;t have the chops to do anything about it most of the time, but I don&#8217;t understand how this could seem OK to anyone. There are a couple lucene Jira tickets (<a href="https://issues.apache.org/jira/browse/LUCENE-167">Lucene-167</a> and <a href="https://issues.apache.org/jira/browse/LUCENE-1823">Lucene-1823</a>) and a <a href="http://www.mail-archive.com/java-user@lucene.apache.org/msg00008.html">2005 mailing list thread</a> denouncing the current behavior, but it persists.</p>

<p>Until the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>/Lucene powers that be decide to tackle this, the rest of us will either have to write pre-parsers to make sure users get something sensible, or cripple our applications to disallow unrestricted boolean queries.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/solr-and-boolean-operators/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>A short personal note</title>
		<link>http://robotlibrarian.billdueber.com/a-short-personal-note/</link>
		<comments>http://robotlibrarian.billdueber.com/a-short-personal-note/#comments</comments>
		<pubDate>Tue, 11 Oct 2011 14:08:29 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=377</guid>
		<description><![CDATA[We had another baby. :-) Shai Brown Dueber was born last Monday, the 3rd, at a very moderate 7lbs 7.2oz (his brothers were 9lbs and 9.5lbs). Mother, baby, and older brothers are all doing well. Father is freakin&#8217; tired. &#160; &#160;]]></description>
			<content:encoded><![CDATA[<p>We had another baby. :-)</p>

<p><a href="http://4.bp.blogspot.com/-vauzuyHW3og/To5hGeNC-9I/AAAAAAAAG-k/xEs3THGMvi0/s1600/IMG_6984.JPG"><img class="aligncenter" title="Shai Brown Dueber" src="http://4.bp.blogspot.com/-vauzuyHW3og/To5hGeNC-9I/AAAAAAAAG-k/xEs3THGMvi0/s320/IMG_6984.JPG" alt="Shai Brown Dueber" width="213" height="320" /></a></p>

<p>Shai Brown Dueber was born last Monday, the 3rd, at a very moderate 7lbs 7.2oz (his brothers were 9lbs and 9.5lbs). Mother, baby, and older brothers are all doing well. Father is freakin&#8217; tired.</p>

<p><a href="http://3.bp.blogspot.com/-aS_9RVJ_nNU/To5hGPqqN7I/AAAAAAAAG-c/sHqJbsHPYVM/s1600/IMG_6953.JPG"><img class="aligncenter" title="Ziv, Nadav, and Shai" src="http://3.bp.blogspot.com/-aS_9RVJ_nNU/To5hGPqqN7I/AAAAAAAAG-c/sHqJbsHPYVM/s320/IMG_6953.JPG" alt="Ziv, Nadav, and Shai" width="320" height="240" /></a></p>

<p>&nbsp;</p>

<p>&nbsp;</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/a-short-personal-note/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Even better, even simpler multithreading with JRuby</title>
		<link>http://robotlibrarian.billdueber.com/even-better-even-simpler-multithreading-with-jruby/</link>
		<comments>http://robotlibrarian.billdueber.com/even-better-even-simpler-multithreading-with-jruby/#comments</comments>
		<pubDate>Fri, 01 Jul 2011 16:22:55 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=360</guid>
		<description><![CDATA[[Yes, another post about ruby code; I'll get back to library stuff soon.] Quite a while ago, I released a little gem called threach (for &#8220;threaded #each&#8221;). It allows you to easily process a block with multiple threads. &#160; # Process a CSV file with three threads &#160; FIle.open&#40;&#39;data.csv&#39;&#41;.threach&#40;3, :each_line&#41; &#123;&#124;line&#124; send_to_db&#40;line&#41;&#125; Nice, right? The [...]]]></description>
			<content:encoded><![CDATA[<p>[Yes, another post about ruby code; I'll get back to library stuff soon.]</p>

<p>Quite a while ago, I released a little gem called <em><a href="https://rubygems.org/gems/threach">threach</a></em> (for &#8220;threaded #each&#8221;). It allows you to easily process a block with multiple threads.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="co1"># Process a CSV file with three threads</span></div></li>
<li class="li1"><div class="de1">&nbsp; FIle.<span class="kw3">open</span><span class="br0">&#40;</span><span class="st0">&#39;data.csv&#39;</span><span class="br0">&#41;</span>.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">3</span>, <span class="re3">:each_line</span><span class="br0">&#41;</span> <span class="br0">&#123;</span><span class="sy0">|</span>line<span class="sy0">|</span> send_to_db<span class="br0">&#40;</span>line<span class="br0">&#41;</span><span class="br0">&#125;</span></div></li></ol></div>

<p>Nice, right?</p>

<p>The problem is that I could never figure out a way to deal with a <code>break</code> or an <code>Exception</code> raised inside the block. The core problem is that once a thread trying to push/pop from a ruby <code>SizedQueue</code> is blocking, there&#8217;s no way (I could find) to tell it to wake up and see if there&#8217;s an error from another thread floating around that needs to be addressed.</p>

<p>So, I got into a pattern of running my code with <code>each</code> for a while, debugging, and eventually doing the production run under <code>threach</code>. Which is just dumb. Then I&#8217;d try to re-write <code>threach</code> to deal with this stuff using different approach (mutexes, lightweight events), quickly (or not so quickly) fail, give up, and start again.</p>

<p>So&#8230;let&#8217;s not worry MRI for the moment. I run all my big jobs under JRuby these days anyway, and there I can take advantage of Java&#8217;s blocking queues that have timeouts. When a queue operation times out, I can check to see if there&#8217;s been a break or an exception thrown in the meantime and behave appropriately.</p>

<p>The result is the gem <code><a href="https://rubygems.org/gems/jruby_threach">jruby_threach</a></code>. It works just like <code>threach</code>, except that, you know, it actually works the way I&#8217;d like it to.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;jruby_threach&#39;</span></div></li>
<li class="li1"><div class="de1">FIle.<span class="kw3">open</span><span class="br0">&#40;</span><span class="st0">&#39;data.csv&#39;</span><span class="br0">&#41;</span>.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">3</span>, <span class="re3">:each_line</span><span class="br0">&#41;</span> <span class="br0">&#123;</span><span class="sy0">|</span>line<span class="sy0">|</span> send_to_db<span class="br0">&#40;</span>line<span class="br0">&#41;</span><span class="br0">&#125;</span></div></li></ol></div>

<p>Looks familiar, doesn&#8217;t it.</p>

<p>But you can also break out of the loop.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">myarray.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">2</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>item<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">break</span> <span class="kw1">if</span> item_indicates_to_break<span class="br0">&#40;</span>item<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">if</span> item == <span class="re3">:really_bad_value</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">raise</span> <span class="kw4">RuntimeError</span>.<span class="me1">new</span>, <span class="st0">&quot;Something&#39;s really wrong&quot;</span>, <span class="kw2">nil</span> </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1">&nbsp; process_item<span class="br0">&#40;</span>item<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1"><span class="kw1">end</span></div></li></ol></div>

<p>Any exceptions that are <code>rescue</code>d within the block are handled internally and don&#8217;t cause processing to stop. Any that are <em>not</em> handled within the block are noticed by <code>threach</code>, cause the processing to stop, and the re-raised so you can deal with them outside of <code>threach</code></p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">reader = SpecializedFileReader.<span class="me1">new</span><span class="br0">&#40;</span>filename<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1"><span class="kw1">begin</span></div></li>
<li class="li1"><div class="de1">&nbsp; reader.<span class="me1">threach</span><span class="br0">&#40;</span><span class="nu0">2</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>item<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; process_item<span class="br0">&#40;</span>item<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li>
<li class="li1"><div class="de1"><span class="kw1">rescue</span> SpecializedFileReaderError </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># deal with the fact that the reader failed</span></div></li>
<li class="li1"><div class="de1"><span class="kw1">rescue</span> <span class="kw4">Exception</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># deal with the problem processing the item</span></div></li>
<li class="li1"><div class="de1"><span class="kw1">end</span></div></li></ol></div>

<p>Dealing with the underlying Java data structures makes life a lot easier. To the point that I added an enhancement &#8212; threading production as well.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; <span class="co1"># Use two threads to read lines from files, and another three threads</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># to process the data that comes out of those files.</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw4">Dir</span>.<span class="me1">glob</span><span class="br0">&#40;</span><span class="st0">&quot;*.csv&quot;</span><span class="br0">&#41;</span>.<span class="me1">map</span><span class="br0">&#123;</span><span class="sy0">|</span>f<span class="sy0">|</span> <span class="kw4">File</span>.<span class="kw3">open</span><span class="br0">&#40;</span>f<span class="br0">&#41;</span><span class="br0">&#125;</span>.<span class="me1">mthreach</span><span class="br0">&#40;</span><span class="nu0">2</span>,<span class="nu0">3</span><span class="br0">&#41;</span> <span class="kw1">do</span> <span class="sy0">|</span>item<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; send_item_to_datbase<span class="br0">&#40;</span>item<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="kw1">end</span></div></li></ol></div>

<p><code>mthreach</code> basically allows you to treat an array of Enumerables as a single logical entity, multithreading both the producer and consumer sides of the operation. There aren&#8217;t a whole lot of obvious use cases, but it can certainly come in handy.</p>

<p>You can also access the underlying class that aggregates multiple enumerables directly.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;jruby_threach&#39;</span></div></li>
<li class="li1"><div class="de1">me = <span class="re2">Threach::MultiEnum</span>.<span class="me1">new</span><span class="br0">&#40;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#91;</span>enum1, enum2, enum3<span class="br0">&#93;</span>, <span class="co1"># enumerables</span></div></li>
<li class="li1"><div class="de1">&nbsp; threads, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># How many threads to use to </span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re3">:each_with_index</span>, &nbsp; &nbsp; &nbsp;<span class="co1"># the iterator to call on the enumerables</span></div></li>
<li class="li1"><div class="de1">&nbsp; size &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># size of the under-the-hood queue</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1"><span class="co1"># Note that like threach, calling #each against an MultiEnum actually</span></div></li>
<li class="li1"><div class="de1"><span class="co1"># calls the iterator you sent in (in this case, #each_with_index)</span></div></li>
<li class="li1"><div class="de1">me.<span class="me1">each</span> <span class="br0">&#123;</span><span class="sy0">|</span>item<span class="sy0">|</span> process_item<span class="br0">&#40;</span>item<span class="br0">&#41;</span><span class="br0">&#125;</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/even-better-even-simpler-multithreading-with-jruby/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using SQLite3 from JRuby without ActiveRecord</title>
		<link>http://robotlibrarian.billdueber.com/using-sqlite3-from-jruby-without-activerecord-2/</link>
		<comments>http://robotlibrarian.billdueber.com/using-sqlite3-from-jruby-without-activerecord-2/#comments</comments>
		<pubDate>Thu, 26 May 2011 18:20:57 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=345</guid>
		<description><![CDATA[I spent way too long asking my friend, The Internet, how to get a normal DBI connection to SQLIte3 using JRuby. Apparently, everyone except me is using ActiveRecord and/or Rails and doesn&#8217;t want to just connect to the database. But I do. Here&#8217;s how. First, get the gems: &#160; gem install dbi &#160; gem install [...]]]></description>
			<content:encoded><![CDATA[<p>I spent way too long asking my friend, The Internet, how to get a normal DBI connection to SQLIte3 using JRuby. Apparently, everyone except me is <a href="http://jruby-extras.rubyforge.org/activerecord-jdbc-adapter/">using ActiveRecord and/or Rails</a> and doesn&#8217;t <em>want</em> to just connect to the database.</p>

<p>But I do. Here&#8217;s how.</p>

<p>First, get the gems:</p>

<div class="geshi no bash"><ol><li class="li1"><div class="de1">&nbsp; gem <span class="kw2">install</span> dbi</div></li>
<li class="li1"><div class="de1">&nbsp; gem <span class="kw2">install</span> dbd-jdbc</div></li>
<li class="li1"><div class="de1">&nbsp; gem <span class="kw2">install</span> jdbc-sqlite3</div></li></ol></div>

<p>Then you&#8217;re ready to load it up into DBI.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;rubygems&#39;</span> <span class="co1"># if you&#39;re using 1.8 still</span></div></li>
<li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;java&#39;</span></div></li>
<li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;dbi&#39;</span></div></li>
<li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;dbd/jdbc&#39;</span></div></li>
<li class="li1"><div class="de1"><span class="kw3">require</span> <span class="st0">&#39;jdbc/sqlite3&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">databasefile = <span class="st0">&#39;test.db&#39;</span></div></li>
<li class="li1"><div class="de1">dbh = DBI.<span class="me1">connect</span><span class="br0">&#40;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;DBI:jdbc:sqlite:#{databasefile}&quot;</span>, &nbsp;<span class="co1"># connection string</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;&#39;</span>, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># no username for sqlite3</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;&#39;</span>, &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="co1"># no password for sqlite3</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;driver&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;org.sqlite.JDBC&#39;</span><span class="br0">&#41;</span> &nbsp; &nbsp; &nbsp;<span class="co1"># need to set the driver</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1"><span class="co1"># That&#39;s it. Everything below here is stock DBI</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">dbh.<span class="kw1">do</span> <span class="st0">&quot;create table squares (i integer, isquared integer)&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">ins = dbh.<span class="me1">prepare</span><span class="br0">&#40;</span><span class="st0">&quot;insert into squares values (?, ?)&quot;</span><span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#40;</span><span class="nu0">1</span>..<span class="nu0">20</span><span class="br0">&#41;</span>.<span class="me1">each</span> <span class="kw1">do</span> <span class="sy0">|</span>i<span class="sy0">|</span></div></li>
<li class="li1"><div class="de1">&nbsp; ins.<span class="me1">execute</span><span class="br0">&#40;</span>i, i<span class="sy0">*</span>i<span class="br0">&#41;</span></div></li>
<li class="li1"><div class="de1"><span class="kw1">end</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/using-sqlite3-from-jruby-without-activerecord-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>How good is our relevancy ranking?</title>
		<link>http://robotlibrarian.billdueber.com/how-good-is-our-relevancy-ranking/</link>
		<comments>http://robotlibrarian.billdueber.com/how-good-is-our-relevancy-ranking/#comments</comments>
		<pubDate>Wed, 25 May 2011 18:53:58 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=333</guid>
		<description><![CDATA[For those of us that spend our days trying to tweak Mirlyn to make it better, one of the most important &#8212; and, in many ways, most opaque &#8212; questions is, &#8220;How good is our relevancy ranking?&#8221; Research from the UMich Library&#8217;s Usability Group (pdf; 600k) points to the importance of relevancy ranking  for both [...]]]></description>
			<content:encoded><![CDATA[<p>For those of us that spend our days trying to tweak <a title="Mirlyn -- The University of Michigan Library Catalog" href="http://mirlyn.lib.umich.edu/">Mirlyn</a> to make it better, one of the most important  &#8212; and, in many ways, most opaque &#8212; questions is, &#8220;How good is our relevancy ranking?&#8221;</p>

<p>Research from the UMich Library&#8217;s Usability Group (<a href="http://www.lib.umich.edu/files/services/usability/MirlynSearchSurvey_Feb2011.pdf">pdf; 600k</a>) points to the importance of relevancy ranking  for both known-item searches and discovery, but mapping search terms to the &#8220;best&#8221; results involves crawling deep inside the searcher&#8217;s head to know what she&#8217;s looking for.</p>

<p>So, what can we do?</p>

<p><strong>Record interaction as a way of showing interest</strong></p>

<p>One possibility is to look at those records that are somehow &#8220;touched&#8221; by a user in such a way that we can log it. If a user bothers to interact with an individual record, we&#8217;ll assume the record is interesting to her in the context of the current search.</p>

<p>There are three links associated with an individual record that a user can click on from the search results:</p>

<ul>
    <li>(62% of all record interactions) The title</li>
    <li>(28%) An external link (HathiTrust, Google Books, or one of our vendors)</li>
    <li>(10%) The &#8220;see holdings&#8221; link for those items that have multiple holdings</li>
</ul>

<p>Our first issue arises quickly: only about a quarter of Mirlyn sessions contain any of these actions. For a full 75% of sessions, we have no data about which records users are paying attention to. They get a call number &#8212; or determine they have a failed search &#8212;  and move on.</p>

<p><strong>Where on the page do users interact with items?</strong></p>

<p>We don&#8217;t know how users that interact with items differ from those that don&#8217;t. But for those that do, more than half of all record interactions are with the first record.</p>

<p>Here are the numbers for the first five records:</p>

<ul>
    <li>First record: 54%</li>
    <li>Second record: 12%</li>
    <li>Third record: 6%</li>
    <li>Fouth record: 3.7%</li>
    <li>Fifth record: 2.5%</li>
</ul>

<p>More than 75% of all record interactions are with the first four items on the first page of results.</p>

<p><strong>What does it all mean?</strong></p>

<p>Frustratingly, we don&#8217;t know. Several possibilities are obvious:</p>

<ul>
    <li>we&#8217;re doing a good job with relevancy ranking</li>
    <li>people do mostly known-item searches</li>
    <li>people don&#8217;t bother looking past the first few results</li>
    <li>excellent general search engines (e.g., Google) have trained people to believe that the first result is always worth a closer look.</li>
</ul>

<p>The interactions between these (and unknown other) factors are likely complex.</p>

<p>In the meantime, though, to the extent these data can be extended to the general case (not at all obvious), we&#8217;re not doing too bad of a job.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/how-good-is-our-relevancy-ranking/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Ruby gem library_stdnums goes to version 1.0</title>
		<link>http://robotlibrarian.billdueber.com/ruby-gem-library_stdnums-goes-to-version-1-0/</link>
		<comments>http://robotlibrarian.billdueber.com/ruby-gem-library_stdnums-goes-to-version-1-0/#comments</comments>
		<pubDate>Fri, 06 May 2011 19:32:35 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/ruby-gem-library_stdnums-goes-to-version-1-0/</guid>
		<description><![CDATA[I just released another (this time pretty good) version of my gem for normalizing/validating library standard numbers, library_stdnums (github source / docs). The short version of the functions available: ISBN: get checkdigit, validate, convert isbn10 to/from isbn13, normalize (to 13-digit) ISSN: get checkdigit, validate, normalize LCCN: validate, normalize Validation of LCCNs doesn&#8217;t involve a checkdigit; [...]]]></description>
			<content:encoded><![CDATA[<p>I just released another (this time pretty good) version of my gem for normalizing/validating library standard numbers, <code>library_stdnums</code> (<a href="https://github.com/billdueber/library_stdnums">github source</a> / <a href="http://rubydoc.info/github/billdueber/library_stdnums/master/frames">docs</a>).</p>

<p>The short version of the functions available:</p>

<ul>
<li><strong><acronym title="International Standard Book Number">ISBN</acronym></strong>: get checkdigit, validate, convert isbn10 to/from isbn13, normalize (to 13-digit)</li>
<li><strong><acronym title="International Standard Serial Number">ISSN</acronym></strong>: get checkdigit, validate, normalize</li>
<li><strong>LCCN</strong>: validate, normalize</li>
</ul>

<p>Validation of LCCNs doesn&#8217;t involve a checkdigit; I basically just normalize whatever is sent in and then see if the result is syntactically valid.</p>

<p>My plan in my Copious Free Time is to do a Java version of these as well and then stick them into a new-style <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> v.3 filter so I (and, by extension, you, if you&#8217;re interested) can have <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> do normalization during both index and search time.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/ruby-gem-library_stdnums-goes-to-version-1-0/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>A short ruby diversion: cost of flow control under Ruby</title>
		<link>http://robotlibrarian.billdueber.com/a-short-ruby-diversion-cost-of-flow-control-under-ruby/</link>
		<comments>http://robotlibrarian.billdueber.com/a-short-ruby-diversion-cost-of-flow-control-under-ruby/#comments</comments>
		<pubDate>Tue, 03 May 2011 19:54:01 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=308</guid>
		<description><![CDATA[A couple days ago I decided to finally get back to working on threach to try to deal with problems it had &#8212; essentially, it didn&#8217;t deal well with non-local exits due to calls to break or even something simple like a NoMethodError. [BTW, I think I managed it. As near as I can tell, [...]]]></description>
			<content:encoded><![CDATA[<p>A couple days ago I decided to finally get back to working on <a href="https://github.com/billdueber/threach"><code>threach</code></a> to try to deal with problems it had &#8212; essentially, it didn&#8217;t deal well with non-local exits due to calls to <code>break</code> or even something simple like a <code>NoMethodError</code>.</p>

<p>[<acronym title="By The Way">BTW</acronym>, I think I managed it. As near as I can tell, <code>threach</code> version 0.4 won't deadlock anymore]</p>

<p>Along the way, while trying to figure out how threads affect the behavior of different non-local exits, I noticed that in some cases there was still work being done by one or more threads long after there was an exception raised.</p>

<p>I re-discovered something that a lot of people already know: <code>raise</code>/<code>rescue</code> under MRI is slow, and under JRuby can be <em>unbearably</em> slow. How slow?</p>

<p>Let&#8217;s look at four simple blocks that exercise four different block exit strategies: <code>break</code>, <code>catch</code> and <code>throw</code>, <code>raise</code> with the normal single (or zero) arguments, as well as the three-argument version of <code>raise</code>.</p>

<table class="data">
  <tr>
    <th>Simple break</th><th>Catch/Throw</th>
  </tr>
  <tr>
    <td>
      <pre lang="ruby">
range.each do |i|      
  break          
end              
      </code>
    </td>
    <td>
      <pre lang="ruby">
catch(:benchmarking) do  
 range.each do |i|      
   throw(:benchmarking) 
 end                    
end
      </code>
    </td>
  </tr>
  <tr>
    <th>Raise (1 arg)</th><th>Raise (3 args)</th>
  </tr>
  <tr>
    <td>
      <pre lang="ruby">
 begin                  
   range.each do |i|    
     raise StandardError
   end                  
 rescue                 
  # do nothing                
 end                          
     </code>
    </td>
    <td>
      <pre lang="ruby">
begin                  
  range.each do |i|
    raise StandardError, :hi, nil
  end
rescue 
 # do nothing
end
      </code>
    </td>
  </tr>
</table>

<p>In each case, we immediately exit the block without doing any work; the idea is to measure how long it takes to break out for each case.</p>

<p>So....let's run them each 100K times and see what happens, shall we? Times are in seconds, averaged over two runs.</p>

<p><style type="text/css" media="screen">
  #t td {border: 1pt solid gray; border-collapse: collapse;}
  td.bad {color: red;}
</style></p>

<table class="data" id="t">
  <tr>
    <th></th><th>Ruby 1.8</th><th>Ruby 1.9</th><th>JRuby</th><th>JRuby --1.9</th>
  </tr>  
  <tr><th>break</th>        <td>0.12</td><td>0.07</td><td>0.29</td> <td>0.21</td></tr>
  <tr><th>catch/throw</th>  <td>0.35</td><td>0.28</td><td>0.64</td> <td>0.48</td></tr>
  <tr><th>raise (1 arg)</th><td>1.78</td><td>2.10</td><td class="bad">26.60</td><td class="bad">22.06</td></tr>
  <tr><th>raise (3 arg)</th><td>1.85</td><td>2.13</td><td>0.45</td> <td>0.45</td></tr>
</table>

<p>The first thing to note is that this is 100K iterations. Three of the strategies are fast enough that you'd have to work really, really hard to notice them. In terms of speed, <code>raise (3 args)</code>, <code>catch/throw</code>, and <code>break</code> are fast enough that you shouldn't bother worrying about them (although you <em>should</em> choose the method that makes your code easy to understand).</p>

<p>The second things to note is <em>Holy Camoli!</em> JRuby is slow there!</p>

<p><a href="http://jira.codehaus.org/browse/JRUBY-5534">This Jira ticket</a> tells the tale: The creation of the backtrace is very, very expensive for JRuby. That <code>nil</code> at the end of the <code>raise (3 args)</code> call suppresses the creation of that backtrace, so the speed is fine.</p>

<p>Three things worth saying here:</p>

<ul>
<li>If you're using <code>raise/rescue</code> for flow control, <em>you're already doing it wrong.</em> Reserve exceptions for, well, exceptional conditions that are only going to be raised once or twice, not all the time. </li>
<li>If you're writing code that, for some ungodly reason, is planning on raising a crapload of exceptions, use the three-arg version. I'm looking at you, gem authors. </li>
<li>If you're writing your code without worrying about how it will work under multiple threads, well, please don't do that. Everyone has multi-core systems these days, and it's silly to not be able to use them. Plus, counting on Matz to never move to a VM with real threads is a big gamble.</li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/a-short-ruby-diversion-cost-of-flow-control-under-ruby/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>ISBN parenthetical notes: Bad MARC data #1</title>
		<link>http://robotlibrarian.billdueber.com/isbn-parenthetical-notes-bad-marc-data-1/</link>
		<comments>http://robotlibrarian.billdueber.com/isbn-parenthetical-notes-bad-marc-data-1/#comments</comments>
		<pubDate>Tue, 12 Apr 2011 16:22:46 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=296</guid>
		<description><![CDATA[Yesterday, I gave a brief overview of why free text is hard to deal with. Today, I&#8217;m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the ISBN in the 020) and appending stuff onto the end of it. The point is [...]]]></description>
			<content:encoded><![CDATA[<p>Yesterday, I gave a brief overview of <a href="http://robotlibrarian.billdueber.com/why-programmers-hate-free-text-in-marc-records/">why free text is hard to deal with</a>.</p>

<p>Today, I&#8217;m turning my attention to a concrete example that drives me absolutely batshit crazy: taking a perfectly good unique-id field (in this case, the <acronym title="International Standard Book Number">ISBN</acronym> in the 020) and appending stuff onto the end of it.</p>

<p>The point is not to mock anything. Mocking will, however, be included for free.</p>

<h2>What&#8217;s supposed to be in the 020?</h2>

<p>Well, for starters, an <acronym title="International Standard Book Number">ISBN</acronym> (10 or 13 digit, we&#8217;re not picky).</p>

<p>Let&#8217;s not worry, for the moment, about the actual <acronym title="International Standard Book Number">ISBN</acronym> and whether it&#8217;s valid or not.</p>

<p>Wait, no, let&#8217;s go ahead and worry about it. It&#8217;s an easy enough script to write, although it takes a while to run.</p>

<pre><code>8,630,794  Total records
3,220,666  Total 020a's
    6,498  020a's that don't obviously contain an <acronym title="International Standard Book Number">ISBN</acronym>
    8,407  that look like an <acronym title="International Standard Book Number">ISBN</acronym> but fail checksum test:
... so 0.26% of the ISBNs have invalid checksums
</code></pre>

<p>So, not bad at all, especially considering some of those are known to be bad, but are transcribed dutifully from the actual (mis-)printed book.</p>

<p>A lot of the malformed data (anything from which I can&#8217;t seem to extract something that looks like an <acronym title="International Standard Book Number">ISBN</acronym>) is pricing data, and most of it appears in system numbers that are close enough to each other that I presume it was just a bad batch.</p>

<h2>What&#8217;s goes after the <acronym title="International Standard Book Number">ISBN</acronym> in the 020?</h2>

<p>I&#8217;m no cataloger, of course, but it looks to me like the answer is &#8220;Something about how the book is bound together, or the publisher, unless you want to put something else there, and then, really, go ahead, because it&#8217;s not like anyone is ever going to want to parse this out, all we need to do is print cards with it for god&#8217;s sake.&#8221;</p>

<p>No, I kid, I kid! The actual rules are in <a href="http://sites.google.com/site/opencatalogingrules/aacr2-chapter-1/1-8-standard-number-and-terms-of-availability-area">Library of Congress Rule Interpretation 1.8</a>, which reads, in part:</p>

<blockquote>
  <p>For a hardbound resource, there is no attempt to use a consistent term other than to use one that conveys the condition intelligibly.</p>
</blockquote>

<p>I think it&#8217;s important to read that a second time, because it succinctly conveys the culture in which these rules were devised.</p>

<ul>
<li>Don&#8217;t worry about consistency, because your only reader is human.</li>
<li>Defer to the cataloger.</li>
<li>Being complete is more important than being consistent.</li>
<li>Base your notes on your subjective view of the actual, physical item you&#8217;re presumed to be holding in your hands.</li>
</ul>

<p>Interestingly (to me, anyway), it looks like the <a href="http://www.oclc.org/bibformats/en/0xx/020.shtm"><acronym title="Online Computer Library Center">OCLC</acronym> once had a (now deprecated) <code>$$b</code> subfield for binding information</a>. Apparently it didn&#8217;t catch on.</p>

<h2>What did I find?</h2>

<p>So, let&#8217;s pretend I&#8217;d like to be able to differentiate between paperback and hardbound books. Probably useful, yes?</p>

<p>I went ahead and took all parenthetical notes from any field in the 020, split them on colon (&#8217;cause that seems to be the way they roll) and did some basic normalization:</p>

<ul>
<li>Eliminate numbers (so &#8216;vol. 1&#8242; and &#8216;vol. 2&#8242; count as only one pattern)</li>
<li>Lowercase everything</li>
<li>Turn runs of spaces into a single space</li>
<li>Trim leading/trailing spaces</li>
<li>Remove any trailing punctuation</li>
</ul>

<p>I found 1,506,729 parenthetical remarks in the 020 subfields of our catalog.</p>

<p>The top twenty most common entries using those normalizations are:</p>

<ol>
<li>402537 pbk</li>
<li>387406 alk. paper</li>
<li>99260 v  # <em>(e.g., &#8220;v. 1&#8243;, &#8220;v. 22&#8243;, etc.)</em></li>
<li>82918 cloth</li>
<li>51125 hbk</li>
<li>42036 electronic bk</li>
<li>41360 acid-free paper</li>
<li>38792 hardcover</li>
<li>28913 set</li>
<li>20358 hardback</li>
<li>19160 ebook</li>
<li>16264 paper</li>
<li>15269 u.s</li>
<li>12770 hd.bd</li>
<li>11793 print</li>
<li>10625 lib. bdg</li>
<li>10520 hc</li>
<li>8772 est</li>
<li>7767 pb</li>
<li>7639 hard</li>
</ol>

<p>The kicker? These are the top twenty of <em>13,374</em> unique parenthetical strings found in the 020 field. Many of them are publishers, or cities, or whatnot, but an awful lot of them are variations on &#8220;hardcover&#8221; and &#8220;paperback.&#8221;</p>

<p>For example, a quick search for anything that might be &#8220;hard&#8221; (regexp: /h[ar]{0,2}d/) got me started on a list. Here&#8217;s just the 90 examples from that list that start with &#8216;h&#8217;:</p>

<blockquote>
  <p>hard | hard adhesive | hard back | hard bd | hard book | hard bound | hard bound book | hard boundhard case | hard casehard copy | hard copy | hard copy set | hard cov | hard cover | hard covers | hard sewn | hard signed | hard-backhard-backcased | hard-bound | hard-cover | hard-cover acid-free | hardb | hard\cover | hardbach | hardback | hardback book | hardback cover |  hardbackcased | hardbd | hardbk | hardbond | hardbook | hardboubd | hardbound | hardboundhardboundtion | hardc | hardcase | hardcopy | hardcopy publication | hardcov | hardcov er | hardcovcer | hardcove | hardcover | hardcover-alk. paper | hardcovercloth | hardcoverflexibound | hardcoverhardcoverwith cd | hardcoverr | hardcovers | hardcoversame | hardcoversame as above | hardcoverset | hardcovertion | hardcver | hardcvoer | hardcvr | harddback | harde | hardocover | hardover | hardpack | hardpaper | hardvocer | hardware | hd | hd bd | hd. bd | hd. bd. in slip case | hd. bd.in sl.cs | hd. bk | hd. cover | hd.bd | hd.bd. in box | hdb | hdbd | hdbk | hdbkb | hdbkhdbk | hdbnd | hdc | hdcvr | hdk | hdp | hdpk | hradback | hradcover | hrd | hrdbk | hrdcver | hrdcvr</p>
</blockquote>

<p>And that&#8217;s after eliminating things like places of publication, strings like  &#8220;with&#8230;&#8221;, &#8220;plus&#8230;&#8221;, &#8220;alk. paper&#8221;, etc.</p>

<h2>&#8220;Yeah, but you have to understand that historically&#8230;&#8221;</h2>

<p>Stop hiding behind that.</p>

<p>I understand that at one point in time it probably made sense (to someone at least) to do it this way. I can deal with that.</p>

<p>What I can&#8217;t accept is that <em>as I type this</em> there&#8217;s a cataloger doing this in this way. Today. April 2011. Some, what? maybe <em>thirty years</em> since computer-based OPACs became prevalent?</p>

<p>These sorts of problems were recognized <em>ages</em> ago and should have been dealt with. Add a subfield. Invent a controlled vocabulary. Don&#8217;t worry about the legacy data; it&#8217;s always going to suck.</p>

<p>But <em>why are we still producing sucky data???</em></p>

<h2>To sum up</h2>

<p>The point is that there&#8217;s a better way to do this stuff. Lots and lots of better ways, in fact. Time I spend dealing with crappy data is time I <em>don&#8217;t</em> spend making relevancy raking better, or building a better command language search option for my librarians, or working on ways to get a decent &#8220;more like this&#8221;.</p>

<p>The need is both dire and urgent; the latter because sooner or later we&#8217;re going to have to go to a &#8220;two state solution&#8221; with traditional MARC21 for many of our records and whatever comes next (<acronym title="Resource Description and Access">RDA</acronym>?) for the newer stuff. And every day we wait, that first category grows, and the growth rate keeps increasing.</p>

<p>And then there&#8217;s serials. Don&#8217;t talk to me about serials.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/isbn-parenthetical-notes-bad-marc-data-1/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Why programmers hate free text in MARC records</title>
		<link>http://robotlibrarian.billdueber.com/why-programmers-hate-free-text-in-marc-records/</link>
		<comments>http://robotlibrarian.billdueber.com/why-programmers-hate-free-text-in-marc-records/#comments</comments>
		<pubDate>Mon, 11 Apr 2011 19:40:16 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=291</guid>
		<description><![CDATA[One of the frustrating things about dealing with MARC (nee AACR2) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job. A lot of people seem to not understand why. This post, then, is for all the catalogers out there [...]]]></description>
			<content:encoded><![CDATA[<p>One of the frustrating things about dealing with <acronym title="MAchine Readable Cataloging">MARC</acronym> (nee <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym>) data is how much nonsense is stored in free text when a unique identifier in a well-defined place would have done a much better job.</p>

<p>A lot of people seem to not understand why.</p>

<p>This post, then, is for all the catalogers out there who constantly answer my questions with, &#8220;Well, it depends&#8221; and don&#8217;t understand why that&#8217;s a problem.</p>

<h2>Description vs Findability</h2>

<p>I&#8217;m surprised &#8212; and a little dismayed &#8212; by how often I talk to people in the library world who don&#8217;t understand the difference between <em>description</em> and <em>findability</em>. <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> is clearly designed for <em>description</em>; once you&#8217;ve found a record, it does a pretty good job telling a human being what she&#8217;s looking at. With respect to a person who&#8217;s already got a copy of the record in her (virtual) hand, strings of text and reasonable abbreviations are&#8230;well, often good enough, let&#8217;s say.</p>

<p>But much of <acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> is a giant mountain of fail when it comes to supporting <em>findability</em> &#8212; the ability for a machine to slice and dice the data in ways that can be mapped onto searches and transformations. What those of us on the business end of the computer need are <em>well-defined values</em> stuck into <em>well-defined places</em> that represent <em>well-defined relationships</em>.</p>

<p>Free text stuck on the end of a field fails all three of those criteria.</p>

<h2>Machine Reasoning vs. Machine Parsing</h2>

<p>When many people look at something like <acronym title="Resource Description Framework">RDF</acronym>, their first reaction is, &#8220;<a href="http://www.youtube.com/watch?v=ffN9jcVcH_o">Great Googally Moogally</a>!  Just tell me the language! I don&#8217;t want to follow a chain of reasoning that&#8217;s seventeen steps long just to figure out the damn thing is in English!!!&#8221;</p>

<p>Of course you don&#8217;t. And you don&#8217;t have to. Someone &#8212; hopefully someone smarter than me &#8212; needs to write a program to do it. And we can.</p>

<p>Following all that logic &#8212; deriving relationships, figuring out eventual values, determining how to convert between various forms &#8212; is what I&#8217;ll call (for simplicity&#8217;s sake) <em>machine reasoning</em>. And machine reasoning &#8212; for the purposes of this discussion, anyway &#8212; is a <strong>solved problem</strong>. I&#8217;m not saying it&#8217;s not hard, and I&#8217;m not saying it might not take gobs of hardware resources. But we, the collective of humanity, know how to do it.</p>

<p>On the other hand, <em>machine parsing</em> &#8212; looking at all that free text that is sprinkled throughout our records and trying to turn it into something that is susceptible to machine reasoning &#8212; is vehemently <em>not</em> a solved problem. Even if you ignore all the misspellings, we&#8217;re still stuck with one-off abbreviations, lack of ordering, gobs of &#8220;local practice,&#8221; and iffy punctuation.</p>

<p>And, come to think of it, you can&#8217;t ignore the misspellings, either.</p>

<p>The point is this: <strong>good data trumps everything else</strong>. If there&#8217;s good, solid, well-defined data in computable places, we can (given some time) do damn near anything with it. If there&#8217;s human-entered, free-text, parenthetical-remark-type data, we&#8217;re pretty much stuck.</p>

<h2>Examples?</h2>

<p><a href="http://bibwild.wordpress.com/">Jonathan Rochkind</a> just did a <a href="http://bibwild.wordpress.com/2011/04/04/broad-categories-from-class-numbers/">great post</a> looking at <acronym title="Library of Congress">LC</acronym> call numbers, and how, well, they might be in a few different places, and may or may not be valid <acronym title="Library of Congress">LC</acronym> call numbers, and so on and on and on and on.</p>

<p>And my next post (hopefully tomorrow) will be an analysis of the first freetext in <acronym title="MAchine Readable Cataloging">MARC</acronym> I ever tried to deal with &#8212; the parenthetical remarks in the 020 (<acronym title="International Standard Book Number">ISBN</acronym>) field. If that doesn&#8217;t keep you up all night, well, I don&#8217;t know what will.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/why-programmers-hate-free-text-in-marc-records/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Corrected Code4Lib slides are up</title>
		<link>http://robotlibrarian.billdueber.com/corrected-code4lib-slides-are-up/</link>
		<comments>http://robotlibrarian.billdueber.com/corrected-code4lib-slides-are-up/#comments</comments>
		<pubDate>Tue, 15 Feb 2011 21:49:49 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=281</guid>
		<description><![CDATA[&#8230;at the same URL. I was, to put it mildly, incredibly excited about code4lib this year because, for once, I thought I had something to say. And I did have something to say. And I said it. But it was wrong. I presented a bunch of statistics drawn from nearly a year of Mirlyn logs. [...]]]></description>
			<content:encoded><![CDATA[<p>&#8230;at <a href="http://robotlibrarian.billdueber.com/wp-content/uploads/2011/02/dueber_lightning_c4l11.ppt">the same <acronym title="Uniform Resource Locator">URL</acronym></a>.</p>

<p>I was, to put it mildly, incredibly excited about <a href="http://code4lib.org/">code4lib</a> this year because, for once, I thought I had something to say. And I did have something to say. And I said it. But it was wrong.</p>

<p>I presented a bunch of statistics drawn from nearly a year of <a href="http://mirlyn.lib.umich.edu/">Mirlyn</a> logs. The most outlandish of my assertions, and the one that eventually turned out to be the most incorrect, was that some 45% of all our user sessions consist of only one action: a search.</p>

<p>Unfortunately, I&#8217;d missed a whole swath of things I should have excluded. I&#8217;d remembered robots and stuff coming in from our link resolver and so on. I hadn&#8217;t counted on having to fight my own stupidity.</p>

<p>In short: <a href="http://catalog.hathitrust.org/">catalog.hathitrust.org</a> and <a href="http://mirlyn.lib.umich.edu/">mirlyn.lib.umich.edu</a> share a common code base, as well as a <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> backend. I was correctly excluding all the HathiTrust stuff from my stats <em>except for simple searches</em>. What I ended up with was a whole lotta sessions with nothing in them but that search. Luckily, I noticed waaaay too many people coming in via the HathiTrust site (which I <em>know</em> doesn&#8217;t have a link to Mirlyn) and did more digging.</p>

<p>The slides have been updated with correct numbers. Luckily, even though the adjustment was pretty extreme, I don&#8217;t think many of my conclusions are invalidated, especially given <a href="http://www.lib.umich.edu/files/services/usability/MirlynSearchSurvey_Feb2011.pdf">corroborating evidence from an extensive survey conducted by our usability team</a> (<acronym title="Portable Document Format">PDF</acronym>). They conclude, among other things, that known-item searching is prevalent and relevancy raking is important across task boundaries.</p>

<p>The basic stats from the powerpoint, for those who don&#8217;t want to read all my notes:</p>

<ul>
    <li>17% of all sessions have one action: a search</li>
    <li>In only 28% of all sessions does the user see the Record View</li>
    <li>75% of all logged actions that target an individual record (see the full record view, look at extended holdings, etc.) happen with a record in the top 6 search results</li>
    <li>7% of sessions involve a user adding a facet</li>
    <li>2% of sessions involve a user exporting records</li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/corrected-code4lib-slides-are-up/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.472 seconds -->

