<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	>

<channel>
	<title>Robot Librarian</title>
	<atom:link href="http://robotlibrarian.billdueber.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://robotlibrarian.billdueber.com</link>
	<description>Disclaimer: I'm not actually a robot.</description>
	<pubDate>Mon, 11 May 2009 14:50:14 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.7</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Sending MARC(ish) data to Refworks</title>
		<link>http://robotlibrarian.billdueber.com/sending-marcish-data-to-refworks/</link>
		<comments>http://robotlibrarian.billdueber.com/sending-marcish-data-to-refworks/#comments</comments>
		<pubDate>Mon, 11 May 2009 14:48:58 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=71</guid>
		<description><![CDATA[Refworks has some okish documentation about how to deal with its callback import procedure, but I thought I&#8217;d put down how I&#8217;m doing it for our vufind install (mirlyn2-beta.lib.umich.edu) in case other folks are interested.

The basic procedure is:


    Send your user to a specific refworks URL along with a callback URL that [...]]]></description>
			<content:encoded><![CDATA[<p>Refworks has some <a href="http://www.refworks.com/DirectExport.htm">okish documentation</a> about how to deal with its callback import procedure, but I thought I&#8217;d put down how I&#8217;m doing it for our <a href="http://vufind.org/">vufind</a> install (<a href="http://mirlyn2-beta.lib.umich.edu/">mirlyn2-beta.lib.umich.edu</a>) in case other folks are interested.</p>

<p>The basic procedure is:</p>

<ul>
    <li>Send your user to a specific refworks <acronym title="Uniform Resource Locator">URL</acronym> along with a callback <acronym title="Uniform Resource Locator">URL</acronym> that can enumerate the record(s) you want to import in a supported form</li>
    <li>Your user logs in (if need be) gets to her RefWorks page</li>
    <li>RefWorks calls up your system and requests the record(s)</li>
    <li>The import happens, and your user does whatever she want to do with them</li>
</ul>

<p>Of course, there are lots of issues with doing this <em>well</em> (quick! Is this <acronym title="MAchine Readable Cataloging">MARC</acronym> record for a book? An edited book? Is it a journal, or a serial of some other sort? Who&#8217;s the actual author/editor?), but doing it <em>at all</em> isn&#8217;t so bad.</p>

<h3>The <acronym title="Uniform Resource Locator">URL</acronym> to send them to</h3>

<p>This is the &#8220;Export this record&#8221; <acronym title="Uniform Resource Locator">URL</acronym> on my system:</p>

<pre>http://www.refworks.com.proxy.lib.umich.edu/express/expressimport.asp?
vendor=[your system]&amp;
filter=<acronym title="MAchine Readable Cataloging">MARC</acronym>+Format&amp;
database=All+<acronym title="MAchine Readable Cataloging">MARC</acronym>+Formats&amp;
encoding=65001
&amp;url=[your callback <acronym title="Uniform Resource Locator">URL</acronym>]</pre>

<p>Note that the vendor variable should be a unique string (made up by your) for your <em>system</em>, not a larger entity (like the whole library or the institution).</p>

<p>The &#8220;<acronym title="MAchine Readable Cataloging">MARC</acronym> Format&#8221; filter we&#8217;re using is <em>not</em> a filter for real <acronym title="MAchine Readable Cataloging">MARC</acronym>. It&#8217;s a <acronym title="MAchine Readable Cataloging">MARC</acronym>-like delimited format (see <a target="marcish" href="http://mirlyn2-beta.lib.umich.edu/Record/000152772/Export?style=REF">an example from my catalog</a>).</p>

<p>Basically, you have three types of lines (but really, look at the example, &#8217;cause it&#8217;ll make everything a lot clearer):</p>

<p><strong>LEADER</strong></p>

<pre>
  LEADER [one space] [leader text]
</pre>

<p><strong>Control Field</strong></p>

<pre>
  [three-digit control tag] [four spaces] [data text]
</pre>

<p><strong>Data Field</strong></p>

<pre>
  [three-digit data tag] [one space] [ind1] [ind2] [one space] [value of subfield a] [other subfield constructs]
</pre>

<p>&#8230;where [other subfield constructs] look like</p>

<pre>
  [pipe characeter][subfield code][subfield value]
</pre>

<p>Notice that (a) there&#8217;s no leading &#8216;|a&#8217; before the subfield a value, and (b) there are no spaces between the pipe, the subfield code, and the subfield value for the non-code-a subfields.</p>

<p>Some easy <acronym title="Pre-Hypertext Processing">PHP</acronym> code to produce such a format is as follows. Note that I&#8217;m sending it as text (because it&#8217;s <em>not <acronym title="MAchine Readable Cataloging">MARC</acronym></em>) and UTF-8. If you&#8217;re got <acronym title="MAchine Readable Cataloging">MARC</acronym>-8, you&#8217;ll have to convert it before sending.</p>

<div class="geshi no php"><ol><li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="re1">$m</span> <span class="sy0">=</span> <span class="re1">$this</span><span class="sy0">-&gt;</span><span class="me1">marcRecord</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="kw3">header</span><span class="br0">&#40;</span><span class="st0">&#39;Content-type: text/plain; charset=UTF-8&#39;</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="kw3">echo</span> <span class="st0">&#39;LEADER &#39;</span><span class="sy0">,</span> <span class="re1">$m</span><span class="sy0">-&gt;</span><span class="me1">getLeader</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy0">,</span> <span class="st0">&quot;<span class="es0">\n</span>&quot;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="kw1">foreach</span> <span class="br0">&#40;</span><span class="re1">$m</span><span class="sy0">-&gt;</span><span class="me1">getFields</span><span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="kw1">as</span> <span class="re1">$tag</span> <span class="sy0">=&gt;</span> <span class="re1">$val</span><span class="br0">&#41;</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">echo</span> <span class="re1">$tag</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span><span class="re1">$val</span> instanceof File_MARC_Control_FIELD<span class="br0">&#41;</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">echo</span> <span class="st0">&#39; &nbsp; &nbsp;&#39;</span><span class="sy0">,</span> <span class="re1">$val</span><span class="sy0">-&gt;</span><span class="me1">getData</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy0">,</span> <span class="st0">&quot;<span class="es0">\n</span>&quot;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span> <span class="kw1">else</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">echo</span> <span class="st0">&#39; &#39;</span><span class="sy0">,</span> <span class="re1">$val</span><span class="sy0">-&gt;</span><span class="me1">getIndicator</span><span class="br0">&#40;</span><span class="nu0">1</span><span class="br0">&#41;</span><span class="sy0">,</span> &nbsp;<span class="re1">$val</span><span class="sy0">-&gt;</span><span class="me1">getIndicator</span><span class="br0">&#40;</span><span class="nu0">2</span><span class="br0">&#41;</span><span class="sy0">,</span> <span class="st0">&#39; &#39;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re1">$subs</span> <span class="sy0">=</span> <span class="kw3">array</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">foreach</span> <span class="br0">&#40;</span><span class="re1">$val</span><span class="sy0">-&gt;</span><span class="me1">getSubFields</span><span class="br0">&#40;</span><span class="br0">&#41;</span> <span class="kw1">as</span> <span class="re1">$code</span><span class="sy0">=&gt;</span><span class="re1">$subdata</span><span class="br0">&#41;</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re1">$line</span> <span class="sy0">=</span> <span class="st0">&#39;&#39;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw1">if</span> <span class="br0">&#40;</span><span class="re1">$code</span> <span class="sy0">!=</span> <span class="st0">&#39;a&#39;</span><span class="br0">&#41;</span> <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re1">$line</span> <span class="sy0">=</span> <span class="st0">&#39;|&#39;</span> <span class="sy0">.</span> <span class="re1">$code</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re1">$subs</span><span class="br0">&#91;</span><span class="br0">&#93;</span> <span class="sy0">=</span> <span class="re1">$line</span> <span class="sy0">.</span> <span class="re1">$subdata</span><span class="sy0">-&gt;</span><span class="me1">getData</span><span class="br0">&#40;</span><span class="br0">&#41;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="kw3">echo</span> <span class="kw3">implode</span><span class="br0">&#40;</span><span class="st0">&#39; &#39;</span><span class="sy0">,</span> <span class="re1">$subs</span><span class="br0">&#41;</span><span class="sy0">,</span> <span class="st0">&quot;<span class="es0">\n</span>&quot;</span><span class="sy0">;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span> &nbsp; &nbsp; &nbsp; &nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#125;</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/sending-marcish-data-to-refworks/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MARC-HASH: The saga continues (now with even less structure)</title>
		<link>http://robotlibrarian.billdueber.com/marc-hash-the-saga-continues-now-with-even-less-structure/</link>
		<comments>http://robotlibrarian.billdueber.com/marc-hash-the-saga-continues-now-with-even-less-structure/#comments</comments>
		<pubDate>Wed, 15 Apr 2009 18:55:11 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/marc-hash-the-saga-continues-now-with-even-less-structure/</guid>
		<description><![CDATA[After a medium-sized discussion on #code4lib, we&#8217;ve collectively decided that&#8230;well, ok, no one really cares all that much, but a few people weighed in.

The new format is: A list of arrays. If it&#8217;s got two elements, it&#8217;s a control field; if it&#8217;s got four, it&#8217;s a data field.

SO&#8230;.it&#8217;s like this now.

&#123;
&#160; &#34;type&#34; : &#34;marc-hash&#34;,
&#160; &#34;version&#34; [...]]]></description>
			<content:encoded><![CDATA[<p>After a medium-sized discussion on #code4lib, we&#8217;ve collectively decided that&#8230;well, ok, no one really cares all that much, but a few people weighed in.</p>

<p>The new format is: A list of arrays. If it&#8217;s got two elements, it&#8217;s a control field; if it&#8217;s got four, it&#8217;s a data field.</p>

<p>SO&#8230;.it&#8217;s like this now.</p>

<div class="geshi no javascript"><ol><li class="li1"><div class="de1"><span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;type&quot;</span> : <span class="st0">&quot;marc-hash&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;version&quot;</span> : <span class="br0">&#91;</span><span class="nu0">1</span>, <span class="nu0">0</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;leader&quot;</span> : <span class="st0">&quot;leader string&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;fields&quot;</span> : <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&quot;001&quot;</span>, <span class="st0">&quot;001 value&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&quot;002&quot;</span>, <span class="st0">&quot;002 value&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&quot;010&quot;</span>, <span class="st0">&quot; &quot;</span>, <span class="st0">&quot; &quot;</span>, </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;68009499&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;035&quot;</span>, <span class="st0">&quot; &quot;</span>, <span class="st0">&quot; &quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;(RLIN)MIUG0000733-B&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;035&quot;</span>, <span class="st0">&quot; &quot;</span>, <span class="st0">&quot; &quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;(CaOTULAS)159818014&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;245&quot;</span>, <span class="st0">&quot;1&quot;</span>, <span class="st0">&quot;0&quot;</span>, </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;Capitalism, primitive and modern;&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;b&quot;</span>, <span class="st0">&quot;some aspects of Tolai economic growth&quot;</span> <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;c&quot;</span>, <span class="st0">&quot;[by] T. Scarlett Epstein.&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#125;</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/marc-hash-the-saga-continues-now-with-even-less-structure/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MARC-HASH control field, now with less structure</title>
		<link>http://robotlibrarian.billdueber.com/marc-hash-control-field-now-with-less-structure/</link>
		<comments>http://robotlibrarian.billdueber.com/marc-hash-control-field-now-with-less-structure/#comments</comments>
		<pubDate>Wed, 15 Apr 2009 15:23:44 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/marc-hash-control-field-now-with-less-structure/</guid>
		<description><![CDATA[Why do I ever, ever think that MARC might not rely on order? I don&#8217;t know.

In any case, control fields will now be just an array of duples:

control: &#91;
&#160; &#91;&#39;001&#39;, &#39;value of the 001&#39;&#93;,
&#160; &#91;&#39;006&#39;, &#39;value of the 006&#39;&#93;
&#160; &#91;&#39;006&#39;, &#39;another 006&#39;&#93;
&#125;]]></description>
			<content:encoded><![CDATA[<p>Why do I ever, ever think that <acronym title="MAchine Readable Cataloging">MARC</acronym> might not rely on order? I don&#8217;t know.</p>

<p>In any case, control fields will now be just an array of duples:</p>

<div class="geshi no javascript"><ol><li class="li1"><div class="de1">control: <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#91;</span><span class="st0">&#39;001&#39;</span>, <span class="st0">&#39;value of the 001&#39;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#91;</span><span class="st0">&#39;006&#39;</span>, <span class="st0">&#39;value of the 006&#39;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#91;</span><span class="st0">&#39;006&#39;</span>, <span class="st0">&#39;another 006&#39;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#125;</span></div></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/marc-hash-control-field-now-with-less-structure/feed/</wfw:commentRss>
		</item>
		<item>
		<title>MARC-Hash: a proposed format for JSON/YAML/Whatever-compatible MARC records</title>
		<link>http://robotlibrarian.billdueber.com/marc-hash-a-proposed-format-for-jsonyamlwhatever-compatible-marc-records/</link>
		<comments>http://robotlibrarian.billdueber.com/marc-hash-a-proposed-format-for-jsonyamlwhatever-compatible-marc-records/#comments</comments>
		<pubDate>Mon, 13 Apr 2009 19:28:00 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/marc-hash-a-proposed-format-for-jsonyamlwhatever-compatible-marc-records/</guid>
		<description><![CDATA[In my first shot at MARC-in-JSON, which I appropriately (and prematurely) named MARC-JSON, I made a point of losing round-tripability (to and from MARC) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. &#8220;Who really cares what order the subfields come in?&#8221; I asked myself.

Well, of course, it turns [...]]]></description>
			<content:encoded><![CDATA[<p>In my first shot at <acronym title="MAchine Readable Cataloging">MARC</acronym>-in-JSON, which I appropriately (and prematurely) named <a href="http://code.google.com/p/marc-json/"><acronym title="MAchine Readable Cataloging">MARC</acronym>-JSON</a>, I made a point of losing round-tripability (to and from <acronym title="MAchine Readable Cataloging">MARC</acronym>) in order to end up with a nice, easy-to-work-with data structure based mostly on hashes. &#8220;Who really cares what order the subfields come in?&#8221; I asked myself.</p>

<p>Well, of course, it turns out some people do. Some even care about the order of the tags. &#8220;Only in the 500s&#8230;usually&#8221; I was told today. All my lovely dreams of using easy-to-access hashes up in so much smoke.</p>

<p>So&#8230;I&#8217;m suggesting we try something a little simpler. Something so brain-dead, in fact, that I&#8217;m loathe to put it down because it&#8217;s pretty much the obvious way to do it. To wit:</p>

<div class="geshi no javascript"><ol><li class="li1"><div class="de1"><span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;type&quot;</span> : <span class="st0">&quot;marc-hash&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;version&quot;</span> : <span class="br0">&#91;</span><span class="nu0">1</span>, <span class="nu0">0</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;leader&quot;</span> : <span class="st0">&quot;leader string&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;control&quot;</span> : <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&quot;001&quot;</span>, <span class="br0">&#91;</span><span class="st0">&quot;all&quot;</span>, <span class="st0">&quot;001&quot;</span>, <span class="st0">&quot;values&quot;</span><span class="br0">&#93;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp;<span class="br0">&#91;</span><span class="st0">&quot;002&quot;</span>, <span class="br0">&#91;</span><span class="st0">&quot;all&quot;</span>, <span class="st0">&quot;002&quot;</span>, <span class="st0">&quot;values&quot;</span><span class="br0">&#93;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;data&quot;</span> : <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;010&quot;</span>, <span class="st0">&quot; &quot;</span>, <span class="st0">&quot; &quot;</span>, </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;68009499&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;035&quot;</span>, <span class="st0">&quot; &quot;</span>, <span class="st0">&quot; &quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;(RLIN)MIUG0000733-B&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;035&quot;</span>, <span class="st0">&quot; &quot;</span>, <span class="st0">&quot; &quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;(CaOTULAS)159818014&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;245&quot;</span>, <span class="st0">&quot;1&quot;</span>, <span class="st0">&quot;0&quot;</span>, </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;a&quot;</span>, <span class="st0">&quot;Capitalism, primitive and modern;&quot;</span><span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;b&quot;</span>, <span class="st0">&quot;some aspects of Tolai economic growth&quot;</span> <span class="br0">&#93;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; <span class="br0">&#91;</span><span class="st0">&quot;c&quot;</span>, <span class="st0">&quot;[by] T. Scarlett Epstein.&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#125;</span></div></li></ol></div>

<p>Stupid <acronym title="MAchine Readable Cataloging">MARC</acronym> allows all the stupid fields to stupid repeat and be out of stupid order and such, so it&#8217;s just a lot of arrays. Easily round-tripable.</p>

<p>Why bother? Excellent question, and one that&#8217;s a little harder to answer now that the data structure requires so much looping to find anything (the first time, anyway). I guess it&#8217;s still a lot easier than working with raw <acronym title="MAchine Readable Cataloging">MARC</acronym> (or, I would claim, <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym>), requires no special libraries in any language that supports strings, hashes, and arrays, and can be manipulated with basic language constructs.</p>

<p>A few things worth noting about the assumptions in my mind:</p>

<ul>
  <li>By definition, it&#8217;s always UTF-8. The leader should be changed to note this on the sending end, but it&#8217;s not required.</li>
  <li>We include both a type &#8220;marc-hash&#8221;, and a version with major/minor numbers.
  <li>Everything is a string.</li>
  <li>Alpha characters in indicators/tags are all lowercased.</li>
  <li>A control field is a duple: tag and array of values.</li>
  <li>A data field has four values:
    <ol>
      <li>The tag</li>
      <li>Indicator one</li>
      <li>Indicator two</li>
      <li>An array of duples: subfield and its value</li>
    </ol></li>
</ul>

<h2>A simple transformation to make it a little more queryable</h2>

<p>Let&#8217;s say you don&#8217;t give a damn about tags that appear out of order, because that&#8217;s just a crime against nature, anyway. And you really don&#8217;t care what order the subtags appear in most of the time, &#8217;cause really, who does?</p>

<p>A simple run-through (psuedocode ahead):</p>

<pre>
  my marchash = getTheMarcHash();
  my kindamarc;
  kindamarc{leader} = marchash{leader};
  
  # Map the control fields by tag => array-of-values
  foreach cfield (marchash{control}) {
    kindamarc{control}{cfield[0] ||= []};
    kindamarc{control}{cfield[0]}.push(cfield[1]);
  }
  
  foreach d (marchash{data}) {
    (tag, ind1, ind1) = (d[0], d[1], d[2]);
    
    # build up a hash based on subfields for this tag
    newd = {};
    foreach subfield (d[3]) {
      (stag, sval) = subfield;
      newd{stag} = sval;
    }
    
    # Store the subfield hash in a few places so it's easy to find.
    foreach i1 ('*', ind1) {
      foreach i2 ('*', ind2) {
        kindamarc{data}{tag}{i1}{i2} ||= [];
        kindamarc{data}{tag}{i1}{i2}.push(newd);
      }
    }
  }
</pre>

<p>Control fields are stored as arrays of values associated with the tag. Data fields are built up as a hash of subfield to array-of-values pairs, and then stored both based on the indicator given and the wildcard indicator &#8216;*&#8217;.</p>

<p>Basically, this will allow things like this:</p>

<div class="geshi no perl"><ol><li class="li1"><div class="de1">&nbsp; <span class="re0">$leader</span> = <span class="re0">$kindamarc</span><span class="br0">&#123;</span>leader<span class="br0">&#125;</span>;</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re0">$first001</span> = <span class="re0">$kindamarc</span><span class="br0">&#123;</span>control<span class="br0">&#125;</span><span class="br0">&#123;</span><span class="st0">&quot;001&quot;</span><span class="br0">&#125;</span><span class="br0">&#91;</span><span class="nu0">0</span><span class="br0">&#93;</span>;</div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co1"># Find 856s where indicator 2 is &#39;1&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="re0">@mystuff</span> = <span class="re0">$kindamarc</span><span class="br0">&#123;</span>data<span class="br0">&#125;</span><span class="br0">&#123;</span><span class="nu0">856</span><span class="br0">&#125;</span><span class="br0">&#123;</span><span class="st0">&#39;*&#39;</span><span class="br0">&#125;</span><span class="br0">&#123;</span><span class="nu0">1</span><span class="br0">&#125;</span>;</div></li></ol></div>

<p>It&#8217;s easy to see how we could store the index from the original array to make it easy to find the original order, too.</p>

<p>For many, I&#8217;m sure, the prospect of dealing with something like this is more daunting than just learning to use <acronym title="MAchine Readable Cataloging">MARC</acronym>-<acronym title="Extensible Markup Language">XML</acronym> or using existing libraries to deal with straight <acronym title="MAchine Readable Cataloging">MARC</acronym>. But there seems to be a set of folks out there for whom this might be useful, so I&#8217;m throwing it out there.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/marc-hash-a-proposed-format-for-jsonyamlwhatever-compatible-marc-records/feed/</wfw:commentRss>
		</item>
		<item>
		<title>A plea: use Solr to normalize your data</title>
		<link>http://robotlibrarian.billdueber.com/a-plea-use-solr-to-normalize-your-data/</link>
		<comments>http://robotlibrarian.billdueber.com/a-plea-use-solr-to-normalize-your-data/#comments</comments>
		<pubDate>Mon, 30 Mar 2009 20:22:47 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=56</guid>
		<description><![CDATA[[Only, of course, if you're using Solr. Otherwise, that'd be dumb.]

We&#8217;ve been working on Mirlyn2-Beta, our installation of VuFind for some time now (don&#8217;t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself [...]]]></description>
			<content:encoded><![CDATA[<p>[Only, of course, if you're using Solr. Otherwise, that'd be dumb.]</p>

<p>We&#8217;ve been working on <a title="Mirlyn2-Beta Library Catalog at the University of Michigan University LIbrary" href="http://mirlyn2-beta.lib.umich.edu/">Mirlyn2-Beta</a>, our installation of VuFind for some time now (don&#8217;t let the fancy-pants name scare you off), and the further we get into it, the more obvious it is that I want to move as much data normalization into Solr itself as possible.</p>

<p>Arguments about how much business logic to move into the database layer, in the form of foreign-key requirements, cascading inserts and deletes, stored procedures, etc. are as old as the features themselves. Solid arguments for and against are made on all sides, and like all things, there&#8217;s a happy middle ground for most people. <sup class='footnote'><a href='#fn-56-1' id='fnref-56-1'>1</a></sup></p>

<p>But Solr provides an incredibly compelling use case because it allows for data transformation at both index and query time via the use of custom analyzers (or a standard analyzer with text filters applied). We&#8217;re starting to migrate our schema to use more and more of these things, and I even went so far as to create a custom text filter for LCCNs after being <a href="http://bibwild.wordpress.com/2009/03/11/normalize-your-lccns/">inspired by Jonathan Rochkind.</a></p>

<p>The incentive is easy to see: client diversity. Let a thousand interfaces bloom, if you can give them all access to the same underlying Solr instance. And, seriously, how many times are you going to write that regexp to semi-normalize ISBNs and ISSNs, huh? Enough already.</p>

<p>If you&#8217;re using a Solr nightly (and, really, you should be &#8212; faceting is <em>so</em> much faster than the official 1.3 release) you have access to regexp-based filters as well, which makes stuff like this really, really easy:</p>

<div class="geshi no enc_xml"><ol><li class="li1"><div class="de1">&nbsp; &nbsp;&lt;!&#8211; Simple type to normalize isbn/issn/other standard numbers &#8211;&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &lt;fieldType name=&quot;stdnum&quot; class=&quot;solr.TextField&quot; sortMissingLast=&quot;true&quot; omitNorms=&quot;true&quot; &gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &lt;analyzer&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &lt;tokenizer class=&quot;solr.KeywordTokenizerFactory&quot;/&gt; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &lt;filter class=&quot;solr.TrimFilterFactory&quot;/&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;pattern=&quot;^0*([\d\-\.]+[xX]?).*$&quot; replacement=&quot;$1&quot; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; /&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;pattern=&quot;[\-\.]&quot; replacement=&quot;&quot; &nbsp;replace=&quot;all&quot;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; /&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &lt;/analyzer&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &lt;/fieldType&gt;</div></li></ol></div>

<p>Here, we use the <em>KeywordTokenizerFactory</em> which, not so intuitively, produces a single token from the input. Then lowercase it and pull of any leading and trailing spaces (Trim).</p>

<p>For those of you that don&#8217;t read regexp, we then match anything that looks like:</p>

<ol>
<li>Any number of leading zeros</li>
<li>&#8230;followed by any number of digits, dashes, or periods and an optional &#8216;X&#8217;</li>
<li>&#8230;followed by&#8230;well, we don&#8217;t care. Anything else.
</ol>

<p>&#8230;and throw away all but the stuff in #2. Then take <em>that</em> and throw away all the dashes and dots, and you&#8217;re left with a string of numbers.</p>

<p>The beauty is that it happens both while the index is being made <em>and</em> during query time, so if your user types in &#8221; 123-45-6-X  &#8221; it will be normalized to 123456x, and then checked against your index.</p>

<p>This is simple stuff, and probably doesn&#8217;t deserve the virtual ink I&#8217;m providing for it, but Vufind out of the box doesn&#8217;t do any of this sort of thing (likely because &#8220;the box&#8221; existed before it was super-easy to do this), and we all should be doing it.</p>

<div class='footnotes'><div class='footnotedivider'></div><ol><li id='fn-56-1'>&#8220;Most,&#8221; in this case, excluding the old-time MySQL fanboys who took it as gospel that all data validation and manipulation belongs in the application layer, because their &#8220;database&#8221; didn&#8217;t do any of it. Februrary 30th in a date field, anyone? <span class='footnotereverse'><a href='#fnref-56-1'>&#8617;</a></span></li></ol></div>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/a-plea-use-solr-to-normalize-your-data/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Enough with the freakin&#8217; LC Call Number normalization!</title>
		<link>http://robotlibrarian.billdueber.com/enough-with-the-freakin-lc-call-number-normalization/</link>
		<comments>http://robotlibrarian.billdueber.com/enough-with-the-freakin-lc-call-number-normalization/#comments</comments>
		<pubDate>Wed, 18 Mar 2009 15:29:35 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=54</guid>
		<description><![CDATA[OK. I&#8217;m done with it, and this time I mean it.

I&#8217;ve updated and improved the lc normalization code, documented the algorithm, and put it all into Google Code. In the next couple weeks, I&#8217;ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.]]></description>
			<content:encoded><![CDATA[<p>OK. I&#8217;m done with it, and this time I mean it.</p>

<p>I&#8217;ve updated and improved the lc normalization code, documented the algorithm, and put it all into <a title="Google Code Repository for LC Normalization algorithm and code" href="http://code.google.com/p/library-callnumber-lc/">Google Code</a>. In the next couple weeks, I&#8217;ll be turning it into a Solr text filter so we can do some decent sorting on call-number search results.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/enough-with-the-freakin-lc-call-number-normalization/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Ask, and you shall receive, and it shall be AWESOME!</title>
		<link>http://robotlibrarian.billdueber.com/ask-and-you-shall-receive-and-it-shall-be-awesome/</link>
		<comments>http://robotlibrarian.billdueber.com/ask-and-you-shall-receive-and-it-shall-be-awesome/#comments</comments>
		<pubDate>Thu, 12 Feb 2009 22:00:57 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=51</guid>
		<description><![CDATA[The good folks at ticTocs heard the call for open data, and they responded&#8230;exactly as I asked them to. Which makes me think I should have asked for a pony, too, but I&#8217;m still very, very happy!

Anyone can now download a simple tab-delimited text file describing all the journal table of contents RSS files they&#8217;ve [...]]]></description>
			<content:encoded><![CDATA[<p>The good folks at <a href="http://www.tictocs.ac.uk">ticToc</a>s heard the call for open data, and they responded&#8230;exactly as I asked them to. Which makes me think I should have asked for a pony, too, but I&#8217;m still very, very happy!</p>

<p>Anyone can now download a <a href="http://www.tictocs.ac.uk/text.php">simple tab-delimited text file</a> describing all the journal table of contents <acronym title="Reference Services Section (RUSA)">RSS</acronym> files they&#8217;ve assembled, for use however anyone wants.</p>

<p>The data include issns and eissns (where available), the title of the journal, and of course the <acronym title="Uniform Resource Locator">URL</acronym> of the <acronym title="Reference Services Section (RUSA)">RSS</acronym>/Atom/Whatever feed.</p>

<p>The feeds themselves are all over the map &#8212; it&#8217;s whatever the publisher decides to provide, which might include  abstract/volume/number/doi, or might just be the title of the article. But regardless, they represent data that are useful to our patron and are now available in a format that&#8217;s easy to exploit.</p>

<p>So&#8230;go to it?</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/ask-and-you-shall-receive-and-it-shall-be-awesome/feed/</wfw:commentRss>
		</item>
		<item>
		<title>TicTocs: Give us a file! Pretty pretty pretty please!</title>
		<link>http://robotlibrarian.billdueber.com/tictocs-give-us-a-file-pretty-pretty-pretty-please/</link>
		<comments>http://robotlibrarian.billdueber.com/tictocs-give-us-a-file-pretty-pretty-pretty-please/#comments</comments>
		<pubDate>Mon, 02 Feb 2009 16:00:24 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=45</guid>
		<description><![CDATA[For those who haven&#8217;t heard, ticTOCs is a service that provides web-based access to a database of Journal RSS/Atom Table of Contents feeds. Awesome.

In their blog at News from TicTocs, a post titled I want to be completely honest with you about ticTOCs notes that:

As for the API - yes, we’ve been asked this several times, [...]]]></description>
			<content:encoded><![CDATA[<p>For those who haven&#8217;t heard, <a href="http://www.tictocs.ac.uk/">ticTOCs</a> is a service that provides web-based access to a database of Journal <acronym title="Reference Services Section (RUSA)">RSS</acronym>/Atom Table of Contents feeds. Awesome.</p>

<p>In their blog at <a href="http://tictocsnews.wordpress.com/">News from TicTocs</a>, a post titled <a title="Permanent Link to I want to be completely honest with you about ticTOCs" rel="bookmark" href="http://tictocsnews.wordpress.com/2009/01/27/i-want-to-be-completely-honest-with-you-about-tictocs/">I want to be completely honest with you about ticTOCs</a> notes that:</p>

<blockquote>As for the <acronym title="Application Programming Interface">API</acronym> - yes, we’ve been asked this several times, and the answer is that it is currently being written and should be available very soon.</blockquote>

<p>That&#8217;s great, but writing in a comment on that post (after logging in with a very, very old OpenID &#8212; I used to have a blog named <em>Opachyderm</em>, a name which I thought was insufferably clever), I noted that we don&#8217;t need an <acronym title="Application Programming Interface">API</acronym> right away.</p>

<p>What we need is a text file.</p>

<p>Simple. Tab-delimited. TicTocID,Title,<acronym title="Uniform Resource Locator">URL</acronym>,issn,eissn. Update it every night.</p>

<p>That&#8217;s all we need.</p>

<p>We can do the rest. Put it in the <acronym title="Online Public Access Catalog">OPAC</acronym>. Stick it on our SFX pages. Not screw around with Javascript/<acronym title="Asynchronous JavaScript and XML">AJAX</acronym> calls when the data we need are (relatively) static and (absolutely) simple.</p>

<p>Someone needed to put a web interface on those data, and the one provided at ticTocs is really nice. I&#8217;m glad it&#8217;s there.</p>

<p>And I can&#8217;t tell you how much I applaud the JISC for starting this project and getting vendors on board. That&#8217;s always the hard part &#8212; participation and standardization. They&#8217;re doing it, and I couldn&#8217;t be happier.</p>

<p>But these data are incredibly valuable,  and their value is currently limited because they&#8217;re boxed up.</p>

<p>Spreading these data far and wide is good for scholarship, and I can&#8217;t imagine the case that could be made showing it&#8217;s better for JISC to keep them at a single endpoint.</p>

<p>The knee-jerk reaction is always, I know, to keep things behind a wall, even if it&#8217;s a short wall. &#8220;Things will get out of sync if people have their own copies.&#8221; Or, &#8220;We&#8217;ll provide whatever access you need, as fast as you need it, honest.&#8221; Or, &#8220;We&#8217;re going to be providing value-added services on top of the data.&#8221;</p>

<p>It&#8217;s all true. Things will get out-of-sync &#8212; but that&#8217;s going to happen whether you encourage people to not cache results or not. And I don&#8217;t doubt for a moment that the <acronym title="Application Programming Interface">API</acronym> provided will be great. And of course you&#8217;ll be in a position where you can provide value-added services.</p>

<p>But so can the rest of us.</p>

<p>I&#8217;ve run into this myself. I fear&#8230;well, let&#8217;s be honest. I fear providing a service, having the data stripmined, and then having no one appreciate the front-end I put on it. I do this job for the fame, not the fortune. Obviously.</p>

<p>But I&#8217;ll never provide services as fast as me plus three hundred other geeks, all responding to different situations and servicing different patrons.</p>

<p>So&#8230;provide an <acronym title="Application Programming Interface">API</acronym>. Start simple: a single call named <em>getCurrentTextFile</em>. Or maybe add <em>getCurrentTextFileGzipped</em>. It&#8217;s only ten-thousand lines of text, probably less than 75k gzipped up. I promise to call it every night about 3am local time so I&#8217;m up-to-date.</p>

<p>So&#8230;.pretty please? With sugar on top? My catalog is waiting. So is my SFX install. And our list of ejournals. And our subject guides. And lots of pages on our website. And our pre-packaged <acronym title="Outline Processor Markup Language">OPML</acronym> files to offer students and professors. And a thousand yet-to-be-devised services as well.</p>

<p>Pretty pretty pretty please???</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/tictocs-give-us-a-file-pretty-pretty-pretty-please/feed/</wfw:commentRss>
		</item>
		<item>
		<title>Five rules to make your open source more open</title>
		<link>http://robotlibrarian.billdueber.com/five-rules-to-make-your-open-source-more-open/</link>
		<comments>http://robotlibrarian.billdueber.com/five-rules-to-make-your-open-source-more-open/#comments</comments>
		<pubDate>Sun, 25 Jan 2009 21:40:00 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=38</guid>
		<description><![CDATA[[I've noticed that a sure way to get people to look at stuff (as measured by, say, digg) is to include a number. So I did. Five. ]

Over at Bibliographic Wilderness, Jonathan Rothkind has a great followup to an ongoing discussion on the Blacklight list called How to build shared open source in which he [...]]]></description>
			<content:encoded><![CDATA[<p>[I've noticed that a sure way to get people to look at stuff (as measured by, say, <a href="http://digg.com/">digg</a>) is to include a number. So I did. Five. ]</p>

<p>Over at <a href="http://bibwild.wordpress.com/">Bibliographic Wilderness</a>, Jonathan Rothkind has a great followup to an ongoing discussion on the Blacklight list called <em><a href="http://bibwild.wordpress.com/2009/01/06/how-to-build-shared-open-source/">How to build shared open source</a></em> in which he tackles some of the differences between open-sourcing your code (a legal and distribution issue) and actually making it so someone else can usefully contribute to your code.</p>

<p>The project I&#8217;m spending most of my time on right now, <a href="http://vufind.org/">VUFind</a>, is a great piece of functional code but, in many ways, a nightmare in terms of trying to contribute code and abstract out local functionality. This isn&#8217;t meant as a slam on the main contributor(s) to VUFind &#8212; Andrew, especially, seems to be an almost frighteningly-productive coder &#8212; but my experiences trying to customize the code to our local situation has given me a lot of time to think about how I <em>wish</em> things had been architected.</p>

<p>So, here I give some general rules and some specifics as to What I Wish I Had To Work With.</p>

<h3>1. Abstraction</h3>

<p><strong>General rule:</strong> Abstract things out as much as makes sense</p>

<p><strong>Specific rule:</strong> Abstract <em>the living crap</em> out of your authentication scheme.</p>

<p>Look, pretty much everyone with anything worth protecting already has an auth/authZ infrastructure in place. Sometimes an extensive, perhaps multi-institutional infrastructure. One that isn&#8217;t going to be bypassed without, say, getting fired.</p>

<p>So if you&#8217;re going to require people to log in, make sure you make that process as abstract as you possibly can, both in algorithm and in code. Have a singleton class that&#8217;s easily subclassed to represent your user, and call it exclusively. Make sure that your URIs are easily separated into those that require auth and those that don&#8217;t, for simple use of mod_rewrite or whatnot to redirect to authentication. Make sure it&#8217;s easy to hook into (or work around) <acronym title="Asynchronous JavaScript and XML">AJAX</acronym> links that might require authentication that has expired.</p>

<p>And for the love of god, don&#8217;t stuff username/password information into a cookie if you&#8217;re doing web work. Use a session and session key. Any auth scheme that <em>I</em> can spoof is no auth scheme at all, because I&#8217;m an idiot and not even trying hard.</p>

<h3>2. Configuration files</h3>

<p><strong>General rule:</strong> use config files for anything local</p>

<p><strong>Specific rule:</strong> Use a configuration file format that can represent complex data</p>

<p>That&#8217;s right, I&#8217;m looking at you, <em>.ini</em> and <em>.properties</em> files.</p>

<p>Use something like YAML, or <acronym title="Extensible Markup Language">XML</acronym>, or even straight programming-language code (i.e., a file with a <acronym title="Pre-Hypertext Processing">PHP</acronym> hash or a perl hashref or whatnot) that can actually represent, in a logical way, the complexities of the stuff you need to configure. And then, again, have a singleton class that will read that data and expose it in a useful and safe way.</p>

<p>And include a semantics checker if you can manage to write one.  It&#8217;ll save everyone a load of trouble.</p>

<p><em>Huge bonus points</em> if your configuration singleton class can read from multiple files, overriding previous (default) definitions with subsequent (local) ones.</p>

<h3>3. Hide subapplications</h3>

<p><strong>General rule:</strong> Don&#8217;t force your user to intimately understand every piece of every library/application you include</p>

<p><strong>Specific rule:</strong> Generate configuration information for sublibraries/applications</p>

<p>This might be a little specific to the project I&#8217;m working on now, which uses <a href="http://lucene.apache.org/solr/">Solr</a> as a backend, but I think it applies more generally.</p>

<p><em>If</em> you&#8217;re using a non-brain-dead configuration file format, and <em>if</em> you can assume reasonable defaults, <em>then</em> generate configuration files for your user. A low-level extreme of this is the traditional unix <em>autoconf</em>, which essentially allows you to install software without knowing a damn thing about your own system. Which is useful to those of us that don&#8217;t.</p>

<p>In VUFind, there are three files &#8212; a .properties file that specifies how to map <acronym title="MAchine Readable Cataloging">MARC</acronym> data into  field names,  Solr&#8217;s <em>schema.xml</em> that describes the structure and behavior of those same fields, and an <acronym title="eXtensible Stylesheet Language Transformations">XSLT</acronym> stylesheet that pretties the data as it comes out of Solr to make it easier to work with. As you might expect, the overlap in data is about 80% across the three of them, and it would be a bazillion times easier to have a single file that generated all three.</p>

<p>OK. Maybe not a <em>ba</em>zillion, because if it was that easy, I&#8217;d have taken a couple hours to write the code to do it already. Let&#8217;s say just a <em>zillion</em> times easier.</p>

<p>The caveat to this is that you need to either make sure your config file specification is complete enough to encompass everything all the other files might need to know (bad), or that the other config files can import subsections that override your defaults (good).</p>

<h3>4.Testing</h3>

<p><strong>General rule:</strong> practice test-first (TDD or BDD) development</p>

<p><strong>Specific rule:</strong> write your code in such a way that it&#8217;s testable</p>

<p>Look, we all know we should spend the first three weeks writing eight thousand tests to describe every corner of the code. And we all have bosses that will ask, every morning about 10:30am, &#8220;So, what do you have that you can show me?&#8221;</p>

<p>Not everyone is going to be able to write tests first. That&#8217;s not right, it&#8217;s not smart, but it&#8217;s the way the world works. But at least <em>put in the hooks</em> so someone else can come along and write tests.</p>

<p>Writing tests is one of the easiest ways that a newbie can come along to a project and instantly contribute in a meaningful way. But if you&#8217;re constantly calling global variables, depending on live database connections and not providing a way to mock them up, or throwing fatal errors if every subsystem isn&#8217;t present no matter the context, then it&#8217;s going to be hard to write tests.  So hard, in fact, that not only will you not do it, but neither will anyone else.</p>

<h3>5. Error handling</h3>

<p><strong>General rule:</strong> provide a sane, hierarchical set of error classes and hooks to catch them as necessary</p>

<p><strong>Specific rule:</strong> THROW SOME GODDAMN ERRORS!!!!</p>

<p>Don&#8217;t be an idiot. Things will fail. In the absense of Design by Contract or somesuch, errors will happen. Throw them. Catch them. But at least throw them, instead of letting your code die six hundred lines later with a &#8220;Cannot cast null value to string&#8221; when you finally get around to trying to print something out.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/five-rules-to-make-your-open-source-more-open/feed/</wfw:commentRss>
		</item>
		<item>
		<title>And then I finally shut the hell up</title>
		<link>http://robotlibrarian.billdueber.com/and-then-i-finally-shut-the-hell-up/</link>
		<comments>http://robotlibrarian.billdueber.com/and-then-i-finally-shut-the-hell-up/#comments</comments>
		<pubDate>Mon, 08 Dec 2008 14:37:51 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
		
		<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=32</guid>
		<description><![CDATA[I had a great &#8212; great! I tell you &#8212; 30 second conversation with Ken Varnum (of RSS4Lib fame) that went something like this (much paraphrasing, obviously):

B: You&#8217;re gonna have to fix that interface. The standard header won&#8217;t work.
K: Well, no, we&#8217;re going leave it as it is.
B: It&#8217;s not gonna work.
K: We&#8217;ve decided to [...]]]></description>
			<content:encoded><![CDATA[<p>I had a great &#8212; great! I tell you &#8212; 30 second conversation with Ken Varnum (of <a title="Ken Varnum's RSS4Lib Blog" href="http://www.rss4lib.com/">RSS4Lib</a> fame) that went something like this (much paraphrasing, obviously):</p>

<blockquote>B: You&#8217;re gonna have to fix that interface. The standard header won&#8217;t work.<br/>
K: Well, no, we&#8217;re going leave it as it is.<br/>
B: It&#8217;s not gonna work.<br/>
K: We&#8217;ve decided to make it all consistent.<br/>
B: OK, you can keep saying that, but I&#8217;m really, really smart and I say users are going to be confused.<br/>
K: We&#8217;ve done user testing. They weren&#8217;t confused. And here&#8217;s our plan to see if they are confused once we go live.
</blockquote>

<p>And then I finally shut the hell up. While I&#8217;m never crazy about being just plain wrong, it was so so SO refreshing to have someone say, &#8220;Well, actually, we&#8217;re making this decision based on data and not just pulling answers out of our pants like so many flying monkeys.&#8221;</p>

<p>Where, oh where in the library is the dedication to making actual data-based decisions? Besides Ken&#8217;s office, I mean?</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/and-then-i-finally-shut-the-hell-up/feed/</wfw:commentRss>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 1.418 seconds -->
<!-- Cached page served by WP-Cache -->
