Robot Librarian 2014-01-30T00:00:00Z Bill Dueber Help me test yet another LC Callnumber parser 2014-01-30T00:00:00Z 2014-09-09T15:10:58-04:00 Bill Dueber <p>Those who have followed this blog and my code for a while know that I have a <a href="">long</a>, slightly <a href="">sad</a>, and borderline <a href="">abusive</a> relationship with Library of Congress call numbers. </p> <p>They're a freakin' nightmare. They just are.</p> <p>But, based on the premise that Sisyphus was a quitter, I took another stab at it, this time writing a real (PEG-) parser instead of trying to futz with extended regular expressions.</p> <p>The results, so far, aren't too bad. </p> <p>The gem is called <a href="">lc_callnumber</a>, but more importantly, I've put together a little heroku app to let you play with it, and then correct any incorrect parses (or tell me that it worked correctly) to build up a test suite.</p> <p>So…<a href="">Please try to break my LC Callnumber parser</a>! </p> <p>[Code for the app itself is <a href="">on github</a>; pull requests for both the app and the gem joyously received]</p> New blog front- and back-end 2013-12-17T00:00:00Z 2014-01-31T10:24:53-05:00 Bill Dueber <p>A while back, <a href="">Dreamhost</a> had some problems and my blog and assorted other websites I help keep track of went down.</p> <p>For more than two weeks.</p> <p>Now, I understand that crap happens. And I understand that sometimes lots of things happen at once. But fundamentally, their infrastructure is such that they could lose everything on a machine and be unable to get it back for more than two weeks. I'm not a mathematician, but that's not "five-nine" service. </p> <p>So, I decided to start hunting around for another provider. And then I got distracted by the idea that maybe having my blog in <a href="">Wordpress</a> was more trouble than it was worth. There's something to be said for simplicity, especially since all I really wanted to do is throw up posts written in markdown with code samples.</p> <p>I got a few pointers toward using <a href="">middleman</a>, a pre-processor that takes in almost anything and produces regular css/html. Between that and <a href="">Disqus</a> for the comments, well, this just seems easier. And now that I've put in the effort, it'll be easier to actually get blog posts up, most importantly, move it over when I find a new hosting provider. </p> <p>Feel free to tell me how ugly it is and suggest improvements. I have the design skills of a one-eyed poodle. </p> Announcing "traject" indexing software 2013-10-14T00:00:00Z 2014-01-31T10:24:53-05:00 Bill Dueber <p>[Over the next few days I'll be writing a series of posts that highlight a new indexing solution by Jonathan Rochkind and myself called <code>traject</code> that we're using to index MARC data into Solr. This is the introduction.]</p> <p>Wow. Six months since I posted here. What have I been doing?</p> <p>Well, mostly parenting, but in the last few weeks I was lucky enough to get on board with a project started by <a href="">Jonathan Rochkind</a> for a new JRuby-based tool optimized for indexing MARC data into solr. You know, kinda like solrmarc, but JRuby.</p> <h3 id="whats-it-look-like">What's it look like?</h3> <p>I encourage you to take a look at a <a href="">little sample setup</a> I put together for instructional purposes. It's based on the HathiTrust catalog indexing scheme and shows off about 85% of what <code>traject</code> can do. Clone it and go through the README and the two indexing files to get a taste of how things are put together.</p> <p>Real quickly, though, here's a sample configuration file to pull out the ID, title, and authors (if any) out of a file of MARC records and send them to a file as JSON object, one record per line (i.e., newline-delimited JSON)</p> <div><div class="CodeRay"> <div class="code"><pre><span class="line-numbers"> <a href="#n1" name="n1">1</a></span> <span class="line-numbers"> <a href="#n2" name="n2">2</a></span><span class="comment"># we'll pretend this file is called 'sample.rb'</span> <span class="line-numbers"> <a href="#n3" name="n3">3</a></span>require <span class="string"><span class="delimiter">'</span><span class="content">traject</span><span class="delimiter">'</span></span> <span class="line-numbers"> <a href="#n4" name="n4">4</a></span>require <span class="string"><span class="delimiter">'</span><span class="content">traject/marc_reader</span><span class="delimiter">'</span></span> <span class="line-numbers"> <a href="#n5" name="n5">5</a></span>require <span class="string"><span class="delimiter">'</span><span class="content">traject/json_writer</span><span class="delimiter">'</span></span> <span class="line-numbers"> <a href="#n6" name="n6">6</a></span> <span class="line-numbers"> <a href="#n7" name="n7">7</a></span> <span class="line-numbers"> <a href="#n8" name="n8">8</a></span><span class="comment"># It's just ruby, so I can have comments!</span> <span class="line-numbers"> <a href="#n9" name="n9">9</a></span><span class="comment"># Here we set up which reader/writer to use and so on</span> <span class="line-numbers"><strong><a href="#n10" name="n10">10</a></strong></span>settings <span class="keyword">do</span> <span class="line-numbers"><a href="#n11" name="n11">11</a></span> provide <span class="string"><span class="delimiter">&quot;</span><span class="content">reader_class_name</span><span class="delimiter">&quot;</span></span>, <span class="string"><span class="delimiter">&quot;</span><span class="content">Traject::MarcReader</span><span class="delimiter">&quot;</span></span> <span class="line-numbers"><a href="#n12" name="n12">12</a></span> provide <span class="string"><span class="delimiter">&quot;</span><span class="content">writer_class_name</span><span class="delimiter">&quot;</span></span>, <span class="string"><span class="delimiter">&quot;</span><span class="content">Traject::JsonWriter</span><span class="delimiter">&quot;</span></span> <span class="line-numbers"><a href="#n13" name="n13">13</a></span> provide <span class="string"><span class="delimiter">&quot;</span><span class="content">output_file</span><span class="delimiter">&quot;</span></span>, <span class="string"><span class="delimiter">&quot;</span><span class="content">basics.ndj</span><span class="delimiter">&quot;</span></span> <span class="line-numbers"><a href="#n14" name="n14">14</a></span> provide <span class="string"><span class="delimiter">'</span><span class="content">processing_thread_pool</span><span class="delimiter">'</span></span>, <span class="integer">3</span> <span class="line-numbers"><a href="#n15" name="n15">15</a></span><span class="keyword">end</span> <span class="line-numbers"><a href="#n16" name="n16">16</a></span> <span class="line-numbers"><a href="#n17" name="n17">17</a></span> <span class="line-numbers"><a href="#n18" name="n18">18</a></span><span class="comment"># It's *still* just ruby, so I can declare a variable!</span> <span class="line-numbers"><a href="#n19" name="n19">19</a></span>idfield = <span class="string"><span class="delimiter">'</span><span class="content">001</span><span class="delimiter">'</span></span> <span class="line-numbers"><strong><a href="#n20" name="n20">20</a></strong></span> <span class="line-numbers"><a href="#n21" name="n21">21</a></span><span class="comment"># ...and then use it to find the ID</span> <span class="line-numbers"><a href="#n22" name="n22">22</a></span>to_field <span class="string"><span class="delimiter">&quot;</span><span class="content">id</span><span class="delimiter">&quot;</span></span>, extract_marc(idfield, <span class="symbol">:first</span> =&gt; <span class="predefined-constant">true</span>) <span class="line-numbers"><a href="#n23" name="n23">23</a></span> <span class="line-numbers"><a href="#n24" name="n24">24</a></span><span class="comment"># Now the other data</span> <span class="line-numbers"><a href="#n25" name="n25">25</a></span>to_field <span class="string"><span class="delimiter">&quot;</span><span class="content">title</span><span class="delimiter">&quot;</span></span>, extract_marc(<span class="string"><span class="delimiter">'</span><span class="content">245</span><span class="delimiter">'</span></span>) <span class="line-numbers"><a href="#n26" name="n26">26</a></span>to_field <span class="string"><span class="delimiter">&quot;</span><span class="content">author</span><span class="delimiter">&quot;</span></span>, extract_marc(<span class="string"><span class="delimiter">'</span><span class="content">100abcd:110abcd:111abc</span><span class="delimiter">'</span></span>) <span class="line-numbers"><a href="#n27" name="n27">27</a></span> <span class="line-numbers"><a href="#n28" name="n28">28</a></span> <span class="line-numbers"><a href="#n29" name="n29">29</a></span><span class="comment"># You'd run this as: </span> <span class="line-numbers"><strong><a href="#n30" name="n30">30</a></strong></span><span class="comment"># traject -c sample.rb myfile.mrc</span> <span class="line-numbers"><a href="#n31" name="n31">31</a></span> <span class="line-numbers"><a href="#n32" name="n32">32</a></span> </pre></div> </div> </div> <p>That's simplistic, of course, but it should drive home the point that we strove to make sure traject <em>makes the easy stuff easy</em>. For a more complex example, look at <a href="">the heavily-annotated index.rb file</a> in the sample project.</p> <h3 id="why-use-or-move-to-traject">Why use (or move to) traject?</h3> <p>First off, you can and should look at <a href="">the annoucement</a> and/or <a href="">the README</a> for a longer answer, but I'll tell you why <em>I</em> use <code>traject</code> in one word: </p> <p>Flexibility.</p> <p>After a year or so of struggling with <a href="">solrmarc</a> (often due to my lack of Java-fu), and then even more years after that using my own, home-grown <a href="">marc2solr</a>, the things I most wanted were the ability to decouple the various components from each other, rely on code instead of configuration, and basically just know that I can up the complexity of my code without paying an enormous price.</p> <p>I'm fast wtih Ruby. And the architecture of <code>traject</code> allows me to easily build and test my transformations in isolation, with tools I'm good with, with debugging output that's easy to read or process by machine or inspection.</p> <h3 id="what-does-it-have-out-of-the-box">What does it have out of the box?</h3> <p>One advantage <code>traject</code> has that my previous system didn't is, well, years of struggling with my previous system. I've learned a lot about what I need, what needs to be easy, and how I want to think about indexing.</p> <p>The nature of <code>traject</code> is that "a reader" sends "a record" to "an indexer" which produces a key=&gt;value hash and sends <em>that</em> to "a writer." Obviously, this is a pretty abstract setup; it's not hard to see how it could be used for all sorts of transformations (e.g., I'm already thinking about a simple gem that would provide macros to index CSV or tab-delmited files into Solr. Or maybe going to/from a database).</p> <p>But Jonathan and I are, mostly, stuck dealing with MARC data and Solr. So here's what we get:</p> <p><strong>Readers</strong>: MARC readers for MARC21 binary and MARC-XML based on both ruby-marc and marc4j (the latter allowing you to deal with encoding transformations and the like). An NDJ reader (for one marc-in-json structure per line in a file – that's what we use in for the HathiTrust). And we've already got a couple gems for people with other needs: <a href="">traject_alephsequential_reader</a> for those that need to deal with AlephSequential, and Jonathan's new <a href="">horizon reader</a> for efficiently pulling records right out of your Horizon ILS, if you happen to run one.</p> <p><strong>Transforming Macros</strong>: A traject indexing step is just a well-formed ruby block (or lambda), which makes writing macros ridiculously easy. Traject ships with most of what you'd commonly need to deal with MARC: extracting data based on tag/subfield/indicators (or substring of a fixed field), dealing with non-filing characters, automatically dealing with 880 linked fields. Mucking with publication dates. Dealing with languages, formats, etc. And, of course, doing it all with multiple threads, because who wants to see all those lovely cores go to waste?</p> <p><strong>Writers</strong>: Of course, you can write to solr, using the excellent <code>solrj</code> java library. And you can do it in multiple threads, to keep things fast. But there's also the <code>DebugWriter</code> to spit stuff out in a human-readable format, and the <code>JsonWriter</code> mentioned above to spit stuff out in a <em>machine</em>-readable format. And building your own writer is literally just a couple methods. </p> <h3 id="how-do-i-get-a-taste">How do I get a taste?</h3> <p>Like I said, clone and play with <a href="">the sample project</a>. And ask me questions, either here or via email. After years of being the only person running my indexing software, I'm anxious to try to build up a community around <code>traject</code>.</p> Come work at the University of Michigan 2013-04-18T00:00:00Z 2014-01-31T10:24:53-05:00 Bill Dueber <p>The Library has three UX positions available right now – interface designer, interface developer, and a web content strategist.</p> <p>Come join me at what is easily the best place I've ever worked! <a href="">Full details are over at Suz's blog</a>.</p> Please: don't return your books 2013-02-12T00:00:00Z 2014-01-31T10:24:53-05:00 Bill Dueber <p>So, I'm at <a href="">code4lib 2013</a> right now, where side conversations and informal exchanges tend to be the most interesting part.</p> <p>Last night I had an conversation with the inimitable <a href="">Michael B. Klein</a>, and after complaining about faculty members that keep books out for <em>decades</em> at a time, we ended up asking a simple question:</p> <blockquote> <p>How much more shelving would we need if everyone returned their books?</p> </blockquote> <p>Assuming we could get them all checked in and such, well, where would we put them?</p> <p>I'm looking at this in the simplest, most conservative way possible:</p> <ul> <li>Assume they're all paperbacks, so we don't worry about how thick a cover is (cover width = 0)</li> <li>Assume items for which we don't have page count information are "average"</li> </ul> <h3 id="starting-data">Starting data</h3> <p>What's my current situation at Michigan?</p> <ul> <li>Total bibs: about 10M (but that includes a bunch of HathiTrust items and other electronic-only items that could never be checked out)</li> <li>Total items checked out right now: 162,080</li> </ul> <p>The first problem I run into is that I don't know how many pages are in a given book. Well, in theory I can look in MARC field 300$a, and it will tell me.</p> <h3 id="finding-the-number-of-pages-in-a-book">Finding the number of pages in a book</h3> <p>I went through a recent dump of all our records and pulled out page counts from the 300 (those that matched the regular expression $$a\d+\s+[pP].).</p> <p>Problem solved, right? Well, kind of</p> <ul> <li>3,085,433 total bibs with page count data (about 30%)</li> <li>40,872 checked out items with page count data (about 25%)</li> </ul> <p>OK, so I don't have data for everything. Plus, some of those are multi-volume works that list the total page count, even though only a single volume may be checked out.</p> <p>We'll have to drop down into statistics:</p> <ul> <li>Average number of pages in a checked-out item: 270</li> <li>Median number of pages in a checked-out item: 244</li> </ul> <p>The median is lower, so we'll go with that. Being conservative, remember?</p> <h3 id="bringing-it-all-together">Bringing it all together</h3> <p>Obviously we need to make a lot of assumptions.</p> <ul> <li>All paperbacks (== no space allowance for covers)</li> <li>244 pages per item (the median of checked out items for which we have data)</li> <li>Pages = 244 * 162,080 = 39,547,520 pages</li> </ul> <h3 id="sowhats-the-damage">So…what's the damage?</h3> <p>But how to do the calculation?</p> <p>It turns out that simply googling <a href=";amp;amp;hl=en&amp;amp;amp;safe=off&amp;amp;amp;tbo=d&amp;amp;amp;noj=1&amp;amp;amp;site=webhp&amp;amp;amp;source=hp&amp;amp;amp;q=book+spine+width+calculator&amp;amp;amp;oq=book+spine+widt">book spine width calculator</a> a few come up.</p> <p>I picked one and input 39,547,520 pages and assumed 50lb paper (the lightest paper in the tool).</p> <p><strong>Total width: 77,241.25 inches, or 6437 feet, or 1.22 miles</strong></p> <h3 id="miles">1.22 miles???</h3> <p>Well, we had a lot of assumptions,but most of them were pretty conservative. And I have no idea if the book spine calculator is at all accurate.</p> <p>But…it's gonna be a big number no matter what. Add in that many of them are hardcover, and this seems like a pretty good guess at a lower end.</p> <h3 id="what-is-this-good-for-again">What is this good for again?</h3> <p>Oh, nothing at all. Just a little fun while I'm at code4lib.</p> <h3 id="next-steps">Next steps</h3> <p>Well, the best next step would be to walk away. This is a huge waste of time.</p> <p>But…we could look in the 020s for a hint of whether it's hardcover or paperback (which is <a href="">really hard</a>. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and take that into account.</p> <p>But really: this is enough for me. Whether Michael wants to pursue it further on his own, well, that's up to him.</p> Ruby sidebar: Using rvm on the shebang (#!) line in a script 2012-05-04T00:00:00Z 2014-01-31T10:24:53-05:00 Bill Dueber <p>Just throwing this up here because I didn't find it elsewhere.</p> <p>I want to run ruby scripts from the command line or in a cronjob, and I do <em>not</em> want to always have to type "ruby scriptname".</p> <p><em>But</em>, I use <a href="">rvm</a>. I want to run a particular ruby, maybe identified by an alias, maybe with a specific gemset.</p> <p>It turns out you can use the <code>env</code> program with <code>rvm do</code> to accomplish this.</p> <div><div class="CodeRay"> <div class="code"><pre><span class="line-numbers"><a href="#n1" name="n1">1</a></span> <span class="line-numbers"><a href="#n2" name="n2">2</a></span> <span class="comment">#!/usr/bin/env rvm 1.9 do ruby</span> <span class="line-numbers"><a href="#n3" name="n3">3</a></span> <span class="line-numbers"><a href="#n4" name="n4">4</a></span> require <span class="string"><span class="delimiter">'</span><span class="content">mygem</span><span class="delimiter">'</span></span> <span class="line-numbers"><a href="#n5" name="n5">5</a></span> o = <span class="constant">MyGem</span>.new <span class="line-numbers"><a href="#n6" name="n6">6</a></span> <span class="comment"># blah blah blah</span> <span class="line-numbers"><a href="#n7" name="n7">7</a></span> </pre></div> </div> </div> <p>In this example, <code>1.9</code> is the name of the ruby (actually, an rvm alias) I want to use, and it could just as easily specify a gemset as well (e.g., 1.9@mygems).</p> <p>If you're running in cron, don't forget you need to load the environment variables first. Here I use the bash <code>.</code> command to source my <code>.bashrc</code>.</p> <pre><code> 54 9-16 * * 1-5 . /Users/dueberb/.bashrc; /Users/dueberb/bin/exercise </code></pre> <p>Nothing fancy, but worth knowing.</p>