<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Robot Librarian</title>
	<atom:link href="http://robotlibrarian.billdueber.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://robotlibrarian.billdueber.com</link>
	<description>Disclaimer: I'm not actually a robot.</description>
	<lastBuildDate>Thu, 18 Apr 2013 01:34:39 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Come work at the University of Michigan</title>
		<link>http://robotlibrarian.billdueber.com/come-work-at-the-university-of-michigan/</link>
		<comments>http://robotlibrarian.billdueber.com/come-work-at-the-university-of-michigan/#comments</comments>
		<pubDate>Thu, 18 Apr 2013 01:34:39 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=569</guid>
		<description><![CDATA[The Library has three UX positions available right now &#8212; interface designer, interface developer, and a web content strategist. Come join me at what is easily the best place I&#8217;ve ever worked! Full details are over at Suz&#8217;s blog. &#160; &#160;]]></description>
				<content:encoded><![CDATA[<p>The Library has three UX positions available right now &#8212; interface designer, interface developer, and a web content strategist.</p>

<p>Come join me at what is easily the best place I&#8217;ve ever worked! <a href="http://userslib.com/2013/04/16/ux-and-web-systems-job-postings-at-the-university-of-michigan-library/">Full details are over at Suz&#8217;s blog</a>.</p>

<p>&nbsp;</p>

<p>&nbsp;</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/come-work-at-the-university-of-michigan/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Please: don&#8217;t return your books</title>
		<link>http://robotlibrarian.billdueber.com/please-dont-return-your-books/</link>
		<comments>http://robotlibrarian.billdueber.com/please-dont-return-your-books/#comments</comments>
		<pubDate>Tue, 12 Feb 2013 20:16:50 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=559</guid>
		<description><![CDATA[So, I&#8217;m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part. Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question: &#62; [...]]]></description>
				<content:encoded><![CDATA[<p>So, I&#8217;m at <a href="http://code4lib.org/conference/2013">code4lib 2013</a> right now, where side conversations and informal exchanges tend to be the most interesting part.</p>

<p>Last night I had an conversation with the inimitable <a href="https://twitter.com/mbklein">Michael B. Klein</a>, and after complaining about faculty members that keep books out for <em>decades</em> at a time, we ended up asking a simple question:</p>

<p>&gt; How much more shelving would we need if everyone returned their books?</p>

<p>Assuming we could get them all checked in and such, well, where would we put them?</p>

<p>I&#8217;m looking at this in the simplest, most conservative way possible:</p>

<ul>
<li>Assume they&#8217;re all paperbacks, so we don&#8217;t worry about how thick a cover is (cover width = 0)</li>
<li>Assume items for which we don&#8217;t have page count information are &#8220;average&#8221;</li>
</ul>

<h2>Starting data</h2>

<p>What&#8217;s my current situation at Michigan?</p>

<ul>
<li>Total bibs: about 10M (but that includes a bunch of HathiTrust items and other electronic-only items that could never be checked out)</li>
<li>Total items checked out right now: 162,080</li>
</ul>

<p>The first problem I run into is that I don&#8217;t know how many pages are in a given book. Well, in theory I can look in <acronym title="MAchine Readable Cataloging">MARC</acronym> field 300$a, and it will tell me.</p>

<h2>Finding the number of pages in a book</h2>

<p>I went through a recent dump of all our records and pulled out page counts from the 300 (those that matched the regular expression $$a\d+\s+[pP]&#46;).</p>

<p>Problem solved, right? Well, kind of</p>

<ul>
<li>3,085,433 total bibs with page count data (about 30%)</li>
<li>40,872 checked out items with page count data (about 25%)</li>
</ul>

<p>OK, so I don&#8217;t have data for everything. Plus, some of those are multi-volume works that list the total page count, even though only a single volume may be checked out.</p>

<p>We&#8217;ll have to drop down into statistics:</p>

<ul>
<li>Average number of pages in a checked-out item: 270</li>
<li>Median number of pages in a checked-out item: 244</li>
</ul>

<p>The median is lower, so we&#8217;ll go with that. Being conservative, remember?</p>

<h2>Bringing it all together</h2>

<p>Obviously we need to make a lot of assumptions.</p>

<ul>
<li>All paperbacks (== no space allowance for covers)</li>
<li>244 pages per item (the median of checked out items for which we have data)</li>
<li>Pages = 244 * 162,080 = 39,547,520 pages</li>
</ul>

<h2>So&#8230;what&#8217;s the damage?</h2>

<p>But how to do the calculation?</p>

<p>It turns out that simply googling <a href="https://www.google.com/search?num=30&amp;hl=en&amp;safe=off&amp;tbo=d&amp;noj=1&amp;site=webhp&amp;source=hp&amp;q=book+spine+width+calculator&amp;oq=book+spine+widt">book spine width calculator</a> a few come up.</p>

<p>I picked one and input 39,547,520 pages and assumed 50lb paper (the lightest paper in the tool).</p>

<p><strong>Total width: 77,241.25 inches, or 6437 feet, or 1.22 miles</strong></p>

<h2>1.22 miles???</h2>

<p>Well, we had a lot of assumptions,but most of them were pretty conservative. And I have no idea if the book spine calculator is at all accurate.</p>

<p>But&#8230;it&#8217;s gonna be a big number no matter what. Add in that many of them are hardcover, and this seems like a pretty good guess at a lower end.</p>

<h2>What is this good for again?</h2>

<p>Oh, nothing at all. Just a little fun while I&#8217;m at code4lib.</p>

<h2>Next steps</h2>

<p>Well, the best next step would be to walk away. This is a huge waste of time.</p>

<p>But&#8230;we could look in the 020s for a hint of whether it&#8217;s hardcover or paperback (which is <a href="http://robotlibrarian.billdueber.com/isbn-parenthetical-notes-bad-marc-data-1/">really hard</a>. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and take that into account.</p>

<p>But really: this is enough for me. Whether Michael wants to pursue it further on his own, well, that&#8217;s up to him.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/please-dont-return-your-books/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Ruby sidebar: Using rvm on the shebang (#!) line in a script</title>
		<link>http://robotlibrarian.billdueber.com/using-rvm-on-the-shebang-line-in-a-script/</link>
		<comments>http://robotlibrarian.billdueber.com/using-rvm-on-the-shebang-line-in-a-script/#comments</comments>
		<pubDate>Fri, 04 May 2012 14:58:01 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=537</guid>
		<description><![CDATA[Just throwing this up here because I didn&#8217;t find it elsewhere. I want to run ruby scripts from the command line or in a cronjob, and I do not want to always have to type &#8220;ruby scriptname&#8221;. But, I use rvm. I want to run a particular ruby, maybe identified by an alias, maybe with [...]]]></description>
				<content:encoded><![CDATA[<p>Just throwing this up here because I didn&#8217;t find it elsewhere.</p>

<p>I want to run ruby scripts from the command line or in a cronjob, and I do <em>not</em> want to always have to type &#8220;ruby scriptname&#8221;.</p>

<p><em>But</em>, I use <a href="https://rvm.io/">rvm</a>. I want to run a particular ruby, maybe identified by an alias, maybe with a specific gemset.</p>

<p>It turns out you can use the <code>env</code> program with <code>rvm do</code> to accomplish this.</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1">&nbsp; &nbsp; <span class="co1">#!/usr/bin/env &nbsp;rvm 1.9 do ruby</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="kw3">require</span> <span class="st0">&#39;mygem&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; o = MyGem.<span class="me1">new</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="co1"># blah blah blah</span></div></li></ol></div>

<p>In this example, <code>1.9</code> is the name of the ruby (actually, an rvm alias) I want to use, and it could just as easily specify a gemset as well (e.g., 1.9@mygems).</p>

<p>If you&#8217;re running in cron, don&#8217;t forget you need to load the environment variables first. Here I use the bash <code>.</code> command to source my <code>.bashrc</code>.</p>

<p><pre>
  54 9-16 * * 1-5 . /Users/dueberb/.bashrc; /Users/dueberb/bin/exercise
</pre></p>

<p>Nothing fancy, but worth knowing.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/using-rvm-on-the-shebang-line-in-a-script/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Boosting on Exactish (anchored) phrase matching in Solr: (SST #4)</title>
		<link>http://robotlibrarian.billdueber.com/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/</link>
		<comments>http://robotlibrarian.billdueber.com/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/#comments</comments>
		<pubDate>Mon, 19 Mar 2012 19:11:19 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[exact]]></category>
		<category><![CDATA[fieldtype]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[Stupid Solr Tricks]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=517</guid>
		<description><![CDATA[Check out introduction to the Stupid Solr Tricks series if you&#8217;re just joining us.] Exact matching in Solr is easy. Use the default string type: all it does is, essentially, exact phrase matching. string is a great type for faceted values, where the only way we expect to search the index is via text pulled [...]]]></description>
				<content:encoded><![CDATA[<blockquote>
  <p>Check out <a href="http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/">introduction to the Stupid <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Tricks series</a> if you&#8217;re just joining us.]</p>
</blockquote>

<p><em>Exact</em> matching in <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> is easy. Use the default <em>string</em> type: all it does is, essentially, exact phrase matching. <em>string</em> is a great type for faceted values, where the only way we expect to search the index is via text pulled from the index itself. Query the index to get a value: use that value to re-query the index. Simple and self-contained.</p>

<p>But much of the time, we don&#8217;t want exact matching. We want <em>exactish</em> matching. You know, where things are exactly the same <em>except</em>. Except for case, or punctuation, or how much whitespace is between tokens. Maybe do some unicode folding, or stemming.</p>

<p>Essentially, we want to reward users (via high relevancy) for getting <em>really close</em>. If someone types in a full title, but misses a colon, well, let&#8217;s go  ahead and assume they want that particular item.</p>

<h2><em>Exactish</em> matching vs phrase matching</h2>

<p>Phrase matching in <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> does a great job, but fails those of us generating super-complex queries where we want to provide awesome service for those users doing <em>known-item queries</em>. If someone puts in the exact(ish) title, or the exact(ish) subject, well, those items should float to the top.</p>

<p><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>&#8217;s default phrase matching (via, say, the <code>pf</code> param in dismax or just putting your query in quotes) doesn&#8217;t differentiate between a phrase that matches the whole target string and only part of that target string. For this, we&#8217;ll need a decent text <code>fieldtype</code> and a way to &#8220;anchor&#8221; the search to both ends of the target string.</p>

<h2>Our goals</h2>

<p>We&#8217;re shooting for:</p>

<ul>
<li>A useful text type that we can use all over the place</li>
<li>A phrase match against that field that will match any portion of the target text. <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> already does this &#8212; that&#8217;s a normal <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> phrase search.</li>
<li>A &#8220;fully anchored&#8221; text type that will only phrase match if the query string exactishly-matches the whole field. We&#8217;ll phrase-search on this field and boost it way up.</li>
<li>And, what the heck, a left-anchored version that will exactish match a phrase only at the start of a field. We&#8217;ll boost this one up a bit less.</li>
</ul>

<h2>Follow along at home</h2>

<p>Go ahead and clone the <a href="https://billdueber@github.com/billdueber/solr_stupid_tricks">github repo</a> I&#8217;ve been using  if you haven&#8217;t already and let&#8217;s dig in.</p>

<div class="geshi no bash"><ol><li class="li1"><div class="de1"><span class="kw3">cd</span> solr_stupid_tricks</div></li>
<li class="li1"><div class="de1">git pull origin master</div></li>
<li class="li1"><div class="de1">git fetch &#8211;all</div></li>
<li class="li1"><div class="de1">git checkout SST4</div></li>
<li class="li1"><div class="de1">java -jar start.jar <span class="sy0">&amp;</span></div></li></ol></div>

<p>There are some additions to the <code>schema.xml</code> file; let&#8217;s take a look!</p>

<h2>Step 1: get a decent text type</h2>

<p>The recent-nighty of <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> 3.x we&#8217;re using has a great tokenizer in ICUTokenizerFactory, which does &#8220;the right thing&#8221; across a whole host of languages.</p>

<div class="geshi no xml"><ol><li class="li1"><div class="de1"><span class="sc3"><span class="re1">&lt;fieldtype</span> <span class="re0">name</span>=<span class="st0">&quot;text&quot;</span> <span class="re0">class</span>=<span class="st0">&quot;solr.TextField&quot;</span> <span class="re0">positionIncrementGap</span>=<span class="st0">&quot;1000&quot;</span><span class="re2">&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;analyzer<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;tokenizer</span> <span class="re0">class</span>=<span class="st0">&quot;solr.ICUTokenizerFactory&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.ICUFoldingFilterFactory&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.SynonymFilterFactory&quot;</span> </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re0">synonyms</span>=<span class="st0">&quot;syn.txt&quot;</span> <span class="re0">ignoreCase</span>=<span class="st0">&quot;true&quot;</span> <span class="re0">expand</span>=<span class="st0">&quot;false&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1"><span class="sc3"><span class="coMULTI">&lt;!&#8211; &lt;filter class=&quot;solr.WordDelimiterFilterFactory&quot; </span></div></li>
<li class="li1"><div class="de1"><span class="coMULTI">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;generateWordParts=&quot;1&quot; generateNumberParts=&quot;1&quot; </span></div></li>
<li class="li1"><div class="de1"><span class="coMULTI">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;catenateWords=&quot;1&quot; catenateNumbers=&quot;1&quot; catenateAll=&quot;0&quot;/&gt;</span></div></li>
<li class="li1"><div class="de1"></span> &#8211;&gt;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.CJKWidthFilterFactory&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.CJKBigramFilterFactory&quot;</span><span class="re2">/&gt;</span></span> </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;/analyzer<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1"><span class="sc3"><span class="re1">&lt;/fieldtype<span class="re2">&gt;</span></span></span></div></li></ol></div>

<p>Let&#8217;s take it bit by bit:</p>

<ul>
<li>Obviously, start with the <code>ICUTokenizer</code> with a large positionIncrementGap so we can do some of the tricks we talked about <a href="http://robotlibrarian.billdueber.com/requiringpreferring-searches-that-dont-span-multiple-values-sst-3/">last time</a></li>
<li>Next, we get one-stop shopping with the <code>ICUFoldingFilterFactory</code>. It provides all of the following:

<ul>
<li>NFKC normalization (precomosing), </li>
<li>Unicode case folding (i.e., lowercasing)</li>
<li>search term folding (removing accents, etc).</li>
</ul></li>
<li>Push in synonyms if you have any</li>
<li>Uncomment the <code>WordDelimiterFilterFactory</code> if you want to. I&#8217;m going to try to avoid it, since it messes with the number of tokens midstream and I worry about the effect on dismax and its <code>mm</code> parameter as <a href="http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/">explained so excellently by Jonathan Rochkind</a></li>
<li><a href="http://www.hathitrust.org/blogs/large-scale-search/multilingual-issues-part-1-word-segmentation">Dealing with CJK (Chinese, Japanese, Korean) is hard</a>. The CJK filters process those languages and provide overlapping bigrams so searching isn&#8217;t (I&#8217;m told) quite as painful. (I really, really recommend the above link for a great overview by Tom Burton-West).</li>
</ul>

<h2>Step 2: Set up parallel text types that anchor phrase matches to one or both ends</h2>

<p>We&#8217;re going to use something new: a <code>charFilter</code>. This differs from a normal filter in that it affects the input string before tokenization.</p>

<p>Here&#8217;s the trick. We&#8217;re going to add anchoring text (I chose just &#8216;AAAA&#8217; at the front and &#8216;ZZZZ&#8217; at the end) to the normal text type, just by adding a simple charfilter.</p>

<div class="geshi no xml"><ol><li class="li1"><div class="de1"><span class="sc3"><span class="re1">&lt;fieldtype</span> <span class="re0">name</span>=<span class="st0">&quot;text_lr&quot;</span> <span class="re0">class</span>=<span class="st0">&quot;solr.TextField&quot;</span> <span class="re0">positionIncrementGap</span>=<span class="st0">&quot;1000&quot;</span><span class="re2">&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;analyzer<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;charFilter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.PatternReplaceCharFilterFactory&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="re0">pattern</span>=<span class="st0">&quot;^(.*)$&quot;</span> <span class="re0">replacement</span>=<span class="st0">&quot;AAAA $1 ZZZZ&quot;</span> <span class="re2">/&gt;</span></span> &nbsp; &nbsp; &nbsp; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;tokenizer</span> <span class="re0">class</span>=<span class="st0">&quot;solr.ICUTokenizerFactory&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.ICUFoldingFilterFactory&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.SynonymFilterFactory&quot;</span> </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re0">synonyms</span>=<span class="st0">&quot;syn.txt&quot;</span> </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="re0">ignoreCase</span>=<span class="st0">&quot;true&quot;</span> <span class="re0">expand</span>=<span class="st0">&quot;false&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.CJKWidthFilterFactory&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;filter</span> <span class="re0">class</span>=<span class="st0">&quot;solr.CJKBigramFilterFactory&quot;</span><span class="re2">/&gt;</span></span> </div></li>
<li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;/analyzer<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1"><span class="sc3"><span class="re1">&lt;/fieldtype<span class="re2">&gt;</span></span></span></div></li></ol></div>

<p>Note that this charFilter actually adds two new tokens (&#8216;AAAA&#8217; and &#8216;ZZZZ&#8217;) to your token stream on both index and query. How does this help us?</p>

<p>Let&#8217;s look at indexing <code>Mister Blue Sky</code> in a normal text field. A normal solr phrase query <code>q="Blue Sky"</code> will match on that value, because the query phrase is fully contained in the indexed phrase.</p>

<p>But what happens if we index into a <code>text_lr</code> field?</p>

<ul>
<li>Indexing <code>Mister Blue Sky</code> becomes <code>aaaa mister blue sky zzzz</code></li>
<li>Search terms <code>blue sky</code> becomes <code>aaaa blue sky zzzz</code></li>
<li>Phrase searching will then compare the two transformed values using normal <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> rules, <em>find the the latter is not fully contained in the former as a phrase</em>, and give up.</li>
</ul>

<p>Be careful, though. That &#8216;aaaa&#8217; and &#8216;zzzz&#8217; are there just as if you&#8217;d typed them in. Thus every indexed value has the tokens &#8216;aaaa&#8217; and &#8216;zzzz&#8217;, and every query will, in effect, include a query for &#8216;aaaa&#8217; or &#8216;zzzz&#8217; (depending on your <code>mm</code> settings).</p>

<p>That means that <strong>any non-phrase query will match every field that uses this fieldtype</strong>, and it will also mess with token counts with respect to your <code>mm</code> parameter. For those reasons, <em>only ever use anchored fieldtypes for phrase queries when you want exactish matches</em>.</p>

<p>By adding only one of &#8216;AAAA&#8217; or &#8216;ZZZZ&#8217;, we can have left-anchored and right-anchored searches as well. See <a href="https://github.com/billdueber/solr_stupid_tricks/blob/SST4/solr/conf/schema.xml">the schema.xml</a> for these definitions.</p>

<h2>Try it out!</h2>

<p>Let&#8217;s take a small set of new documents:</p>

<div class="geshi no ruby"><ol><li class="li1"><div class="de1"><span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;id&quot;</span>: <span class="st0">&quot;1&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;title&quot;</span>: <span class="st0">&quot;The Monkees: Pleasant Valley Never&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;id&quot;</span>: <span class="st0">&quot;2&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;title&quot;</span>: <span class="st0">&quot;The Monkees&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;id&quot;</span>: <span class="st0">&quot;3&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;title&quot;</span>: <span class="st0">&quot;Meet the Monkees&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;id&quot;</span>: <span class="st0">&quot;4&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;title&quot;</span>: <span class="st0">&quot;Corportate boy bands through the ages&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#93;</span></div></li></ol></div>

<p>We have copyFields set up to copy the title field to both a fully-anchored field (<code>text_exact</code>) and a left-anchored field (<code>text_l</code>).</p>

<div class="geshi no xml"><ol><li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;copyField</span> <span class="re0">source</span>=<span class="st0">&quot;title&quot;</span> <span class="re0">dest</span>=<span class="st0">&quot;title_exact&quot;</span><span class="re2">/&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;copyField</span> <span class="re0">source</span>=<span class="st0">&quot;title&quot;</span> <span class="re0">dest</span>=<span class="st0">&quot;title_l&quot;</span><span class="re2">/&gt;</span></span></div></li></ol></div>

<p>If you&#8217;re following at home, clear out your solr and index them:</p>

<div class="geshi no bash"><ol><li class="li1"><div class="de1"><span class="kw3">cd</span> exampledocs</div></li>
<li class="li1"><div class="de1">&nbsp;.<span class="sy0">/</span>reset_and_index_json.<span class="kw2">sh</span> exactish.json</div></li></ol></div>

<p>We&#8217;ll now run three dismax queries, all of which use the search terms <code>the monkees</code>. Watch what happens to the score as we change things.</p>

<ul>
<li>First, <code>qf=title, pf=title^2</code>. This will match the three Monkees documents, and then boost <em>all</em> of them because they all contain the phrase &#8220;the monkees&#8221; in the title.</li>
<li>Second, <code>qf=title, pf=title_exact^10 title^2</code>. These will match the Monkees documents, and then give a huge boost to the one with the exact match.</li>
<li>Finally, <code>qf=title, pf=title_exact^10 title_l^5 title^2</code>. There you&#8217;ll see the score for the exact title match go way up (relatively speaking, of course), and document 1 go up quite a bit (because it begins with the phrase &#8220;The Monkees&#8221;).</li>
</ul>

<p>You can run all three queries as:</p>

<div class="geshi no bash"><ol><li class="li1"><div class="de1"><span class="kw3">cd</span> ruby</div></li>
<li class="li1"><div class="de1">ruby browse.rb exactish_query.rb </div></li>
<li class="li1"><div class="de1"><span class="co0"># or ruby browse.rb exactish_query.rb json|xml|csv to get different output type</span></div></li></ol></div>

<p>[<acronym title="By The Way">BTW</acronym>, <code>browse.rb</code> will now take an array of queries to run in a single file.]</p>

<p>Tah Dah! You&#8217;ve successfully boosted the exatish match, and the left-anchored exactish match. Your known-item-searchers will thank you.</p>

<p>You may want to take a look at <code>exactish_query.rb</code> to see what&#8217;s going on.</p>

<h2>To sum up</h2>

<ul>
<li>Your <code>schema.xml</code> now contains a decent text type and three variants for anchoring phrase searches left, right, and full (exactish)</li>
<li>The anchored text fields should NOT NOT NOT be searched against by anything other than a single phrase (which means they&#8217;re very useful in the <code>pf</code> param of a dismax search). A non-phrase search will trivially match <em>every single document</em>, so, you know, avoid that.</li>
<li>You now have a set of tools (field types, copyField directives, phrase search) that can be used to provide higher boosts to exactish matches and left-anchored exactish phrase matches.</li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Requiring/Preferring searches that don&#8217;t span multiple values (SST #3)</title>
		<link>http://robotlibrarian.billdueber.com/requiringpreferring-searches-that-dont-span-multiple-values-sst-3/</link>
		<comments>http://robotlibrarian.billdueber.com/requiringpreferring-searches-that-dont-span-multiple-values-sst-3/#comments</comments>
		<pubDate>Fri, 09 Mar 2012 05:02:54 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[phrase slop]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[Stupid Solr Tricks]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=480</guid>
		<description><![CDATA[Check out introduction to the Stupid Solr Tricks series if you&#8217;re just joining us.] Solr and multiValued fields Here&#8217;s another thing you need to understand about Solr: it doesn&#8217;t really have fields that can take multiple values. &#8220;But Bill,&#8221; you&#8217;re saying, &#8220;sure it does. I mean, hell, it even has a &#8216;multiValued&#8217; parameter.&#8221; First off: [...]]]></description>
				<content:encoded><![CDATA[<blockquote>
  <p>Check out <a href="http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/">introduction to the Stupid <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Tricks series</a> if you&#8217;re just joining us.]</p>
</blockquote>

<h2><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> and multiValued fields</h2>

<p>Here&#8217;s another thing you need to understand about Solr: it doesn&#8217;t really have fields that can take multiple values.</p>

<p>&#8220;But Bill,&#8221; you&#8217;re saying, &#8220;sure it does. I mean, hell, it even has a &#8216;multiValued&#8217; parameter.&#8221;</p>

<p>First off: watch your language.</p>

<p>Second off: are you <em>sure</em>?</p>

<p>Let&#8217;s do a quick test. Look at the following documents</p>

<div class="geshi no javascript"><div class="head">exampledocs/names.json</div><ol><li class="li1"><div class="de1"><span class="br0">&#91;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;id&quot;</span>: <span class="st0">&quot;1&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;title&quot;</span>: <span class="st0">&quot;The Monkees&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;name_text&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;Peter Tork&quot;</span>, <span class="st0">&quot;Mike Nesmith&quot;</span>, </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; <span class="st0">&quot;Micky Dolenz&quot;</span>, <span class="st0">&quot;Davy Thomas Jones&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;id&quot;</span>: <span class="st0">&quot;2&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;title&quot;</span>: <span class="st0">&quot;Heros of the Wild West&quot;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&quot;name_text&quot;</span>: <span class="br0">&#91;</span><span class="st0">&quot;Buck Jones&quot;</span>, <span class="st0">&quot;Davy Crockett&quot;</span><span class="br0">&#93;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#93;</span></div></li></ol></div>

<p>Question: what do you get when you run this query against those two documents?</p>

<div class="geshi no ruby"><div class="head">ruby/names_query.rb</div><ol><li class="li1"><div class="de1"><span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;fl&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;score, *&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;defType&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;dismax&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;wt&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;csv&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;qf&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;name_text&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;q&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;davy jones&#39;</span> &nbsp; <span class="co1"># Poor guy just died. So young. So short. </span></div></li>
<li class="li1"><div class="de1"><span class="br0">&#125;</span></div></li></ol></div>

<p>See how I threw the <em>wt=csv</em> in there? Check out <a href="http://lucene.apache.org/solr/api/org/apache/solr/response/QueryResponseWriter.html">all the query response formats</a> if you&#8217;re interested, but really all you&#8217;ll use is <code>standard</code> (<acronym title="Extensible Markup Language">XML</acronym>), <code>json</code>, or <code>csv</code> unless you&#8217;re rolling your own in some way.</p>

<p>I&#8217;ve updated <code>ruby/browse.rb</code> to allow a second argument of the type of output you want. You can now do <code>ruby browse.rb jsonfile [json|csv|standard|xml]</code></p>

<h2>Following along at home?</h2>

<p>If so, let&#8217;s go ahead and index these document and run the query.</p>

<div class="geshi no bash"><div class="head">Play along at home</div><ol><li class="li1"><div class="de1"><span class="kw3">cd</span> solr_stupid_tricks</div></li>
<li class="li1"><div class="de1">git pull origin master</div></li>
<li class="li1"><div class="de1">git fetch &#8211;all</div></li>
<li class="li1"><div class="de1">git checkout SST3 <span class="co0"># I&#39;ve started tagging the repo for these posts</span></div></li>
<li class="li1"><div class="de1"><span class="co0"># ignore warning about &quot;detached HEAD&quot;</span></div></li>
<li class="li1"><div class="de1">java -jar start.jar <span class="sy0">&amp;</span></div></li>
<li class="li1"><div class="de1"><span class="kw3">cd</span> exampledocs</div></li>
<li class="li1"><div class="de1">&nbsp;.<span class="sy0">/</span>reset_and_index_json.<span class="kw2">sh</span> names.json</div></li>
<li class="li1"><div class="de1">&nbsp;<span class="kw3">cd</span> ..<span class="sy0">/</span>ruby</div></li>
<li class="li1"><div class="de1">&nbsp;ruby browse.rb names_query.rb</div></li></ol></div>

<p>Here&#8217;s the scores that I get:</p>

<div class="geshi no csv"><div class="head">Return from <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></div><ol><li class="li1"><div class="de1">&nbsp; id,title,name_text,score</div></li>
<li class="li1"><div class="de1">&nbsp; 2,Heros of the Wild West,&quot;Buck Jones,Davy Crockett&quot;,0.42039964</div></li>
<li class="li1"><div class="de1">&nbsp; 1,The Monkees,&quot;Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones&quot;,0.26274976</div></li></ol></div>

<p>Check out that last column. The query was <em>davy jones</em>. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.</p>

<h2>The relevance ranking seems&#8230;wrong</h2>

<p>While it <em>looks</em> like we added four separate names to the <code>name_text</code> field in our first document, <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> doesn&#8217;t see it that way. <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> treats those four poor Monkees as if they had one long name.</p>

<p>Then it finds all the documents that match the query (both of our documents match) and figures out which is a better match by assigning a score.</p>

<p>In this case, while both document have both query terms, the field in the second document is <em>shorter</em>. Which means that, essentially, a <em>higher percentage of the terms in the field value match the given  query terms</em>. In <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>&#8217;s mind, that makes it a better match, and the shorter document shows up first.</p>

<p><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> doesn&#8217;t automatically give more weight to the recently-dead Monkee because internally it doesn&#8217;t care that you&#8217;re thinking of those values as four separate names. It just concatenates them together and indexes them.</p>

<p>This is <strong>not</strong>, for most people, expected behavior.</p>

<h2>Phrase slop</h2>

<p>Part of what&#8217;s going on here is that we haven&#8217;t told <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> that it should care how close together the terms are.</p>

<p>One way to do that is to use a phrase query by throwing quotes around the terms</p>

<div class="geshi no ruby"><div class="head">Put double-quotes around it to make it a phrase query</div><ol><li class="li1"><div class="de1">&nbsp; <span class="st0">&quot;q&quot;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;&quot;Davy Jones&quot;&#39;</span></div></li></ol></div>

<p>&#8230;but that won&#8217;t find anything, because <em>Davy</em> and <em>Jones</em> aren&#8217;t right next to each other in our document.</p>

<p><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> does allow a phrase query to be &#8220;sloppy&#8221;, though &#8212; basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.</p>

<p>For that, we&#8217;ll tell solr to search against certain fields (<code>pf</code>) treating the query as a phrase, and allow a little slop (<code>ps</code>) as well.</p>

<div class="geshi no ruby"><div class="head">ruby/names_sloppy_query.rb</div><ol><li class="li1"><div class="de1">&nbsp; <span class="br0">&#123;</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;fl&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;score, *&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;defType&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;dismax&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;wt&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;csv&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;q&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;davy jones&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;qf&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;name_text&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;pf&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;name_text^10&#39;</span>, <span class="co1"># search this field as a phrase</span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="st0">&#39;ps&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;4&#39;</span> <span class="co1"># allow &#39;phrase&#39; to mean &#39;within 4 tokens of each other&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li></ol></div>

<p>That gets us something more expected.</p>

<div class="geshi no csv"><ol><li class="li1"><div class="de1">&nbsp; id,title,name_text,score</div></li>
<li class="li1"><div class="de1">&nbsp; 1,The Monkees,&quot;Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones&quot;,0.2806283</div></li>
<li class="li1"><div class="de1">&nbsp; 2,Heros of the Wild West,&quot;Buck Jones,Davy Crockett&quot;,0.029652705</div></li></ol></div>

<h2>Enter <code>positionIncrementGap</code></h2>

<p>OK. Now that we have the concept of &#8220;slop&#8221;, one of those mystery <code>fieldtype</code> parameters makes sense: <code>positionIncrementGap</code>. Basically, a <code>positionIncrementGap</code> of 1000 means <em>When computing slop, pretend there are 1000 tokens between the entries in a multValued field</em>.</p>

<p>A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your <code>positionIncrementGap</code>.</p>

<p>All you have to do is use the <code>pf</code> and <code>ps</code> parameters and you&#8217;re set.</p>

<p>Note that this should be telling you two things:</p>

<ul>
<li><strong>Always use the same positionIncrementGap for your multiValued fields</strong></li>
<li><strong>Make it a number much larger than the maximum number of tokens you expect to ever have in a field.</strong></li>
</ul>

<p>Note that a large <code>positionIncrementGap</code> doesn&#8217;t <em>actually</em> put 1000 tokens in there &#8212; a large value doesn&#8217;t affect processing time or your index size or anything.</p>

<h2>But I&#8217;m already using the <code>pf</code> parameter!</h2>

<p>Slop is great when you want it. But I don&#8217;t always want to use slop. Slop of 4 makes the phrase &#8220;<em>Sex in the City</em>&#8221; be treated exactly the same as &#8220;<em>In the Sex City</em>&#8220;. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.</p>

<p>[<em>Forshadowing</em>: We'll talk about exact-ish matches in a few days.]</p>

<p>OK, so we can&#8217;t just appropriate the <code>pf</code>/<code>ps</code> parameters and and push the slop value up all the time &#8212; that cripples our ability to create the query boost structure we want.</p>

<h2>Query slop</h2>

<p>So, dismax (and its cousin edismax) have an analogous parameter that affects only <em>phrases within</em> the normal query: <code>qs</code>.</p>

<p><code>qs</code> is a dismax param that affects <em>query slop</em> &#8212; how much slop to allow in phrases within the query, much like the <code>ps</code> param.</p>

<p>The query</p>

<div class="geshi no ruby"><div class="head">A three-token query</div><ol><li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;q&#39;</span> <span class="sy0">=&gt;</span> <span class="st0">&#39;Bill &quot;The Weasel&quot; Dueber&#39;</span></div></li></ol></div>

<p>&#8230;has three tokens, the second of which (<em>&#8220;The Weasel&#8221;</em>) is a phrase. It&#8217;s that phrase token that is affected by query slop.</p>

<p>OK. So it affects only the phrases in the normal query. But&#8230;suppose we just force the <em>entire</em> query to be one big phrase? That&#8217;ll get us somewhere!</p>

<p>We just need to do the following:</p>

<ul>
<li>Create a boost query that uses the same fields as the regular query</li>
<li>&#8230;but treats all the query terms as one big phrase</li>
<li>&#8230;and give it a query slop of one less that the <code>positionIncrementGap</code> in our field type definition (in my case, 999)</li>
</ul>

<h2>Package it up</h2>

<p>OK, so here&#8217;s what we&#8217;re going to do. You can just take this basic idea and build it into your own queries in your application code. Try it. You might like it. Play around with what fields are affected, how much weight to give it, etc.</p>

<p>But heck, we&#8217;ve gone this far. Let&#8217;s encode it into the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> configuration file <code>solrconfig.xml</code> itself as a custom request handler.</p>

<p>We&#8217;re going to extend our <code>edismaxplus</code> requestHandler from <a href="http://robotlibrarian.billdueber.com/using-localparams-in-solr-sst-2/">last time</a>, but we&#8217;ll add an extra boost query that reflects this new &#8220;prefer documents where all the tokens appear in the same &#8216;line&#8217; of a multiValued query&#8221; attitude.</p>

<div class="geshi xml"><div class="head">solr/conf/solrconfig.xml</div><ol><li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;requestHandler</span> <span class="re0">name</span>=<span class="st0">&quot;/edismaxplus&quot;</span> <span class="re0">class</span>=<span class="st0">&quot;solr.SearchHandler&quot;</span><span class="re2">&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;lst</span> <span class="re0">name</span>=<span class="st0">&quot;defaults&quot;</span><span class="re2">&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&quot;rows&quot;</span><span class="re2">&gt;</span></span>10<span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&quot;fl&quot;</span><span class="re2">&gt;</span></span>*,score<span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&quot;echoParams&quot;</span><span class="re2">&gt;</span></span>explicit<span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&quot;q&quot;</span><span class="re2">&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; _query_:&quot;{!edismax qf=$fields mm=$mymm </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; v=$qwords bq=$boostForAll}&quot;<span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&#39;mymm&#39;</span><span class="re2">&gt;</span></span>0%<span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&quot;qwordsphrase&quot;</span><span class="re2">&gt;</span></span>&quot;JunkThatWillNEverShowUpInAMillionFreakinYears&quot;<span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;str</span> <span class="re0">name</span>=<span class="st0">&#39;boostForAll&#39;</span><span class="re2">&gt;</span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; _query_:&quot;{!edismax qf=$fields </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;mm=&#39;100%&#39; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;v=$qwords }&quot;^5 OR</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; _query_:&quot;{!dismax &nbsp;qf=$fields </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;mm=&#39;100%&#39; </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;v=$qwordsphrase </div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;qs=&#39;999&#39;}&quot;^5</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;/str<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; <span class="sc3"><span class="re1">&lt;/lst<span class="re2">&gt;</span></span></span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="sc3"><span class="re1">&lt;/requestHandler<span class="re2">&gt;</span></span></span></div></li></ol></div>

<p>We now do a few new things:</p>

<ul>
<li>(<em>Line 15</em>) Add a second clause to the boost query that use the same fields provided for the regular query (note the boolean OR between the two localparams queries that comprise this boost query)</li>
<li>(<em>Line 17</em>) Ask for another user-provided value: <code>qwordsphrase</code> which your application-level stuff should set to the set of all the regular query ters, but as a single phrase. Basically, strip out all the double-quotes, then put the whole thing in double quotes. In ruby: <code>qwordsphrase = '"' + qwords.gsub(/"/, '"') + '"'</code> </li>
<li>(<em>Line 10</em>) Provide a default value for the new <code>qwordsphrase</code> that won&#8217;t ever show up in a real query (empty string won&#8217;t work; I tried it and it throws an error). So, if the application doesn&#8217;t provide <code>qwordsphrase</code>, no harm is done &#8212; the search regresses to what we had last time.</li>
<li>(<em>Line 18</em>) Use a <code>qs</code> (query slop) of 999 in the new boost clause acting against <code>qwordsphrase</code>. That value is one less than the <code>positionIncrementGap</code> of 1000, making sure that we don&#8217;t cross multiValue boundaries. </li>
</ul>

<p><strong>Note</strong>: If you wanted to, you could make this a filter query (<code>fq</code>) instead of a boost query to <em>only</em> allow documents that meet this criterion.</p>

<h2>Let&#8217;s try it out!</h2>

<p>Once again, if you did a <code>git pull origin master</code> you&#8217;ve got this up and running already &#8212; the updated requestHandler source is already in <code>solr/conf/solrconfig.xml</code>.</p>

<p>We first construct the query just like we did last week, without the <code>qwordsphrase</code> argument:</p>

<p><a href="http://localhost:8983/solr/edismaxplus/?qwords=davy%20jones&amp;fields=name_text">http://localhost:8983/solr/edismaxplus/?qwords=davy jones&amp;fields=name_text</a></p>

<p>You&#8217;ll see Davy Crockett and friend appear as the first item.</p>

<p>But when you add the phraseified query, you&#8217;ll see the boost we&#8217;ve been talking about this whole post and get something more expected.</p>

<p><a href="http://localhost:8983/solr/edismaxplus/?qwords=davy%20jones&amp;fields=name_text&amp;qwordsphrase=%22Davy%20Jones%22">http://localhost:8983/solr/edismaxplus/?qwords=davy jones&amp;fields=name_text&amp;qwordsphrase=&#8221;Davy Jones&#8221;</a></p>

<p>The Monkees are again on top! Party like it&#8217;s 1967!</p>

<h2>Where it breaks down</h2>

<p>If you actually have a phrase as one of your query terms, it will no longer be treated as a phrase during the boost because we&#8217;re getting rid of all the double-quotes.</p>

<p>And, of course, if you&#8217;ve got gobs of full-text and include your fulltext field, setting  query slop to 999 isn&#8217;t just a cute trick, it&#8217;s a cute trick that will melt your servers to slag and still not do what you want it to do.</p>

<h2>What have we learned?</h2>

<ul>
<li><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> doesn&#8217;t really separate multiple values from each other in a <code>multiValued</code> field</li>
<li>Phrase slop (<code>ps</code>) and query slop (<code>qs</code>) can be used to allow &#8220;phrase&#8221; to mean &#8220;a bunch of tokens within X spots of each other&#8221;</li>
<li><em>I&#8217;m A Believer</em> is the best song Neil Diamond ever wrote.</li>
</ul>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/requiringpreferring-searches-that-dont-span-multiple-values-sst-3/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Using localparams in Solr (or, how to boost records that contain all terms) (SST #2)</title>
		<link>http://robotlibrarian.billdueber.com/using-localparams-in-solr-sst-2/</link>
		<comments>http://robotlibrarian.billdueber.com/using-localparams-in-solr-sst-2/#comments</comments>
		<pubDate>Tue, 06 Mar 2012 21:57:29 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[Stupid Solr Tricks]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=432</guid>
		<description><![CDATA[[Note: this isn't so much a Stupid Solr Trick as a Thing You Should Probably Know; consider it required reading for the next SST. If you're just joining us, check out the introduction to the Stupid Solr Tricks series] What the heck is a localparams query? A garden-variety Solr query URL looks something like this: [...]]]></description>
				<content:encoded><![CDATA[<p>[Note: this isn't so much a <em>Stupid <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Trick</em> as a <em>Thing You Should Probably Know</em>; consider it required reading for the next SST. If you're just joining us, check out the <a href="http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/">introduction to the Stupid <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Tricks series</a>]</p>

<h2>What the heck is a localparams query?</h2>

<p>A garden-variety <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> query <acronym title="Uniform Resource Locator">URL</acronym> looks something like this:
<pre>  http://localhost:8983/solr/select?
    defType=dismax
    &amp;qf=name^2 place^1
    &amp;q=Dueber</pre>
Which is fine, as far as it goes. But it&#8217;s easy to run into the limits of the standard query plugins (e.g., Dismax).</p>

<p>Say, for example, you want something like this:
<pre>  title:Constructivism AND author:Dueber</pre>
And furthermore, you have multiple underlying fields (title1, title2, title3, author1, author2).</p>

<p>The naïve approach would be to just do this:
<pre>  defType=dismax
  &amp;qf=title1 title2 title3 author1 author2
  &amp;q=Constructivism Dueber</pre>
But you can&#8217;t construct a dismax query with the boolean AND. You can with edismax, but even then you&#8217;ve got no way of telling (e)dismax that <em>Constructivism</em> must be found in the title fields, and <em>Dueber</em> must be found in the author fields. Dismax doesn&#8217;t do that.</p>

<h2>Solution: Build a query of queries</h2>

<p>The solution is to build a query made up of fully-encapsulated sub-queries. A localparams query has two forms (note that, of course, you&#8217;d need to <acronym title="Uniform Resource Locator">URL</acronym>-Escape the values):
<pre>  &#95;query&#95;:"{!dismax qf='field^2 otherfield^4'}my search terms"
or
  &#95;query&#95;:"{!dismax qf='field^2 otherfield^4' v=$q1}"&amp;q1=my search terms</pre>
I far prefer the second form (which uses a second <acronym title="Uniform Resource Locator">URL</acronym> parameter <code>q1</code> instead of sticking the search right in there), because I don&#8217;t have to worry about escaping double-quotes in the query terms (as you would if there&#8217;s a phrase as part of the query).</p>

<p>Once you&#8217;ve got these things, you can combine them with booleans.
<pre>    q=<em>query</em>:"{!dismax qf='title1 title2 title3' v=$q1}" AND
      <em>query</em>:"{!dismax qf='author1 author2' v=$q2}"
  &amp;q1=Constructivism
  &amp;q2=Dueber</pre>
[Note: <strong><a href="http://robotlibrarian.billdueber.com/solr-and-boolean-operators/">be careful with solr booleans!!!</a></strong>]</p>

<p>You can add any <em>local parameters</em> you need (for dismax, stuff like mm, qs, pf, and ps) and you can use any query parser you want by changing what comes after the bang (e.g., <code>{!lucene ...}</code> or <code>{!edismax...}</code>).</p>

<p>In this way, you can build up arbitrarily complex queries using any available query parsers in combination with each other. Very powerful.</p>

<h2>An example: boost records that contain all terms</h2>

<p>Just about everything in a localparams query can be pulled out in the way I pulled out the search terms above. Here&#8217;s a fairly-complex example (which, let&#8217;s be honest, would be a lot more complex if you were trying to inline and escape everything).</p>

<p><strong>Scenario</strong>: We want to do a logical-OR search (mm=0%), but want to make sure we boost documents that contain all the search terms. This is necessary because sometimes a very long document with all the terms will have a lower score than a very short document with most of the terms.</p>

<p>Having short document with a few keywords show up before long documents with all the keywords will drive your librarians <em>CraZy</em>!!! So it&#8217;s tempting to just leave it alone. But let&#8217;s fix it anyway.</p>

<p>The gist of it is as follows:</p>

<ul>
<li>Query against title and author</li>
<li>Use an mm of 0% (logical OR) for the main query</li>
<li>Use a pf to boost on a phrase in those same two fields (just common sense)</li>
<li>Set up a boost query (bq) to boost the score if <em>all</em> the search terms are present</li>
</ul>

<p>To accomplish this, we&#8217;re going to have two localparams queries: one to be the main query, and another that we&#8217;re going to use as the boost query. This works in much the same way as our previous &#8220;AND-together two localparams queries&#8221; did.</p>

<p>[Presenting the <acronym title="Uniform Resource Locator">URL</acronym> parameters as a ruby hash to make it easier to read]</p>

<div class="geshi no ruby"><div class="head">{</div><ol><li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;q&#39;</span><span class="sy0">=&amp;gt;</span><span class="st0">&#39;_query_:&quot;{!dismax qf=$f1 mm=$mm1 pf=$f1 bq=$bq1 v=$q1}&quot;&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;mm1&#39;</span><span class="sy0">=&amp;gt;</span><span class="st0">&#39;0%&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;f1&#39;</span><span class="sy0">=&amp;gt;</span><span class="st0">&#39;author^3 title^1&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;q1&#39;</span><span class="sy0">=&amp;gt;</span><span class="st0">&#39;Dueber Constructivism&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;bq1&#39;</span><span class="sy0">=&amp;gt;</span><span class="st0">&#39;_query_:&quot;{!dismax qf=$f1 mm=<span class="es0">\&#39;</span>100%<span class="es0">\&#39;</span> v=$q1 }&quot;^5&#39;</span>,</div></li>
<li class="li1"><div class="de1">&nbsp; <span class="st0">&#39;fl&#39;</span> <span class="sy0">=&amp;gt;</span> <span class="st0">&#39;score,*&#39;</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="br0">&#125;</span></div></li></ol></div>

<p>What&#8217;s nice about this is that I&#8217;m reusing the search terms (for the main query and the boost query) and field list (for the query field and the phrase fields) so I don&#8217;t have to repeat them.</p>

<h2>Try along at home</h2>

<p>First off, if you don&#8217;t have a browser that does nice <acronym title="Extensible Markup Language">XML</acronym> and JSON formatting, well, get one. I use Chrome with <a href="https://chrome.google.com/webstore/detail/chklaanhfefbnpoihckbnefhakgolnmc">JSONView</a> and <a href="https://chrome.google.com/webstore/detail/gbammbheopgpmaagmckhpjbfgdfkpadb">XMLTree</a>, but I&#8217;m sure there are equivalents for Firefox. They&#8217;ll make your life easier.</p>

<p>By now you know the drill:</p>

<div class="geshi no bash"><div class="head">cd solr_stupid_tricks</div><ol><li class="li1"><div class="de1">&nbsp; git pull origin master</div></li>
<li class="li1"><div class="de1">&nbsp; git fetch origin master</div></li>
<li class="li1"><div class="de1">&nbsp; git checkout SST2 <span class="co0"># I&#39;ve started tagging the repo for these posts</span></div></li>
<li class="li1"><div class="de1">&nbsp; <span class="co0"># ignore warning about &quot;detached HEAD&quot;</span></div></li>
<li class="li1"><div class="de1">&nbsp; java -jar start.jar <span class="sy0">&amp;</span>amp;</div></li></ol></div>

<p>We&#8217;ll want to empty out the index and put in some documents to work with. I&#8217;m presuming you have <code>curl</code> installed. If not&#8230;well, you&#8217;re on your own.</p>

<div class="geshi no bash"><div class="head">cd exampledocs</div><ol><li class="li1"><div class="de1">&nbsp; .<span class="sy0">/</span>reset_and_index_json localparams.json</div></li></ol></div>

<p>You might want to take a look at the <code>localparams.json</code> file, which contains a set of documents in the new JSON update structure. The <a href="http://wiki.apache.org/solr/UpdateJSON">full <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> JSON Update structure</a> allows repeated keys. Apparently, so does the <a href="http://www.ietf.org/rfc/rfc4627">JSON RFC</a>:</p>

<p>&gt; 2.2. Objects
&gt; An object structure is represented as a pair of curly brackets
&gt; surrounding zero or more name/value pairs (or members). A name is a
&gt; string. A single colon comes after each name, separating the name
&gt; from the value. A single comma separates a value from a following
&gt; name. <strong>The names within an object SHOULD be unique.</strong> <em>(emphasis mine)</em></p>

<p>&#8220;SHOULD&#8221;. Not &#8220;MUST&#8221;. I don&#8217;t care if it&#8217;s legal. It still weirds me out.</p>

<p>Once you&#8217;ve got solr running in the background, you can go ahead and try our query!</p>

<ul>
<li>If you&#8217;re really lazy, just <a href="http://localhost:8983/solr/select/?q=&#95;query&#95;%3A%22%7B%21dismax+qf%3D%24f1+mm%3D%24mm1+pf%3D%24f1+bq%3D%24bq1+v%3D%24q1%7D%22&amp;mm1=0%25&amp;f1=author%5E3+title%5E1&amp;q1=Dueber+Constructivism&amp;bq1=&#95;query&#95;%3A%22%7B%21dismax+qf%3D%24f1+mm%3D%27100%25%27+v%3D%24q1+%7D%22%5E5&amp;fl=score%2C%2A">click the link</a></li>
<li>If you&#8217;re slightly less lazy, and you&#8217;ve got ruby installed, take a look in the new ruby directory. You can run <code>ruby browse.rb localparams_query.rb</code> to run the query and have it automatically open up in your browser.</li>
<li>If you&#8217;re ambitious, you might want to actually mess with the <code>localparams_query.rb</code> file so you can try things out.</li>
</ul>

<p>As a longish side note, we&#8217;ll probably use <code>browse.rb</code> in the future of this series as well, so you might want to go ahead and get ruby installed if you don&#8217;t already. <a href="https://rvm.beginrescueend.com/">RVM</a> is the easiest route if you&#8217;re on linux/OSX. You can also just install <a href="http://jruby.org/">JRuby</a>, seeing as how you&#8217;re running java anway (just make sure to use 1.9 mode by calling stuff as <code>jruby --1.9 myscript.rb</code> or setting the environment variable <code>export JRUBY_OPTS=--1.9</code>).</p>

<h2>Special Stupid <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Trick: Make a special query handler for a complex query</h2>

<p>OK, so I said I wouldn&#8217;t have a real SST in this episode, but it&#8217;s so damn long at this point I figure I&#8217;ve lost everyone except Rochkind (Hey, Jonathan!), so let&#8217;s throw one in.</p>

<p>The <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> configuration file <code>solrconfig.xml</code> is where you can configure custom search handlers. In such a custom handler, you can specify defaults (which, by default, can be overridden by passed-in parameters, although you can control that, too) &#8212; this is commonly used to, say, put in a <code>q.alt</code> or a filter query that will always be applied.</p>

<p>But we can use it to put in our special query defaults that boosts when a document contains all the terms:</p>

<div class="geshi no xml"><ol><li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; 10</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; *,score</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; explicit</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; _query_:&quot;{!edismax qf=$fields</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;mm=$mymm</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;v=$qwords</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;bq=$boostForAll}&quot;</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; 0%</div></li>
<li class="li1"><div class="de1">&nbsp;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; _query_:&quot;{!edismax qf=$fields</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;mm=&#39;100%&#39;</div></li>
<li class="li1"><div class="de1">&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;v=$qwords }&quot;^5</div></li></ol></div>

<p>If you look closely, you&#8217;ll see that everything you need is defined in this requestHandler in the <code>solrconfig.xml</code> file, except for <code>$fields</code> and <code>$qwords</code>. You could also override <code>mymm</code> by passing in an argument with that name, if the default &#8217;0%&#8217; isn&#8217;t to your liking.</p>

<p>If you&#8217;ve been following along at home, this requestHandler is already in the <code>solrconfig.xml</code> file that you&#8217;re running right now. Go ahead and try it! Let&#8217;s search for the terms &#8216;dueberb&#8217; and &#8216;penn&#8217; and see if the correct record floats to the top.</p>

<p><a href="http://localhost:8983/solr/edismaxplus/?qwords=dueber%20penn&amp;fields=author%20title">http://localhost:8983/solr/edismaxplus/?qwords=dueber penn&amp;fields=author title</a></p>

<p>Nifty, huh?</p>

<p>Next time we&#8217;ll use a local params query to get around something about dismax that drives me crazy: preventing (or penalizing) matches that go across a field&#8217;s multiple values.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/using-localparams-in-solr-sst-2/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Solr Field Type for numeric(ish) IDs (SST #1)</title>
		<link>http://robotlibrarian.billdueber.com/solr-field-type-for-numericish-ids/</link>
		<comments>http://robotlibrarian.billdueber.com/solr-field-type-for-numericish-ids/#comments</comments>
		<pubDate>Thu, 01 Mar 2012 04:01:21 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[Stupid Solr Tricks]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=420</guid>
		<description><![CDATA[[For the introduction to this series, take a quick gander at the introduction] Like everyone else in the library world, I&#8217;ve got a bunch of well-defined, well-controlled standard identifiers I need to keep track of and allow searching on. You know, well-vetted stuff like this: 1234-5678 123-4567-890 12-34-567-X 0012-0045 ISBN13: 1234567890123 ISSN: 1234567X (1998-99) ISSN [...]]]></description>
				<content:encoded><![CDATA[<p>[For the introduction to this series, take a quick gander at <a href="http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/">the introduction</a>]</p>

<p>Like everyone else in the library world, I&#8217;ve got a bunch of well-defined, well-controlled 
standard identifiers I need to keep track of and allow searching on.</p>

<p>You know, well-vetted stuff like this:</p>

<ul>
<li>1234-5678</li>
<li>123-4567-890</li>
<li>12-34-567-X</li>
<li>0012-0045</li>
<li>ISBN13: 1234567890123</li>
<li>ISSN: 1234567X (1998-99)</li>
<li><acronym title="International Standard Serial Number">ISSN</acronym> (1998-99): 1234567X</li>
<li>1234567890 (hdk. 22 pgs)</li>
<li>9</li>
<li>Behind the 3rd floor desk</li>
<li>Henry VIII</li>
</ul>

<p>[Note: some of these may be a titch exaggerated]</p>

<p>How does your system deal with these on index? How about on query?</p>

<p>Here&#8217;s an idea of how to use a custom solr fieldtype to do the heavy lifting.</p>

<h2>What we&#8217;re shooting for</h2>

<p>I&#8217;d like to be able to send in a text string as follows:</p>

<ul>
<li>The input can contain other text besides the id</li>
<li>The ID starts with a digit and consists solely of digits and (optional) dashes, then ends with a digits and possibly a trailing &#8216;X&#8217; or &#8216;x&#8217; so we can deal with <acronym title="International Standard Book Number">ISBN</acronym>/<acronym title="International Standard Serial Number">ISSN</acronym></li>
<li>The ID has to be at least N characters long (for this example, I&#8217;m using N=8); this helps us avoid other text that might trivially look like an ID but isn&#8217;t.</li>
<li>Only the ID itself is indexed</li>
<li>If no valid ID is identified, nothing is indexed</li>
</ul>

<h2>The numericID field, suitable for <acronym title="International Standard Book Number">ISBN</acronym>/<acronym title="International Standard Serial Number">ISSN</acronym>/<acronym title="Online Computer Library Center">OCLC</acronym>/etc.</h2>

<p>Let&#8217;s take a look at the end product and then walk through it.</p>

<script src="https://gist.github.com/1947347.js"></script><noscript><pre><code class="language-xml xml">&lt;fieldtype name=&quot;numericID&quot; class=&quot;solr.TextField&quot; 
           positionIncrementGap=&quot;1000&quot; omitNorms=&quot;true&quot;&gt;
&lt;analyzer&gt;
  &lt;tokenizer class=&quot;solr.KeywordTokenizerFactory&quot;/&gt;
    &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;
              pattern=&quot;^.*?(\p{N}[\p{N}\-\.]{6,}[xX]?).*$&quot; 
              replacement=&quot;***$1&quot; /&gt;
    &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;
              pattern=&quot;^[^\*].*$&quot; replacement=&quot;&quot; /&gt;
    &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;
              pattern=&quot;^\*\*\*&quot; replacement=&quot;&quot; /&gt;
    &lt;filter class=&quot;solr.LowerCaseFilterFactory&quot;/&gt;
    &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;
              pattern=&quot;[^\p{N}x]&quot; replacement=&quot;&quot; 
              replace=&quot;all&quot; /&gt;
    &lt;filter class=&quot;solr.LengthFilterFactory&quot; min=&quot;8&quot; max=&quot;14&quot; /&gt;
    &lt;filter class=&quot;solr.PatternReplaceFilterFactory&quot;
              pattern=&quot;^0*&quot; replacement=&quot;&quot;
    /&gt;
  &lt;/analyzer&gt;
&lt;/fieldtype&gt;</code></pre></noscript>

<h2>Things we&#8217;ll be learning about today</h2>

<p><strong>NOTE: I really, really recommend taking a look at <a href="http://www.lucidimagination.com/content/scaling-lucene-and-solr">Scaling Lucene and <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></a> by the good folks over at <a href="http://www.lucidimagination.com/">Lucid Imagination</a> for great, short explanations of <em>omitNorms</em>, term frequencies, etc.</strong></p>

<p>Since this is the first post, I&#8217;ll go over some stuff that&#8217;s probably a little 
too basic for any audience that&#8217;s likely to show up here, but what the heck.</p>

<ul>
<li>KeywordTokenizer</li>
<li>PatternReplaceFilterFactory</li>
<li>LowerCaseFilterFactory</li>
<li>LengthFilterFactory</li>
</ul>

<h3>Step 1: &#8220;Tokenize&#8221; to a single token</h3>

<p>The job of a <em>tokenizer</em> is to decide how to split your input into individual tokens (often &#8220;words&#8221;), which are then munged by any filters you&#8217;re applying.</p>

<p>For the case of an ID, <em>we don&#8217;t want to tokenize</em>. At least at this juncture, I&#8217;m not trying to extract multiple valid IDs out of a single string; I&#8217;m just trying to determine if there&#8217;s a valid ID in there somewhere and throwing everything else away.</p>

<p>In other words, I&#8217;m going to treat the input as a <em>single token</em>, and then munge the bejeebers out of it in order to get what I want.</p>

<p>In the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> world, that leads us to the confusingly-named <a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.KeywordTokenizerFactory">KeywordTokenizer</a>.</p>

<p><strong>What we have now</strong>: exactly what we started with</p>

<h3>Step 2: Find the first thing that looks like an ID and mark it</h3>

<p>I primarily work in Ruby and <acronym title="Practical Extraction and Report Language">Perl</acronym>, which means the dramatic abuse of regular expressions is just part of my daily life.</p>

<p>Line 5 is our first use of a regexp in the filter chain via <a href="http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PatternReplaceFilterFactory">PatternReplaceFilterFactory</a>.</p>

<p>The idea is to:</p>

<ol>
<li>Find something that looks like a match</li>
<li>If found, get rid of everything else, and throw a &#8216;***&#8217; onto the beginning so later on I can tell if I matched or not.</li>
</ol>

<p>The second step is a little&#8230;odd&#8230;but necessary because I need a way to know if I found a candidate ID or not. If I did, well, there will be three asterisks on the front of the string from here on out. If not, there won&#8217;t.</p>

<p>This is a little confusing as these things go, so I&#8217;ll break it down.</p>

<p>Line 6: the match:</p>

<ul>
<li>Skip any amount of stuff we don&#8217;t care about (.*?)</li>
<li>Match a number (\p{N}) (that&#8217;s unicode regexp syntax, if you haven&#8217;t seen it)</li>
<li>Match a string of at least 6 numbers and dashes</li>
<li>Close with an optional X or x [Xx]?</li>
<li>&#8230;and any trailing bits until the end of the string.</li>
</ul>

<p>So&#8230;<em>[number][six numbers/dashes][optional X]</em></p>

<p>At minimum, that&#8217;s seven digits/dashes.</p>

<p>Line 7: replacement</p>

<ul>
<li>Replace the whole string (note how I anchored the match with ^ and $?) with whatever was matched inside the parentheses (represented here by $1) after prepending a set of three asterisks &#8216;***&#8217;</li>
</ul>

<p><strong>What we have now</strong>: If we found a candidate ID, we have that string prepended by &#8216;***&#8217;. Otherwise, we have exactly what we started with.</p>

<h3>Step 3: If we didn&#8217;t find a match, throw it all away</h3>

<p>Line 9 shows an attempt to match on any string that start with an asterisk (which we&#8217;re pretty sure we won&#8217;t see because that&#8217;s illegal lucene wildcard syntax). If we have a string that doesn&#8217;t start with an asterisk, then throw the whole damn thing away because we don&#8217;t have a candidate ID anyway.</p>

<p>[There's a strong argument to be made that using an asterisk as the tagging character is a bad choice. Anyone have suggestions?]</p>

<p><strong>What we have now</strong>: Either a candidate ID string preceded by &#8216;***&#8217; or the empty string.</p>

<h3>Step 4: Ditch the &#8216;***&#8217; used to mark a candidate ID</h3>

<p>Lines 10-11</p>

<p>Find the &#8216;***&#8217; and throw it away.</p>

<p><strong>What we have now</strong>: The raw candidate ID string or the empty string.</p>

<h3>Step 5: Lowercase it</h3>

<p>Line 12.</p>

<p>By &#8216;it&#8217; I mean &#8220;any X that might be trailing the ID&#8221;; we should have thrown everything else away by now. (Note: could have done this with a PattenReplace as well, obviously; not sure why&#8217;d I&#8217;d choose one over the other).</p>

<p><strong>What we have now</strong>: The raw candidate ID string with its optional trailing &#8216;X&#8217; lowercased, or the empty string</p>

<h3>Step 6: Get rid of everything that&#8217;s not a number or an &#8216;x&#8217;</h3>

<p>Lines 13-15</p>

<p>Ditch any dashes that are remaining. I&#8217;m doing it like this instead of just ditching the dashes because I&#8217;ll likely modify this at some point to allow, e.g., periods between numbers, or maybe spaces. This is safer.</p>

<p>Note the extra parameter (replace=&#8221;all&#8221;), indicating that I want to replace all occurrences. This hasn&#8217;t been an issue until now because I&#8217;ve been careful to match the entire string by anchoring the pattern at the beginning (&#8216;^&#8217;) and end (&#8216;$&#8217;).</p>

<p><strong>What we have now</strong>: A string of numbers possibly followed by an &#8216;x&#8217;, or the empty string.</p>

<h3>Step 7: Make sure what we have is a reasonable length</h3>

<p>Line 16</p>

<p>Now that we&#8217;ve gotten rid of the dashes, we need to make sure we have enough digits left to make a valid identifier.</p>

<p>If we didn&#8217;t match originally, it quickly got reduced to the empty string, and that will disappear here due to having length 0.</p>

<p>It&#8217;s also possible that our initial match was, say, &#8217;1&#8212;-3&#8212;&#8211;6&#8212;7&#8242;, which will at this point have been reduced to just &#8217;1367&#8242; &#8212; too short for our taste.</p>

<p>In this version, I allow strings of any length between 7 (old <acronym title="Online Computer Library Center">OCLC</acronym> number) and 14 (barcode).</p>

<p><strong>What we have now</strong>: A string consisting purely of 7-14 characters, the last of which may be an &#8216;x&#8217;, or nothing at all (e.g., nothing will get indexed).</p>

<h3>Step 8: Remove leading 0s</h3>

<p>My ILS (Aleph) loves to zero-pad all its local identifiers. I&#8217;d rather get rid of them.</p>

<p><strong>What we have now</strong>: What we had before, but with no leading zeros</p>

<h2>Let&#8217;s try it!</h2>

<p>If you&#8217;re following along at home, get the latest version of the schema and try it!</p>

<div class="geshi no bash"><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">cd</span> solr_stupid_tricks</div></li>
<li class="li1"><div class="de1">&nbsp; git pull origin master</div></li>
<li class="li1"><div class="de1">&nbsp; java -jar start.jar</div></li></ol></div>

<p>&#8230;and then:</p>

<ul>
<li>Go to the analysis page at <a href="http://localhost:8983/solr/admin/analysis.jsp?highlight=on">http://localhost:8983/solr/admin/analysis.jsp?highlight=on</a></li>
<li>Set the first line of the form to use Field: <strong>type</strong> and input <em>numericID</em></li>
<li>Check the &#8220;verbose output&#8221; checkbox under <em>Field value: index</em></li>
<li>Put in a test value and see what the analyzer gives you!</li>
</ul>

<p>For those of you <em>not</em> following along at home, here are the examples from waaaaaay at the top of this post:</p>

<ul>
<li>1234-5678 => 12345678</li>
<li>123-4567-890 => 1234567890</li>
<li>12-34-567-X => 1234567x</li>
<li>0012-0045 => 120045</li>
<li>ISBN13: 1234567890123 => 1234567890123</li>
<li>ISSN: 1234567X (1998-99) => 1234567x</li>
<li>ISSN (1998-99): 1234567X => <strong>199899</strong> </li>
<li>1234567890 (hdk. 22 pgs) => 1234567890</li>
<li>9 => [nothing]</li>
<li>Behind the 3rd floor desk => [nothing]</li>
<li>Henry VIII => [nothing]</li>
</ul>

<p>So&#8230;not too bad. We did miss one, mistaking a year range for a numeric ID, but if your data are that bad, there&#8217;s only so much we can do.</p>

<h2>Conclusions</h2>

<p>Obviously, this is the tip of the iceberg with this sort of thing. And it can still be confused.</p>

<p>But it <em>does</em> follow our goal of having the exact same behavior on index and query, moving the logic to solr, and being pretty flexible.</p>

<p>Perfect? No. Useful? Yes.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/solr-field-type-for-numericish-ids/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Stupid Solr tricks: Introduction (SST #0)</title>
		<link>http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/</link>
		<comments>http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/#comments</comments>
		<pubDate>Wed, 29 Feb 2012 18:20:24 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[solr]]></category>
		<category><![CDATA[Stupid Solr Tricks]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=414</guid>
		<description><![CDATA[Completed parts of the series: A Solr Field Type for numeric(ish) IDs Using localparams in Solr (or, how to boost records that contain all terms) Requiring/Preferring searches that don&#8217;t span multiple values Boosting on Exactish (anchored) phrase matching Those of you who read this blog regularly (Hi Mom!) know that while we do a lot [...]]]></description>
				<content:encoded><![CDATA[<p><strong>Completed parts of the series:</strong></p>

<ol>
<li><a href="http://robotlibrarian.billdueber.com/solr-field-type-for-numericish-ids/">A <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Field Type for numeric(ish) IDs</a></li>
<li><a href="http://robotlibrarian.billdueber.com/using-localparams-in-solr-sst-2/">Using localparams in <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> (or, how to boost records that contain all terms)</a></li>
<li><a href="http://robotlibrarian.billdueber.com/requiringpreferring-searches-that-dont-span-multiple-values-sst-3/">Requiring/Preferring searches that don&#8217;t span multiple values</a></li>
<li><a href="http://robotlibrarian.billdueber.com/boosting-on-exactish-anchored-phrase-matching-in-solr-sst-4/">Boosting on Exactish (anchored) phrase matching</a></li>
</ol>

<p>Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the <a href="http://lib.umich.edu">University of Michigan Library</a>, our bread-and-butter these days are projects that center around <a href="http://lucene.apache.org/solr/"><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></a>.</p>

<p>Right now, my production <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn&#8217;t know when we first started down this path. My primary responsibility is for <a href="http://mirlyn.lib.umich.edu">Mirlyn</a>, our catalog, but there&#8217;s plenty of smart people doing smart things around here, and I&#8217;d like to be one of them.</p>

<p><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> has since advanced to 3.x (with version 4 on the horizon), and during that time I&#8217;ve learned a lot more about <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> and how to push it around. More importantly, I&#8217;ve learned a <em>lot</em> more about our data, the vagaries in the <acronym title="MAchine Readable Cataloging">MARC</acronym>/<acronym title="Anglo-American Cataloguing Rules">AACR2</acronym> that I process and how awful so much of it really is.</p>

<p>So&#8230;starting today I&#8217;m going to be doing some on-the-blog experiments with a new version of <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>, reflecting some of the problems I&#8217;ve run into and ways I think we can get more out of <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>.</p>

<h2>Premise 1: put all the logic you possible can into <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></h2>

<p>Much of what I&#8217;ll be doing is looking at new field type definitions that are appropriate (in my mind, anyway) for library data. Some of this stuff (e.g., normalizing ISBNs) would be a <em>lot</em> easier to do in your indexing code.</p>

<p>But then you&#8217;d have to do it again in your application to munge whatever is entered in the search box. And maybe it won&#8217;t be the same every time. Or maybe you don&#8217;t want to write a freakin&#8217; parser to try to find anything that might look like an <acronym title="International Standard Book Number">ISBN</acronym> and mess with it.</p>

<p>I take it as gospel that you should put all your logic into the solr field analysis chain, so the exact same thing is happening on index and on query. That way, even if it&#8217;s <em>wrong</em>, at least it&#8217;ll be wrong in the exact same way and your users will find the stuff they&#8217;re looking for.</p>

<h2>Premise 2: Doing it crappily is better than not doing it at all.</h2>

<p>Look, the <em>right</em> way to do much of this stuff is by hacking on <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> itself, building custom field analyzers or filters or tokenizers that mess with the token chain and&#8230;</p>

<p>Wait. I already lost myself, and probably you, too. At some point, I&#8217;m going to do an actual sample custom filter for the new <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> codebase (the stuff I did once before is out-of-date); the example will be LCCN normalization and you&#8217;ll be able to follow along with me on this blog.</p>

<p>But in the meantime, we can do a lot of fairly ambitious stuff just by using and abusing the out-of-the-box stuff: pattern replacement filters, the existing tokenizers, etc. It might be ugly, and not very fast, but if I start getting the 200 hits a second that mean this is a bottleneck for me, I&#8217;ll be happy to deal with it then.</p>

<h2>Premise 3: It&#8217;s always better to put <em>something</em> out there so smart people can tell you how to do it <em>right</em></h2>

<p>One of the disappointments in my life right now is that there isn&#8217;t more formal and informal discussion about what people are doing/trying. I&#8217;m sure it&#8217;s out there, but some of it is buried in a sea of application-level crap, and much of it is ignored by the people that really understand the data.</p>

<p>With luck, I&#8217;ll get comments from folks who really know their stuff and can tell me, in excruciating detail, exactly how I <em>don&#8217;t</em>. Please: correct me. I might not be the brightest guy in the room, but I know enough to try to outsource my thinking.</p>

<h2>Follow along at home!</h2>

<h3>Option 1: Build your own current-trunk <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym></h3>

<p>If you want to follow along at home, you&#8217;ll need a copy of the current source (not the 3.5 stable, since I use things like the ICUTokenizer coming in 3.6 / 4.0), which you can find and build from the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> site.</p>

<h3>Option 2: Just use what I&#8217;m using</h3>

<p>Alternately, if you&#8217;re lazy (and who isn&#8217;t??), I&#8217;ve provided a <a href="https://billdueber@github.com/billdueber/solr_stupid_tricks">github repo</a> of the standard solr &#8220;example&#8221; directory you can nab and run on your own java-equipped machine.</p>

<p><strong>Warning: the git repo is currently 60MB or so</strong>.</p>

<div class="geshi no bash"><div class="head">git clone https://billdueber@github.com/billdueber/solr_stupid_tricks.git</div><ol><li class="li1"><div class="de1">&nbsp; <span class="kw3">cd</span> solr_stupid_tricks</div></li>
<li class="li1"><div class="de1">&nbsp; java -jar start.jar</div></li></ol></div>

<p>&#8230;and then head to your local <a href="http://localhost:8983/solr/admin/"><acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> Admin page</a> page on port 8983 to check things out. We&#8217;ll be spending most of our time in <a href="http://localhost:8983/solr/admin/analysis.jsp">the analysis tab</a>.</p>

<p>I&#8217;ll get the first post in the series up later today, and then every few days as I think of more things to talk about. I hope you&#8217;ll join me!</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/stupid-solr-tricks-introduction/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Another short personal note</title>
		<link>http://robotlibrarian.billdueber.com/another-short-personal-note/</link>
		<comments>http://robotlibrarian.billdueber.com/another-short-personal-note/#comments</comments>
		<pubDate>Mon, 27 Feb 2012 20:50:51 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[personal]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=380</guid>
		<description><![CDATA[The baby spent all last week in the hospital. Nothing life-threatening (so long as he was in the hospital and could get O2 when needed); it was just annoying. So&#8230;.here&#8217;s to a week-long hospital stay being able to be merely &#8220;annoying&#8221;. A tip of the hat to steady employment, generous sick/vacation policies, flexible co-workers, excellent [...]]]></description>
				<content:encoded><![CDATA[<p>The baby spent all last week in the hospital. Nothing life-threatening (so long as he was in the hospital and could get O2 when needed); it was just annoying.</p>

<p>So&#8230;.here&#8217;s to a week-long hospital stay being able to be merely &#8220;annoying&#8221;. A tip of the hat to steady employment, generous sick/vacation policies, flexible co-workers, excellent insurance, and having a world-class hospital in town. This could have been a much, much worse week than it was.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/another-short-personal-note/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Solr and boolean operators</title>
		<link>http://robotlibrarian.billdueber.com/solr-and-boolean-operators/</link>
		<comments>http://robotlibrarian.billdueber.com/solr-and-boolean-operators/#comments</comments>
		<pubDate>Thu, 01 Dec 2011 16:13:09 +0000</pubDate>
		<dc:creator>Bill</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://robotlibrarian.billdueber.com/?p=402</guid>
		<description><![CDATA[[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!] What does Solr do, given the following query? a OR b AND c I&#8217;ll give you three guesses, but you&#8217;ll get the first two wrong and won&#8217;t have any idea how to generate a third, so don&#8217;t spend too much time on it. Boolean [...]]]></description>
				<content:encoded><![CDATA[<p>[Summary: <strong>ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!</strong>]</p>

<p>What does <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> do, given the following query?</p>

<p><pre>
  a OR b AND c
</pre></p>

<p>I&#8217;ll give you three guesses, but you&#8217;ll get the first two wrong
and won&#8217;t have any idea how to generate a third, so don&#8217;t spend too much time on it.</p>

<h2>Boolean algebra and operator precedence</h2>

<p>Anyone who&#8217;s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping:</p>

<p><pre>
  a OR (b AND c)
</pre></p>

<p>That&#8217;s guess one. It&#8217;s not how <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym> does it.</p>

<h2>Left to right?</h2>

<p>Some naive students, and at least one programming language (Smalltalk), do a simple left-to-right evaluation. So you might go with:</p>

<p><pre>
  (a OR b) AND c
</pre></p>

<p>Nope. Wrong again.</p>

<h2>So what&#8217;s left???</h2>

<p>Excellent question. I don&#8217;t know the code well enough to know what&#8217;s going on underneath, but here&#8217;s what we get under the lucene query parser.</p>

<p><pre>
    (b AND c)
</pre></p>

<p>That&#8217;s right. The first term is thrown away.(More correctly, the first term is deemed &#8220;optional&#8221;).</p>

<h2>Do you let your users put AND/OR/NOT in their queries?</h2>

<p>Hopefully, they don&#8217;t know any boolean algebra. If they do, hopefully they use parentheses, or you parse it out for them. And if not, well, they&#8217;re gonna be pretty damn confused.</p>

<h2>It gets weirder</h2>

<p>I populated a fresh solr (3.5) index with all possible subsets of the strings &#8220;curly&#8221;, &#8220;larry&#8221;, &#8220;moe&#8221;, and &#8220;shemp&#8221; (not Joe. Don&#8217;t talk to me about Joe). There are 15 of them, from the one-item &#8216;curly&#8217; to all four at once.</p>

<p>I wrote a script to run a set of queries against the index under both lucene and edismax to see what I would get. In all cases the default lucene operator is &#8216;AND&#8217; and the edismax mm parameter is set to 100% (equivalent to &#8220;all required&#8221;).</p>

<p><pre>
        Lucene                    EDismax<br />
  &#45;&#45;------------------------------------------------------</p>

<ol>
<li><p>curly AND larry
    curly larry               curly larry<br />
    curly larry moe           curly larry moe<br />
    curly larry shemp         curly larry shemp<br />
    curly larry moe shemp     curly larry moe shemp</p></li>
<li><p>curly AND larry OR moe
    curly                     curly larry<br />
    curly larry               curly larry moe<br />
    curly moe                 curly larry shemp<br />
    curly shemp               curly larry moe shemp<br />
    curly larry moe<br />
    curly larry shemp<br />
    curly moe shemp<br />
    curly larry moe shemp</p></li>
<li><p>curly OR larry AND moe
    larry moe                 larry moe<br />
    curly larry moe           curly larry moe<br />
    larry moe shemp           larry moe shemp<br />
    curly larry moe shemp     curly larry moe shemp</p></li>
<li><p>curly AND larry OR moe AND shemp
    curly moe shemp           curly larry moe shemp<br />
    curly larry moe shemp</p></li>
<li><p>moe AND shemp OR curly AND larry
    curly larry moe           curly larry moe shemp<br />
    curly larry moe shemp</p></li>
</ol>

<p></pre></p>

<p>Query 1 is as expected. Query 2 apparently reduces to just &#8216;curly&#8217; under the lucene parser and &#8216;curly AND larry&#8217; under edismax (and query 3 similarly reduces to the two AND&#8217;d words). Queries 4 and 5 are&#8230;well, you can look at the debugQuery output to see what it gets, but not <strong>why</strong>. And then tell me how to explain it to a user.</p>

<h2>Where does this leave us?</h2>

<p>The good news is that both lucene and edismax behave predictably when you use parentheses for grouping. So do that.</p>

<p>I&#8217;m generally not one to complain about open-source software, at least partially because I don&#8217;t have the chops to do anything about it most of the time, but I don&#8217;t understand how this could seem OK to anyone. There are a couple lucene Jira tickets (<a href="https://issues.apache.org/jira/browse/LUCENE-167">Lucene-167</a> and <a href="https://issues.apache.org/jira/browse/LUCENE-1823">Lucene-1823</a>) and a <a href="http://www.mail-archive.com/java-user@lucene.apache.org/msg00008.html">2005 mailing list thread</a> denouncing the current behavior, but it persists.</p>

<p>Until the <acronym title="Solr isn\'t an acronym, silly!">Solr</acronym>/Lucene powers that be decide to tackle this, the rest of us will either have to write pre-parsers to make sure users get something sensible, or cripple our applications to disallow unrestricted boolean queries.</p>]]></content:encoded>
			<wfw:commentRss>http://robotlibrarian.billdueber.com/solr-and-boolean-operators/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
	</channel>
</rss>

<!-- Dynamic Page Served (once) in 2.752 seconds -->
<!-- Cached page served by WP-Cache -->
