Category: Uncategorized

Come work at the University of Michigan

April 17, 2013 at 9:34 pmCategory:Uncategorized

The Library has three UX positions available right now — interface designer, interface developer, and a web content strategist.

Come join me at what is easily the best place I’ve ever worked! Full details are over at Suz’s blog.

 

 

Leave a Reply

Please: don’t return your books

February 12, 2013 at 4:16 pmCategory:Uncategorized

So, I’m at code4lib 2013 right now, where side conversations and informal exchanges tend to be the most interesting part.

Last night I had an conversation with the inimitable Michael B. Klein, and after complaining about faculty members that keep books out for decades at a time, we ended up asking a simple question:

> How much more shelving would we need if everyone returned their books?

Assuming we could get them all checked in and such, well, where would we put them?

I’m looking at this in the simplest, most conservative way possible:

  • Assume they’re all paperbacks, so we don’t worry about how thick a cover is (cover width = 0)
  • Assume items for which we don’t have page count information are “average”

Starting data

What’s my current situation at Michigan?

  • Total bibs: about 10M (but that includes a bunch of HathiTrust items and other electronic-only items that could never be checked out)
  • Total items checked out right now: 162,080

The first problem I run into is that I don’t know how many pages are in a given book. Well, in theory I can look in MARC field 300$a, and it will tell me.

Finding the number of pages in a book

I went through a recent dump of all our records and pulled out page counts from the 300 (those that matched the regular expression $$a\d+\s+[pP].).

Problem solved, right? Well, kind of

  • 3,085,433 total bibs with page count data (about 30%)
  • 40,872 checked out items with page count data (about 25%)

OK, so I don’t have data for everything. Plus, some of those are multi-volume works that list the total page count, even though only a single volume may be checked out.

We’ll have to drop down into statistics:

  • Average number of pages in a checked-out item: 270
  • Median number of pages in a checked-out item: 244

The median is lower, so we’ll go with that. Being conservative, remember?

Bringing it all together

Obviously we need to make a lot of assumptions.

  • All paperbacks (== no space allowance for covers)
  • 244 pages per item (the median of checked out items for which we have data)
  • Pages = 244 * 162,080 = 39,547,520 pages

So…what’s the damage?

But how to do the calculation?

It turns out that simply googling book spine width calculator a few come up.

I picked one and input 39,547,520 pages and assumed 50lb paper (the lightest paper in the tool).

Total width: 77,241.25 inches, or 6437 feet, or 1.22 miles

1.22 miles???

Well, we had a lot of assumptions,but most of them were pretty conservative. And I have no idea if the book spine calculator is at all accurate.

But…it’s gonna be a big number no matter what. Add in that many of them are hardcover, and this seems like a pretty good guess at a lower end.

What is this good for again?

Oh, nothing at all. Just a little fun while I’m at code4lib.

Next steps

Well, the best next step would be to walk away. This is a huge waste of time.

But…we could look in the 020s for a hint of whether it’s hardcover or paperback (which is really hard. And maybe try to figure out if multiple volumes of a multi-volume work are all checked out and take that into account.

But really: this is enough for me. Whether Michael wants to pursue it further on his own, well, that’s up to him.

One Response to “Please: don’t return your books”

  1. I had a class in Library School in the 1990′s on operations research in the library. There was some interesting stuff on adjusting loan periods so that some number of books was always checked out in order to make sure there was enough shelf space.

    A bunch of researchers applied Operations Research (queueing theory/monte carlo simulations, and more math stuff) to the relationship between library loan periods/policies and shelving needs (as well as user satisfaction etc)

    Here is an excerpt from some of Buckland’s research in the 1960′s. It doesn’t actually talk about deliberately making the loan period longer in order to reduce the need for shelf space though. I’m pretty sure either his book or subsequent research dealt with this issue however. On the other hand, this research on user’s behavior from 1968 sounds relevant for today.

    http://people.ischool.berkeley.edu/~buckland/lancasterlru.pdf

    During 1967 and 1968 a series of measurements were undertaken which showed

    that library users could find the books they were looking for about 6 times out of 10;

    that the major cause of nonavailability was that the book was out on loan to someone else;

    that borrowed books tended to remain out for the full length of the loan period;

    that in practice a loan period was determined not by written policies but by when overdue fines began;

    that disappointed would-be borrowers did not often avail themselves of the procedures for recalling books back from loan;

    and that in-library book use tended to have a stable relationship to circulation in any given library (Hindle and Buckland, 1978).

    A Monte Carlo simulation was used to avoid the limitations of queuing theory. A flow chart of borrowing activities was programmed so that a computer could simulate the sequence of users seeking a single book, its repeatedly being borrowed and returned, and how often a copy was not available when sought. The simulation was flexible enough to show the effects of changes in the pattern and level of demand, in the length of the loan period, and/or of changing the number of copies of that book.

    For more details see Buckland’s book on this “Book availability and the library user”: http://mirlyn.lib.umich.edu/Record/000014253

    Tom

Leave a Reply

Just throwing this up here because I didn’t find it elsewhere.

I want to run ruby scripts from the command line or in a cronjob, and I do not want to always have to type “ruby scriptname”.

But, I use rvm. I want to run a particular ruby, maybe identified by an alias, maybe with a specific gemset.

It turns out you can use the env program with rvm do to accomplish this.

  1.     #!/usr/bin/env  rvm 1.9 do ruby
  2.    
  3.     require 'mygem'
  4.     o = MyGem.new
  5.     # blah blah blah

In this example, 1.9 is the name of the ruby (actually, an rvm alias) I want to use, and it could just as easily specify a gemset as well (e.g., 1.9@mygems).

If you’re running in cron, don’t forget you need to load the environment variables first. Here I use the bash . command to source my .bashrc.

  54 9-16 * * 1-5 . /Users/dueberb/.bashrc; /Users/dueberb/bin/exercise

Nothing fancy, but worth knowing.

3 Responses to “Ruby sidebar: Using rvm on the shebang (#!) line in a script”

  1. If you’re running in a cronjob anyway, you could also just

       54 9-16 * * 1-5 . /Users/dueberb/.bashrc; rvm 1.9 do ruby /Users/dueberb/bin/exercise
    

    I’ve been experimenting with rbenv instead or rvm lately, especially on production servers. I’m liking it, it’s simpler, easier to understand, things go wrong less and i know how to fix em when they do. But i’m not sure if it would have the shebang line trick feature like that.

  2. (ha, good job with markdown in your comments, it looks like! What software do you use for your blog?)

    • mark
    • down
    • list
  3. Sands Fish says:

    Jonathan, I wish I had found out about rbenv just days earlier so I wouldn’t have done all kinds of unholy things to my pristine new Macbook’s environment to get rvm to install (like installing alternate GCC compilers into the XCode setup. cry)

    Thought looking at it, it doesn’t manage gemsets, which for me is a huge value, since most of my issues when running different Rails apps on the same machine end up being problems with various gems that are present or not.

Leave a Reply

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Exact matching in Solr is easy. Use the default string type: all it does is, essentially, exact phrase matching. string is a great type for faceted values, where the only way we expect to search the index is via text pulled from the index itself. Query the index to get a value: use that value to re-query the index. Simple and self-contained.

But much of the time, we don’t want exact matching. We want exactish matching. You know, where things are exactly the same except. Except for case, or punctuation, or how much whitespace is between tokens. Maybe do some unicode folding, or stemming.

Essentially, we want to reward users (via high relevancy) for getting really close. If someone types in a full title, but misses a colon, well, let’s go ahead and assume they want that particular item.

Exactish matching vs phrase matching

Phrase matching in Solr does a great job, but fails those of us generating super-complex queries where we want to provide awesome service for those users doing known-item queries. If someone puts in the exact(ish) title, or the exact(ish) subject, well, those items should float to the top.

Solr’s default phrase matching (via, say, the pf param in dismax or just putting your query in quotes) doesn’t differentiate between a phrase that matches the whole target string and only part of that target string. For this, we’ll need a decent text fieldtype and a way to “anchor” the search to both ends of the target string.

Our goals

We’re shooting for:

  • A useful text type that we can use all over the place
  • A phrase match against that field that will match any portion of the target text. Solr already does this — that’s a normal Solr phrase search.
  • A “fully anchored” text type that will only phrase match if the query string exactishly-matches the whole field. We’ll phrase-search on this field and boost it way up.
  • And, what the heck, a left-anchored version that will exactish match a phrase only at the start of a field. We’ll boost this one up a bit less.

Follow along at home

Go ahead and clone the github repo I’ve been using if you haven’t already and let’s dig in.

  1. cd solr_stupid_tricks
  2. git pull origin master
  3. git fetch –all
  4. git checkout SST4
  5. java -jar start.jar &

There are some additions to the schema.xml file; let’s take a look!

Step 1: get a decent text type

The recent-nighty of Solr 3.x we’re using has a great tokenizer in ICUTokenizerFactory, which does “the right thing” across a whole host of languages.

  1. <fieldtype name="text" class="solr.TextField" positionIncrementGap="1000">
  2.   <analyzer>
  3.     <tokenizer class="solr.ICUTokenizerFactory"/>
  4.       <filter class="solr.ICUFoldingFilterFactory"/>
  5.       <filter class="solr.SynonymFilterFactory"
  6.               synonyms="syn.txt" ignoreCase="true" expand="false"/>
  7. <!– <filter class="solr.WordDelimiterFilterFactory"
  8.              generateWordParts="1" generateNumberParts="1"
  9.              catenateWords="1" catenateNumbers="1" catenateAll="0"/>
  10. –>
  11.       <filter class="solr.CJKWidthFilterFactory"/>
  12.       <filter class="solr.CJKBigramFilterFactory"/>
  13.   </analyzer>
  14. </fieldtype>

Let’s take it bit by bit:

  • Obviously, start with the ICUTokenizer with a large positionIncrementGap so we can do some of the tricks we talked about last time
  • Next, we get one-stop shopping with the ICUFoldingFilterFactory. It provides all of the following:
    • NFKC normalization (precomosing),
    • Unicode case folding (i.e., lowercasing)
    • search term folding (removing accents, etc).
  • Push in synonyms if you have any
  • Uncomment the WordDelimiterFilterFactory if you want to. I’m going to try to avoid it, since it messes with the number of tokens midstream and I worry about the effect on dismax and its mm parameter as explained so excellently by Jonathan Rochkind
  • Dealing with CJK (Chinese, Japanese, Korean) is hard. The CJK filters process those languages and provide overlapping bigrams so searching isn’t (I’m told) quite as painful. (I really, really recommend the above link for a great overview by Tom Burton-West).

Step 2: Set up parallel text types that anchor phrase matches to one or both ends

We’re going to use something new: a charFilter. This differs from a normal filter in that it affects the input string before tokenization.

Here’s the trick. We’re going to add anchoring text (I chose just ‘AAAA’ at the front and ‘ZZZZ’ at the end) to the normal text type, just by adding a simple charfilter.

  1. <fieldtype name="text_lr" class="solr.TextField" positionIncrementGap="1000">
  2.   <analyzer>
  3.     <charFilter class="solr.PatternReplaceCharFilterFactory"
  4.       pattern="^(.*)$" replacement="AAAA $1 ZZZZ" />      
  5.     <tokenizer class="solr.ICUTokenizerFactory"/>
  6.       <filter class="solr.ICUFoldingFilterFactory"/>
  7.       <filter class="solr.SynonymFilterFactory"
  8.               synonyms="syn.txt"
  9.               ignoreCase="true" expand="false"/>
  10.       <filter class="solr.CJKWidthFilterFactory"/>
  11.       <filter class="solr.CJKBigramFilterFactory"/>
  12.   </analyzer>
  13. </fieldtype>

Note that this charFilter actually adds two new tokens (‘AAAA’ and ‘ZZZZ’) to your token stream on both index and query. How does this help us?

Let’s look at indexing Mister Blue Sky in a normal text field. A normal solr phrase query q="Blue Sky" will match on that value, because the query phrase is fully contained in the indexed phrase.

But what happens if we index into a text_lr field?

  • Indexing Mister Blue Sky becomes aaaa mister blue sky zzzz
  • Search terms blue sky becomes aaaa blue sky zzzz
  • Phrase searching will then compare the two transformed values using normal Solr rules, find the the latter is not fully contained in the former as a phrase, and give up.

Be careful, though. That ‘aaaa’ and ‘zzzz’ are there just as if you’d typed them in. Thus every indexed value has the tokens ‘aaaa’ and ‘zzzz’, and every query will, in effect, include a query for ‘aaaa’ or ‘zzzz’ (depending on your mm settings).

That means that any non-phrase query will match every field that uses this fieldtype, and it will also mess with token counts with respect to your mm parameter. For those reasons, only ever use anchored fieldtypes for phrase queries when you want exactish matches.

By adding only one of ‘AAAA’ or ‘ZZZZ’, we can have left-anchored and right-anchored searches as well. See the schema.xml for these definitions.

Try it out!

Let’s take a small set of new documents:

  1. [
  2.   {
  3.     "id": "1",
  4.     "title": "The Monkees: Pleasant Valley Never"
  5.   },
  6.   {
  7.     "id": "2",
  8.     "title": "The Monkees"
  9.   },
  10.   {
  11.     "id": "3",
  12.     "title": "Meet the Monkees"
  13.   },
  14.   {
  15.     "id": "4",
  16.     "title": "Corportate boy bands through the ages"
  17.   }
  18. ]

We have copyFields set up to copy the title field to both a fully-anchored field (text_exact) and a left-anchored field (text_l).

  1.   <copyField source="title" dest="title_exact"/>
  2.   <copyField source="title" dest="title_l"/>

If you’re following at home, clear out your solr and index them:

  1. cd exampledocs
  2.  ./reset_and_index_json.sh exactish.json

We’ll now run three dismax queries, all of which use the search terms the monkees. Watch what happens to the score as we change things.

  • First, qf=title, pf=title^2. This will match the three Monkees documents, and then boost all of them because they all contain the phrase “the monkees” in the title.
  • Second, qf=title, pf=title_exact^10 title^2. These will match the Monkees documents, and then give a huge boost to the one with the exact match.
  • Finally, qf=title, pf=title_exact^10 title_l^5 title^2. There you’ll see the score for the exact title match go way up (relatively speaking, of course), and document 1 go up quite a bit (because it begins with the phrase “The Monkees”).

You can run all three queries as:

  1. cd ruby
  2. ruby browse.rb exactish_query.rb
  3. # or ruby browse.rb exactish_query.rb json|xml|csv to get different output type

[BTW, browse.rb will now take an array of queries to run in a single file.]

Tah Dah! You’ve successfully boosted the exatish match, and the left-anchored exactish match. Your known-item-searchers will thank you.

You may want to take a look at exactish_query.rb to see what’s going on.

To sum up

  • Your schema.xml now contains a decent text type and three variants for anchoring phrase searches left, right, and full (exactish)
  • The anchored text fields should NOT NOT NOT be searched against by anything other than a single phrase (which means they’re very useful in the pf param of a dismax search). A non-phrase search will trivially match every single document, so, you know, avoid that.
  • You now have a set of tools (field types, copyField directives, phrase search) that can be used to provide higher boosts to exactish matches and left-anchored exactish phrase matches.

6 Responses to “Boosting on Exactish (anchored) phrase matching in Solr: (SST #4)”

  1. Incredibly helpful, thanks.

  2. [...] Boosting on Exactish (anchored) phrase matching [...]

  3. ntucker says:

    Re: “For those reasons, only ever use anchored fieldtypes for phrase queries when you want exactish matches.”

    Are you suggesting that the application code analyze the user input and only search against the anchored fields if it contains phrases?

  4. Bill says:

    No, sorry I was unclear. I’m saying you should munge the users’ input to be a phrase query (i.e, remove all the double-quotes and the wrap the whole thing in double-quotes) or, more trivially, only ever include the anchored fields in a pf dismax parameter which does the work for you.

  5. ntucker says:

    I’ve been pondering this technique and there’s something that’s been nagging me about it which may just be a schema problem on a more fundamental level. I’ve used the what I thought was a fairly standard “catch-all” text field copyField definition: . ‘text’ is also my default query field. However, if I were to use this ‘anchored text’ technique, wouldn’t I need to somehow prevent those from also being copied into my ‘text’ field? Is this catch-all field a bad idea? Is there a less problematic way to specify it?

  6. Aparna says:

    We are trying to get the results which starts with the given phrase. But even if we give great boost to title_l field the document 1 is not coming up. How do you actually get results that starts with the keyword on top?

Leave a Reply

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Solr and multiValued fields

Here’s another thing you need to understand about Solr: it doesn’t really have fields that can take multiple values.

“But Bill,” you’re saying, “sure it does. I mean, hell, it even has a ‘multiValued’ parameter.”

First off: watch your language.

Second off: are you sure?

Let’s do a quick test. Look at the following documents

exampledocs/names.json
  1. [
  2.   {
  3.     "id": "1",
  4.     "title": "The Monkees",
  5.     "name_text": ["Peter Tork", "Mike Nesmith",
  6.                   "Micky Dolenz", "Davy Thomas Jones"]
  7.   },
  8.   {
  9.     "id": "2",
  10.     "title": "Heros of the Wild West",
  11.     "name_text": ["Buck Jones", "Davy Crockett"]
  12.   }
  13. ]

Question: what do you get when you run this query against those two documents?

ruby/names_query.rb
  1. {
  2.   'fl' => 'score, *',
  3.   'defType' => 'dismax',
  4.   'wt' => 'csv',
  5.   'qf' => 'name_text',
  6.   'q' => 'davy jones'   # Poor guy just died. So young. So short.
  7. }

See how I threw the wt=csv in there? Check out all the query response formats if you’re interested, but really all you’ll use is standard (XML), json, or csv unless you’re rolling your own in some way.

I’ve updated ruby/browse.rb to allow a second argument of the type of output you want. You can now do ruby browse.rb jsonfile [json|csv|standard|xml]

Following along at home?

If so, let’s go ahead and index these document and run the query.

Play along at home
  1. cd solr_stupid_tricks
  2. git pull origin master
  3. git fetch –all
  4. git checkout SST3 # I've started tagging the repo for these posts
  5. # ignore warning about "detached HEAD"
  6. java -jar start.jar &
  7. cd exampledocs
  8.  ./reset_and_index_json.sh names.json
  9.  cd ../ruby
  10.  ruby browse.rb names_query.rb

Here’s the scores that I get:

Return from Solr
  1.   id,title,name_text,score
  2.   2,Heros of the Wild West,"Buck Jones,Davy Crockett",0.42039964
  3.   1,The Monkees,"Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones",0.26274976

Check out that last column. The query was davy jones. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.

The relevance ranking seems…wrong

While it looks like we added four separate names to the name_text field in our first document, Solr doesn’t see it that way. Solr treats those four poor Monkees as if they had one long name.

Then it finds all the documents that match the query (both of our documents match) and figures out which is a better match by assigning a score.

In this case, while both document have both query terms, the field in the second document is shorter. Which means that, essentially, a higher percentage of the terms in the field value match the given query terms. In Solr’s mind, that makes it a better match, and the shorter document shows up first.

Solr doesn’t automatically give more weight to the recently-dead Monkee because internally it doesn’t care that you’re thinking of those values as four separate names. It just concatenates them together and indexes them.

This is not, for most people, expected behavior.

Phrase slop

Part of what’s going on here is that we haven’t told Solr that it should care how close together the terms are.

One way to do that is to use a phrase query by throwing quotes around the terms

Put double-quotes around it to make it a phrase query
  1.   "q" => '"Davy Jones"'

…but that won’t find anything, because Davy and Jones aren’t right next to each other in our document.

Solr does allow a phrase query to be “sloppy”, though — basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.

For that, we’ll tell solr to search against certain fields (pf) treating the query as a phrase, and allow a little slop (ps) as well.

ruby/names_sloppy_query.rb
  1.   {
  2.     'fl' => 'score, *',
  3.     'defType' => 'dismax',
  4.     'wt' => 'csv',
  5.     'q' => 'davy jones',
  6.     'qf' => 'name_text',
  7.     'pf' => 'name_text^10', # search this field as a phrase
  8.     'ps' => '4' # allow 'phrase' to mean 'within 4 tokens of each other'
  9.   }

That gets us something more expected.

  1.   id,title,name_text,score
  2.   1,The Monkees,"Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones",0.2806283
  3.   2,Heros of the Wild West,"Buck Jones,Davy Crockett",0.029652705

Enter positionIncrementGap

OK. Now that we have the concept of “slop”, one of those mystery fieldtype parameters makes sense: positionIncrementGap. Basically, a positionIncrementGap of 1000 means When computing slop, pretend there are 1000 tokens between the entries in a multValued field.

A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your positionIncrementGap.

All you have to do is use the pf and ps parameters and you’re set.

Note that this should be telling you two things:

  • Always use the same positionIncrementGap for your multiValued fields
  • Make it a number much larger than the maximum number of tokens you expect to ever have in a field.

Note that a large positionIncrementGap doesn’t actually put 1000 tokens in there — a large value doesn’t affect processing time or your index size or anything.

But I’m already using the pf parameter!

Slop is great when you want it. But I don’t always want to use slop. Slop of 4 makes the phrase “Sex in the City” be treated exactly the same as “In the Sex City“. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.

[Forshadowing: We'll talk about exact-ish matches in a few days.]

OK, so we can’t just appropriate the pf/ps parameters and and push the slop value up all the time — that cripples our ability to create the query boost structure we want.

Query slop

So, dismax (and its cousin edismax) have an analogous parameter that affects only phrases within the normal query: qs.

qs is a dismax param that affects query slop — how much slop to allow in phrases within the query, much like the ps param.

The query

A three-token query
  1.   'q' => 'Bill "The Weasel" Dueber'

…has three tokens, the second of which (“The Weasel”) is a phrase. It’s that phrase token that is affected by query slop.

OK. So it affects only the phrases in the normal query. But…suppose we just force the entire query to be one big phrase? That’ll get us somewhere!

We just need to do the following:

  • Create a boost query that uses the same fields as the regular query
  • …but treats all the query terms as one big phrase
  • …and give it a query slop of one less that the positionIncrementGap in our field type definition (in my case, 999)

Package it up

OK, so here’s what we’re going to do. You can just take this basic idea and build it into your own queries in your application code. Try it. You might like it. Play around with what fields are affected, how much weight to give it, etc.

But heck, we’ve gone this far. Let’s encode it into the Solr configuration file solrconfig.xml itself as a custom request handler.

We’re going to extend our edismaxplus requestHandler from last time, but we’ll add an extra boost query that reflects this new “prefer documents where all the tokens appear in the same ‘line’ of a multiValued query” attitude.

solr/conf/solrconfig.xml
  1.   <requestHandler name="/edismaxplus" class="solr.SearchHandler">
  2.     <lst name="defaults">
  3.       <str name="rows">10</str>
  4.       <str name="fl">*,score</str>
  5.       <str name="echoParams">explicit</str>
  6.       <str name="q">
  7.         _query_:"{!edismax qf=$fields mm=$mymm
  8.                             v=$qwords bq=$boostForAll}"</str>
  9.       <str name='mymm'>0%</str>
  10.       <str name="qwordsphrase">"JunkThatWillNEverShowUpInAMillionFreakinYears"</str>
  11.       <str name='boostForAll'>
  12.         _query_:"{!edismax qf=$fields
  13.                            mm='100%'
  14.                            v=$qwords }"^5 OR
  15.         _query_:"{!dismax  qf=$fields
  16.                            mm='100%'
  17.                            v=$qwordsphrase
  18.                            qs='999'}"^5
  19.       </str>
  20.     </lst>
  21.   </requestHandler>

We now do a few new things:

  • (Line 15) Add a second clause to the boost query that use the same fields provided for the regular query (note the boolean OR between the two localparams queries that comprise this boost query)
  • (Line 17) Ask for another user-provided value: qwordsphrase which your application-level stuff should set to the set of all the regular query ters, but as a single phrase. Basically, strip out all the double-quotes, then put the whole thing in double quotes. In ruby: qwordsphrase = '"' + qwords.gsub(/"/, '"') + '"'
  • (Line 10) Provide a default value for the new qwordsphrase that won’t ever show up in a real query (empty string won’t work; I tried it and it throws an error). So, if the application doesn’t provide qwordsphrase, no harm is done — the search regresses to what we had last time.
  • (Line 18) Use a qs (query slop) of 999 in the new boost clause acting against qwordsphrase. That value is one less than the positionIncrementGap of 1000, making sure that we don’t cross multiValue boundaries.

Note: If you wanted to, you could make this a filter query (fq) instead of a boost query to only allow documents that meet this criterion.

Let’s try it out!

Once again, if you did a git pull origin master you’ve got this up and running already — the updated requestHandler source is already in solr/conf/solrconfig.xml.

We first construct the query just like we did last week, without the qwordsphrase argument:

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text

You’ll see Davy Crockett and friend appear as the first item.

But when you add the phraseified query, you’ll see the boost we’ve been talking about this whole post and get something more expected.

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text&qwordsphrase=”Davy Jones”

The Monkees are again on top! Party like it’s 1967!

Where it breaks down

If you actually have a phrase as one of your query terms, it will no longer be treated as a phrase during the boost because we’re getting rid of all the double-quotes.

And, of course, if you’ve got gobs of full-text and include your fulltext field, setting query slop to 999 isn’t just a cute trick, it’s a cute trick that will melt your servers to slag and still not do what you want it to do.

What have we learned?

  • Solr doesn’t really separate multiple values from each other in a multiValued field
  • Phrase slop (ps) and query slop (qs) can be used to allow “phrase” to mean “a bunch of tokens within X spots of each other”
  • I’m A Believer is the best song Neil Diamond ever wrote.

3 Responses to “Requiring/Preferring searches that don’t span multiple values (SST #3)”

  1. [...] Requiring/Preferring searches that don’t span multiple values [...]

  2. Very clever, nice.

    I’m missing why your ‘boostForAll’ does an ‘OR’ of TWO nested queries… one dismax and one edismax? Was that explained somewhere, I’m just missing it?

    Oh wait, okay, the first is from ‘last time’, and now you OR in another one…. I’m gonna have to think on this.

Leave a Reply

[Note: this isn't so much a Stupid Solr Trick as a Thing You Should Probably Know; consider it required reading for the next SST. If you're just joining us, check out the introduction to the Stupid Solr Tricks series]

What the heck is a localparams query?

A garden-variety Solr query URL looks something like this:

  http://localhost:8983/solr/select?
    defType=dismax
    &qf=name^2 place^1
    &q=Dueber
Which is fine, as far as it goes. But it’s easy to run into the limits of the standard query plugins (e.g., Dismax).

Say, for example, you want something like this:

  title:Constructivism AND author:Dueber
And furthermore, you have multiple underlying fields (title1, title2, title3, author1, author2).

The naĂŻve approach would be to just do this:

  defType=dismax
  &qf=title1 title2 title3 author1 author2
  &q=Constructivism Dueber
But you can’t construct a dismax query with the boolean AND. You can with edismax, but even then you’ve got no way of telling (e)dismax that Constructivism must be found in the title fields, and Dueber must be found in the author fields. Dismax doesn’t do that.

Solution: Build a query of queries

The solution is to build a query made up of fully-encapsulated sub-queries. A localparams query has two forms (note that, of course, you’d need to URL-Escape the values):

  _query_:"{!dismax qf='field^2 otherfield^4'}my search terms"
or
  _query_:"{!dismax qf='field^2 otherfield^4' v=$q1}"&q1=my search terms
I far prefer the second form (which uses a second URL parameter q1 instead of sticking the search right in there), because I don’t have to worry about escaping double-quotes in the query terms (as you would if there’s a phrase as part of the query).

Once you’ve got these things, you can combine them with booleans.

    q=query:"{!dismax qf='title1 title2 title3' v=$q1}" AND
      query:"{!dismax qf='author1 author2' v=$q2}"
  &q1=Constructivism
  &q2=Dueber
[Note: be careful with solr booleans!!!]

You can add any local parameters you need (for dismax, stuff like mm, qs, pf, and ps) and you can use any query parser you want by changing what comes after the bang (e.g., {!lucene ...} or {!edismax...}).

In this way, you can build up arbitrarily complex queries using any available query parsers in combination with each other. Very powerful.

An example: boost records that contain all terms

Just about everything in a localparams query can be pulled out in the way I pulled out the search terms above. Here’s a fairly-complex example (which, let’s be honest, would be a lot more complex if you were trying to inline and escape everything).

Scenario: We want to do a logical-OR search (mm=0%), but want to make sure we boost documents that contain all the search terms. This is necessary because sometimes a very long document with all the terms will have a lower score than a very short document with most of the terms.

Having short document with a few keywords show up before long documents with all the keywords will drive your librarians CraZy!!! So it’s tempting to just leave it alone. But let’s fix it anyway.

The gist of it is as follows:

  • Query against title and author
  • Use an mm of 0% (logical OR) for the main query
  • Use a pf to boost on a phrase in those same two fields (just common sense)
  • Set up a boost query (bq) to boost the score if all the search terms are present

To accomplish this, we’re going to have two localparams queries: one to be the main query, and another that we’re going to use as the boost query. This works in much the same way as our previous “AND-together two localparams queries” did.

[Presenting the URL parameters as a ruby hash to make it easier to read]

{
  1.   'q'=&gt;'_query_:"{!dismax qf=$f1 mm=$mm1 pf=$f1 bq=$bq1 v=$q1}"',
  2.   'mm1'=&gt;'0%',
  3.   'f1'=&gt;'author^3 title^1',
  4.   'q1'=&gt;'Dueber Constructivism',
  5.   'bq1'=&gt;'_query_:"{!dismax qf=$f1 mm=\'100%\' v=$q1 }"^5',
  6.   'fl' =&gt; 'score,*'
  7.   }

What’s nice about this is that I’m reusing the search terms (for the main query and the boost query) and field list (for the query field and the phrase fields) so I don’t have to repeat them.

Try along at home

First off, if you don’t have a browser that does nice XML and JSON formatting, well, get one. I use Chrome with JSONView and XMLTree, but I’m sure there are equivalents for Firefox. They’ll make your life easier.

By now you know the drill:

cd solr_stupid_tricks
  1.   git pull origin master
  2.   git fetch origin master
  3.   git checkout SST2 # I've started tagging the repo for these posts
  4.   # ignore warning about "detached HEAD"
  5.   java -jar start.jar &amp;

We’ll want to empty out the index and put in some documents to work with. I’m presuming you have curl installed. If not…well, you’re on your own.

cd exampledocs
  1.   ./reset_and_index_json localparams.json

You might want to take a look at the localparams.json file, which contains a set of documents in the new JSON update structure. The full Solr JSON Update structure allows repeated keys. Apparently, so does the JSON RFC:

> 2.2. Objects > An object structure is represented as a pair of curly brackets > surrounding zero or more name/value pairs (or members). A name is a > string. A single colon comes after each name, separating the name > from the value. A single comma separates a value from a following > name. The names within an object SHOULD be unique. (emphasis mine)

“SHOULD”. Not “MUST”. I don’t care if it’s legal. It still weirds me out.

Once you’ve got solr running in the background, you can go ahead and try our query!

  • If you’re really lazy, just click the link
  • If you’re slightly less lazy, and you’ve got ruby installed, take a look in the new ruby directory. You can run ruby browse.rb localparams_query.rb to run the query and have it automatically open up in your browser.
  • If you’re ambitious, you might want to actually mess with the localparams_query.rb file so you can try things out.

As a longish side note, we’ll probably use browse.rb in the future of this series as well, so you might want to go ahead and get ruby installed if you don’t already. RVM is the easiest route if you’re on linux/OSX. You can also just install JRuby, seeing as how you’re running java anway (just make sure to use 1.9 mode by calling stuff as jruby --1.9 myscript.rb or setting the environment variable export JRUBY_OPTS=--1.9).

Special Stupid Solr Trick: Make a special query handler for a complex query

OK, so I said I wouldn’t have a real SST in this episode, but it’s so damn long at this point I figure I’ve lost everyone except Rochkind (Hey, Jonathan!), so let’s throw one in.

The Solr configuration file solrconfig.xml is where you can configure custom search handlers. In such a custom handler, you can specify defaults (which, by default, can be overridden by passed-in parameters, although you can control that, too) — this is commonly used to, say, put in a q.alt or a filter query that will always be applied.

But we can use it to put in our special query defaults that boosts when a document contains all the terms:

  1.  
  2.       10
  3.       *,score
  4.       explicit
  5.  
  6.         _query_:"{!edismax qf=$fields
  7.                            mm=$mymm
  8.                            v=$qwords
  9.                            bq=$boostForAll}"
  10.  
  11.       0%
  12.  
  13.         _query_:"{!edismax qf=$fields
  14.                            mm='100%'
  15.                            v=$qwords }"^5

If you look closely, you’ll see that everything you need is defined in this requestHandler in the solrconfig.xml file, except for $fields and $qwords. You could also override mymm by passing in an argument with that name, if the default ’0%’ isn’t to your liking.

If you’ve been following along at home, this requestHandler is already in the solrconfig.xml file that you’re running right now. Go ahead and try it! Let’s search for the terms ‘dueberb’ and ‘penn’ and see if the correct record floats to the top.

http://localhost:8983/solr/edismaxplus/?qwords=dueber penn&fields=author title

Nifty, huh?

Next time we’ll use a local params query to get around something about dismax that drives me crazy: preventing (or penalizing) matches that go across a field’s multiple values.

3 Responses to “Using localparams in Solr (or, how to boost records that contain all terms) (SST #2)”

  1. Ah, I see.

    The unexplained mystery is why you’d need to do a special boost of the same query with mm 100%. One would think that ordinary Solr relevance ranking would make things where all terms match rank higher than things where just some terms match, with a less than 100% mm.

    Now, I’ve noticed times it doesn’t too. But I’ve never understood why.

    This is necessary because sometimes a very long document with all the terms will have a lower score than a very short document with most of the terms.

    Hmm, okay, have to ponder on this one too. Ah, it’s starting to sink in.

  2. [...] August 18, 2010 4 Comments » [Note: I've since made a better explanation of, and solution for, this problem.] [...]

Leave a Reply

[For the introduction to this series, take a quick gander at the introduction]

Like everyone else in the library world, I’ve got a bunch of well-defined, well-controlled standard identifiers I need to keep track of and allow searching on.

You know, well-vetted stuff like this:

  • 1234-5678
  • 123-4567-890
  • 12-34-567-X
  • 0012-0045
  • ISBN13: 1234567890123
  • ISSN: 1234567X (1998-99)
  • ISSN (1998-99): 1234567X
  • 1234567890 (hdk. 22 pgs)
  • 9
  • Behind the 3rd floor desk
  • Henry VIII

[Note: some of these may be a titch exaggerated]

How does your system deal with these on index? How about on query?

Here’s an idea of how to use a custom solr fieldtype to do the heavy lifting.

What we’re shooting for

I’d like to be able to send in a text string as follows:

  • The input can contain other text besides the id
  • The ID starts with a digit and consists solely of digits and (optional) dashes, then ends with a digits and possibly a trailing ‘X’ or ‘x’ so we can deal with ISBN/ISSN
  • The ID has to be at least N characters long (for this example, I’m using N=8); this helps us avoid other text that might trivially look like an ID but isn’t.
  • Only the ID itself is indexed
  • If no valid ID is identified, nothing is indexed

The numericID field, suitable for ISBN/ISSN/OCLC/etc.

Let’s take a look at the end product and then walk through it.

Things we’ll be learning about today

NOTE: I really, really recommend taking a look at Scaling Lucene and Solr by the good folks over at Lucid Imagination for great, short explanations of omitNorms, term frequencies, etc.

Since this is the first post, I’ll go over some stuff that’s probably a little too basic for any audience that’s likely to show up here, but what the heck.

  • KeywordTokenizer
  • PatternReplaceFilterFactory
  • LowerCaseFilterFactory
  • LengthFilterFactory

Step 1: “Tokenize” to a single token

The job of a tokenizer is to decide how to split your input into individual tokens (often “words”), which are then munged by any filters you’re applying.

For the case of an ID, we don’t want to tokenize. At least at this juncture, I’m not trying to extract multiple valid IDs out of a single string; I’m just trying to determine if there’s a valid ID in there somewhere and throwing everything else away.

In other words, I’m going to treat the input as a single token, and then munge the bejeebers out of it in order to get what I want.

In the Solr world, that leads us to the confusingly-named KeywordTokenizer.

What we have now: exactly what we started with

Step 2: Find the first thing that looks like an ID and mark it

I primarily work in Ruby and Perl, which means the dramatic abuse of regular expressions is just part of my daily life.

Line 5 is our first use of a regexp in the filter chain via PatternReplaceFilterFactory.

The idea is to:

  1. Find something that looks like a match
  2. If found, get rid of everything else, and throw a ‘***’ onto the beginning so later on I can tell if I matched or not.

The second step is a little…odd…but necessary because I need a way to know if I found a candidate ID or not. If I did, well, there will be three asterisks on the front of the string from here on out. If not, there won’t.

This is a little confusing as these things go, so I’ll break it down.

Line 6: the match:

  • Skip any amount of stuff we don’t care about (.*?)
  • Match a number (\p{N}) (that’s unicode regexp syntax, if you haven’t seen it)
  • Match a string of at least 6 numbers and dashes
  • Close with an optional X or x [Xx]?
  • …and any trailing bits until the end of the string.

So…[number][six numbers/dashes][optional X]

At minimum, that’s seven digits/dashes.

Line 7: replacement

  • Replace the whole string (note how I anchored the match with ^ and $?) with whatever was matched inside the parentheses (represented here by $1) after prepending a set of three asterisks ‘***’

What we have now: If we found a candidate ID, we have that string prepended by ‘***’. Otherwise, we have exactly what we started with.

Step 3: If we didn’t find a match, throw it all away

Line 9 shows an attempt to match on any string that start with an asterisk (which we’re pretty sure we won’t see because that’s illegal lucene wildcard syntax). If we have a string that doesn’t start with an asterisk, then throw the whole damn thing away because we don’t have a candidate ID anyway.

[There's a strong argument to be made that using an asterisk as the tagging character is a bad choice. Anyone have suggestions?]

What we have now: Either a candidate ID string preceded by ‘***’ or the empty string.

Step 4: Ditch the ‘***’ used to mark a candidate ID

Lines 10-11

Find the ‘***’ and throw it away.

What we have now: The raw candidate ID string or the empty string.

Step 5: Lowercase it

Line 12.

By ‘it’ I mean “any X that might be trailing the ID”; we should have thrown everything else away by now. (Note: could have done this with a PattenReplace as well, obviously; not sure why’d I’d choose one over the other).

What we have now: The raw candidate ID string with its optional trailing ‘X’ lowercased, or the empty string

Step 6: Get rid of everything that’s not a number or an ‘x’

Lines 13-15

Ditch any dashes that are remaining. I’m doing it like this instead of just ditching the dashes because I’ll likely modify this at some point to allow, e.g., periods between numbers, or maybe spaces. This is safer.

Note the extra parameter (replace=”all”), indicating that I want to replace all occurrences. This hasn’t been an issue until now because I’ve been careful to match the entire string by anchoring the pattern at the beginning (‘^’) and end (‘$’).

What we have now: A string of numbers possibly followed by an ‘x’, or the empty string.

Step 7: Make sure what we have is a reasonable length

Line 16

Now that we’ve gotten rid of the dashes, we need to make sure we have enough digits left to make a valid identifier.

If we didn’t match originally, it quickly got reduced to the empty string, and that will disappear here due to having length 0.

It’s also possible that our initial match was, say, ’1—-3—–6—7′, which will at this point have been reduced to just ’1367′ — too short for our taste.

In this version, I allow strings of any length between 7 (old OCLC number) and 14 (barcode).

What we have now: A string consisting purely of 7-14 characters, the last of which may be an ‘x’, or nothing at all (e.g., nothing will get indexed).

Step 8: Remove leading 0s

My ILS (Aleph) loves to zero-pad all its local identifiers. I’d rather get rid of them.

What we have now: What we had before, but with no leading zeros

Let’s try it!

If you’re following along at home, get the latest version of the schema and try it!

  1.   cd solr_stupid_tricks
  2.   git pull origin master
  3.   java -jar start.jar

…and then:

For those of you not following along at home, here are the examples from waaaaaay at the top of this post:

  • 1234-5678 => 12345678
  • 123-4567-890 => 1234567890
  • 12-34-567-X => 1234567x
  • 0012-0045 => 120045
  • ISBN13: 1234567890123 => 1234567890123
  • ISSN: 1234567X (1998-99) => 1234567x
  • ISSN (1998-99): 1234567X => 199899
  • 1234567890 (hdk. 22 pgs) => 1234567890
  • 9 => [nothing]
  • Behind the 3rd floor desk => [nothing]
  • Henry VIII => [nothing]

So…not too bad. We did miss one, mistaking a year range for a numeric ID, but if your data are that bad, there’s only so much we can do.

Conclusions

Obviously, this is the tip of the iceberg with this sort of thing. And it can still be confused.

But it does follow our goal of having the exact same behavior on index and query, moving the logic to solr, and being pretty flexible.

Perfect? No. Useful? Yes.

4 Responses to “Solr Field Type for numeric(ish) IDs (SST #1)”

  1. Do you have to do something to deal with the fact that while your analyzer specifies keyword tokenizer (no split on whitespace in tokenization phase) at both index time and query time, in reality at query time if you’re using lucene or dismax query parsers, they’ll sort of “pre-tokenize” on whitespace before even sending it to the analyzer?

    Does that end up causing a problem?

  2. This has an interesting problem. For tokens with no numbers, it creates a zero-length string. This matched seven records in my database that (incorrectly) had a zero-length string as the ISBN. Normal text searches would match these fields, which meant we never had zero hits, instead we showed the same seven irrelevant books.

    I added a LengthFilter to turn zero-length tokens into no token at all, which fixed the problem. Then I simplified the regex since I only care about ISBNs and EANs.

    <!-- Remove anything that isn’t a digit or an 'x'. -->
    <filter class="solr.PatternReplaceFilterFactory"
        pattern="[^\dx]" replacement="" replace="all"/>
    <!-- ISBNs and EANs are either 10 or 13 characters long. -->
    <filter class="solr.LengthFilterFactory" min="10" max="13"/>
    

    Thanks!

  3. Bill says:

    Note — I’ve talked to Walter, and his guess is that he left of the existing LengthFilterFactory when he dropped the remove-leading-zeros regexp (which others may want to do if you’re dealing only with stuff where leading zeros are useful and don’t want spurious collisions between, say, ISBN and ISSN). The numericID fieldtype in the post will not ever give zero-length strings.

    His comment also drives home the point that you should restrict this as much as possible — know your data so you know how to set the min/max length, whether to remove leading zeros, etc.

Leave a Reply

Completed parts of the series:

  1. A Solr Field Type for numeric(ish) IDs
  2. Using localparams in Solr (or, how to boost records that contain all terms)
  3. Requiring/Preferring searches that don’t span multiple values
  4. Boosting on Exactish (anchored) phrase matching

Those of you who read this blog regularly (Hi Mom!) know that while we do a lot of stuff at the University of Michigan Library, our bread-and-butter these days are projects that center around Solr.

Right now, my production Solr is running an ancient nightly of version 1.4 (i.e., before 1.4 was even officially released), and reflects how much I didn’t know when we first started down this path. My primary responsibility is for Mirlyn, our catalog, but there’s plenty of smart people doing smart things around here, and I’d like to be one of them.

Solr has since advanced to 3.x (with version 4 on the horizon), and during that time I’ve learned a lot more about Solr and how to push it around. More importantly, I’ve learned a lot more about our data, the vagaries in the MARC/AACR2 that I process and how awful so much of it really is.

So…starting today I’m going to be doing some on-the-blog experiments with a new version of Solr, reflecting some of the problems I’ve run into and ways I think we can get more out of Solr.

Premise 1: put all the logic you possible can into Solr

Much of what I’ll be doing is looking at new field type definitions that are appropriate (in my mind, anyway) for library data. Some of this stuff (e.g., normalizing ISBNs) would be a lot easier to do in your indexing code.

But then you’d have to do it again in your application to munge whatever is entered in the search box. And maybe it won’t be the same every time. Or maybe you don’t want to write a freakin’ parser to try to find anything that might look like an ISBN and mess with it.

I take it as gospel that you should put all your logic into the solr field analysis chain, so the exact same thing is happening on index and on query. That way, even if it’s wrong, at least it’ll be wrong in the exact same way and your users will find the stuff they’re looking for.

Premise 2: Doing it crappily is better than not doing it at all.

Look, the right way to do much of this stuff is by hacking on Solr itself, building custom field analyzers or filters or tokenizers that mess with the token chain and…

Wait. I already lost myself, and probably you, too. At some point, I’m going to do an actual sample custom filter for the new Solr codebase (the stuff I did once before is out-of-date); the example will be LCCN normalization and you’ll be able to follow along with me on this blog.

But in the meantime, we can do a lot of fairly ambitious stuff just by using and abusing the out-of-the-box stuff: pattern replacement filters, the existing tokenizers, etc. It might be ugly, and not very fast, but if I start getting the 200 hits a second that mean this is a bottleneck for me, I’ll be happy to deal with it then.

Premise 3: It’s always better to put something out there so smart people can tell you how to do it right

One of the disappointments in my life right now is that there isn’t more formal and informal discussion about what people are doing/trying. I’m sure it’s out there, but some of it is buried in a sea of application-level crap, and much of it is ignored by the people that really understand the data.

With luck, I’ll get comments from folks who really know their stuff and can tell me, in excruciating detail, exactly how I don’t. Please: correct me. I might not be the brightest guy in the room, but I know enough to try to outsource my thinking.

Follow along at home!

Option 1: Build your own current-trunk Solr

If you want to follow along at home, you’ll need a copy of the current source (not the 3.5 stable, since I use things like the ICUTokenizer coming in 3.6 / 4.0), which you can find and build from the Solr site.

Option 2: Just use what I’m using

Alternately, if you’re lazy (and who isn’t??), I’ve provided a github repo of the standard solr “example” directory you can nab and run on your own java-equipped machine.

Warning: the git repo is currently 60MB or so.

git clone https://billdueber@github.com/billdueber/solr_stupid_tricks.git
  1.   cd solr_stupid_tricks
  2.   java -jar start.jar

…and then head to your local Solr Admin page page on port 8983 to check things out. We’ll be spending most of our time in the analysis tab.

I’ll get the first post in the series up later today, and then every few days as I think of more things to talk about. I hope you’ll join me!

6 Responses to “Stupid Solr tricks: Introduction (SST #0)”

  1. Jon Gorman says:

    Very cool. Kudos for doing this.

  2. Joe Montibello says:

    This is good stuff. I’m just getting my feet wet with Solr and Blacklight, and I’m already learning from your first two posts!

    Thanks.

  3. [...] Know; consider it required reading for the next SST. If you're just joining us, check out the introduction to the Stupid Solr Tricks series] Contents1 What the heck is a localparams query?2 Solution: Build a query of queries3 An example: [...]

  4. [...] phrase slop, solr, Stupid Solr Tricks March 9, 2012 No Comments Check out introduction to the Stupid Solr Tricks series if you’re just joining us.] Contents1 Solr and multiValued fields2 Following along at home?3 [...]

  5. [...] out introduction to the Stupid Solr Tricks series if you’re just joining [...]

Leave a Reply

Another short personal note

Tags:

February 27, 2012 at 4:50 pmCategory:Uncategorized

The baby spent all last week in the hospital. Nothing life-threatening (so long as he was in the hospital and could get O2 when needed); it was just annoying.

So….here’s to a week-long hospital stay being able to be merely “annoying”. A tip of the hat to steady employment, generous sick/vacation policies, flexible co-workers, excellent insurance, and having a world-class hospital in town. This could have been a much, much worse week than it was.

Leave a Reply

Solr and boolean operators

December 1, 2011 at 12:13 pmCategory:Uncategorized

[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!]

What does Solr do, given the following query?

  a OR b AND c

I’ll give you three guesses, but you’ll get the first two wrong and won’t have any idea how to generate a third, so don’t spend too much time on it.

Boolean algebra and operator precedence

Anyone who’s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping:

  a OR (b AND c)

That’s guess one. It’s not how Solr does it.

Left to right?

Some naive students, and at least one programming language (Smalltalk), do a simple left-to-right evaluation. So you might go with:

  (a OR b) AND c

Nope. Wrong again.

So what’s left???

Excellent question. I don’t know the code well enough to know what’s going on underneath, but here’s what we get under the lucene query parser.

    (b AND c)

That’s right. The first term is thrown away.(More correctly, the first term is deemed “optional”).

Do you let your users put AND/OR/NOT in their queries?

Hopefully, they don’t know any boolean algebra. If they do, hopefully they use parentheses, or you parse it out for them. And if not, well, they’re gonna be pretty damn confused.

It gets weirder

I populated a fresh solr (3.5) index with all possible subsets of the strings “curly”, “larry”, “moe”, and “shemp” (not Joe. Don’t talk to me about Joe). There are 15 of them, from the one-item ‘curly’ to all four at once.

I wrote a script to run a set of queries against the index under both lucene and edismax to see what I would get. In all cases the default lucene operator is ‘AND’ and the edismax mm parameter is set to 100% (equivalent to “all required”).

        Lucene                    EDismax
--------------------------------------------------------

  1. curly AND larry curly larry curly larry
    curly larry moe curly larry moe
    curly larry shemp curly larry shemp
    curly larry moe shemp curly larry moe shemp

  2. curly AND larry OR moe curly curly larry
    curly larry curly larry moe
    curly moe curly larry shemp
    curly shemp curly larry moe shemp
    curly larry moe
    curly larry shemp
    curly moe shemp
    curly larry moe shemp

  3. curly OR larry AND moe larry moe larry moe
    curly larry moe curly larry moe
    larry moe shemp larry moe shemp
    curly larry moe shemp curly larry moe shemp

  4. curly AND larry OR moe AND shemp curly moe shemp curly larry moe shemp
    curly larry moe shemp

  5. moe AND shemp OR curly AND larry curly larry moe curly larry moe shemp
    curly larry moe shemp

Query 1 is as expected. Query 2 apparently reduces to just ‘curly’ under the lucene parser and ‘curly AND larry’ under edismax (and query 3 similarly reduces to the two AND’d words). Queries 4 and 5 are…well, you can look at the debugQuery output to see what it gets, but not why. And then tell me how to explain it to a user.

Where does this leave us?

The good news is that both lucene and edismax behave predictably when you use parentheses for grouping. So do that.

I’m generally not one to complain about open-source software, at least partially because I don’t have the chops to do anything about it most of the time, but I don’t understand how this could seem OK to anyone. There are a couple lucene Jira tickets (Lucene-167 and Lucene-1823) and a 2005 mailing list thread denouncing the current behavior, but it persists.

Until the Solr/Lucene powers that be decide to tackle this, the rest of us will either have to write pre-parsers to make sure users get something sensible, or cripple our applications to disallow unrestricted boolean queries.

3 Responses to “Solr and boolean operators”

  1. So the way I handle part of this at present is not actually passing these user-entered boolean queries straight to any solr query parser (not lucene, not edismax either; curious if edismax has the same or similar idiosyncracies, I predict it will).

    Instead, I actually parse all user queries in my own application, and then construct Solr queries (using a ‘lucene’ type query, with nested dismax type queries) that I know do what is needed for the user query to be interpreted with more typical boolean logic.

    Of course, it’s possible I’ve gotten it wrong too, but theoretically my parser/interpreter can then be fixed to what I want.

    Now, it might make even MORE sense to do this as an actual custom Solr query parser, but I lack the Solr/java comfort/skills to do that, so I’d rather do it in ruby. I think it would probably in some ways be better to do this in Java as a Solr plug-in query parser, rather than at the application layer, but oh well.

    Here’s the specs from my code that show how various user-entered queries are translated to Solr. Note they also take care of various types of “pure negative” queries that lucene and dismax query parsers can’t handle ‘right’ either (I think edismax fixes some but not all of these ‘pure negative’ cases). Hopefully the spec file is somewhat readable, despite (or because of!) the helper functions I put in to test generated solr query params against templates.

    https://github.com/projectblacklight/blacklight_advanced_search/blob/master/spec/parsing_nesting/to_solr_spec.rb

  2. PS: Sorry, I see you covered edismax too, thanks!

    PPS: One answer I’ve gotten from solr-ites is basically “Well, yeah, AND/OR aren’t really boolean operators to these query parsers, they are just indication of lucene optional/required clauses.”

    To which my answer is: “Okay, but then why use syntax that looks like boolean algebra, and that makes it very hard to predict exactly how they are translated to optional/required clauses, to the novice. Why not use a different syntax that actually indicates what it’s doing?” Perhaps the answer is “Well, becuase users are used to AND/OR”, but I think it’s no service to give users a syntax they are used to but with semantics that they are NOT used to!

    However, it’s possible that if you keep in mind “translating to lucene optional/mandatory clauses”, it will be the right mental model to figure out why the query parsers are doing what they’re doing. Although I still can’t figure it out, especially for some of the especially weird cases where terms are dropped altogether.

  3. Bruce Rosen says:

    Many thanks for this posting. I just stumbled upon Solr’s boolean precedence peculiarity yesterday (not sure why I hadn’t seen it earlier). It was precisely your initial question: a OR b AND c. Search results always returned the b AND c part. If I tried b OR a AND c on the identical docs, I always got the a AND c docs. Using parentheses, (a OR b) AND c, did the right thing. So, I was very glad to see your posting, and will take the punchline (always use parentheses) to heart … I’ve got some code to go back and fix!

Leave a Reply