Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]
Solr and multiValued fields
Here’s another thing you need to understand about Solr: it doesn’t really have fields that can take multiple values.
“But Bill,” you’re saying, “sure it does. I mean, hell, it even has a ‘multiValued’ parameter.”
First off: watch your language.
Second off: are you sure?
Let’s do a quick test. Look at the following documents
exampledocs/names.json
[
{
"id": "1",
"title": "The Monkees",
"name_text": ["Peter Tork", "Mike Nesmith",
"Micky Dolenz", "Davy Thomas Jones"]
},
{
"id": "2",
"title": "Heros of the Wild West",
"name_text": ["Buck Jones", "Davy Crockett"]
}
]
Question: what do you get when you run this query against those two documents?
ruby/names_query.rb
{
'fl' => 'score, *',
'defType' => 'dismax',
'wt' => 'csv',
'qf' => 'name_text',
'q' => 'davy jones' # Poor guy just died. So young. So short.
}
See how I threw the wt=csv in there? Check out all the query response formats if you’re interested, but really all you’ll use is standard (XML), json, or csv unless you’re rolling your own in some way.
I’ve updated ruby/browse.rb to allow a second argument of the type of output you want. You can now do ruby browse.rb jsonfile [json|csv|standard|xml]
Following along at home?
If so, let’s go ahead and index these document and run the query.
Play along at home
cd solr_stupid_tricks
git pull origin master
git fetch –all
git checkout SST3 # I've started tagging the repo for these posts
# ignore warning about "detached HEAD"
java -jar start.jar &
cd exampledocs
./reset_and_index_json.sh names.json
cd ../ruby
ruby browse.rb names_query.rb
Here’s the scores that I get:
id,title,name_text,score
2,Heros of the Wild West,"Buck Jones,Davy Crockett",0.42039964
1,The Monkees,"Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones",0.26274976
Check out that last column. The query was davy jones. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.
The relevance ranking seems…wrong
While it looks like we added four separate names to the name_text field in our first document, Solr doesn’t see it that way. Solr treats those four poor Monkees as if they had one long name.
Then it finds all the documents that match the query (both of our documents match) and figures out which is a better match by assigning a score.
In this case, while both document have both query terms, the field in the second document is shorter. Which means that, essentially, a higher percentage of the terms in the field value match the given query terms. In Solr’s mind, that makes it a better match, and the shorter document shows up first.
Solr doesn’t automatically give more weight to the recently-dead Monkee because internally it doesn’t care that you’re thinking of those values as four separate names. It just concatenates them together and indexes them.
This is not, for most people, expected behavior.
Phrase slop
Part of what’s going on here is that we haven’t told Solr that it should care how close together the terms are.
One way to do that is to use a phrase query by throwing quotes around the terms
Put double-quotes around it to make it a phrase query
"q" => '"Davy Jones"'
…but that won’t find anything, because Davy and Jones aren’t right next to each other in our document.
Solr does allow a phrase query to be “sloppy”, though — basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.
For that, we’ll tell solr to search against certain fields (pf) treating the query as a phrase, and allow a little slop (ps) as well.
ruby/names_sloppy_query.rb
{
'fl' => 'score, *',
'defType' => 'dismax',
'wt' => 'csv',
'q' => 'davy jones',
'qf' => 'name_text',
'pf' => 'name_text^10', # search this field as a phrase
'ps' => '4' # allow 'phrase' to mean 'within 4 tokens of each other'
}
That gets us something more expected.
id,title,name_text,score
1,The Monkees,"Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones",0.2806283
2,Heros of the Wild West,"Buck Jones,Davy Crockett",0.029652705
Enter positionIncrementGap
OK. Now that we have the concept of “slop”, one of those mystery fieldtype parameters makes sense: positionIncrementGap. Basically, a positionIncrementGap of 1000 means When computing slop, pretend there are 1000 tokens between the entries in a multValued field.
A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your positionIncrementGap.
All you have to do is use the pf and ps parameters and you’re set.
Note that this should be telling you two things:
- Always use the same positionIncrementGap for your multiValued fields
- Make it a number much larger than the maximum number of tokens you expect to ever have in a field.
Note that a large positionIncrementGap doesn’t actually put 1000 tokens in there — a large value doesn’t affect processing time or your index size or anything.
But I’m already using the pf parameter!
Slop is great when you want it. But I don’t always want to use slop. Slop of 4 makes the phrase “Sex in the City” be treated exactly the same as “In the Sex City“. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.
[Forshadowing: We'll talk about exact-ish matches in a few days.]
OK, so we can’t just appropriate the pf/ps parameters and and push the slop value up all the time — that cripples our ability to create the query boost structure we want.
Query slop
So, dismax (and its cousin edismax) have an analogous parameter that affects only phrases within the normal query: qs.
qs is a dismax param that affects query slop — how much slop to allow in phrases within the query, much like the ps param.
The query
A three-token query
'q' => 'Bill "The Weasel" Dueber'
…has three tokens, the second of which (“The Weasel”) is a phrase. It’s that phrase token that is affected by query slop.
OK. So it affects only the phrases in the normal query. But…suppose we just force the entire query to be one big phrase? That’ll get us somewhere!
We just need to do the following:
- Create a boost query that uses the same fields as the regular query
- …but treats all the query terms as one big phrase
- …and give it a query slop of one less that the
positionIncrementGap in our field type definition (in my case, 999)
Package it up
OK, so here’s what we’re going to do. You can just take this basic idea and build it into your own queries in your application code. Try it. You might like it. Play around with what fields are affected, how much weight to give it, etc.
But heck, we’ve gone this far. Let’s encode it into the Solr configuration file solrconfig.xml itself as a custom request handler.
We’re going to extend our edismaxplus requestHandler from last time, but we’ll add an extra boost query that reflects this new “prefer documents where all the tokens appear in the same ‘line’ of a multiValued query” attitude.
solr/conf/solrconfig.xml
<requestHandler name="/edismaxplus" class="solr.SearchHandler">
<lst name="defaults">
<str name="rows">10</str>
<str name="fl">*,score</str>
<str name="echoParams">explicit</str>
<str name="q">
_query_:"{!edismax qf=$fields mm=$mymm
v=$qwords bq=$boostForAll}"</str>
<str name='mymm'>0%</str>
<str name="qwordsphrase">"JunkThatWillNEverShowUpInAMillionFreakinYears"</str>
<str name='boostForAll'>
_query_:"{!edismax qf=$fields
mm='100%'
v=$qwords }"^5 OR
_query_:"{!dismax qf=$fields
mm='100%'
v=$qwordsphrase
qs='999'}"^5
</str>
</lst>
</requestHandler>
We now do a few new things:
- (Line 15) Add a second clause to the boost query that use the same fields provided for the regular query (note the boolean OR between the two localparams queries that comprise this boost query)
- (Line 17) Ask for another user-provided value:
qwordsphrase which your application-level stuff should set to the set of all the regular query ters, but as a single phrase. Basically, strip out all the double-quotes, then put the whole thing in double quotes. In ruby: qwordsphrase = '"' + qwords.gsub(/"/, '"') + '"'
- (Line 10) Provide a default value for the new
qwordsphrase that won’t ever show up in a real query (empty string won’t work; I tried it and it throws an error). So, if the application doesn’t provide qwordsphrase, no harm is done — the search regresses to what we had last time.
- (Line 18) Use a
qs (query slop) of 999 in the new boost clause acting against qwordsphrase. That value is one less than the positionIncrementGap of 1000, making sure that we don’t cross multiValue boundaries.
Note: If you wanted to, you could make this a filter query (fq) instead of a boost query to only allow documents that meet this criterion.
Let’s try it out!
Once again, if you did a git pull origin master you’ve got this up and running already — the updated requestHandler source is already in solr/conf/solrconfig.xml.
We first construct the query just like we did last week, without the qwordsphrase argument:
http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text
You’ll see Davy Crockett and friend appear as the first item.
But when you add the phraseified query, you’ll see the boost we’ve been talking about this whole post and get something more expected.
http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text&qwordsphrase=”Davy Jones”
The Monkees are again on top! Party like it’s 1967!
Where it breaks down
If you actually have a phrase as one of your query terms, it will no longer be treated as a phrase during the boost because we’re getting rid of all the double-quotes.
And, of course, if you’ve got gobs of full-text and include your fulltext field, setting query slop to 999 isn’t just a cute trick, it’s a cute trick that will melt your servers to slag and still not do what you want it to do.
What have we learned?
- Solr doesn’t really separate multiple values from each other in a
multiValued field
- Phrase slop (
ps) and query slop (qs) can be used to allow “phrase” to mean “a bunch of tokens within X spots of each other”
- I’m A Believer is the best song Neil Diamond ever wrote.
I had a class in Library School in the 1990′s on operations research in the library. There was some interesting stuff on adjusting loan periods so that some number of books was always checked out in order to make sure there was enough shelf space.
A bunch of researchers applied Operations Research (queueing theory/monte carlo simulations, and more math stuff) to the relationship between library loan periods/policies and shelving needs (as well as user satisfaction etc)
Here is an excerpt from some of Buckland’s research in the 1960′s. It doesn’t actually talk about deliberately making the loan period longer in order to reduce the need for shelf space though. I’m pretty sure either his book or subsequent research dealt with this issue however. On the other hand, this research on user’s behavior from 1968 sounds relevant for today.
http://people.ischool.berkeley.edu/~buckland/lancasterlru.pdf
During 1967 and 1968 a series of measurements were undertaken which showed
that library users could find the books they were looking for about 6 times out of 10;
that the major cause of nonavailability was that the book was out on loan to someone else;
that borrowed books tended to remain out for the full length of the loan period;
that in practice a loan period was determined not by written policies but by when overdue fines began;
that disappointed would-be borrowers did not often avail themselves of the procedures for recalling books back from loan;
and that in-library book use tended to have a stable relationship to circulation in any given library (Hindle and Buckland, 1978).
A Monte Carlo simulation was used to avoid the limitations of queuing theory. A flow chart of borrowing activities was programmed so that a computer could simulate the sequence of users seeking a single book, its repeatedly being borrowed and returned, and how often a copy was not available when sought. The simulation was flexible enough to show the effects of changes in the pattern and level of demand, in the length of the loan period, and/or of changing the number of copies of that book.
For more details see Buckland’s book on this “Book availability and the library user”: http://mirlyn.lib.umich.edu/Record/000014253
Tom