Archives: March 2012

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Exact matching in Solr is easy. Use the default string type: all it does is, essentially, exact phrase matching. string is a great type for faceted values, where the only way we expect to search the index is via text pulled from the index itself. Query the index to get a value: use that value to re-query the index. Simple and self-contained.

But much of the time, we don’t want exact matching. We want exactish matching. You know, where things are exactly the same except. Except for case, or punctuation, or how much whitespace is between tokens. Maybe do some unicode folding, or stemming.

Essentially, we want to reward users (via high relevancy) for getting really close. If someone types in a full title, but misses a colon, well, let’s go ahead and assume they want that particular item.

Exactish matching vs phrase matching

Phrase matching in Solr does a great job, but fails those of us generating super-complex queries where we want to provide awesome service for those users doing known-item queries. If someone puts in the exact(ish) title, or the exact(ish) subject, well, those items should float to the top.

Solr’s default phrase matching (via, say, the pf param in dismax or just putting your query in quotes) doesn’t differentiate between a phrase that matches the whole target string and only part of that target string. For this, we’ll need a decent text fieldtype and a way to “anchor” the search to both ends of the target string.

Our goals

We’re shooting for:

  • A useful text type that we can use all over the place
  • A phrase match against that field that will match any portion of the target text. Solr already does this — that’s a normal Solr phrase search.
  • A “fully anchored” text type that will only phrase match if the query string exactishly-matches the whole field. We’ll phrase-search on this field and boost it way up.
  • And, what the heck, a left-anchored version that will exactish match a phrase only at the start of a field. We’ll boost this one up a bit less.

Follow along at home

Go ahead and clone the github repo I’ve been using if you haven’t already and let’s dig in.

  1. cd solr_stupid_tricks
  2. git pull origin master
  3. git fetch –all
  4. git checkout SST4
  5. java -jar start.jar &

There are some additions to the schema.xml file; let’s take a look!

Step 1: get a decent text type

The recent-nighty of Solr 3.x we’re using has a great tokenizer in ICUTokenizerFactory, which does “the right thing” across a whole host of languages.

  1. <fieldtype name="text" class="solr.TextField" positionIncrementGap="1000">
  2.   <analyzer>
  3.     <tokenizer class="solr.ICUTokenizerFactory"/>
  4.       <filter class="solr.ICUFoldingFilterFactory"/>
  5.       <filter class="solr.SynonymFilterFactory"
  6.               synonyms="syn.txt" ignoreCase="true" expand="false"/>
  7. <!– <filter class="solr.WordDelimiterFilterFactory"
  8.              generateWordParts="1" generateNumberParts="1"
  9.              catenateWords="1" catenateNumbers="1" catenateAll="0"/>
  10. –>
  11.       <filter class="solr.CJKWidthFilterFactory"/>
  12.       <filter class="solr.CJKBigramFilterFactory"/>
  13.   </analyzer>
  14. </fieldtype>

Let’s take it bit by bit:

  • Obviously, start with the ICUTokenizer with a large positionIncrementGap so we can do some of the tricks we talked about last time
  • Next, we get one-stop shopping with the ICUFoldingFilterFactory. It provides all of the following:
    • NFKC normalization (precomosing),
    • Unicode case folding (i.e., lowercasing)
    • search term folding (removing accents, etc).
  • Push in synonyms if you have any
  • Uncomment the WordDelimiterFilterFactory if you want to. I’m going to try to avoid it, since it messes with the number of tokens midstream and I worry about the effect on dismax and its mm parameter as explained so excellently by Jonathan Rochkind
  • Dealing with CJK (Chinese, Japanese, Korean) is hard. The CJK filters process those languages and provide overlapping bigrams so searching isn’t (I’m told) quite as painful. (I really, really recommend the above link for a great overview by Tom Burton-West).

Step 2: Set up parallel text types that anchor phrase matches to one or both ends

We’re going to use something new: a charFilter. This differs from a normal filter in that it affects the input string before tokenization.

Here’s the trick. We’re going to add anchoring text (I chose just ‘AAAA’ at the front and ‘ZZZZ’ at the end) to the normal text type, just by adding a simple charfilter.

  1. <fieldtype name="text_lr" class="solr.TextField" positionIncrementGap="1000">
  2.   <analyzer>
  3.     <charFilter class="solr.PatternReplaceCharFilterFactory"
  4.       pattern="^(.*)$" replacement="AAAA $1 ZZZZ" />      
  5.     <tokenizer class="solr.ICUTokenizerFactory"/>
  6.       <filter class="solr.ICUFoldingFilterFactory"/>
  7.       <filter class="solr.SynonymFilterFactory"
  8.               synonyms="syn.txt"
  9.               ignoreCase="true" expand="false"/>
  10.       <filter class="solr.CJKWidthFilterFactory"/>
  11.       <filter class="solr.CJKBigramFilterFactory"/>
  12.   </analyzer>
  13. </fieldtype>

Note that this charFilter actually adds two new tokens (‘AAAA’ and ‘ZZZZ’) to your token stream on both index and query. How does this help us?

Let’s look at indexing Mister Blue Sky in a normal text field. A normal solr phrase query q="Blue Sky" will match on that value, because the query phrase is fully contained in the indexed phrase.

But what happens if we index into a text_lr field?

  • Indexing Mister Blue Sky becomes aaaa mister blue sky zzzz
  • Search terms blue sky becomes aaaa blue sky zzzz
  • Phrase searching will then compare the two transformed values using normal Solr rules, find the the latter is not fully contained in the former as a phrase, and give up.

Be careful, though. That ‘aaaa’ and ‘zzzz’ are there just as if you’d typed them in. Thus every indexed value has the tokens ‘aaaa’ and ‘zzzz’, and every query will, in effect, include a query for ‘aaaa’ or ‘zzzz’ (depending on your mm settings).

That means that any non-phrase query will match every field that uses this fieldtype, and it will also mess with token counts with respect to your mm parameter. For those reasons, only ever use anchored fieldtypes for phrase queries when you want exactish matches.

By adding only one of ‘AAAA’ or ‘ZZZZ’, we can have left-anchored and right-anchored searches as well. See the schema.xml for these definitions.

Try it out!

Let’s take a small set of new documents:

  1. [
  2.   {
  3.     "id": "1",
  4.     "title": "The Monkees: Pleasant Valley Never"
  5.   },
  6.   {
  7.     "id": "2",
  8.     "title": "The Monkees"
  9.   },
  10.   {
  11.     "id": "3",
  12.     "title": "Meet the Monkees"
  13.   },
  14.   {
  15.     "id": "4",
  16.     "title": "Corportate boy bands through the ages"
  17.   }
  18. ]

We have copyFields set up to copy the title field to both a fully-anchored field (text_exact) and a left-anchored field (text_l).

  1.   <copyField source="title" dest="title_exact"/>
  2.   <copyField source="title" dest="title_l"/>

If you’re following at home, clear out your solr and index them:

  1. cd exampledocs
  2.  ./reset_and_index_json.sh exactish.json

We’ll now run three dismax queries, all of which use the search terms the monkees. Watch what happens to the score as we change things.

  • First, qf=title, pf=title^2. This will match the three Monkees documents, and then boost all of them because they all contain the phrase “the monkees” in the title.
  • Second, qf=title, pf=title_exact^10 title^2. These will match the Monkees documents, and then give a huge boost to the one with the exact match.
  • Finally, qf=title, pf=title_exact^10 title_l^5 title^2. There you’ll see the score for the exact title match go way up (relatively speaking, of course), and document 1 go up quite a bit (because it begins with the phrase “The Monkees”).

You can run all three queries as:

  1. cd ruby
  2. ruby browse.rb exactish_query.rb
  3. # or ruby browse.rb exactish_query.rb json|xml|csv to get different output type

[BTW, browse.rb will now take an array of queries to run in a single file.]

Tah Dah! You’ve successfully boosted the exatish match, and the left-anchored exactish match. Your known-item-searchers will thank you.

You may want to take a look at exactish_query.rb to see what’s going on.

To sum up

  • Your schema.xml now contains a decent text type and three variants for anchoring phrase searches left, right, and full (exactish)
  • The anchored text fields should NOT NOT NOT be searched against by anything other than a single phrase (which means they’re very useful in the pf param of a dismax search). A non-phrase search will trivially match every single document, so, you know, avoid that.
  • You now have a set of tools (field types, copyField directives, phrase search) that can be used to provide higher boosts to exactish matches and left-anchored exactish phrase matches.

6 Responses to “Boosting on Exactish (anchored) phrase matching in Solr: (SST #4)”

  1. Incredibly helpful, thanks.

  2. [...] Boosting on Exactish (anchored) phrase matching [...]

  3. ntucker says:

    Re: “For those reasons, only ever use anchored fieldtypes for phrase queries when you want exactish matches.”

    Are you suggesting that the application code analyze the user input and only search against the anchored fields if it contains phrases?

  4. Bill says:

    No, sorry I was unclear. I’m saying you should munge the users’ input to be a phrase query (i.e, remove all the double-quotes and the wrap the whole thing in double-quotes) or, more trivially, only ever include the anchored fields in a pf dismax parameter which does the work for you.

  5. ntucker says:

    I’ve been pondering this technique and there’s something that’s been nagging me about it which may just be a schema problem on a more fundamental level. I’ve used the what I thought was a fairly standard “catch-all” text field copyField definition: . ‘text’ is also my default query field. However, if I were to use this ‘anchored text’ technique, wouldn’t I need to somehow prevent those from also being copied into my ‘text’ field? Is this catch-all field a bad idea? Is there a less problematic way to specify it?

  6. Aparna says:

    We are trying to get the results which starts with the given phrase. But even if we give great boost to title_l field the document 1 is not coming up. How do you actually get results that starts with the keyword on top?

Leave a Reply

Check out introduction to the Stupid Solr Tricks series if you’re just joining us.]

Solr and multiValued fields

Here’s another thing you need to understand about Solr: it doesn’t really have fields that can take multiple values.

“But Bill,” you’re saying, “sure it does. I mean, hell, it even has a ‘multiValued’ parameter.”

First off: watch your language.

Second off: are you sure?

Let’s do a quick test. Look at the following documents

exampledocs/names.json
  1. [
  2.   {
  3.     "id": "1",
  4.     "title": "The Monkees",
  5.     "name_text": ["Peter Tork", "Mike Nesmith",
  6.                   "Micky Dolenz", "Davy Thomas Jones"]
  7.   },
  8.   {
  9.     "id": "2",
  10.     "title": "Heros of the Wild West",
  11.     "name_text": ["Buck Jones", "Davy Crockett"]
  12.   }
  13. ]

Question: what do you get when you run this query against those two documents?

ruby/names_query.rb
  1. {
  2.   'fl' => 'score, *',
  3.   'defType' => 'dismax',
  4.   'wt' => 'csv',
  5.   'qf' => 'name_text',
  6.   'q' => 'davy jones'   # Poor guy just died. So young. So short.
  7. }

See how I threw the wt=csv in there? Check out all the query response formats if you’re interested, but really all you’ll use is standard (XML), json, or csv unless you’re rolling your own in some way.

I’ve updated ruby/browse.rb to allow a second argument of the type of output you want. You can now do ruby browse.rb jsonfile [json|csv|standard|xml]

Following along at home?

If so, let’s go ahead and index these document and run the query.

Play along at home
  1. cd solr_stupid_tricks
  2. git pull origin master
  3. git fetch –all
  4. git checkout SST3 # I've started tagging the repo for these posts
  5. # ignore warning about "detached HEAD"
  6. java -jar start.jar &
  7. cd exampledocs
  8.  ./reset_and_index_json.sh names.json
  9.  cd ../ruby
  10.  ruby browse.rb names_query.rb

Here’s the scores that I get:

Return from Solr
  1.   id,title,name_text,score
  2.   2,Heros of the Wild West,"Buck Jones,Davy Crockett",0.42039964
  3.   1,The Monkees,"Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones",0.26274976

Check out that last column. The query was davy jones. Document #1 contains a name that has both those terms, but document #2 (which has both terms, but in different names) gets a higher score.

The relevance ranking seems…wrong

While it looks like we added four separate names to the name_text field in our first document, Solr doesn’t see it that way. Solr treats those four poor Monkees as if they had one long name.

Then it finds all the documents that match the query (both of our documents match) and figures out which is a better match by assigning a score.

In this case, while both document have both query terms, the field in the second document is shorter. Which means that, essentially, a higher percentage of the terms in the field value match the given query terms. In Solr’s mind, that makes it a better match, and the shorter document shows up first.

Solr doesn’t automatically give more weight to the recently-dead Monkee because internally it doesn’t care that you’re thinking of those values as four separate names. It just concatenates them together and indexes them.

This is not, for most people, expected behavior.

Phrase slop

Part of what’s going on here is that we haven’t told Solr that it should care how close together the terms are.

One way to do that is to use a phrase query by throwing quotes around the terms

Put double-quotes around it to make it a phrase query
  1.   "q" => '"Davy Jones"'

…but that won’t find anything, because Davy and Jones aren’t right next to each other in our document.

Solr does allow a phrase query to be “sloppy”, though — basically saying that instead of being right next to each other, the terms need to be within a certain number of tokens of each other.

For that, we’ll tell solr to search against certain fields (pf) treating the query as a phrase, and allow a little slop (ps) as well.

ruby/names_sloppy_query.rb
  1.   {
  2.     'fl' => 'score, *',
  3.     'defType' => 'dismax',
  4.     'wt' => 'csv',
  5.     'q' => 'davy jones',
  6.     'qf' => 'name_text',
  7.     'pf' => 'name_text^10', # search this field as a phrase
  8.     'ps' => '4' # allow 'phrase' to mean 'within 4 tokens of each other'
  9.   }

That gets us something more expected.

  1.   id,title,name_text,score
  2.   1,The Monkees,"Peter Tork,Mike Nesmith,Micky Dolenz,Davy Thomas Jones",0.2806283
  3.   2,Heros of the Wild West,"Buck Jones,Davy Crockett",0.029652705

Enter positionIncrementGap

OK. Now that we have the concept of “slop”, one of those mystery fieldtype parameters makes sense: positionIncrementGap. Basically, a positionIncrementGap of 1000 means When computing slop, pretend there are 1000 tokens between the entries in a multValued field.

A sloppy phrase search, then, will only find (and thus boost) the phrase if (a) the tokens are in the same entry for a multiValued field, and (b) your slop value is less than your positionIncrementGap.

All you have to do is use the pf and ps parameters and you’re set.

Note that this should be telling you two things:

  • Always use the same positionIncrementGap for your multiValued fields
  • Make it a number much larger than the maximum number of tokens you expect to ever have in a field.

Note that a large positionIncrementGap doesn’t actually put 1000 tokens in there — a large value doesn’t affect processing time or your index size or anything.

But I’m already using the pf parameter!

Slop is great when you want it. But I don’t always want to use slop. Slop of 4 makes the phrase “Sex in the City” be treated exactly the same as “In the Sex City“. If someone puts in an exact title, I want to reward them for that query by floating the exact match to the top, and slop prevents me from doing so.

[Forshadowing: We'll talk about exact-ish matches in a few days.]

OK, so we can’t just appropriate the pf/ps parameters and and push the slop value up all the time — that cripples our ability to create the query boost structure we want.

Query slop

So, dismax (and its cousin edismax) have an analogous parameter that affects only phrases within the normal query: qs.

qs is a dismax param that affects query slop — how much slop to allow in phrases within the query, much like the ps param.

The query

A three-token query
  1.   'q' => 'Bill "The Weasel" Dueber'

…has three tokens, the second of which (“The Weasel”) is a phrase. It’s that phrase token that is affected by query slop.

OK. So it affects only the phrases in the normal query. But…suppose we just force the entire query to be one big phrase? That’ll get us somewhere!

We just need to do the following:

  • Create a boost query that uses the same fields as the regular query
  • …but treats all the query terms as one big phrase
  • …and give it a query slop of one less that the positionIncrementGap in our field type definition (in my case, 999)

Package it up

OK, so here’s what we’re going to do. You can just take this basic idea and build it into your own queries in your application code. Try it. You might like it. Play around with what fields are affected, how much weight to give it, etc.

But heck, we’ve gone this far. Let’s encode it into the Solr configuration file solrconfig.xml itself as a custom request handler.

We’re going to extend our edismaxplus requestHandler from last time, but we’ll add an extra boost query that reflects this new “prefer documents where all the tokens appear in the same ‘line’ of a multiValued query” attitude.

solr/conf/solrconfig.xml
  1.   <requestHandler name="/edismaxplus" class="solr.SearchHandler">
  2.     <lst name="defaults">
  3.       <str name="rows">10</str>
  4.       <str name="fl">*,score</str>
  5.       <str name="echoParams">explicit</str>
  6.       <str name="q">
  7.         _query_:"{!edismax qf=$fields mm=$mymm
  8.                             v=$qwords bq=$boostForAll}"</str>
  9.       <str name='mymm'>0%</str>
  10.       <str name="qwordsphrase">"JunkThatWillNEverShowUpInAMillionFreakinYears"</str>
  11.       <str name='boostForAll'>
  12.         _query_:"{!edismax qf=$fields
  13.                            mm='100%'
  14.                            v=$qwords }"^5 OR
  15.         _query_:"{!dismax  qf=$fields
  16.                            mm='100%'
  17.                            v=$qwordsphrase
  18.                            qs='999'}"^5
  19.       </str>
  20.     </lst>
  21.   </requestHandler>

We now do a few new things:

  • (Line 15) Add a second clause to the boost query that use the same fields provided for the regular query (note the boolean OR between the two localparams queries that comprise this boost query)
  • (Line 17) Ask for another user-provided value: qwordsphrase which your application-level stuff should set to the set of all the regular query ters, but as a single phrase. Basically, strip out all the double-quotes, then put the whole thing in double quotes. In ruby: qwordsphrase = '"' + qwords.gsub(/"/, '"') + '"'
  • (Line 10) Provide a default value for the new qwordsphrase that won’t ever show up in a real query (empty string won’t work; I tried it and it throws an error). So, if the application doesn’t provide qwordsphrase, no harm is done — the search regresses to what we had last time.
  • (Line 18) Use a qs (query slop) of 999 in the new boost clause acting against qwordsphrase. That value is one less than the positionIncrementGap of 1000, making sure that we don’t cross multiValue boundaries.

Note: If you wanted to, you could make this a filter query (fq) instead of a boost query to only allow documents that meet this criterion.

Let’s try it out!

Once again, if you did a git pull origin master you’ve got this up and running already — the updated requestHandler source is already in solr/conf/solrconfig.xml.

We first construct the query just like we did last week, without the qwordsphrase argument:

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text

You’ll see Davy Crockett and friend appear as the first item.

But when you add the phraseified query, you’ll see the boost we’ve been talking about this whole post and get something more expected.

http://localhost:8983/solr/edismaxplus/?qwords=davy jones&fields=name_text&qwordsphrase=”Davy Jones”

The Monkees are again on top! Party like it’s 1967!

Where it breaks down

If you actually have a phrase as one of your query terms, it will no longer be treated as a phrase during the boost because we’re getting rid of all the double-quotes.

And, of course, if you’ve got gobs of full-text and include your fulltext field, setting query slop to 999 isn’t just a cute trick, it’s a cute trick that will melt your servers to slag and still not do what you want it to do.

What have we learned?

  • Solr doesn’t really separate multiple values from each other in a multiValued field
  • Phrase slop (ps) and query slop (qs) can be used to allow “phrase” to mean “a bunch of tokens within X spots of each other”
  • I’m A Believer is the best song Neil Diamond ever wrote.

3 Responses to “Requiring/Preferring searches that don’t span multiple values (SST #3)”

  1. [...] Requiring/Preferring searches that don’t span multiple values [...]

  2. Very clever, nice.

    I’m missing why your ‘boostForAll’ does an ‘OR’ of TWO nested queries… one dismax and one edismax? Was that explained somewhere, I’m just missing it?

    Oh wait, okay, the first is from ‘last time’, and now you OR in another one…. I’m gonna have to think on this.

Leave a Reply

[Note: this isn't so much a Stupid Solr Trick as a Thing You Should Probably Know; consider it required reading for the next SST. If you're just joining us, check out the introduction to the Stupid Solr Tricks series]

What the heck is a localparams query?

A garden-variety Solr query URL looks something like this:

  http://localhost:8983/solr/select?
    defType=dismax
    &qf=name^2 place^1
    &q=Dueber
Which is fine, as far as it goes. But it’s easy to run into the limits of the standard query plugins (e.g., Dismax).

Say, for example, you want something like this:

  title:Constructivism AND author:Dueber
And furthermore, you have multiple underlying fields (title1, title2, title3, author1, author2).

The naïve approach would be to just do this:

  defType=dismax
  &qf=title1 title2 title3 author1 author2
  &q=Constructivism Dueber
But you can’t construct a dismax query with the boolean AND. You can with edismax, but even then you’ve got no way of telling (e)dismax that Constructivism must be found in the title fields, and Dueber must be found in the author fields. Dismax doesn’t do that.

Solution: Build a query of queries

The solution is to build a query made up of fully-encapsulated sub-queries. A localparams query has two forms (note that, of course, you’d need to URL-Escape the values):

  _query_:"{!dismax qf='field^2 otherfield^4'}my search terms"
or
  _query_:"{!dismax qf='field^2 otherfield^4' v=$q1}"&q1=my search terms
I far prefer the second form (which uses a second URL parameter q1 instead of sticking the search right in there), because I don’t have to worry about escaping double-quotes in the query terms (as you would if there’s a phrase as part of the query).

Once you’ve got these things, you can combine them with booleans.

    q=query:"{!dismax qf='title1 title2 title3' v=$q1}" AND
      query:"{!dismax qf='author1 author2' v=$q2}"
  &q1=Constructivism
  &q2=Dueber
[Note: be careful with solr booleans!!!]

You can add any local parameters you need (for dismax, stuff like mm, qs, pf, and ps) and you can use any query parser you want by changing what comes after the bang (e.g., {!lucene ...} or {!edismax...}).

In this way, you can build up arbitrarily complex queries using any available query parsers in combination with each other. Very powerful.

An example: boost records that contain all terms

Just about everything in a localparams query can be pulled out in the way I pulled out the search terms above. Here’s a fairly-complex example (which, let’s be honest, would be a lot more complex if you were trying to inline and escape everything).

Scenario: We want to do a logical-OR search (mm=0%), but want to make sure we boost documents that contain all the search terms. This is necessary because sometimes a very long document with all the terms will have a lower score than a very short document with most of the terms.

Having short document with a few keywords show up before long documents with all the keywords will drive your librarians CraZy!!! So it’s tempting to just leave it alone. But let’s fix it anyway.

The gist of it is as follows:

  • Query against title and author
  • Use an mm of 0% (logical OR) for the main query
  • Use a pf to boost on a phrase in those same two fields (just common sense)
  • Set up a boost query (bq) to boost the score if all the search terms are present

To accomplish this, we’re going to have two localparams queries: one to be the main query, and another that we’re going to use as the boost query. This works in much the same way as our previous “AND-together two localparams queries” did.

[Presenting the URL parameters as a ruby hash to make it easier to read]

{
  1.   'q'=&gt;'_query_:"{!dismax qf=$f1 mm=$mm1 pf=$f1 bq=$bq1 v=$q1}"',
  2.   'mm1'=&gt;'0%',
  3.   'f1'=&gt;'author^3 title^1',
  4.   'q1'=&gt;'Dueber Constructivism',
  5.   'bq1'=&gt;'_query_:"{!dismax qf=$f1 mm=\'100%\' v=$q1 }"^5',
  6.   'fl' =&gt; 'score,*'
  7.   }

What’s nice about this is that I’m reusing the search terms (for the main query and the boost query) and field list (for the query field and the phrase fields) so I don’t have to repeat them.

Try along at home

First off, if you don’t have a browser that does nice XML and JSON formatting, well, get one. I use Chrome with JSONView and XMLTree, but I’m sure there are equivalents for Firefox. They’ll make your life easier.

By now you know the drill:

cd solr_stupid_tricks
  1.   git pull origin master
  2.   git fetch origin master
  3.   git checkout SST2 # I've started tagging the repo for these posts
  4.   # ignore warning about "detached HEAD"
  5.   java -jar start.jar &amp;

We’ll want to empty out the index and put in some documents to work with. I’m presuming you have curl installed. If not…well, you’re on your own.

cd exampledocs
  1.   ./reset_and_index_json localparams.json

You might want to take a look at the localparams.json file, which contains a set of documents in the new JSON update structure. The full Solr JSON Update structure allows repeated keys. Apparently, so does the JSON RFC:

> 2.2. Objects > An object structure is represented as a pair of curly brackets > surrounding zero or more name/value pairs (or members). A name is a > string. A single colon comes after each name, separating the name > from the value. A single comma separates a value from a following > name. The names within an object SHOULD be unique. (emphasis mine)

“SHOULD”. Not “MUST”. I don’t care if it’s legal. It still weirds me out.

Once you’ve got solr running in the background, you can go ahead and try our query!

  • If you’re really lazy, just click the link
  • If you’re slightly less lazy, and you’ve got ruby installed, take a look in the new ruby directory. You can run ruby browse.rb localparams_query.rb to run the query and have it automatically open up in your browser.
  • If you’re ambitious, you might want to actually mess with the localparams_query.rb file so you can try things out.

As a longish side note, we’ll probably use browse.rb in the future of this series as well, so you might want to go ahead and get ruby installed if you don’t already. RVM is the easiest route if you’re on linux/OSX. You can also just install JRuby, seeing as how you’re running java anway (just make sure to use 1.9 mode by calling stuff as jruby --1.9 myscript.rb or setting the environment variable export JRUBY_OPTS=--1.9).

Special Stupid Solr Trick: Make a special query handler for a complex query

OK, so I said I wouldn’t have a real SST in this episode, but it’s so damn long at this point I figure I’ve lost everyone except Rochkind (Hey, Jonathan!), so let’s throw one in.

The Solr configuration file solrconfig.xml is where you can configure custom search handlers. In such a custom handler, you can specify defaults (which, by default, can be overridden by passed-in parameters, although you can control that, too) — this is commonly used to, say, put in a q.alt or a filter query that will always be applied.

But we can use it to put in our special query defaults that boosts when a document contains all the terms:

  1.  
  2.       10
  3.       *,score
  4.       explicit
  5.  
  6.         _query_:"{!edismax qf=$fields
  7.                            mm=$mymm
  8.                            v=$qwords
  9.                            bq=$boostForAll}"
  10.  
  11.       0%
  12.  
  13.         _query_:"{!edismax qf=$fields
  14.                            mm='100%'
  15.                            v=$qwords }"^5

If you look closely, you’ll see that everything you need is defined in this requestHandler in the solrconfig.xml file, except for $fields and $qwords. You could also override mymm by passing in an argument with that name, if the default ’0%’ isn’t to your liking.

If you’ve been following along at home, this requestHandler is already in the solrconfig.xml file that you’re running right now. Go ahead and try it! Let’s search for the terms ‘dueberb’ and ‘penn’ and see if the correct record floats to the top.

http://localhost:8983/solr/edismaxplus/?qwords=dueber penn&fields=author title

Nifty, huh?

Next time we’ll use a local params query to get around something about dismax that drives me crazy: preventing (or penalizing) matches that go across a field’s multiple values.

3 Responses to “Using localparams in Solr (or, how to boost records that contain all terms) (SST #2)”

  1. Ah, I see.

    The unexplained mystery is why you’d need to do a special boost of the same query with mm 100%. One would think that ordinary Solr relevance ranking would make things where all terms match rank higher than things where just some terms match, with a less than 100% mm.

    Now, I’ve noticed times it doesn’t too. But I’ve never understood why.

    This is necessary because sometimes a very long document with all the terms will have a lower score than a very short document with most of the terms.

    Hmm, okay, have to ponder on this one too. Ah, it’s starting to sink in.

  2. [...] August 18, 2010 4 Comments » [Note: I've since made a better explanation of, and solution for, this problem.] [...]

Leave a Reply

[For the introduction to this series, take a quick gander at the introduction]

Like everyone else in the library world, I’ve got a bunch of well-defined, well-controlled standard identifiers I need to keep track of and allow searching on.

You know, well-vetted stuff like this:

  • 1234-5678
  • 123-4567-890
  • 12-34-567-X
  • 0012-0045
  • ISBN13: 1234567890123
  • ISSN: 1234567X (1998-99)
  • ISSN (1998-99): 1234567X
  • 1234567890 (hdk. 22 pgs)
  • 9
  • Behind the 3rd floor desk
  • Henry VIII

[Note: some of these may be a titch exaggerated]

How does your system deal with these on index? How about on query?

Here’s an idea of how to use a custom solr fieldtype to do the heavy lifting.

What we’re shooting for

I’d like to be able to send in a text string as follows:

  • The input can contain other text besides the id
  • The ID starts with a digit and consists solely of digits and (optional) dashes, then ends with a digits and possibly a trailing ‘X’ or ‘x’ so we can deal with ISBN/ISSN
  • The ID has to be at least N characters long (for this example, I’m using N=8); this helps us avoid other text that might trivially look like an ID but isn’t.
  • Only the ID itself is indexed
  • If no valid ID is identified, nothing is indexed

The numericID field, suitable for ISBN/ISSN/OCLC/etc.

Let’s take a look at the end product and then walk through it.

Things we’ll be learning about today

NOTE: I really, really recommend taking a look at Scaling Lucene and Solr by the good folks over at Lucid Imagination for great, short explanations of omitNorms, term frequencies, etc.

Since this is the first post, I’ll go over some stuff that’s probably a little too basic for any audience that’s likely to show up here, but what the heck.

  • KeywordTokenizer
  • PatternReplaceFilterFactory
  • LowerCaseFilterFactory
  • LengthFilterFactory

Step 1: “Tokenize” to a single token

The job of a tokenizer is to decide how to split your input into individual tokens (often “words”), which are then munged by any filters you’re applying.

For the case of an ID, we don’t want to tokenize. At least at this juncture, I’m not trying to extract multiple valid IDs out of a single string; I’m just trying to determine if there’s a valid ID in there somewhere and throwing everything else away.

In other words, I’m going to treat the input as a single token, and then munge the bejeebers out of it in order to get what I want.

In the Solr world, that leads us to the confusingly-named KeywordTokenizer.

What we have now: exactly what we started with

Step 2: Find the first thing that looks like an ID and mark it

I primarily work in Ruby and Perl, which means the dramatic abuse of regular expressions is just part of my daily life.

Line 5 is our first use of a regexp in the filter chain via PatternReplaceFilterFactory.

The idea is to:

  1. Find something that looks like a match
  2. If found, get rid of everything else, and throw a ‘***’ onto the beginning so later on I can tell if I matched or not.

The second step is a little…odd…but necessary because I need a way to know if I found a candidate ID or not. If I did, well, there will be three asterisks on the front of the string from here on out. If not, there won’t.

This is a little confusing as these things go, so I’ll break it down.

Line 6: the match:

  • Skip any amount of stuff we don’t care about (.*?)
  • Match a number (\p{N}) (that’s unicode regexp syntax, if you haven’t seen it)
  • Match a string of at least 6 numbers and dashes
  • Close with an optional X or x [Xx]?
  • …and any trailing bits until the end of the string.

So…[number][six numbers/dashes][optional X]

At minimum, that’s seven digits/dashes.

Line 7: replacement

  • Replace the whole string (note how I anchored the match with ^ and $?) with whatever was matched inside the parentheses (represented here by $1) after prepending a set of three asterisks ‘***’

What we have now: If we found a candidate ID, we have that string prepended by ‘***’. Otherwise, we have exactly what we started with.

Step 3: If we didn’t find a match, throw it all away

Line 9 shows an attempt to match on any string that start with an asterisk (which we’re pretty sure we won’t see because that’s illegal lucene wildcard syntax). If we have a string that doesn’t start with an asterisk, then throw the whole damn thing away because we don’t have a candidate ID anyway.

[There's a strong argument to be made that using an asterisk as the tagging character is a bad choice. Anyone have suggestions?]

What we have now: Either a candidate ID string preceded by ‘***’ or the empty string.

Step 4: Ditch the ‘***’ used to mark a candidate ID

Lines 10-11

Find the ‘***’ and throw it away.

What we have now: The raw candidate ID string or the empty string.

Step 5: Lowercase it

Line 12.

By ‘it’ I mean “any X that might be trailing the ID”; we should have thrown everything else away by now. (Note: could have done this with a PattenReplace as well, obviously; not sure why’d I’d choose one over the other).

What we have now: The raw candidate ID string with its optional trailing ‘X’ lowercased, or the empty string

Step 6: Get rid of everything that’s not a number or an ‘x’

Lines 13-15

Ditch any dashes that are remaining. I’m doing it like this instead of just ditching the dashes because I’ll likely modify this at some point to allow, e.g., periods between numbers, or maybe spaces. This is safer.

Note the extra parameter (replace=”all”), indicating that I want to replace all occurrences. This hasn’t been an issue until now because I’ve been careful to match the entire string by anchoring the pattern at the beginning (‘^’) and end (‘$’).

What we have now: A string of numbers possibly followed by an ‘x’, or the empty string.

Step 7: Make sure what we have is a reasonable length

Line 16

Now that we’ve gotten rid of the dashes, we need to make sure we have enough digits left to make a valid identifier.

If we didn’t match originally, it quickly got reduced to the empty string, and that will disappear here due to having length 0.

It’s also possible that our initial match was, say, ’1—-3—–6—7′, which will at this point have been reduced to just ’1367′ — too short for our taste.

In this version, I allow strings of any length between 7 (old OCLC number) and 14 (barcode).

What we have now: A string consisting purely of 7-14 characters, the last of which may be an ‘x’, or nothing at all (e.g., nothing will get indexed).

Step 8: Remove leading 0s

My ILS (Aleph) loves to zero-pad all its local identifiers. I’d rather get rid of them.

What we have now: What we had before, but with no leading zeros

Let’s try it!

If you’re following along at home, get the latest version of the schema and try it!

  1.   cd solr_stupid_tricks
  2.   git pull origin master
  3.   java -jar start.jar

…and then:

For those of you not following along at home, here are the examples from waaaaaay at the top of this post:

  • 1234-5678 => 12345678
  • 123-4567-890 => 1234567890
  • 12-34-567-X => 1234567x
  • 0012-0045 => 120045
  • ISBN13: 1234567890123 => 1234567890123
  • ISSN: 1234567X (1998-99) => 1234567x
  • ISSN (1998-99): 1234567X => 199899
  • 1234567890 (hdk. 22 pgs) => 1234567890
  • 9 => [nothing]
  • Behind the 3rd floor desk => [nothing]
  • Henry VIII => [nothing]

So…not too bad. We did miss one, mistaking a year range for a numeric ID, but if your data are that bad, there’s only so much we can do.

Conclusions

Obviously, this is the tip of the iceberg with this sort of thing. And it can still be confused.

But it does follow our goal of having the exact same behavior on index and query, moving the logic to solr, and being pretty flexible.

Perfect? No. Useful? Yes.

4 Responses to “Solr Field Type for numeric(ish) IDs (SST #1)”

  1. Do you have to do something to deal with the fact that while your analyzer specifies keyword tokenizer (no split on whitespace in tokenization phase) at both index time and query time, in reality at query time if you’re using lucene or dismax query parsers, they’ll sort of “pre-tokenize” on whitespace before even sending it to the analyzer?

    Does that end up causing a problem?

  2. This has an interesting problem. For tokens with no numbers, it creates a zero-length string. This matched seven records in my database that (incorrectly) had a zero-length string as the ISBN. Normal text searches would match these fields, which meant we never had zero hits, instead we showed the same seven irrelevant books.

    I added a LengthFilter to turn zero-length tokens into no token at all, which fixed the problem. Then I simplified the regex since I only care about ISBNs and EANs.

    <!-- Remove anything that isn’t a digit or an 'x'. -->
    <filter class="solr.PatternReplaceFilterFactory"
        pattern="[^\dx]" replacement="" replace="all"/>
    <!-- ISBNs and EANs are either 10 or 13 characters long. -->
    <filter class="solr.LengthFilterFactory" min="10" max="13"/>
    

    Thanks!

  3. Bill says:

    Note — I’ve talked to Walter, and his guess is that he left of the existing LengthFilterFactory when he dropped the remove-leading-zeros regexp (which others may want to do if you’re dealing only with stuff where leading zeros are useful and don’t want spurious collisions between, say, ISBN and ISSN). The numericID fieldtype in the post will not ever give zero-length strings.

    His comment also drives home the point that you should restrict this as much as possible — know your data so you know how to set the min/max length, whether to remove leading zeros, etc.

Leave a Reply