Home > Uncategorized > Solr: Forcing items with all query terms to the top of a Solr search

Solr: Forcing items with all query terms to the top of a Solr search

August 18, 2010 4 Comments »

[Note: I've since made a better explanation of, and solution for, this problem.]

Here at UMich, we’re apparently in the minority in that we have Mirlyn, our catalog discovery interface (a very hacked version of VuFind), set up to find records that match only a subset of the query terms.

Put more succinctly: everyone else seem to join all terms with ‘AND’, whereas we do a DisMax variant on ‘OR’.

Now, I’m actually quite proud of how our searching behaves. Reference desk anecdotes and our statistics all point to the idea that people tend to find what they’re looking for. I invite you to try our current configuration out — and, of course, let me know if something feels off to you. We have control of our OPAC now, and can actually fix things.

The “problem”: DisMax is weird

The DisMax algorithm is complex. Even if you ignore the fact that we weight some fields (title, author) much higher than others, a fundamental feature of DisMax is that it basically gives ranking based on the question, “What percentage of the words in the document match one of our query terms”?

Most of the time, that’s exactly what you want. In general, items that have all the keywords, and more of them, appear at the top of the search results.

But sometimes you can have just, say, two of your three search terms appearing like a rash all across a relatively short record, and it’ll pop to the top, appearing ahead of records that actually contain all three search terms. Or maybe three of four search terms appear in both title and author (highly-weighted fields) and the same thing happens.

And, yeah, it really happens.

An actual, real-life example

Searching for the three terms information AND architecture AND usability, explicitly requiring all three, gives 12 results.

The equivalent DisMax search (where only two of three need to be found) nets about 4300 results. Which is great — we’re casting a much wider net, with some pretty common words. That doesn’t matter so long as the most relevant results float to the top.

The kicker? The first time an item in the first set appears in the second is at record number 62. Our user is more than three pages in before she even see a record that contains all three terms.

Again, most of the time, our current algorithm does really, really well in my opinion. But noticing this led to talk about artificially pushing all the “all terms are present” items to the top.

Pushing records that contain all the terms to the top

So, I wanted to:

  • Push records with all search terms to the top, but
  • …don’t otherwise change their scores. i.e., don’t otherwise re-order them in any way, ’cause I’m already happy with my ordering.

It turns out to be harder than I initially thought. I fought with my code for a whole day, then asked for help, and help was provided.

So, with special thanks to Jan Høydahl for his solution, we get this, in Ruby psuedocode:

andedTerms = allMyTerms.join(‘ AND ‘) bf = map(query($qq),0,0,0,100000.0) # Add this value to the ranking score qq = “allFields:(#{andedTerms})” # Use this as the query

add bf and qq to your solr query

The qq is easy enough — it basically says that to get any relevancy score at all, the record must have all the terms in the allFields Solr field.

For the map, we want to say > If the record matches all the terms, give it an extra 100K points. If not, don’t.

The map takes 5 arguments:

  • An initial value. In this case, we’re getting the relevancy ranking score based on the qq query. Basically, items that don’t have all the terms will have a score of zero; items that do have all three terms will have something bigger than zero.
  • The beginning of range to compare to. In this case, 0.
  • The end of the range. Another zero, so basically, we’ll be seeing if our initial value is between 0 and 0, e.g., if it’s exactly 0.
  • The value to return if the initial value fits in the range — zero. So, if the records doesn’t have all the terms, return a 0.
  • The value to return if the initial value falls outside the given range. 100K — a random very-large number I picked.

And…?

I just pushed this to our beta site, and folks are still looking at it, but so far, it looks awesome. I’ll do a little update post if/when it goes into production. And if it doesn’t, I’ll say why.

Tags:

Comments:4

Leave my own
  1. Naomi Dushay
    August 29, 2010 at 10:31 pm

    So what sort of relevancy testing did you use to confirm the new way is better than the old way? Nice repeatable, automate-able tests, right?

  2. Naomi Dushay
    August 29, 2010 at 10:36 pm

    More questions:

    couldn’t you have set mm higher, or did you want to make sure you still got the more comprehensive result set?

    I’m also wondering if there are other ways to reduce the importance of the size-of-the-document.

    Have you also tweaked for unstemmed matches >> stemmed matches (cooking vs. cooked)

    and for proximity using the pf and ps boosts?

  3. Jonathan Rochkind
    August 31, 2010 at 6:07 pm

    Here’s my refinement, which avoids the need for the client to send the qq, everything is just computed based on existing ‘q’.

    map(query($all_terms),0,0,0,100000.0) {!dismax qf=text pf=” bf=” bq=” mm=’100%’ v=$q}

    Seems to work to do the same thing yours does. Of course, now this is indeed by default applied to every single query, including queries it doesn’t make any sense for (or may even error for?) like queries that weren’t originally dismax (like ‘advanced’ search).

    Not quite sure the best way to deal with this in blacklight, just playing around with different approaches to this functionality, seeing which will require the least code. :)

  4. Jonathan Rochkind
    August 31, 2010 at 6:08 pm

    Bah, that got rid of my brackets. let’s try again without brackets, but this is defaults in solrconfig.xml

    str name=”bf” map(query($all_terms),0,0,0,100000.0) /str

    str name=”all_terms” {!dismax qf=text pf=” bf=” bq=” mm=’100%’ v=$q} /str

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>