Archives: December 2011

Solr and boolean operators

December 1, 2011 at 12:13 pmCategory:Uncategorized

[Summary: ALWAYS ALWAYS ALWAYS USE PARENTHESES TO GROUP BOOLEANS IN SOLR!!!]

What does Solr do, given the following query?

  a OR b AND c

I’ll give you three guesses, but you’ll get the first two wrong and won’t have any idea how to generate a third, so don’t spend too much time on it.

Boolean algebra and operator precedence

Anyone who’s had even a passing introduction to boolean alegebra knows that it specifies a strict order to how the operators are bound: NOT before AND before OR. So, one might expect the following grouping:

  a OR (b AND c)

That’s guess one. It’s not how Solr does it.

Left to right?

Some naive students, and at least one programming language (Smalltalk), do a simple left-to-right evaluation. So you might go with:

  (a OR b) AND c

Nope. Wrong again.

So what’s left???

Excellent question. I don’t know the code well enough to know what’s going on underneath, but here’s what we get under the lucene query parser.

    (b AND c)

That’s right. The first term is thrown away.(More correctly, the first term is deemed “optional”).

Do you let your users put AND/OR/NOT in their queries?

Hopefully, they don’t know any boolean algebra. If they do, hopefully they use parentheses, or you parse it out for them. And if not, well, they’re gonna be pretty damn confused.

It gets weirder

I populated a fresh solr (3.5) index with all possible subsets of the strings “curly”, “larry”, “moe”, and “shemp” (not Joe. Don’t talk to me about Joe). There are 15 of them, from the one-item ‘curly’ to all four at once.

I wrote a script to run a set of queries against the index under both lucene and edismax to see what I would get. In all cases the default lucene operator is ‘AND’ and the edismax mm parameter is set to 100% (equivalent to “all required”).

        Lucene                    EDismax
--------------------------------------------------------

  1. curly AND larry curly larry curly larry
    curly larry moe curly larry moe
    curly larry shemp curly larry shemp
    curly larry moe shemp curly larry moe shemp

  2. curly AND larry OR moe curly curly larry
    curly larry curly larry moe
    curly moe curly larry shemp
    curly shemp curly larry moe shemp
    curly larry moe
    curly larry shemp
    curly moe shemp
    curly larry moe shemp

  3. curly OR larry AND moe larry moe larry moe
    curly larry moe curly larry moe
    larry moe shemp larry moe shemp
    curly larry moe shemp curly larry moe shemp

  4. curly AND larry OR moe AND shemp curly moe shemp curly larry moe shemp
    curly larry moe shemp

  5. moe AND shemp OR curly AND larry curly larry moe curly larry moe shemp
    curly larry moe shemp

Query 1 is as expected. Query 2 apparently reduces to just ‘curly’ under the lucene parser and ‘curly AND larry’ under edismax (and query 3 similarly reduces to the two AND’d words). Queries 4 and 5 are…well, you can look at the debugQuery output to see what it gets, but not why. And then tell me how to explain it to a user.

Where does this leave us?

The good news is that both lucene and edismax behave predictably when you use parentheses for grouping. So do that.

I’m generally not one to complain about open-source software, at least partially because I don’t have the chops to do anything about it most of the time, but I don’t understand how this could seem OK to anyone. There are a couple lucene Jira tickets (Lucene-167 and Lucene-1823) and a 2005 mailing list thread denouncing the current behavior, but it persists.

Until the Solr/Lucene powers that be decide to tackle this, the rest of us will either have to write pre-parsers to make sure users get something sensible, or cripple our applications to disallow unrestricted boolean queries.

3 Responses to “Solr and boolean operators”

  1. So the way I handle part of this at present is not actually passing these user-entered boolean queries straight to any solr query parser (not lucene, not edismax either; curious if edismax has the same or similar idiosyncracies, I predict it will).

    Instead, I actually parse all user queries in my own application, and then construct Solr queries (using a ‘lucene’ type query, with nested dismax type queries) that I know do what is needed for the user query to be interpreted with more typical boolean logic.

    Of course, it’s possible I’ve gotten it wrong too, but theoretically my parser/interpreter can then be fixed to what I want.

    Now, it might make even MORE sense to do this as an actual custom Solr query parser, but I lack the Solr/java comfort/skills to do that, so I’d rather do it in ruby. I think it would probably in some ways be better to do this in Java as a Solr plug-in query parser, rather than at the application layer, but oh well.

    Here’s the specs from my code that show how various user-entered queries are translated to Solr. Note they also take care of various types of “pure negative” queries that lucene and dismax query parsers can’t handle ‘right’ either (I think edismax fixes some but not all of these ‘pure negative’ cases). Hopefully the spec file is somewhat readable, despite (or because of!) the helper functions I put in to test generated solr query params against templates.

    https://github.com/projectblacklight/blacklight_advanced_search/blob/master/spec/parsing_nesting/to_solr_spec.rb

  2. PS: Sorry, I see you covered edismax too, thanks!

    PPS: One answer I’ve gotten from solr-ites is basically “Well, yeah, AND/OR aren’t really boolean operators to these query parsers, they are just indication of lucene optional/required clauses.”

    To which my answer is: “Okay, but then why use syntax that looks like boolean algebra, and that makes it very hard to predict exactly how they are translated to optional/required clauses, to the novice. Why not use a different syntax that actually indicates what it’s doing?” Perhaps the answer is “Well, becuase users are used to AND/OR”, but I think it’s no service to give users a syntax they are used to but with semantics that they are NOT used to!

    However, it’s possible that if you keep in mind “translating to lucene optional/mandatory clauses”, it will be the right mental model to figure out why the query parsers are doing what they’re doing. Although I still can’t figure it out, especially for some of the especially weird cases where terms are dropped altogether.

  3. Bruce Rosen says:

    Many thanks for this posting. I just stumbled upon Solr’s boolean precedence peculiarity yesterday (not sure why I hadn’t seen it earlier). It was precisely your initial question: a OR b AND c. Search results always returned the b AND c part. If I tried b OR a AND c on the identical docs, I always got the a AND c docs. Using parentheses, (a OR b) AND c, did the right thing. So, I was very glad to see your posting, and will take the punchline (always use parentheses) to heart … I’ve got some code to go back and fix!

Leave a Reply