Mailing List Archive

Semantics and performance regarding min number of the optional BooleanClauses
Hi all,

My question concerns the method setMinimumNumberShouldMatch in BooleaQuery
class.

Lets assume that we have 3 queries (optional clauses), namely A, B, C and
we build a BooleanQuery specifying that at least 2 should match.

In terms of semantics what I understand so far is that

(A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)).

In other words a single BooleaQuery with a min should match parameter could
be rewritten as pure disjunctive BooleanQuery comprised from 3 sub-queries.

In terms of performance it seems that the two queries present different
behavior so the minMatch property is not only syntactic sugar and
apparently there is no rewriting between the two.

Coming from the SQL world it is a bit hard for me to justify the addition
of a new operator that looks like syntactic sugar and at the same time is
more performant than the more primitive equivalents. I looked a bit in [1]
to understand motivation for adding this API but without much success.

Summing up everything to three questions:
1. Did I get right the semantics of this extra property or there are things
that I am missing?
(If my understanding is correct)
2. What's the reason for introducing the minMatch property in the first
place? (Avoid creating huge queries?)
3. Should the performance of the two queries shown above differ?

Thanks in advance!

Best,
Stamatis

[1] https://issues.apache.org/jira/browse/LUCENE-395
Re: Semantics and performance regarding min number of the optional BooleanClauses [ In reply to ]
Hi Stamatis,

One thing that you missed regarding semantics is scoring. While (A B C)~2
and ((+A +B) (+A +C) (+B +C)) would match the same documents, they would
produce different scores.

Moreover, many users come to this query because it is exactly what they
need: matching k out of n clauses. In the example you gave it's pretty
simple because there are only 3 clauses, but try to see what the generated
query looks like when matching 3 out of 5 clauses, it's already very
complex.

It would be nice if we could rewrite the expanded form into the variant
that sets a minimum number of matching should clauses, which should be more
efficient. My worry is that it would be quite expensive to do, maybe to the
point that it would more hurt than help on average. I'd be very happy to be
proven wrong though, if we can cheaply rewrite the expanded form, this
would be a good addition.

On Mon, Mar 30, 2020 at 6:06 PM Stamatis Zampetakis <zabetak@gmail.com>
wrote:

> Hi all,
>
> My question concerns the method setMinimumNumberShouldMatch in BooleaQuery
> class.
>
> Lets assume that we have 3 queries (optional clauses), namely A, B, C and
> we build a BooleanQuery specifying that at least 2 should match.
>
> In terms of semantics what I understand so far is that
>
> (A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)).
>
> In other words a single BooleaQuery with a min should match parameter could
> be rewritten as pure disjunctive BooleanQuery comprised from 3 sub-queries.
>
> In terms of performance it seems that the two queries present different
> behavior so the minMatch property is not only syntactic sugar and
> apparently there is no rewriting between the two.
>
> Coming from the SQL world it is a bit hard for me to justify the addition
> of a new operator that looks like syntactic sugar and at the same time is
> more performant than the more primitive equivalents. I looked a bit in [1]
> to understand motivation for adding this API but without much success.
>
> Summing up everything to three questions:
> 1. Did I get right the semantics of this extra property or there are things
> that I am missing?
> (If my understanding is correct)
> 2. What's the reason for introducing the minMatch property in the first
> place? (Avoid creating huge queries?)
> 3. Should the performance of the two queries shown above differ?
>
> Thanks in advance!
>
> Best,
> Stamatis
>
> [1] https://issues.apache.org/jira/browse/LUCENE-395
>


--
Adrien
Re: Semantics and performance regarding min number of the optional BooleanClauses [ In reply to ]
Thanks Adrien, indeed I missed the difference in the score since in my
context I do not need them most of the time.

I will try to check if there is something that can be done with respect to
the rewriting and if something promising comes up I will let you know.

Best,
Stamatis


On Mon, Mar 30, 2020, 7:20 PM Adrien Grand <jpountz@gmail.com> wrote:

> Hi Stamatis,
>
> One thing that you missed regarding semantics is scoring. While (A B C)~2
> and ((+A +B) (+A +C) (+B +C)) would match the same documents, they would
> produce different scores.
>
> Moreover, many users come to this query because it is exactly what they
> need: matching k out of n clauses. In the example you gave it's pretty
> simple because there are only 3 clauses, but try to see what the generated
> query looks like when matching 3 out of 5 clauses, it's already very
> complex.
>
> It would be nice if we could rewrite the expanded form into the variant
> that sets a minimum number of matching should clauses, which should be more
> efficient. My worry is that it would be quite expensive to do, maybe to the
> point that it would more hurt than help on average. I'd be very happy to be
> proven wrong though, if we can cheaply rewrite the expanded form, this
> would be a good addition.
>
> On Mon, Mar 30, 2020 at 6:06 PM Stamatis Zampetakis <zabetak@gmail.com>
> wrote:
>
> > Hi all,
> >
> > My question concerns the method setMinimumNumberShouldMatch in
> BooleaQuery
> > class.
> >
> > Lets assume that we have 3 queries (optional clauses), namely A, B, C and
> > we build a BooleanQuery specifying that at least 2 should match.
> >
> > In terms of semantics what I understand so far is that
> >
> > (A B C)~2 is equivalent to ((+A +B) (+A +C) (+B +C)).
> >
> > In other words a single BooleaQuery with a min should match parameter
> could
> > be rewritten as pure disjunctive BooleanQuery comprised from 3
> sub-queries.
> >
> > In terms of performance it seems that the two queries present different
> > behavior so the minMatch property is not only syntactic sugar and
> > apparently there is no rewriting between the two.
> >
> > Coming from the SQL world it is a bit hard for me to justify the addition
> > of a new operator that looks like syntactic sugar and at the same time is
> > more performant than the more primitive equivalents. I looked a bit in
> [1]
> > to understand motivation for adding this API but without much success.
> >
> > Summing up everything to three questions:
> > 1. Did I get right the semantics of this extra property or there are
> things
> > that I am missing?
> > (If my understanding is correct)
> > 2. What's the reason for introducing the minMatch property in the first
> > place? (Avoid creating huge queries?)
> > 3. Should the performance of the two queries shown above differ?
> >
> > Thanks in advance!
> >
> > Best,
> > Stamatis
> >
> > [1] https://issues.apache.org/jira/browse/LUCENE-395
> >
>
>
> --
> Adrien
>