Mailing List Archive

Query Optimization
Hi there,

I try to optimize my query, but I think I have come to a point where I have
to extend the functionality of lucene.

I have a set of pretty simple queries, let's call them <A>, <B>, <C>, ...
Now I try to create a query which matches, if AT LEAST 2 of this simple
queries match. I started with a query like the following one (build all the
possible pairs and combine them with 'OR'):

(<A> AND <B>) OR (<A> AND <C>) OR (<B> and <C>) OR ...

My first question:
Is it safe to create the simple Query objects (most of the time TermQuery
objects) and then use these objects multiple times in a BooleanQuery object?

Example:

Query a = new TermQuery("name", "John");
Query b = new TermQuery("name", "James");
Query c = new TermQuery("name", "Jack");

Query q1 = new BooleanQuery();
and.add(a,true,false);
and.add(b,true,false);

Query q2 = new BooleanQuery();
and.add(a,true,false);
and.add(c,true,false);

Query q3 = new BooleanQuery();
and.add(b,true,false);
and.add(c,true,false);

Query q = new BooleanQuery();
q.add(a,false,false);
q.add(b,false,false);
q.add(c,false,false);




Back to the optimization problem:
I was able to optimize the above like this:

(<A> AND (<B> OR <C> OR ...)) OR (<B> AND (<C> OR ...)) OR (<C> AND (...))
OR ...

But this is still to slow and there must be some potential to improve this:
If <A> matches, but the subquery (<B> OR <C> OR ...) doesn't match,
evaluation could stop. But the query above will continue. I could change the
query again to something like the following:

(<A> AND (<B> OR <C> OR ...)) OR NOT(<A>) AND ( .... )

But this would get pretty complex.

Now I would like to create a subclass of Query (or MultiTermQuery?),
something like a MatchCountQuery where I can add my simple Query objects:

MatchCountQuery query = new MatchCountQuery();
query.setMinCount(2); // at least 2 matches
query.setMaxCount(-1); // no limit (I don't realy need this)
query.add(<A>);
query.add(<B>);
query.add(<C>);
query.add(<D>);

Has anyone ever done something like this? Or is there a simpler solution?


Thank you for any kind of help in this case.

Stephan
Re: Query Optimization [ In reply to ]
Stephan,


On Friday 06 May 2005 20:29, Stephan Markwalder wrote:
> Hi there,
>
> I try to optimize my query, but I think I have come to a point where I have
> to extend the functionality of lucene.
>
> I have a set of pretty simple queries, let's call them <A>, <B>, <C>, ...
> Now I try to create a query which matches, if AT LEAST 2 of this simple
> queries match. I started with a query like the following one (build all the

In the development (java) version, there is a DisjunctionSumScorer that can be
given a minimum number of simple queries to match (default 1), which is what
you need here. It works by counting the matching simple queries
while evaluting the query as an OR like query.

However, this is supported neither in the query parser nor in the
main BooleanQuery class, so some extension is needed in case you
need to parse this and/or you need a coordination factor in the score.
The coordination factor can be added by subclassing, parsing
would need an extension of the syntax, which may not be straightforward.

> possible pairs and combine them with 'OR'):
>
> (<A> AND <B>) OR (<A> AND <C>) OR (<B> and <C>) OR ...
>
> My first question:
> Is it safe to create the simple Query objects (most of the time TermQuery
> objects) and then use these objects multiple times in a BooleanQuery object?

It should be safe to do that, but I've never tried it myself.
With the DisjunctionSumScorer, these forms are not needed.

> Example:
>
> Query a = new TermQuery("name", "John");
> Query b = new TermQuery("name", "James");
> Query c = new TermQuery("name", "Jack");
>
> Query q1 = new BooleanQuery();
> and.add(a,true,false);
> and.add(b,true,false);
>
> Query q2 = new BooleanQuery();
> and.add(a,true,false);
> and.add(c,true,false);
>
> Query q3 = new BooleanQuery();
> and.add(b,true,false);
> and.add(c,true,false);
>
> Query q = new BooleanQuery();
> q.add(a,false,false);
> q.add(b,false,false);
> q.add(c,false,false);
>
>
>
>
> Back to the optimization problem:
> I was able to optimize the above like this:
>
> (<A> AND (<B> OR <C> OR ...)) OR (<B> AND (<C> OR ...)) OR (<C> AND (...))
> OR ...
>
> But this is still to slow and there must be some potential to improve this:
> If <A> matches, but the subquery (<B> OR <C> OR ...) doesn't match,
> evaluation could stop. But the query above will continue. I could change the
> query again to something like the following:
>
> (<A> AND (<B> OR <C> OR ...)) OR NOT(<A>) AND ( .... )
>
> But this would get pretty complex.

Counting the matching simple queries while evaluating is less complex.

> Now I would like to create a subclass of Query (or MultiTermQuery?),
> something like a MatchCountQuery where I can add my simple Query objects:
>
> MatchCountQuery query = new MatchCountQuery();
> query.setMinCount(2); // at least 2 matches
> query.setMaxCount(-1); // no limit (I don't realy need this)
> query.add(<A>);
> query.add(<B>);
> query.add(<C>);
> query.add(<D>);
>
> Has anyone ever done something like this? Or is there a simpler solution?

Yes, the DisjunctionSumScorer above.
In case you have further questions, please use
java-dev@lucene.apache.org


Regards,
Paul Elschot