Mailing List Archive

OR'ed boolean queries
Hello

I don´t know exactly how is working PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery expanding to a series of OR'ed boolean queries.

For example I have an index with 200.000 registries. Each registry has two metadatas, NAMEFILE and AGENCY. If I do the search
NAMEFILE:ef*
I am getting TooManyClauses error, but if I do the search
AGENCY:ef*
I am getting correctly the results without any error.

Both metadatas has 200.000 values, but, in the metadata AGENCY there are about 30 diferents values and in the metadata NAMEFILE each registry has an unique value.

Both metadatas have been indexed like Field.Text.

The same happens with RangeQuery. For example:

The user select PAGE > 0. Internally it is translated like PAGE:{0000000000 TO 2147483647} (2147483647 This is Integer.MAX_VALUE)
This returns 130.000 registries with value > 0 without TooManyClauses error, but using another numeric metadatas I am getting TooManyClauses error..

The property maxClauseCount is by default (1024).

Could anybody tell me how it is working?



Thanks in advance


Mari Luz Elola
Re: OR'ed boolean queries [ In reply to ]
The problem is that you name a lot of NAMEFILEs that start with "ef".
"A lot" means "more than 1024":
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#getMaxClauseCount()

You could change it with this:
http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)

Otis


--- MariLuz Elola <melola@seinet.es> wrote:

> Hello
>
> I don´t know exactly how is working
> PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery expanding to a
> series of OR'ed boolean queries.
>
> For example I have an index with 200.000 registries. Each
> registry has two metadatas, NAMEFILE and AGENCY. If I do the search
> NAMEFILE:ef*
> I am getting TooManyClauses error, but if I do the search
> AGENCY:ef*
> I am getting correctly the results without any error.
>
> Both metadatas has 200.000 values, but, in the metadata AGENCY
> there are about 30 diferents values and in the metadata NAMEFILE each
> registry has an unique value.
>
> Both metadatas have been indexed like Field.Text.
>
> The same happens with RangeQuery. For example:
>
> The user select PAGE > 0. Internally it is translated like
> PAGE:{0000000000 TO 2147483647} (2147483647 This is
> Integer.MAX_VALUE)
> This returns 130.000 registries with value > 0 without
> TooManyClauses error, but using another numeric metadatas I am
> getting TooManyClauses error..
>
> The property maxClauseCount is by default (1024).
>
> Could anybody tell me how it is working?
>
>
>
> Thanks in advance
>
>
> Mari Luz Elola
>
>
>
>
>
Re: OR'ed boolean queries [ In reply to ]
But, the metadata AGENCY has a lot of "ef" too, more than 1024.
What is the difference between NAMEFILE and AGENCY. Why I am getting
maxClause error with NAMEFILE and not with AGENCY??
If I change maxClauseCount to a big value, I am getting OutOfMemoryError.


----- Original Message -----
From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
To: <general@lucene.apache.org>
Sent: Thursday, July 21, 2005 7:20 PM
Subject: Re: OR'ed boolean queries


> The problem is that you name a lot of NAMEFILEs that start with "ef".
> "A lot" means "more than 1024":
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#getMaxClauseCount()
>
> You could change it with this:
> http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)
>
> Otis
>
>
> --- MariLuz Elola <melola@seinet.es> wrote:
>
>> Hello
>>
>> I don´t know exactly how is working
>> PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery expanding to a
>> series of OR'ed boolean queries.
>>
>> For example I have an index with 200.000 registries. Each
>> registry has two metadatas, NAMEFILE and AGENCY. If I do the search
>> NAMEFILE:ef*
>> I am getting TooManyClauses error, but if I do the search
>> AGENCY:ef*
>> I am getting correctly the results without any error.
>>
>> Both metadatas has 200.000 values, but, in the metadata AGENCY
>> there are about 30 diferents values and in the metadata NAMEFILE each
>> registry has an unique value.
>>
>> Both metadatas have been indexed like Field.Text.
>>
>> The same happens with RangeQuery. For example:
>>
>> The user select PAGE > 0. Internally it is translated like
>> PAGE:{0000000000 TO 2147483647} (2147483647 This is
>> Integer.MAX_VALUE)
>> This returns 130.000 registries with value > 0 without
>> TooManyClauses error, but using another numeric metadatas I am
>> getting TooManyClauses error..
>>
>> The property maxClauseCount is by default (1024).
>>
>> Could anybody tell me how it is working?
>>
>>
>>
>> Thanks in advance
>>
>>
>> Mari Luz Elola
>>
>>
>>
>>
>>
>
Re: OR'ed boolean queries [ In reply to ]
: But, the metadata AGENCY has a lot of "ef" too, more than 1024.
: What is the difference between NAMEFILE and AGENCY. Why I am getting
: maxClause error with NAMEFILE and not with AGENCY??

the issue is not the number of documents that have a value with that
prefix in that field -- the issue is the number of unique values that have
that prefix in that field -- regardless of the number of documents that
use each value.

: If I change maxClauseCount to a big value, I am getting OutOfMemoryError.

you can either increase your memory footprint, or you can abanndon the use
of prefix query in this situation. there are a lot of other options for
achieving simialr results -- using a custom filter, making more fields
that contain only the first few characters of the field for the purpose of
doiing short prefix queries ... etc.

In my opinion, understanding the way PrefixQuery (and RangeQuery) expand
to BooleanQueries, and why it can cause TooManyClauses exceptions is the
second most important thing people using Lucene need to understand (after
Analyzers). Take the time to read up on it in the wiki, mailing list
archives, and LIA -- it's worth it.


:
:
: ----- Original Message -----
: From: "Otis Gospodnetic" <otis_gospodnetic@yahoo.com>
: To: <general@lucene.apache.org>
: Sent: Thursday, July 21, 2005 7:20 PM
: Subject: Re: OR'ed boolean queries
:
:
: > The problem is that you name a lot of NAMEFILEs that start with "ef".
: > "A lot" means "more than 1024":
: > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#getMaxClauseCount()
: >
: > You could change it with this:
: > http://lucene.apache.org/java/docs/api/org/apache/lucene/search/BooleanQuery.html#setMaxClauseCount(int)
: >
: > Otis
: >
: >
: > --- MariLuz Elola <melola@seinet.es> wrote:
: >
: >> Hello
: >>
: >> I don´t know exactly how is working
: >> PrefixQuery,WildcardQuery,RangeQuery and FuzzyQuery expanding to a
: >> series of OR'ed boolean queries.
: >>
: >> For example I have an index with 200.000 registries. Each
: >> registry has two metadatas, NAMEFILE and AGENCY. If I do the search
: >> NAMEFILE:ef*
: >> I am getting TooManyClauses error, but if I do the search
: >> AGENCY:ef*
: >> I am getting correctly the results without any error.
: >>
: >> Both metadatas has 200.000 values, but, in the metadata AGENCY
: >> there are about 30 diferents values and in the metadata NAMEFILE each
: >> registry has an unique value.
: >>
: >> Both metadatas have been indexed like Field.Text.
: >>
: >> The same happens with RangeQuery. For example:
: >>
: >> The user select PAGE > 0. Internally it is translated like
: >> PAGE:{0000000000 TO 2147483647} (2147483647 This is
: >> Integer.MAX_VALUE)
: >> This returns 130.000 registries with value > 0 without
: >> TooManyClauses error, but using another numeric metadatas I am
: >> getting TooManyClauses error..
: >>
: >> The property maxClauseCount is by default (1024).
: >>
: >> Could anybody tell me how it is working?
: >>
: >>
: >>
: >> Thanks in advance
: >>
: >>
: >> Mari Luz Elola
: >>
: >>
: >>
: >>
: >>
: >
:
:



-Hoss