Mailing List Archive

Potential bug
Hi,-

 i think this is a potential bug


i set this time totalHitsThreshold to 10 and i get totalhits reported as
1655 but i get 10 results in total.

I think this suggests that there might be a bug with
TopScoreDocCollector algorithm.


Best regards



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
Hi Baris,

totalhitsThreshold is actually a minimum threshold, not a maximum threshold.

The problem is that Lucene cannot directly identify the top matching
documents for a given query. The strategy it adopts is to start collecting
hits naively in doc ID order and to progressively raise the bar about the
minimum score that is required for a hit to be competitive in order to skip
non-competitive documents. So it's expected that Lucene still collects 100s
or 1000s of hits, even though the collector is configured to only compute
the top 10 hits.

On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:

> Hi,-
>
> i think this is a potential bug
>
>
> i set this time totalHitsThreshold to 10 and i get totalhits reported as
> 1655 but i get 10 results in total.
>
> I think this suggests that there might be a bug with
> TopScoreDocCollector algorithm.
>
>
> Best regards
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien
Re: Potential bug [ In reply to ]
Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:
> Hi Baris,
>
> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>
> The problem is that Lucene cannot directly identify the top matching
> documents for a given query. The strategy it adopts is to start collecting
> hits naively in doc ID order and to progressively raise the bar about the
> minimum score that is required for a hit to be competitive in order to skip
> non-competitive documents. So it's expected that Lucene still collects 100s
> or 1000s of hits, even though the collector is configured to only compute
> the top 10 hits.
>
> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>
>> Hi,-
>>
>> i think this is a potential bug
>>
>>
>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>> 1655 but i get 10 results in total.
>>
>> I think this suggests that there might be a bug with
>> TopScoreDocCollector algorithm.
>>
>>
>> Best regards
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
Hi Baris,

> what if the user needs to limit the search process?

What do you mean by 'limit'?

> there should be a way to speedup lucene then if this is not possible,
> since for some simple queries it takes half a second which is too long.

What do you mean by 'simple' query? there might be multiple reasons behind slowness of a query that are unrelated to the search (for example, if you retrieve many documents and for each document you are extracting the content of many fields) - would you like to tell us a bit more about your use case?

Regards,
Diego

From: java-user@lucene.apache.org At: 06/09/21 18:18:01To: java-user@lucene.apache.org
Cc: baris.kazar@oracle.com
Subject: Re: Potential bug

Thanks Adrien, but the differences is too far apart.

I think the algorithm needs to be revised.


what if the user needs to limit the search process?

that leaves no control.

there should be a way to speedup lucene then if this is not possible,

since for some simple queries it takes half a second which is too long.

Best regards


On 6/9/21 1:13 PM, Adrien Grand wrote:
> Hi Baris,
>
> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>
> The problem is that Lucene cannot directly identify the top matching
> documents for a given query. The strategy it adopts is to start collecting
> hits naively in doc ID order and to progressively raise the bar about the
> minimum score that is required for a hit to be competitive in order to skip
> non-competitive documents. So it's expected that Lucene still collects 100s
> or 1000s of hits, even though the collector is configured to only compute
> the top 10 hits.
>
> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>
>> Hi,-
>>
>> i think this is a potential bug
>>
>>
>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>> 1655 but i get 10 results in total.
>>
>> I think this suggests that there might be a bug with
>> TopScoreDocCollector algorithm.
>>
>>
>> Best regards
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards






On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

> Hi Baris,
>
>> what if the user needs to limit the search process?
> What do you mean by 'limit'?
>
>> there should be a way to speedup lucene then if this is not possible,
>> since for some simple queries it takes half a second which is too long.
> What do you mean by 'simple' query? there might be multiple reasons behind slowness of a query that are unrelated to the search (for example, if you retrieve many documents and for each document you are extracting the content of many fields) - would you like to tell us a bit more about your use case?
>
> Regards,
> Diego
>
> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To: java-user@lucene.apache.org
> Cc: baris.kazar@oracle.com
> Subject: Re: Potential bug
>
> Thanks Adrien, but the differences is too far apart.
>
> I think the algorithm needs to be revised.
>
>
> what if the user needs to limit the search process?
>
> that leaves no control.
>
> there should be a way to speedup lucene then if this is not possible,
>
> since for some simple queries it takes half a second which is too long.
>
> Best regards
>
>
> On 6/9/21 1:13 PM, Adrien Grand wrote:
>> Hi Baris,
>>
>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>
>> The problem is that Lucene cannot directly identify the top matching
>> documents for a given query. The strategy it adopts is to start collecting
>> hits naively in doc ID order and to progressively raise the bar about the
>> minimum score that is required for a hit to be competitive in order to skip
>> non-competitive documents. So it's expected that Lucene still collects 100s
>> or 1000s of hits, even though the collector is configured to only compute
>> the top 10 hits.
>>
>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>
>>> Hi,-
>>>
>>> i think this is a potential bug
>>>
>>>
>>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>>> 1655 but i get 10 results in total.
>>>
>>> I think this suggests that there might be a bug with
>>> TopScoreDocCollector algorithm.
>>>
>>>
>>> Best regards
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
How many documents do you have in the index?
and can you show an example of query?


From: java-user@lucene.apache.org At: 06/09/21 18:33:25To: java-user@lucene.apache.org, baris.kazar@oracle.com
Subject: Re: Potential bug

i have only two fields one string the other is a number (stored as
string), i guess you cant go simpler than this.

i retreieve the hits and my major bottleneck is lucene fuzzy search.


i take each word from the string which is usually around at most 10 words

i build a fuzzy boolean query out of them.


simple query is like this 10 word query.


limit means i want to stop lucene search around 20 hits i dont want
thousands of hits.


Best regards


On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:

> Hi Baris,
>
>> what if the user needs to limit the search process?
> What do you mean by 'limit'?
>
>> there should be a way to speedup lucene then if this is not possible,
>> since for some simple queries it takes half a second which is too long.
> What do you mean by 'simple' query? there might be multiple reasons behind
slowness of a query that are unrelated to the search (for example, if you
retrieve many documents and for each document you are extracting the content of
many fields) - would you like to tell us a bit more about your use case?
>
> Regards,
> Diego
>
> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
java-user@lucene.apache.org
> Cc: baris.kazar@oracle.com
> Subject: Re: Potential bug
>
> Thanks Adrien, but the differences is too far apart.
>
> I think the algorithm needs to be revised.
>
>
> what if the user needs to limit the search process?
>
> that leaves no control.
>
> there should be a way to speedup lucene then if this is not possible,
>
> since for some simple queries it takes half a second which is too long.
>
> Best regards
>
>
> On 6/9/21 1:13 PM, Adrien Grand wrote:
>> Hi Baris,
>>
>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>
>> The problem is that Lucene cannot directly identify the top matching
>> documents for a given query. The strategy it adopts is to start collecting
>> hits naively in doc ID order and to progressively raise the bar about the
>> minimum score that is required for a hit to be competitive in order to skip
>> non-competitive documents. So it's expected that Lucene still collects 100s
>> or 1000s of hits, even though the collector is configured to only compute
>> the top 10 hits.
>>
>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>
>>> Hi,-
>>>
>>> i think this is a potential bug
>>>
>>>
>>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>>> 1655 but i get 10 results in total.
>>>
>>> I think this suggests that there might be a bug with
>>> TopScoreDocCollector algorithm.
>>>
>>>
>>> Best regards
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> How many documents do you have in the index?
> and can you show an example of query?
>
>
> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To: java-user@lucene.apache.org, baris.kazar@oracle.com
> Subject: Re: Potential bug
>
> i have only two fields one string the other is a number (stored as
> string), i guess you cant go simpler than this.
>
> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>
>
> i take each word from the string which is usually around at most 10 words
>
> i build a fuzzy boolean query out of them.
>
>
> simple query is like this 10 word query.
>
>
> limit means i want to stop lucene search around 20 hits i dont want
> thousands of hits.
>
>
> Best regards
>
>
> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>
>> Hi Baris,
>>
>>> what if the user needs to limit the search process?
>> What do you mean by 'limit'?
>>
>>> there should be a way to speedup lucene then if this is not possible,
>>> since for some simple queries it takes half a second which is too long.
>> What do you mean by 'simple' query? there might be multiple reasons behind
> slowness of a query that are unrelated to the search (for example, if you
> retrieve many documents and for each document you are extracting the content of
> many fields) - would you like to tell us a bit more about your use case?
>> Regards,
>> Diego
>>
>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> java-user@lucene.apache.org
>> Cc: baris.kazar@oracle.com
>> Subject: Re: Potential bug
>>
>> Thanks Adrien, but the differences is too far apart.
>>
>> I think the algorithm needs to be revised.
>>
>>
>> what if the user needs to limit the search process?
>>
>> that leaves no control.
>>
>> there should be a way to speedup lucene then if this is not possible,
>>
>> since for some simple queries it takes half a second which is too long.
>>
>> Best regards
>>
>>
>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>> Hi Baris,
>>>
>>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>>
>>> The problem is that Lucene cannot directly identify the top matching
>>> documents for a given query. The strategy it adopts is to start collecting
>>> hits naively in doc ID order and to progressively raise the bar about the
>>> minimum score that is required for a hit to be competitive in order to skip
>>> non-competitive documents. So it's expected that Lucene still collects 100s
>>> or 1000s of hits, even though the collector is configured to only compute
>>> the top 10 hits.
>>>
>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>>
>>>> Hi,-
>>>>
>>>> i think this is a potential bug
>>>>
>>>>
>>>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>>>> 1655 but i get 10 results in total.
>>>>
>>>> I think this suggests that there might be a bug with
>>>> TopScoreDocCollector algorithm.
>>>>
>>>>
>>>> Best regards
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
I have never used fuzzy search but from the documentation it seems very expensive, and if you do it on 10 terms and 1M documents it seems very very very expensive.

Are you using the default 'fuzzyness' parameter? (0.5) - It might end up exploring a lot of documents, did you try to play with that parameter?

Have you tried to see how the performance change if you do not use fuzzy (just to see if is fuzzy the introduce the slow down)?
Or what happens to performance if you do fuzzy with 1, 2, 5 terms instead of 10?


From: java-user@lucene.apache.org At: 06/09/21 18:56:31To: java-user@lucene.apache.org, baris.kazar@oracle.com
Subject: Re: Potential bug

i cant reveal those details i am very sorry. but it is more than 1 million.

let me tell that i have a lot of code that processes results from lucene
but the bottle neck is lucene fuzzy search.

Best regards


On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> How many documents do you have in the index?
> and can you show an example of query?
>
>
> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
java-user@lucene.apache.org, baris.kazar@oracle.com
> Subject: Re: Potential bug
>
> i have only two fields one string the other is a number (stored as
> string), i guess you cant go simpler than this.
>
> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>
>
> i take each word from the string which is usually around at most 10 words
>
> i build a fuzzy boolean query out of them.
>
>
> simple query is like this 10 word query.
>
>
> limit means i want to stop lucene search around 20 hits i dont want
> thousands of hits.
>
>
> Best regards
>
>
> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>
>> Hi Baris,
>>
>>> what if the user needs to limit the search process?
>> What do you mean by 'limit'?
>>
>>> there should be a way to speedup lucene then if this is not possible,
>>> since for some simple queries it takes half a second which is too long.
>> What do you mean by 'simple' query? there might be multiple reasons behind
> slowness of a query that are unrelated to the search (for example, if you
> retrieve many documents and for each document you are extracting the content
of
> many fields) - would you like to tell us a bit more about your use case?
>> Regards,
>> Diego
>>
>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> java-user@lucene.apache.org
>> Cc: baris.kazar@oracle.com
>> Subject: Re: Potential bug
>>
>> Thanks Adrien, but the differences is too far apart.
>>
>> I think the algorithm needs to be revised.
>>
>>
>> what if the user needs to limit the search process?
>>
>> that leaves no control.
>>
>> there should be a way to speedup lucene then if this is not possible,
>>
>> since for some simple queries it takes half a second which is too long.
>>
>> Best regards
>>
>>
>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>> Hi Baris,
>>>
>>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>>
>>> The problem is that Lucene cannot directly identify the top matching
>>> documents for a given query. The strategy it adopts is to start collecting
>>> hits naively in doc ID order and to progressively raise the bar about the
>>> minimum score that is required for a hit to be competitive in order to skip
>>> non-competitive documents. So it's expected that Lucene still collects 100s
>>> or 1000s of hits, even though the collector is configured to only compute
>>> the top 10 hits.
>>>
>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>>
>>>> Hi,-
>>>>
>>>> i think this is a potential bug
>>>>
>>>>
>>>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>>>> 1655 but i get 10 results in total.
>>>>
>>>> I think this suggests that there might be a bug with
>>>> TopScoreDocCollector algorithm.
>>>>
>>>>
>>>> Best regards
>>>>
>>>>
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
Yes, i did those and i believe i am at the best level of performance now
and it is not bad at all but i want to make it much better.

i see like a linear drop in timings when i go lower number of words but
let me do that quick study again.

Fuzzy search  is always expensive but that seems to suit best to my needs.


Thanks Diego for these great questions and i already explored them. But
thanks again.

Best regards


On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> I have never used fuzzy search but from the documentation it seems very expensive, and if you do it on 10 terms and 1M documents it seems very very very expensive.
>
> Are you using the default 'fuzzyness' parameter? (0.5) - It might end up exploring a lot of documents, did you try to play with that parameter?
>
> Have you tried to see how the performance change if you do not use fuzzy (just to see if is fuzzy the introduce the slow down)?
> Or what happens to performance if you do fuzzy with 1, 2, 5 terms instead of 10?
>
>
> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To: java-user@lucene.apache.org, baris.kazar@oracle.com
> Subject: Re: Potential bug
>
> i cant reveal those details i am very sorry. but it is more than 1 million.
>
> let me tell that i have a lot of code that processes results from lucene
> but the bottle neck is lucene fuzzy search.
>
> Best regards
>
>
> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>> How many documents do you have in the index?
>> and can you show an example of query?
>>
>>
>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> java-user@lucene.apache.org, baris.kazar@oracle.com
>> Subject: Re: Potential bug
>>
>> i have only two fields one string the other is a number (stored as
>> string), i guess you cant go simpler than this.
>>
>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>>
>>
>> i take each word from the string which is usually around at most 10 words
>>
>> i build a fuzzy boolean query out of them.
>>
>>
>> simple query is like this 10 word query.
>>
>>
>> limit means i want to stop lucene search around 20 hits i dont want
>> thousands of hits.
>>
>>
>> Best regards
>>
>>
>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>
>>> Hi Baris,
>>>
>>>> what if the user needs to limit the search process?
>>> What do you mean by 'limit'?
>>>
>>>> there should be a way to speedup lucene then if this is not possible,
>>>> since for some simple queries it takes half a second which is too long.
>>> What do you mean by 'simple' query? there might be multiple reasons behind
>> slowness of a query that are unrelated to the search (for example, if you
>> retrieve many documents and for each document you are extracting the content
> of
>> many fields) - would you like to tell us a bit more about your use case?
>>> Regards,
>>> Diego
>>>
>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
>> java-user@lucene.apache.org
>>> Cc: baris.kazar@oracle.com
>>> Subject: Re: Potential bug
>>>
>>> Thanks Adrien, but the differences is too far apart.
>>>
>>> I think the algorithm needs to be revised.
>>>
>>>
>>> what if the user needs to limit the search process?
>>>
>>> that leaves no control.
>>>
>>> there should be a way to speedup lucene then if this is not possible,
>>>
>>> since for some simple queries it takes half a second which is too long.
>>>
>>> Best regards
>>>
>>>
>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>>> Hi Baris,
>>>>
>>>> totalhitsThreshold is actually a minimum threshold, not a maximum threshold.
>>>>
>>>> The problem is that Lucene cannot directly identify the top matching
>>>> documents for a given query. The strategy it adopts is to start collecting
>>>> hits naively in doc ID order and to progressively raise the bar about the
>>>> minimum score that is required for a hit to be competitive in order to skip
>>>> non-competitive documents. So it's expected that Lucene still collects 100s
>>>> or 1000s of hits, even though the collector is configured to only compute
>>>> the top 10 hits.
>>>>
>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>>>
>>>>> Hi,-
>>>>>
>>>>> i think this is a potential bug
>>>>>
>>>>>
>>>>> i set this time totalHitsThreshold to 10 and i get totalhits reported as
>>>>> 1655 but i get 10 results in total.
>>>>>
>>>>> I think this suggests that there might be a bug with
>>>>> TopScoreDocCollector algorithm.
>>>>>
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
Hi Bazir,
this feels like an X Y problem [1 <https://xyproblem.info>].
Can you express what is your original user requirement?
Most of the time, at the cost of indexing time/space you may get quicker
query times.
Also, you should identify where are you wasting most of your time, in the
matching phase (identifying candidates from the corpus of documents) or in
the ranking phase (scoring them by relevance)?

TopScoreDocCollector is quite a solid class, there's a ton to study,
analyze and experiment before raising the alarm of a bug :)

Also didn't understand this :
"what if the user needs to limit the search process?"
Can you elaborate?

Cheers



[1] https://xyproblem.info
--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:

> Yes, i did those and i believe i am at the best level of performance now
> and it is not bad at all but i want to make it much better.
>
> i see like a linear drop in timings when i go lower number of words but
> let me do that quick study again.
>
> Fuzzy search is always expensive but that seems to suit best to my needs.
>
>
> Thanks Diego for these great questions and i already explored them. But
> thanks again.
>
> Best regards
>
>
> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > I have never used fuzzy search but from the documentation it seems very
> expensive, and if you do it on 10 terms and 1M documents it seems very very
> very expensive.
> >
> > Are you using the default 'fuzzyness' parameter? (0.5) - It might end up
> exploring a lot of documents, did you try to play with that parameter?
> >
> > Have you tried to see how the performance change if you do not use fuzzy
> (just to see if is fuzzy the introduce the slow down)?
> > Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> instead of 10?
> >
> >
> > From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
> java-user@lucene.apache.org, baris.kazar@oracle.com
> > Subject: Re: Potential bug
> >
> > i cant reveal those details i am very sorry. but it is more than 1
> million.
> >
> > let me tell that i have a lot of code that processes results from lucene
> > but the bottle neck is lucene fuzzy search.
> >
> > Best regards
> >
> >
> > On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> >> How many documents do you have in the index?
> >> and can you show an example of query?
> >>
> >>
> >> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> > java-user@lucene.apache.org, baris.kazar@oracle.com
> >> Subject: Re: Potential bug
> >>
> >> i have only two fields one string the other is a number (stored as
> >> string), i guess you cant go simpler than this.
> >>
> >> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> >>
> >>
> >> i take each word from the string which is usually around at most 10
> words
> >>
> >> i build a fuzzy boolean query out of them.
> >>
> >>
> >> simple query is like this 10 word query.
> >>
> >>
> >> limit means i want to stop lucene search around 20 hits i dont want
> >> thousands of hits.
> >>
> >>
> >> Best regards
> >>
> >>
> >> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> >>
> >>> Hi Baris,
> >>>
> >>>> what if the user needs to limit the search process?
> >>> What do you mean by 'limit'?
> >>>
> >>>> there should be a way to speedup lucene then if this is not possible,
> >>>> since for some simple queries it takes half a second which is too
> long.
> >>> What do you mean by 'simple' query? there might be multiple reasons
> behind
> >> slowness of a query that are unrelated to the search (for example, if
> you
> >> retrieve many documents and for each document you are extracting the
> content
> > of
> >> many fields) - would you like to tell us a bit more about your use case?
> >>> Regards,
> >>> Diego
> >>>
> >>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> >> java-user@lucene.apache.org
> >>> Cc: baris.kazar@oracle.com
> >>> Subject: Re: Potential bug
> >>>
> >>> Thanks Adrien, but the differences is too far apart.
> >>>
> >>> I think the algorithm needs to be revised.
> >>>
> >>>
> >>> what if the user needs to limit the search process?
> >>>
> >>> that leaves no control.
> >>>
> >>> there should be a way to speedup lucene then if this is not possible,
> >>>
> >>> since for some simple queries it takes half a second which is too long.
> >>>
> >>> Best regards
> >>>
> >>>
> >>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> >>>> Hi Baris,
> >>>>
> >>>> totalhitsThreshold is actually a minimum threshold, not a maximum
> threshold.
> >>>>
> >>>> The problem is that Lucene cannot directly identify the top matching
> >>>> documents for a given query. The strategy it adopts is to start
> collecting
> >>>> hits naively in doc ID order and to progressively raise the bar about
> the
> >>>> minimum score that is required for a hit to be competitive in order
> to skip
> >>>> non-competitive documents. So it's expected that Lucene still
> collects 100s
> >>>> or 1000s of hits, even though the collector is configured to only
> compute
> >>>> the top 10 hits.
> >>>>
> >>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
> >>>>
> >>>>> Hi,-
> >>>>>
> >>>>> i think this is a potential bug
> >>>>>
> >>>>>
> >>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> reported as
> >>>>> 1655 but i get 10 results in total.
> >>>>>
> >>>>> I think this suggests that there might be a bug with
> >>>>> TopScoreDocCollector algorithm.
> >>>>>
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>>
> >>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Potential bug [ In reply to ]
Lets start with writing my name correctly.

Then we can talk

Best regards


On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> Hi Bazir,
> this feels like an X Y problem [1 <https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$ >].
> Can you express what is your original user requirement?
> Most of the time, at the cost of indexing time/space you may get quicker
> query times.
> Also, you should identify where are you wasting most of your time, in the
> matching phase (identifying candidates from the corpus of documents) or in
> the ranking phase (scoring them by relevance)?
>
> TopScoreDocCollector is quite a solid class, there's a ton to study,
> analyze and experiment before raising the alarm of a bug :)
>
> Also didn't understand this :
> "what if the user needs to limit the search process?"
> Can you elaborate?
>
> Cheers
>
>
>
> [1] https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
>
>
> On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:
>
>> Yes, i did those and i believe i am at the best level of performance now
>> and it is not bad at all but i want to make it much better.
>>
>> i see like a linear drop in timings when i go lower number of words but
>> let me do that quick study again.
>>
>> Fuzzy search is always expensive but that seems to suit best to my needs.
>>
>>
>> Thanks Diego for these great questions and i already explored them. But
>> thanks again.
>>
>> Best regards
>>
>>
>> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>> I have never used fuzzy search but from the documentation it seems very
>> expensive, and if you do it on 10 terms and 1M documents it seems very very
>> very expensive.
>>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end up
>> exploring a lot of documents, did you try to play with that parameter?
>>> Have you tried to see how the performance change if you do not use fuzzy
>> (just to see if is fuzzy the introduce the slow down)?
>>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
>> instead of 10?
>>>
>>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
>> java-user@lucene.apache.org, baris.kazar@oracle.com
>>> Subject: Re: Potential bug
>>>
>>> i cant reveal those details i am very sorry. but it is more than 1
>> million.
>>> let me tell that i have a lot of code that processes results from lucene
>>> but the bottle neck is lucene fuzzy search.
>>>
>>> Best regards
>>>
>>>
>>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>>> How many documents do you have in the index?
>>>> and can you show an example of query?
>>>>
>>>>
>>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
>>> java-user@lucene.apache.org, baris.kazar@oracle.com
>>>> Subject: Re: Potential bug
>>>>
>>>> i have only two fields one string the other is a number (stored as
>>>> string), i guess you cant go simpler than this.
>>>>
>>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>>>>
>>>>
>>>> i take each word from the string which is usually around at most 10
>> words
>>>> i build a fuzzy boolean query out of them.
>>>>
>>>>
>>>> simple query is like this 10 word query.
>>>>
>>>>
>>>> limit means i want to stop lucene search around 20 hits i dont want
>>>> thousands of hits.
>>>>
>>>>
>>>> Best regards
>>>>
>>>>
>>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>>>
>>>>> Hi Baris,
>>>>>
>>>>>> what if the user needs to limit the search process?
>>>>> What do you mean by 'limit'?
>>>>>
>>>>>> there should be a way to speedup lucene then if this is not possible,
>>>>>> since for some simple queries it takes half a second which is too
>> long.
>>>>> What do you mean by 'simple' query? there might be multiple reasons
>> behind
>>>> slowness of a query that are unrelated to the search (for example, if
>> you
>>>> retrieve many documents and for each document you are extracting the
>> content
>>> of
>>>> many fields) - would you like to tell us a bit more about your use case?
>>>>> Regards,
>>>>> Diego
>>>>>
>>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
>>>> java-user@lucene.apache.org
>>>>> Cc: baris.kazar@oracle.com
>>>>> Subject: Re: Potential bug
>>>>>
>>>>> Thanks Adrien, but the differences is too far apart.
>>>>>
>>>>> I think the algorithm needs to be revised.
>>>>>
>>>>>
>>>>> what if the user needs to limit the search process?
>>>>>
>>>>> that leaves no control.
>>>>>
>>>>> there should be a way to speedup lucene then if this is not possible,
>>>>>
>>>>> since for some simple queries it takes half a second which is too long.
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>>>>> Hi Baris,
>>>>>>
>>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
>> threshold.
>>>>>> The problem is that Lucene cannot directly identify the top matching
>>>>>> documents for a given query. The strategy it adopts is to start
>> collecting
>>>>>> hits naively in doc ID order and to progressively raise the bar about
>> the
>>>>>> minimum score that is required for a hit to be competitive in order
>> to skip
>>>>>> non-competitive documents. So it's expected that Lucene still
>> collects 100s
>>>>>> or 1000s of hits, even though the collector is configured to only
>> compute
>>>>>> the top 10 hits.
>>>>>>
>>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>>>>>
>>>>>>> Hi,-
>>>>>>>
>>>>>>> i think this is a potential bug
>>>>>>>
>>>>>>>
>>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
>> reported as
>>>>>>> 1655 but i get 10 results in total.
>>>>>>>
>>>>>>> I think this suggests that there might be a bug with
>>>>>>> TopScoreDocCollector algorithm.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
i expect the answers from this list to be more professional please.

You dont have to answer to this list if you intend to insult.

Best regards


On 6/11/21 11:57 AM, Alessandro Benedetti wrote:

> Hi Bazir,
> this feels like an X Y problem [1 <https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$ >].
> Can you express what is your original user requirement?
> Most of the time, at the cost of indexing time/space you may get quicker
> query times.
> Also, you should identify where are you wasting most of your time, in the
> matching phase (identifying candidates from the corpus of documents) or in
> the ranking phase (scoring them by relevance)?
>
> TopScoreDocCollector is quite a solid class, there's a ton to study,
> analyze and experiment before raising the alarm of a bug :)
>
> Also didn't understand this :
> "what if the user needs to limit the search process?"
> Can you elaborate?
>
> Cheers
>
>
>
> [1] https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
>
>
> On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:
>
>> Yes, i did those and i believe i am at the best level of performance now
>> and it is not bad at all but i want to make it much better.
>>
>> i see like a linear drop in timings when i go lower number of words but
>> let me do that quick study again.
>>
>> Fuzzy search is always expensive but that seems to suit best to my needs.
>>
>>
>> Thanks Diego for these great questions and i already explored them. But
>> thanks again.
>>
>> Best regards
>>
>>
>> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>> I have never used fuzzy search but from the documentation it seems very
>> expensive, and if you do it on 10 terms and 1M documents it seems very very
>> very expensive.
>>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end up
>> exploring a lot of documents, did you try to play with that parameter?
>>> Have you tried to see how the performance change if you do not use fuzzy
>> (just to see if is fuzzy the introduce the slow down)?
>>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
>> instead of 10?
>>>
>>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
>> java-user@lucene.apache.org, baris.kazar@oracle.com
>>> Subject: Re: Potential bug
>>>
>>> i cant reveal those details i am very sorry. but it is more than 1
>> million.
>>> let me tell that i have a lot of code that processes results from lucene
>>> but the bottle neck is lucene fuzzy search.
>>>
>>> Best regards
>>>
>>>
>>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>>> How many documents do you have in the index?
>>>> and can you show an example of query?
>>>>
>>>>
>>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
>>> java-user@lucene.apache.org, baris.kazar@oracle.com
>>>> Subject: Re: Potential bug
>>>>
>>>> i have only two fields one string the other is a number (stored as
>>>> string), i guess you cant go simpler than this.
>>>>
>>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>>>>
>>>>
>>>> i take each word from the string which is usually around at most 10
>> words
>>>> i build a fuzzy boolean query out of them.
>>>>
>>>>
>>>> simple query is like this 10 word query.
>>>>
>>>>
>>>> limit means i want to stop lucene search around 20 hits i dont want
>>>> thousands of hits.
>>>>
>>>>
>>>> Best regards
>>>>
>>>>
>>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>>>
>>>>> Hi Baris,
>>>>>
>>>>>> what if the user needs to limit the search process?
>>>>> What do you mean by 'limit'?
>>>>>
>>>>>> there should be a way to speedup lucene then if this is not possible,
>>>>>> since for some simple queries it takes half a second which is too
>> long.
>>>>> What do you mean by 'simple' query? there might be multiple reasons
>> behind
>>>> slowness of a query that are unrelated to the search (for example, if
>> you
>>>> retrieve many documents and for each document you are extracting the
>> content
>>> of
>>>> many fields) - would you like to tell us a bit more about your use case?
>>>>> Regards,
>>>>> Diego
>>>>>
>>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
>>>> java-user@lucene.apache.org
>>>>> Cc: baris.kazar@oracle.com
>>>>> Subject: Re: Potential bug
>>>>>
>>>>> Thanks Adrien, but the differences is too far apart.
>>>>>
>>>>> I think the algorithm needs to be revised.
>>>>>
>>>>>
>>>>> what if the user needs to limit the search process?
>>>>>
>>>>> that leaves no control.
>>>>>
>>>>> there should be a way to speedup lucene then if this is not possible,
>>>>>
>>>>> since for some simple queries it takes half a second which is too long.
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>>>>> Hi Baris,
>>>>>>
>>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
>> threshold.
>>>>>> The problem is that Lucene cannot directly identify the top matching
>>>>>> documents for a given query. The strategy it adopts is to start
>> collecting
>>>>>> hits naively in doc ID order and to progressively raise the bar about
>> the
>>>>>> minimum score that is required for a hit to be competitive in order
>> to skip
>>>>>> non-competitive documents. So it's expected that Lucene still
>> collects 100s
>>>>>> or 1000s of hits, even though the collector is configured to only
>> compute
>>>>>> the top 10 hits.
>>>>>>
>>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>>>>>
>>>>>>> Hi,-
>>>>>>>
>>>>>>> i think this is a potential bug
>>>>>>>
>>>>>>>
>>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
>> reported as
>>>>>>> 1655 but i get 10 results in total.
>>>>>>>
>>>>>>> I think this suggests that there might be a bug with
>>>>>>> TopScoreDocCollector algorithm.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Potential bug [ In reply to ]
Let me guide to a professional answer to the below email:


Hi Baris,

Since You mentioned You did all the performance study on your
application and still believe that

the bottleneck is the fuzzy search api from Lucene, it would be best to
time the application for:

* matching phase (identifying candidates from the corpus of documents)
* or in the ranking phase (scoring them by relevance)?

Maybe this will help speedup further.

Also, what do You mean by "what is the user needs to to limit te search
process" ? can you elaborate?

Cheers



My answer would be :

i cant access the Lucene code so how can time these two cases please?

i mean by that sentence that when i see the hits are good i would like
to limit the number of hits.



this is more like a professional conversation please. Thanks.

Best regards


On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> Hi Bazir,
> this feels like an X Y problem [1 <https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$ >].
> Can you express what is your original user requirement?
> Most of the time, at the cost of indexing time/space you may get quicker
> query times.
> Also, you should identify where are you wasting most of your time, in the
> matching phase (identifying candidates from the corpus of documents) or in
> the ranking phase (scoring them by relevance)?
>
> TopScoreDocCollector is quite a solid class, there's a ton to study,
> analyze and experiment before raising the alarm of a bug :)
>
> Also didn't understand this :
> "what if the user needs to limit the search process?"
> Can you elaborate?
>
> Cheers
>
>
>
> [1] https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
>
>
> On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:
>
>> Yes, i did those and i believe i am at the best level of performance now
>> and it is not bad at all but i want to make it much better.
>>
>> i see like a linear drop in timings when i go lower number of words but
>> let me do that quick study again.
>>
>> Fuzzy search is always expensive but that seems to suit best to my needs.
>>
>>
>> Thanks Diego for these great questions and i already explored them. But
>> thanks again.
>>
>> Best regards
>>
>>
>> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>> I have never used fuzzy search but from the documentation it seems very
>> expensive, and if you do it on 10 terms and 1M documents it seems very very
>> very expensive.
>>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end up
>> exploring a lot of documents, did you try to play with that parameter?
>>> Have you tried to see how the performance change if you do not use fuzzy
>> (just to see if is fuzzy the introduce the slow down)?
>>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
>> instead of 10?
>>>
>>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
>> java-user@lucene.apache.org, baris.kazar@oracle.com
>>> Subject: Re: Potential bug
>>>
>>> i cant reveal those details i am very sorry. but it is more than 1
>> million.
>>> let me tell that i have a lot of code that processes results from lucene
>>> but the bottle neck is lucene fuzzy search.
>>>
>>> Best regards
>>>
>>>
>>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>>> How many documents do you have in the index?
>>>> and can you show an example of query?
>>>>
>>>>
>>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
>>> java-user@lucene.apache.org, baris.kazar@oracle.com
>>>> Subject: Re: Potential bug
>>>>
>>>> i have only two fields one string the other is a number (stored as
>>>> string), i guess you cant go simpler than this.
>>>>
>>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
>>>>
>>>>
>>>> i take each word from the string which is usually around at most 10
>> words
>>>> i build a fuzzy boolean query out of them.
>>>>
>>>>
>>>> simple query is like this 10 word query.
>>>>
>>>>
>>>> limit means i want to stop lucene search around 20 hits i dont want
>>>> thousands of hits.
>>>>
>>>>
>>>> Best regards
>>>>
>>>>
>>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
>>>>
>>>>> Hi Baris,
>>>>>
>>>>>> what if the user needs to limit the search process?
>>>>> What do you mean by 'limit'?
>>>>>
>>>>>> there should be a way to speedup lucene then if this is not possible,
>>>>>> since for some simple queries it takes half a second which is too
>> long.
>>>>> What do you mean by 'simple' query? there might be multiple reasons
>> behind
>>>> slowness of a query that are unrelated to the search (for example, if
>> you
>>>> retrieve many documents and for each document you are extracting the
>> content
>>> of
>>>> many fields) - would you like to tell us a bit more about your use case?
>>>>> Regards,
>>>>> Diego
>>>>>
>>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
>>>> java-user@lucene.apache.org
>>>>> Cc: baris.kazar@oracle.com
>>>>> Subject: Re: Potential bug
>>>>>
>>>>> Thanks Adrien, but the differences is too far apart.
>>>>>
>>>>> I think the algorithm needs to be revised.
>>>>>
>>>>>
>>>>> what if the user needs to limit the search process?
>>>>>
>>>>> that leaves no control.
>>>>>
>>>>> there should be a way to speedup lucene then if this is not possible,
>>>>>
>>>>> since for some simple queries it takes half a second which is too long.
>>>>>
>>>>> Best regards
>>>>>
>>>>>
>>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
>>>>>> Hi Baris,
>>>>>>
>>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
>> threshold.
>>>>>> The problem is that Lucene cannot directly identify the top matching
>>>>>> documents for a given query. The strategy it adopts is to start
>> collecting
>>>>>> hits naively in doc ID order and to progressively raise the bar about
>> the
>>>>>> minimum score that is required for a hit to be competitive in order
>> to skip
>>>>>> non-competitive documents. So it's expected that Lucene still
>> collects 100s
>>>>>> or 1000s of hits, even though the collector is configured to only
>> compute
>>>>>> the top 10 hits.
>>>>>>
>>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
>>>>>>
>>>>>>> Hi,-
>>>>>>>
>>>>>>> i think this is a potential bug
>>>>>>>
>>>>>>>
>>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
>> reported as
>>>>>>> 1655 but i get 10 results in total.
>>>>>>>
>>>>>>> I think this suggests that there might be a bug with
>>>>>>> TopScoreDocCollector algorithm.
>>>>>>>
>>>>>>>
>>>>>>> Best regards
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ---------------------------------------------------------------------
>>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>>>
>>>>>>>
>>>>> ---------------------------------------------------------------------
>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>>
>>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
Re: Potential bug [ In reply to ]
Hi Baris,
first of all apologies for having misspelled your name, definitely, it was
not meant as an insult.
Secondly, your tone is not acceptable on this mailing list (or anywhere
else).
You must remember that we, committers, are operating on a volunteering
basis, contributing code and helping people in our free time purely driven
by passion.
Respect is fundamental, we are not here to be treated aggressively.

Regards

--------------------------
Alessandro Benedetti
Apache Lucene/Solr Committer
Director, R&D Software Engineer, Search Consultant

www.sease.io


On Fri, 11 Jun 2021 at 17:10, <baris.kazar@oracle.com> wrote:

> Let me guide to a professional answer to the below email:
>
>
> Hi Baris,
>
> Since You mentioned You did all the performance study on your
> application and still believe that
>
> the bottleneck is the fuzzy search api from Lucene, it would be best to
> time the application for:
>
> * matching phase (identifying candidates from the corpus of documents)
> * or in the ranking phase (scoring them by relevance)?
>
> Maybe this will help speedup further.
>
> Also, what do You mean by "what is the user needs to to limit te search
> process" ? can you elaborate?
>
> Cheers
>
>
>
> My answer would be :
>
> i cant access the Lucene code so how can time these two cases please?
>
> i mean by that sentence that when i see the hits are good i would like
> to limit the number of hits.
>
>
>
> this is more like a professional conversation please. Thanks.
>
> Best regards
>
>
> On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > Hi Bazir,
> > this feels like an X Y problem [1 <
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> >].
> > Can you express what is your original user requirement?
> > Most of the time, at the cost of indexing time/space you may get quicker
> > query times.
> > Also, you should identify where are you wasting most of your time, in the
> > matching phase (identifying candidates from the corpus of documents) or
> in
> > the ranking phase (scoring them by relevance)?
> >
> > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > analyze and experiment before raising the alarm of a bug :)
> >
> > Also didn't understand this :
> > "what if the user needs to limit the search process?"
> > Can you elaborate?
> >
> > Cheers
> >
> >
> >
> > [1]
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > --------------------------
> > Alessandro Benedetti
> > Apache Lucene/Solr Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> >
> >
> > On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:
> >
> >> Yes, i did those and i believe i am at the best level of performance now
> >> and it is not bad at all but i want to make it much better.
> >>
> >> i see like a linear drop in timings when i go lower number of words but
> >> let me do that quick study again.
> >>
> >> Fuzzy search is always expensive but that seems to suit best to my
> needs.
> >>
> >>
> >> Thanks Diego for these great questions and i already explored them. But
> >> thanks again.
> >>
> >> Best regards
> >>
> >>
> >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> >>> I have never used fuzzy search but from the documentation it seems very
> >> expensive, and if you do it on 10 terms and 1M documents it seems very
> very
> >> very expensive.
> >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end
> up
> >> exploring a lot of documents, did you try to play with that parameter?
> >>> Have you tried to see how the performance change if you do not use
> fuzzy
> >> (just to see if is fuzzy the introduce the slow down)?
> >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> >> instead of 10?
> >>>
> >>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
> >> java-user@lucene.apache.org, baris.kazar@oracle.com
> >>> Subject: Re: Potential bug
> >>>
> >>> i cant reveal those details i am very sorry. but it is more than 1
> >> million.
> >>> let me tell that i have a lot of code that processes results from
> lucene
> >>> but the bottle neck is lucene fuzzy search.
> >>>
> >>> Best regards
> >>>
> >>>
> >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> >>>> How many documents do you have in the index?
> >>>> and can you show an example of query?
> >>>>
> >>>>
> >>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> >>> java-user@lucene.apache.org, baris.kazar@oracle.com
> >>>> Subject: Re: Potential bug
> >>>>
> >>>> i have only two fields one string the other is a number (stored as
> >>>> string), i guess you cant go simpler than this.
> >>>>
> >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> >>>>
> >>>>
> >>>> i take each word from the string which is usually around at most 10
> >> words
> >>>> i build a fuzzy boolean query out of them.
> >>>>
> >>>>
> >>>> simple query is like this 10 word query.
> >>>>
> >>>>
> >>>> limit means i want to stop lucene search around 20 hits i dont want
> >>>> thousands of hits.
> >>>>
> >>>>
> >>>> Best regards
> >>>>
> >>>>
> >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> >>>>
> >>>>> Hi Baris,
> >>>>>
> >>>>>> what if the user needs to limit the search process?
> >>>>> What do you mean by 'limit'?
> >>>>>
> >>>>>> there should be a way to speedup lucene then if this is not
> possible,
> >>>>>> since for some simple queries it takes half a second which is too
> >> long.
> >>>>> What do you mean by 'simple' query? there might be multiple reasons
> >> behind
> >>>> slowness of a query that are unrelated to the search (for example, if
> >> you
> >>>> retrieve many documents and for each document you are extracting the
> >> content
> >>> of
> >>>> many fields) - would you like to tell us a bit more about your use
> case?
> >>>>> Regards,
> >>>>> Diego
> >>>>>
> >>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> >>>> java-user@lucene.apache.org
> >>>>> Cc: baris.kazar@oracle.com
> >>>>> Subject: Re: Potential bug
> >>>>>
> >>>>> Thanks Adrien, but the differences is too far apart.
> >>>>>
> >>>>> I think the algorithm needs to be revised.
> >>>>>
> >>>>>
> >>>>> what if the user needs to limit the search process?
> >>>>>
> >>>>> that leaves no control.
> >>>>>
> >>>>> there should be a way to speedup lucene then if this is not possible,
> >>>>>
> >>>>> since for some simple queries it takes half a second which is too
> long.
> >>>>>
> >>>>> Best regards
> >>>>>
> >>>>>
> >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> >>>>>> Hi Baris,
> >>>>>>
> >>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
> >> threshold.
> >>>>>> The problem is that Lucene cannot directly identify the top matching
> >>>>>> documents for a given query. The strategy it adopts is to start
> >> collecting
> >>>>>> hits naively in doc ID order and to progressively raise the bar
> about
> >> the
> >>>>>> minimum score that is required for a hit to be competitive in order
> >> to skip
> >>>>>> non-competitive documents. So it's expected that Lucene still
> >> collects 100s
> >>>>>> or 1000s of hits, even though the collector is configured to only
> >> compute
> >>>>>> the top 10 hits.
> >>>>>>
> >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
> >>>>>>
> >>>>>>> Hi,-
> >>>>>>>
> >>>>>>> i think this is a potential bug
> >>>>>>>
> >>>>>>>
> >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> >> reported as
> >>>>>>> 1655 but i get 10 results in total.
> >>>>>>>
> >>>>>>> I think this suggests that there might be a bug with
> >>>>>>> TopScoreDocCollector algorithm.
> >>>>>>>
> >>>>>>>
> >>>>>>> Best regards
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> ---------------------------------------------------------------------
> >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>>>
> >>>>>>>
> >>>>> ---------------------------------------------------------------------
> >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>>
> >>>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>>
> >>>>
> >>> ---------------------------------------------------------------------
> >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>>
> >>>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >>
>
Re: Potential bug [ In reply to ]
Baris, you called out an insult from Alessandro and your replies suggest
anger, but I couldn't see an insult from Alessandro actually.

+1 to Alessandro's call to make the tone softer on this discussion.

On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <a.benedetti@sease.io>
wrote:

> Hi Baris,
> first of all apologies for having misspelled your name, definitely, it was
> not meant as an insult.
> Secondly, your tone is not acceptable on this mailing list (or anywhere
> else).
> You must remember that we, committers, are operating on a volunteering
> basis, contributing code and helping people in our free time purely driven
> by passion.
> Respect is fundamental, we are not here to be treated aggressively.
>
> Regards
>
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io
>
>
> On Fri, 11 Jun 2021 at 17:10, <baris.kazar@oracle.com> wrote:
>
> > Let me guide to a professional answer to the below email:
> >
> >
> > Hi Baris,
> >
> > Since You mentioned You did all the performance study on your
> > application and still believe that
> >
> > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > time the application for:
> >
> > * matching phase (identifying candidates from the corpus of documents)
> > * or in the ranking phase (scoring them by relevance)?
> >
> > Maybe this will help speedup further.
> >
> > Also, what do You mean by "what is the user needs to to limit te search
> > process" ? can you elaborate?
> >
> > Cheers
> >
> >
> >
> > My answer would be :
> >
> > i cant access the Lucene code so how can time these two cases please?
> >
> > i mean by that sentence that when i see the hits are good i would like
> > to limit the number of hits.
> >
> >
> >
> > this is more like a professional conversation please. Thanks.
> >
> > Best regards
> >
> >
> > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > Hi Bazir,
> > > this feels like an X Y problem [1 <
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > >].
> > > Can you express what is your original user requirement?
> > > Most of the time, at the cost of indexing time/space you may get
> quicker
> > > query times.
> > > Also, you should identify where are you wasting most of your time, in
> the
> > > matching phase (identifying candidates from the corpus of documents) or
> > in
> > > the ranking phase (scoring them by relevance)?
> > >
> > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > > analyze and experiment before raising the alarm of a bug :)
> > >
> > > Also didn't understand this :
> > > "what if the user needs to limit the search process?"
> > > Can you elaborate?
> > >
> > > Cheers
> > >
> > >
> > >
> > > [1]
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > --------------------------
> > > Alessandro Benedetti
> > > Apache Lucene/Solr Committer
> > > Director, R&D Software Engineer, Search Consultant
> > >
> > >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> > >
> > >
> > > On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:
> > >
> > >> Yes, i did those and i believe i am at the best level of performance
> now
> > >> and it is not bad at all but i want to make it much better.
> > >>
> > >> i see like a linear drop in timings when i go lower number of words
> but
> > >> let me do that quick study again.
> > >>
> > >> Fuzzy search is always expensive but that seems to suit best to my
> > needs.
> > >>
> > >>
> > >> Thanks Diego for these great questions and i already explored them.
> But
> > >> thanks again.
> > >>
> > >> Best regards
> > >>
> > >>
> > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>> I have never used fuzzy search but from the documentation it seems
> very
> > >> expensive, and if you do it on 10 terms and 1M documents it seems very
> > very
> > >> very expensive.
> > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end
> > up
> > >> exploring a lot of documents, did you try to play with that parameter?
> > >>> Have you tried to see how the performance change if you do not use
> > fuzzy
> > >> (just to see if is fuzzy the introduce the slow down)?
> > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > >> instead of 10?
> > >>>
> > >>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
> > >> java-user@lucene.apache.org, baris.kazar@oracle.com
> > >>> Subject: Re: Potential bug
> > >>>
> > >>> i cant reveal those details i am very sorry. but it is more than 1
> > >> million.
> > >>> let me tell that i have a lot of code that processes results from
> > lucene
> > >>> but the bottle neck is lucene fuzzy search.
> > >>>
> > >>> Best regards
> > >>>
> > >>>
> > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>> How many documents do you have in the index?
> > >>>> and can you show an example of query?
> > >>>>
> > >>>>
> > >>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> > >>> java-user@lucene.apache.org, baris.kazar@oracle.com
> > >>>> Subject: Re: Potential bug
> > >>>>
> > >>>> i have only two fields one string the other is a number (stored as
> > >>>> string), i guess you cant go simpler than this.
> > >>>>
> > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> > >>>>
> > >>>>
> > >>>> i take each word from the string which is usually around at most 10
> > >> words
> > >>>> i build a fuzzy boolean query out of them.
> > >>>>
> > >>>>
> > >>>> simple query is like this 10 word query.
> > >>>>
> > >>>>
> > >>>> limit means i want to stop lucene search around 20 hits i dont want
> > >>>> thousands of hits.
> > >>>>
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>>
> > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>>
> > >>>>> Hi Baris,
> > >>>>>
> > >>>>>> what if the user needs to limit the search process?
> > >>>>> What do you mean by 'limit'?
> > >>>>>
> > >>>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > >>>>>> since for some simple queries it takes half a second which is too
> > >> long.
> > >>>>> What do you mean by 'simple' query? there might be multiple reasons
> > >> behind
> > >>>> slowness of a query that are unrelated to the search (for example,
> if
> > >> you
> > >>>> retrieve many documents and for each document you are extracting the
> > >> content
> > >>> of
> > >>>> many fields) - would you like to tell us a bit more about your use
> > case?
> > >>>>> Regards,
> > >>>>> Diego
> > >>>>>
> > >>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> > >>>> java-user@lucene.apache.org
> > >>>>> Cc: baris.kazar@oracle.com
> > >>>>> Subject: Re: Potential bug
> > >>>>>
> > >>>>> Thanks Adrien, but the differences is too far apart.
> > >>>>>
> > >>>>> I think the algorithm needs to be revised.
> > >>>>>
> > >>>>>
> > >>>>> what if the user needs to limit the search process?
> > >>>>>
> > >>>>> that leaves no control.
> > >>>>>
> > >>>>> there should be a way to speedup lucene then if this is not
> possible,
> > >>>>>
> > >>>>> since for some simple queries it takes half a second which is too
> > long.
> > >>>>>
> > >>>>> Best regards
> > >>>>>
> > >>>>>
> > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> > >>>>>> Hi Baris,
> > >>>>>>
> > >>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
> > >> threshold.
> > >>>>>> The problem is that Lucene cannot directly identify the top
> matching
> > >>>>>> documents for a given query. The strategy it adopts is to start
> > >> collecting
> > >>>>>> hits naively in doc ID order and to progressively raise the bar
> > about
> > >> the
> > >>>>>> minimum score that is required for a hit to be competitive in
> order
> > >> to skip
> > >>>>>> non-competitive documents. So it's expected that Lucene still
> > >> collects 100s
> > >>>>>> or 1000s of hits, even though the collector is configured to only
> > >> compute
> > >>>>>> the top 10 hits.
> > >>>>>>
> > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
> > >>>>>>
> > >>>>>>> Hi,-
> > >>>>>>>
> > >>>>>>> i think this is a potential bug
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> > >> reported as
> > >>>>>>> 1655 but i get 10 results in total.
> > >>>>>>>
> > >>>>>>> I think this suggests that there might be a bug with
> > >>>>>>> TopScoreDocCollector algorithm.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Best regards
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>>>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>>>
> > >>>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>>
> > >>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>>
> > >>>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > >>
> > >>
> >
>


--
Adrien
Re: Potential bug [ In reply to ]
+1 to Adrien.

Let's keep the tone neutral.

On Mon, 14 Jun 2021, 16:00 Adrien Grand, <jpountz@gmail.com> wrote:

> Baris, you called out an insult from Alessandro and your replies suggest
> anger, but I couldn't see an insult from Alessandro actually.
>
> +1 to Alessandro's call to make the tone softer on this discussion.
>
> On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <
> a.benedetti@sease.io>
> wrote:
>
> > Hi Baris,
> > first of all apologies for having misspelled your name, definitely, it
> was
> > not meant as an insult.
> > Secondly, your tone is not acceptable on this mailing list (or anywhere
> > else).
> > You must remember that we, committers, are operating on a volunteering
> > basis, contributing code and helping people in our free time purely
> driven
> > by passion.
> > Respect is fundamental, we are not here to be treated aggressively.
> >
> > Regards
> >
> > --------------------------
> > Alessandro Benedetti
> > Apache Lucene/Solr Committer
> > Director, R&D Software Engineer, Search Consultant
> >
> > www.sease.io
> >
> >
> > On Fri, 11 Jun 2021 at 17:10, <baris.kazar@oracle.com> wrote:
> >
> > > Let me guide to a professional answer to the below email:
> > >
> > >
> > > Hi Baris,
> > >
> > > Since You mentioned You did all the performance study on your
> > > application and still believe that
> > >
> > > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > > time the application for:
> > >
> > > * matching phase (identifying candidates from the corpus of
> documents)
> > > * or in the ranking phase (scoring them by relevance)?
> > >
> > > Maybe this will help speedup further.
> > >
> > > Also, what do You mean by "what is the user needs to to limit te search
> > > process" ? can you elaborate?
> > >
> > > Cheers
> > >
> > >
> > >
> > > My answer would be :
> > >
> > > i cant access the Lucene code so how can time these two cases please?
> > >
> > > i mean by that sentence that when i see the hits are good i would like
> > > to limit the number of hits.
> > >
> > >
> > >
> > > this is more like a professional conversation please. Thanks.
> > >
> > > Best regards
> > >
> > >
> > > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > > Hi Bazir,
> > > > this feels like an X Y problem [1 <
> > >
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > >].
> > > > Can you express what is your original user requirement?
> > > > Most of the time, at the cost of indexing time/space you may get
> > quicker
> > > > query times.
> > > > Also, you should identify where are you wasting most of your time, in
> > the
> > > > matching phase (identifying candidates from the corpus of documents)
> or
> > > in
> > > > the ranking phase (scoring them by relevance)?
> > > >
> > > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > > > analyze and experiment before raising the alarm of a bug :)
> > > >
> > > > Also didn't understand this :
> > > > "what if the user needs to limit the search process?"
> > > > Can you elaborate?
> > > >
> > > > Cheers
> > > >
> > > >
> > > >
> > > > [1]
> > >
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > > --------------------------
> > > > Alessandro Benedetti
> > > > Apache Lucene/Solr Committer
> > > > Director, R&D Software Engineer, Search Consultant
> > > >
> > > >
> > >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> > > >
> > > >
> > > > On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com> wrote:
> > > >
> > > >> Yes, i did those and i believe i am at the best level of performance
> > now
> > > >> and it is not bad at all but i want to make it much better.
> > > >>
> > > >> i see like a linear drop in timings when i go lower number of words
> > but
> > > >> let me do that quick study again.
> > > >>
> > > >> Fuzzy search is always expensive but that seems to suit best to my
> > > needs.
> > > >>
> > > >>
> > > >> Thanks Diego for these great questions and i already explored them.
> > But
> > > >> thanks again.
> > > >>
> > > >> Best regards
> > > >>
> > > >>
> > > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > > >>> I have never used fuzzy search but from the documentation it seems
> > very
> > > >> expensive, and if you do it on 10 terms and 1M documents it seems
> very
> > > very
> > > >> very expensive.
> > > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might
> end
> > > up
> > > >> exploring a lot of documents, did you try to play with that
> parameter?
> > > >>> Have you tried to see how the performance change if you do not use
> > > fuzzy
> > > >> (just to see if is fuzzy the introduce the slow down)?
> > > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > > >> instead of 10?
> > > >>>
> > > >>> From: java-user@lucene.apache.org At: 06/09/21 18:56:31To:
> > > >> java-user@lucene.apache.org, baris.kazar@oracle.com
> > > >>> Subject: Re: Potential bug
> > > >>>
> > > >>> i cant reveal those details i am very sorry. but it is more than 1
> > > >> million.
> > > >>> let me tell that i have a lot of code that processes results from
> > > lucene
> > > >>> but the bottle neck is lucene fuzzy search.
> > > >>>
> > > >>> Best regards
> > > >>>
> > > >>>
> > > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > > >>>> How many documents do you have in the index?
> > > >>>> and can you show an example of query?
> > > >>>>
> > > >>>>
> > > >>>> From: java-user@lucene.apache.org At: 06/09/21 18:33:25To:
> > > >>> java-user@lucene.apache.org, baris.kazar@oracle.com
> > > >>>> Subject: Re: Potential bug
> > > >>>>
> > > >>>> i have only two fields one string the other is a number (stored as
> > > >>>> string), i guess you cant go simpler than this.
> > > >>>>
> > > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy
> search.
> > > >>>>
> > > >>>>
> > > >>>> i take each word from the string which is usually around at most
> 10
> > > >> words
> > > >>>> i build a fuzzy boolean query out of them.
> > > >>>>
> > > >>>>
> > > >>>> simple query is like this 10 word query.
> > > >>>>
> > > >>>>
> > > >>>> limit means i want to stop lucene search around 20 hits i dont
> want
> > > >>>> thousands of hits.
> > > >>>>
> > > >>>>
> > > >>>> Best regards
> > > >>>>
> > > >>>>
> > > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > > >>>>
> > > >>>>> Hi Baris,
> > > >>>>>
> > > >>>>>> what if the user needs to limit the search process?
> > > >>>>> What do you mean by 'limit'?
> > > >>>>>
> > > >>>>>> there should be a way to speedup lucene then if this is not
> > > possible,
> > > >>>>>> since for some simple queries it takes half a second which is
> too
> > > >> long.
> > > >>>>> What do you mean by 'simple' query? there might be multiple
> reasons
> > > >> behind
> > > >>>> slowness of a query that are unrelated to the search (for example,
> > if
> > > >> you
> > > >>>> retrieve many documents and for each document you are extracting
> the
> > > >> content
> > > >>> of
> > > >>>> many fields) - would you like to tell us a bit more about your use
> > > case?
> > > >>>>> Regards,
> > > >>>>> Diego
> > > >>>>>
> > > >>>>> From: java-user@lucene.apache.org At: 06/09/21 18:18:01To:
> > > >>>> java-user@lucene.apache.org
> > > >>>>> Cc: baris.kazar@oracle.com
> > > >>>>> Subject: Re: Potential bug
> > > >>>>>
> > > >>>>> Thanks Adrien, but the differences is too far apart.
> > > >>>>>
> > > >>>>> I think the algorithm needs to be revised.
> > > >>>>>
> > > >>>>>
> > > >>>>> what if the user needs to limit the search process?
> > > >>>>>
> > > >>>>> that leaves no control.
> > > >>>>>
> > > >>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > > >>>>>
> > > >>>>> since for some simple queries it takes half a second which is too
> > > long.
> > > >>>>>
> > > >>>>> Best regards
> > > >>>>>
> > > >>>>>
> > > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> > > >>>>>> Hi Baris,
> > > >>>>>>
> > > >>>>>> totalhitsThreshold is actually a minimum threshold, not a
> maximum
> > > >> threshold.
> > > >>>>>> The problem is that Lucene cannot directly identify the top
> > matching
> > > >>>>>> documents for a given query. The strategy it adopts is to start
> > > >> collecting
> > > >>>>>> hits naively in doc ID order and to progressively raise the bar
> > > about
> > > >> the
> > > >>>>>> minimum score that is required for a hit to be competitive in
> > order
> > > >> to skip
> > > >>>>>> non-competitive documents. So it's expected that Lucene still
> > > >> collects 100s
> > > >>>>>> or 1000s of hits, even though the collector is configured to
> only
> > > >> compute
> > > >>>>>> the top 10 hits.
> > > >>>>>>
> > > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com> wrote:
> > > >>>>>>
> > > >>>>>>> Hi,-
> > > >>>>>>>
> > > >>>>>>> i think this is a potential bug
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> > > >> reported as
> > > >>>>>>> 1655 but i get 10 results in total.
> > > >>>>>>>
> > > >>>>>>> I think this suggests that there might be a bug with
> > > >>>>>>> TopScoreDocCollector algorithm.
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>> Best regards
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>>>
> > > ---------------------------------------------------------------------
> > > >>>>>>> To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> > > >>>>>>> For additional commands, e-mail:
> > java-user-help@lucene.apache.org
> > > >>>>>>>
> > > >>>>>>>
> > > >>>>>
> > ---------------------------------------------------------------------
> > > >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> > > >>>>>
> > > >>>>>
> > > >>>>
> > ---------------------------------------------------------------------
> > > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>>
> > > >>>>
> > > >>>
> ---------------------------------------------------------------------
> > > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >>> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>>
> > > >>>
> > > >>
> ---------------------------------------------------------------------
> > > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > >> For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >>
> > > >>
> > >
> >
>
>
> --
> Adrien
>
Re: Potential bug [ In reply to ]
Dear Folks,-
i have a lot of experience in performance tuning and parallel processing: 17+7 years. So, when you say "you dont know what you ask for", that does not sound good at all besides i was clear on that.

Alessandro, i appreciate the apology and i would like to apologize if i hurt feelings and i never mean to hurt anybody's feelings and i still think i was not aggressive but i need to re-explain
what was wrong with the email:

I was not trying to be aggressive with my responses.
I write in this forum for a long time and never received an email like Yours.

I revised your email for this list. Because with my expertise, i dont think i should get a comment like the X Y problem example.

Moreover code can have bugs and raising is not a good word choice here. I am not here to find problems with Lucene and we are all here to use and make Lucene better.

And i appreciate the work committers as volunteers are doing and there is no doubt there. Lucene 8.y.z is much better with your work. Kudos to that success.

We need to keep the tone neutral is what i am looking for here. Yes, respect is fundemantal,
that is what i have been telling here in my last emails.

Would You please look at my revised email?
I think the email should have been composed
that way.

I would like to focus on my question please.
I hope we keep the tone neutral and professional.
Thanks for understanding.

Best regards
________________________________
From: Atri Sharma <atri@apache.org>
Sent: Monday, June 14, 2021 8:46 AM
To: java-user@lucene.apache.org
Cc: Baris Kazar
Subject: Re: Potential bug

+1 to Adrien.

Let's keep the tone neutral.

On Mon, 14 Jun 2021, 16:00 Adrien Grand, <jpountz@gmail.com<mailto:jpountz@gmail.com>> wrote:
Baris, you called out an insult from Alessandro and your replies suggest
anger, but I couldn't see an insult from Alessandro actually.

+1 to Alessandro's call to make the tone softer on this discussion.

On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <a.benedetti@sease.io<mailto:a.benedetti@sease.io>>
wrote:

> Hi Baris,
> first of all apologies for having misspelled your name, definitely, it was
> not meant as an insult.
> Secondly, your tone is not acceptable on this mailing list (or anywhere
> else).
> You must remember that we, committers, are operating on a volunteering
> basis, contributing code and helping people in our free time purely driven
> by passion.
> Respect is fundamental, we are not here to be treated aggressively.
>
> Regards
>
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io<https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!K0ZsQ2P0XzGClQmwefzD5RkmOCe4LzH2fc3siXNLAGO0TRzuPqXWRmuqmOPHCWMakg$>
>
>
> On Fri, 11 Jun 2021 at 17:10, <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
>
> > Let me guide to a professional answer to the below email:
> >
> >
> > Hi Baris,
> >
> > Since You mentioned You did all the performance study on your
> > application and still believe that
> >
> > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > time the application for:
> >
> > * matching phase (identifying candidates from the corpus of documents)
> > * or in the ranking phase (scoring them by relevance)?
> >
> > Maybe this will help speedup further.
> >
> > Also, what do You mean by "what is the user needs to to limit te search
> > process" ? can you elaborate?
> >
> > Cheers
> >
> >
> >
> > My answer would be :
> >
> > i cant access the Lucene code so how can time these two cases please?
> >
> > i mean by that sentence that when i see the hits are good i would like
> > to limit the number of hits.
> >
> >
> >
> > this is more like a professional conversation please. Thanks.
> >
> > Best regards
> >
> >
> > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > Hi Bazir,
> > > this feels like an X Y problem [1 <
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > >].
> > > Can you express what is your original user requirement?
> > > Most of the time, at the cost of indexing time/space you may get
> quicker
> > > query times.
> > > Also, you should identify where are you wasting most of your time, in
> the
> > > matching phase (identifying candidates from the corpus of documents) or
> > in
> > > the ranking phase (scoring them by relevance)?
> > >
> > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > > analyze and experiment before raising the alarm of a bug :)
> > >
> > > Also didn't understand this :
> > > "what if the user needs to limit the search process?"
> > > Can you elaborate?
> > >
> > > Cheers
> > >
> > >
> > >
> > > [1]
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > --------------------------
> > > Alessandro Benedetti
> > > Apache Lucene/Solr Committer
> > > Director, R&D Software Engineer, Search Consultant
> > >
> > >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> > >
> > >
> > > On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
> > >
> > >> Yes, i did those and i believe i am at the best level of performance
> now
> > >> and it is not bad at all but i want to make it much better.
> > >>
> > >> i see like a linear drop in timings when i go lower number of words
> but
> > >> let me do that quick study again.
> > >>
> > >> Fuzzy search is always expensive but that seems to suit best to my
> > needs.
> > >>
> > >>
> > >> Thanks Diego for these great questions and i already explored them.
> But
> > >> thanks again.
> > >>
> > >> Best regards
> > >>
> > >>
> > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>> I have never used fuzzy search but from the documentation it seems
> very
> > >> expensive, and if you do it on 10 terms and 1M documents it seems very
> > very
> > >> very expensive.
> > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end
> > up
> > >> exploring a lot of documents, did you try to play with that parameter?
> > >>> Have you tried to see how the performance change if you do not use
> > fuzzy
> > >> (just to see if is fuzzy the introduce the slow down)?
> > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > >> instead of 10?
> > >>>
> > >>> From: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> At: 06/09/21 18:56:31To:
> > >> java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>, baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>
> > >>> Subject: Re: Potential bug
> > >>>
> > >>> i cant reveal those details i am very sorry. but it is more than 1
> > >> million.
> > >>> let me tell that i have a lot of code that processes results from
> > lucene
> > >>> but the bottle neck is lucene fuzzy search.
> > >>>
> > >>> Best regards
> > >>>
> > >>>
> > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>> How many documents do you have in the index?
> > >>>> and can you show an example of query?
> > >>>>
> > >>>>
> > >>>> From: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> At: 06/09/21 18:33:25To:
> > >>> java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>, baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>
> > >>>> Subject: Re: Potential bug
> > >>>>
> > >>>> i have only two fields one string the other is a number (stored as
> > >>>> string), i guess you cant go simpler than this.
> > >>>>
> > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> > >>>>
> > >>>>
> > >>>> i take each word from the string which is usually around at most 10
> > >> words
> > >>>> i build a fuzzy boolean query out of them.
> > >>>>
> > >>>>
> > >>>> simple query is like this 10 word query.
> > >>>>
> > >>>>
> > >>>> limit means i want to stop lucene search around 20 hits i dont want
> > >>>> thousands of hits.
> > >>>>
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>>
> > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>>
> > >>>>> Hi Baris,
> > >>>>>
> > >>>>>> what if the user needs to limit the search process?
> > >>>>> What do you mean by 'limit'?
> > >>>>>
> > >>>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > >>>>>> since for some simple queries it takes half a second which is too
> > >> long.
> > >>>>> What do you mean by 'simple' query? there might be multiple reasons
> > >> behind
> > >>>> slowness of a query that are unrelated to the search (for example,
> if
> > >> you
> > >>>> retrieve many documents and for each document you are extracting the
> > >> content
> > >>> of
> > >>>> many fields) - would you like to tell us a bit more about your use
> > case?
> > >>>>> Regards,
> > >>>>> Diego
> > >>>>>
> > >>>>> From: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> At: 06/09/21 18:18:01To:
> > >>>> java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>
> > >>>>> Cc: baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>
> > >>>>> Subject: Re: Potential bug
> > >>>>>
> > >>>>> Thanks Adrien, but the differences is too far apart.
> > >>>>>
> > >>>>> I think the algorithm needs to be revised.
> > >>>>>
> > >>>>>
> > >>>>> what if the user needs to limit the search process?
> > >>>>>
> > >>>>> that leaves no control.
> > >>>>>
> > >>>>> there should be a way to speedup lucene then if this is not
> possible,
> > >>>>>
> > >>>>> since for some simple queries it takes half a second which is too
> > long.
> > >>>>>
> > >>>>> Best regards
> > >>>>>
> > >>>>>
> > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> > >>>>>> Hi Baris,
> > >>>>>>
> > >>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
> > >> threshold.
> > >>>>>> The problem is that Lucene cannot directly identify the top
> matching
> > >>>>>> documents for a given query. The strategy it adopts is to start
> > >> collecting
> > >>>>>> hits naively in doc ID order and to progressively raise the bar
> > about
> > >> the
> > >>>>>> minimum score that is required for a hit to be competitive in
> order
> > >> to skip
> > >>>>>> non-competitive documents. So it's expected that Lucene still
> > >> collects 100s
> > >>>>>> or 1000s of hits, even though the collector is configured to only
> > >> compute
> > >>>>>> the top 10 hits.
> > >>>>>>
> > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
> > >>>>>>
> > >>>>>>> Hi,-
> > >>>>>>>
> > >>>>>>> i think this is a potential bug
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> > >> reported as
> > >>>>>>> 1655 but i get 10 results in total.
> > >>>>>>>
> > >>>>>>> I think this suggests that there might be a bug with
> > >>>>>>> TopScoreDocCollector algorithm.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Best regards
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>>>>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>>>
> > >>>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>>
> > >>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>
> > >>>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>
> > >>
> >
>


--
Adrien
Re: Potential bug [ In reply to ]
i was clear on what i wanted to do with Lucene experiments in this thread.
(last part of first paragraph below)

Best regards
________________________________
From: Baris Kazar <baris.kazar@oracle.com>
Sent: Monday, June 14, 2021 10:28:47 AM
To: Atri Sharma <atri@apache.org>; java-user@lucene.apache.org <java-user@lucene.apache.org>; a.benedetti@sease.io <a.benedetti@sease.io>; Baris Kazar <baris.kazar@oracle.com>
Subject: Re: Potential bug

Dear Folks,-
i have a lot of experience in performance tuning and parallel processing: 17+7 years. So, when you say "you dont know what you ask for", that does not sound good at all besides i was clear on that.

Alessandro, i appreciate the apology and i would like to apologize if i hurt feelings and i never mean to hurt anybody's feelings and i still think i was not aggressive but i need to re-explain
what was wrong with the email:

I was not trying to be aggressive with my responses.
I write in this forum for a long time and never received an email like Yours.

I revised your email for this list. Because with my expertise, i dont think i should get a comment like the X Y problem example.

Moreover code can have bugs and raising is not a good word choice here. I am not here to find problems with Lucene and we are all here to use and make Lucene better.

And i appreciate the work committers as volunteers are doing and there is no doubt there. Lucene 8.y.z is much better with your work. Kudos to that success.

We need to keep the tone neutral is what i am looking for here. Yes, respect is fundemantal,
that is what i have been telling here in my last emails.

Would You please look at my revised email?
I think the email should have been composed
that way.

I would like to focus on my question please.
I hope we keep the tone neutral and professional.
Thanks for understanding.

Best regards
________________________________
From: Atri Sharma <atri@apache.org>
Sent: Monday, June 14, 2021 8:46 AM
To: java-user@lucene.apache.org
Cc: Baris Kazar
Subject: Re: Potential bug

+1 to Adrien.

Let's keep the tone neutral.

On Mon, 14 Jun 2021, 16:00 Adrien Grand, <jpountz@gmail.com<mailto:jpountz@gmail.com>> wrote:
Baris, you called out an insult from Alessandro and your replies suggest
anger, but I couldn't see an insult from Alessandro actually.

+1 to Alessandro's call to make the tone softer on this discussion.

On Mon, Jun 14, 2021 at 11:28 AM Alessandro Benedetti <a.benedetti@sease.io<mailto:a.benedetti@sease.io>>
wrote:

> Hi Baris,
> first of all apologies for having misspelled your name, definitely, it was
> not meant as an insult.
> Secondly, your tone is not acceptable on this mailing list (or anywhere
> else).
> You must remember that we, committers, are operating on a volunteering
> basis, contributing code and helping people in our free time purely driven
> by passion.
> Respect is fundamental, we are not here to be treated aggressively.
>
> Regards
>
> --------------------------
> Alessandro Benedetti
> Apache Lucene/Solr Committer
> Director, R&D Software Engineer, Search Consultant
>
> www.sease.io<https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!K0ZsQ2P0XzGClQmwefzD5RkmOCe4LzH2fc3siXNLAGO0TRzuPqXWRmuqmOPHCWMakg$>
>
>
> On Fri, 11 Jun 2021 at 17:10, <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
>
> > Let me guide to a professional answer to the below email:
> >
> >
> > Hi Baris,
> >
> > Since You mentioned You did all the performance study on your
> > application and still believe that
> >
> > the bottleneck is the fuzzy search api from Lucene, it would be best to
> > time the application for:
> >
> > * matching phase (identifying candidates from the corpus of documents)
> > * or in the ranking phase (scoring them by relevance)?
> >
> > Maybe this will help speedup further.
> >
> > Also, what do You mean by "what is the user needs to to limit te search
> > process" ? can you elaborate?
> >
> > Cheers
> >
> >
> >
> > My answer would be :
> >
> > i cant access the Lucene code so how can time these two cases please?
> >
> > i mean by that sentence that when i see the hits are good i would like
> > to limit the number of hits.
> >
> >
> >
> > this is more like a professional conversation please. Thanks.
> >
> > Best regards
> >
> >
> > On 6/11/21 11:57 AM, Alessandro Benedetti wrote:
> > > Hi Bazir,
> > > this feels like an X Y problem [1 <
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > >].
> > > Can you express what is your original user requirement?
> > > Most of the time, at the cost of indexing time/space you may get
> quicker
> > > query times.
> > > Also, you should identify where are you wasting most of your time, in
> the
> > > matching phase (identifying candidates from the corpus of documents) or
> > in
> > > the ranking phase (scoring them by relevance)?
> > >
> > > TopScoreDocCollector is quite a solid class, there's a ton to study,
> > > analyze and experiment before raising the alarm of a bug :)
> > >
> > > Also didn't understand this :
> > > "what if the user needs to limit the search process?"
> > > Can you elaborate?
> > >
> > > Cheers
> > >
> > >
> > >
> > > [1]
> >
> https://urldefense.com/v3/__https://xyproblem.info__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq2Yo0eBzg$
> > > --------------------------
> > > Alessandro Benedetti
> > > Apache Lucene/Solr Committer
> > > Director, R&D Software Engineer, Search Consultant
> > >
> > >
> >
> https://urldefense.com/v3/__http://www.sease.io__;!!GqivPVa7Brio!IrgovQa8yo6rznUAykFBDcTgg_ixlPdRqBgWx6UAfWeZTlJ99CVYsv69Tq07hrsXPw$
> > >
> > >
> > > On Wed, 9 Jun 2021 at 19:08, <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
> > >
> > >> Yes, i did those and i believe i am at the best level of performance
> now
> > >> and it is not bad at all but i want to make it much better.
> > >>
> > >> i see like a linear drop in timings when i go lower number of words
> but
> > >> let me do that quick study again.
> > >>
> > >> Fuzzy search is always expensive but that seems to suit best to my
> > needs.
> > >>
> > >>
> > >> Thanks Diego for these great questions and i already explored them.
> But
> > >> thanks again.
> > >>
> > >> Best regards
> > >>
> > >>
> > >> On 6/9/21 2:04 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>> I have never used fuzzy search but from the documentation it seems
> very
> > >> expensive, and if you do it on 10 terms and 1M documents it seems very
> > very
> > >> very expensive.
> > >>> Are you using the default 'fuzzyness' parameter? (0.5) - It might end
> > up
> > >> exploring a lot of documents, did you try to play with that parameter?
> > >>> Have you tried to see how the performance change if you do not use
> > fuzzy
> > >> (just to see if is fuzzy the introduce the slow down)?
> > >>> Or what happens to performance if you do fuzzy with 1, 2, 5 terms
> > >> instead of 10?
> > >>>
> > >>> From: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> At: 06/09/21 18:56:31To:
> > >> java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>, baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>
> > >>> Subject: Re: Potential bug
> > >>>
> > >>> i cant reveal those details i am very sorry. but it is more than 1
> > >> million.
> > >>> let me tell that i have a lot of code that processes results from
> > lucene
> > >>> but the bottle neck is lucene fuzzy search.
> > >>>
> > >>> Best regards
> > >>>
> > >>>
> > >>> On 6/9/21 1:53 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>> How many documents do you have in the index?
> > >>>> and can you show an example of query?
> > >>>>
> > >>>>
> > >>>> From: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> At: 06/09/21 18:33:25To:
> > >>> java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>, baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>
> > >>>> Subject: Re: Potential bug
> > >>>>
> > >>>> i have only two fields one string the other is a number (stored as
> > >>>> string), i guess you cant go simpler than this.
> > >>>>
> > >>>> i retreieve the hits and my major bottleneck is lucene fuzzy search.
> > >>>>
> > >>>>
> > >>>> i take each word from the string which is usually around at most 10
> > >> words
> > >>>> i build a fuzzy boolean query out of them.
> > >>>>
> > >>>>
> > >>>> simple query is like this 10 word query.
> > >>>>
> > >>>>
> > >>>> limit means i want to stop lucene search around 20 hits i dont want
> > >>>> thousands of hits.
> > >>>>
> > >>>>
> > >>>> Best regards
> > >>>>
> > >>>>
> > >>>> On 6/9/21 1:25 PM, Diego Ceccarelli (BLOOMBERG/ LONDON) wrote:
> > >>>>
> > >>>>> Hi Baris,
> > >>>>>
> > >>>>>> what if the user needs to limit the search process?
> > >>>>> What do you mean by 'limit'?
> > >>>>>
> > >>>>>> there should be a way to speedup lucene then if this is not
> > possible,
> > >>>>>> since for some simple queries it takes half a second which is too
> > >> long.
> > >>>>> What do you mean by 'simple' query? there might be multiple reasons
> > >> behind
> > >>>> slowness of a query that are unrelated to the search (for example,
> if
> > >> you
> > >>>> retrieve many documents and for each document you are extracting the
> > >> content
> > >>> of
> > >>>> many fields) - would you like to tell us a bit more about your use
> > case?
> > >>>>> Regards,
> > >>>>> Diego
> > >>>>>
> > >>>>> From: java-user@lucene.apache.org<mailto:java-user@lucene.apache.org> At: 06/09/21 18:18:01To:
> > >>>> java-user@lucene.apache.org<mailto:java-user@lucene.apache.org>
> > >>>>> Cc: baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>
> > >>>>> Subject: Re: Potential bug
> > >>>>>
> > >>>>> Thanks Adrien, but the differences is too far apart.
> > >>>>>
> > >>>>> I think the algorithm needs to be revised.
> > >>>>>
> > >>>>>
> > >>>>> what if the user needs to limit the search process?
> > >>>>>
> > >>>>> that leaves no control.
> > >>>>>
> > >>>>> there should be a way to speedup lucene then if this is not
> possible,
> > >>>>>
> > >>>>> since for some simple queries it takes half a second which is too
> > long.
> > >>>>>
> > >>>>> Best regards
> > >>>>>
> > >>>>>
> > >>>>> On 6/9/21 1:13 PM, Adrien Grand wrote:
> > >>>>>> Hi Baris,
> > >>>>>>
> > >>>>>> totalhitsThreshold is actually a minimum threshold, not a maximum
> > >> threshold.
> > >>>>>> The problem is that Lucene cannot directly identify the top
> matching
> > >>>>>> documents for a given query. The strategy it adopts is to start
> > >> collecting
> > >>>>>> hits naively in doc ID order and to progressively raise the bar
> > about
> > >> the
> > >>>>>> minimum score that is required for a hit to be competitive in
> order
> > >> to skip
> > >>>>>> non-competitive documents. So it's expected that Lucene still
> > >> collects 100s
> > >>>>>> or 1000s of hits, even though the collector is configured to only
> > >> compute
> > >>>>>> the top 10 hits.
> > >>>>>>
> > >>>>>> On Wed, Jun 9, 2021 at 7:07 PM <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
> > >>>>>>
> > >>>>>>> Hi,-
> > >>>>>>>
> > >>>>>>> i think this is a potential bug
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> i set this time totalHitsThreshold to 10 and i get totalhits
> > >> reported as
> > >>>>>>> 1655 but i get 10 results in total.
> > >>>>>>>
> > >>>>>>> I think this suggests that there might be a bug with
> > >>>>>>> TopScoreDocCollector algorithm.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> Best regards
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > ---------------------------------------------------------------------
> > >>>>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>>>>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>>>>>
> > >>>>>>>
> > >>>>>
> ---------------------------------------------------------------------
> > >>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>>>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>>>
> > >>>>>
> > >>>>
> ---------------------------------------------------------------------
> > >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>>
> > >>>>
> > >>> ---------------------------------------------------------------------
> > >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >>> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>>
> > >>>
> > >> ---------------------------------------------------------------------
> > >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
> > >> For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>
> > >>
> > >>
> >
>


--
Adrien