Mailing List Archive

An interesting case
Hi,-

 I use IndexSearcher.search API with two parameters like Query and int
number (i set as 20).

However, when i look at the TopDocs object which is the result of this
above API call

i see thousands of hits from totalhits. Is this inaccurate or Lucene is
doing actually search based on that many results?

But when i iterate over result of above API call's scoreDocs object i
get int number of hits (ie, 20 hits).


I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
report a number of collected results than

the actual number of results. I see on the order of couple of thousands
vs 20.


Best regards




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: An interesting case [ In reply to ]
https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.kazar@oracle.com wrote:
> Hi,-
>
>  I use IndexSearcher.search API with two parameters like Query and int
> number (i set as 20).
>
> However, when i look at the TopDocs object which is the result of this
> above API call
>
> i see thousands of hits from totalhits. Is this inaccurate or Lucene
> is doing actually search based on that many results?
>
> But when i iterate over result of above API call's scoreDocs object i
> get int number of hits (ie, 20 hits).
>
>
> I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> report a number of collected results than
>
> the actual number of results. I see on the order of couple of
> thousands vs 20.
>
>
> Best regards
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: An interesting case [ In reply to ]
When you call IndexSearcher#search(Query query, int n), there are two cases:
- either your query matches n hits or more, and the TopDocs object will
have a ScoreDoc[] array that contains the n best scoring hits sorted by
descending score,
- or your query matches less then n hits and then the TopDocs object will
have all matches in the ScoreDoc[] array, sorted by descending score.

In both cases, TopDocs#totalHits gives information about the total number
of matches of the query. On older versions of Lucene (<7.0) this is an
integer that is always accurate, while on more recent versions of Lucene
(>= 8.0) it is a lower bound of the total number of matches. It typically
returns the number of collected documents indeed, though this is an
implementation detail that might change in the future.

If you want to count the number of matches of a Query precisely, you can
use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com> wrote:

>
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
>
> looks like someone else also had this problem, too.
>
> Any suggestions please?
>
> Best regards
>
>
> On 6/8/21 1:36 AM, baris.kazar@oracle.com wrote:
> > Hi,-
> >
> > I use IndexSearcher.search API with two parameters like Query and int
> > number (i set as 20).
> >
> > However, when i look at the TopDocs object which is the result of this
> > above API call
> >
> > i see thousands of hits from totalhits. Is this inaccurate or Lucene
> > is doing actually search based on that many results?
> >
> > But when i iterate over result of above API call's scoreDocs object i
> > get int number of hits (ie, 20 hits).
> >
> >
> > I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> > report a number of collected results than
> >
> > the actual number of results. I see on the order of couple of
> > thousands vs 20.
> >
> >
> > Best regards
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien
Re: An interesting case [ In reply to ]
my worry is actually about the lucene's performance.

if lucene collects thousands of hits instead of actually n (<<< a couple of 1000s) hits, then this creates performance issue.

ScoreDoc array is ok as i mentioned ie, it has size n.
i will check count api.

Best regards
________________________________
From: Adrien Grand <jpountz@gmail.com>
Sent: Tuesday, June 8, 2021 2:46 AM
To: Lucene Users Mailing List
Cc: Baris Kazar
Subject: Re: An interesting case

When you call IndexSearcher#search(Query query, int n), there are two cases:
- either your query matches n hits or more, and the TopDocs object will have a ScoreDoc[] array that contains the n best scoring hits sorted by descending score,
- or your query matches less then n hits and then the TopDocs object will have all matches in the ScoreDoc[] array, sorted by descending score.

In both cases, TopDocs#totalHits gives information about the total number of matches of the query. On older versions of Lucene (<7.0) this is an integer that is always accurate, while on more recent versions of Lucene (>= 8.0) it is a lower bound of the total number of matches. It typically returns the number of collected documents indeed, though this is an implementation detail that might change in the future.

If you want to count the number of matches of a Query precisely, you can use IndexSearcher#count.

On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com<mailto:baris.kazar@oracle.com>> wrote:
https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search<https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>

looks like someone else also had this problem, too.

Any suggestions please?

Best regards


On 6/8/21 1:36 AM, baris.kazar@oracle.com<mailto:baris.kazar@oracle.com> wrote:
> Hi,-
>
> I use IndexSearcher.search API with two parameters like Query and int
> number (i set as 20).
>
> However, when i look at the TopDocs object which is the result of this
> above API call
>
> i see thousands of hits from totalhits. Is this inaccurate or Lucene
> is doing actually search based on that many results?
>
> But when i iterate over result of above API call's scoreDocs object i
> get int number of hits (ie, 20 hits).
>
>
> I am trying to find out why org.apache.lucene.search.Topdocs.TotalHits
> report a number of collected results than
>
> the actual number of results. I see on the order of couple of
> thousands vs 20.
>
>
> Best regards
>
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org<mailto:java-user-unsubscribe@lucene.apache.org>
For additional commands, e-mail: java-user-help@lucene.apache.org<mailto:java-user-help@lucene.apache.org>



--
Adrien
Re: An interesting case [ In reply to ]
i am currently happy with Lucene performance but i want to understand
and speedup further

by limiting the results concretely. So i still donot know why totalHits
and scoredocs report

different number of hits.


Best regards


On 6/8/21 2:52 AM, Baris Kazar wrote:
> my worry is actually about the lucene's performance.
>
> if lucene collects thousands of hits instead of actually n (<<< a
> couple of 1000s) hits, then this creates performance issue.
>
> ScoreDoc array is ok as i mentioned ie, it has size n.
> i will check count api.
>
> Best regards
> ------------------------------------------------------------------------
> *From:* Adrien Grand <jpountz@gmail.com>
> *Sent:* Tuesday, June 8, 2021 2:46 AM
> *To:* Lucene Users Mailing List
> *Cc:* Baris Kazar
> *Subject:* Re: An interesting case
> When you call IndexSearcher#search(Query query, int n), there are two
> cases:
>  - either your query matches n hits or more, and the TopDocs object
> will have a ScoreDoc[] array that contains the n best scoring hits
> sorted by descending score,
>  - or your query matches less then n hits and then the TopDocs object
> will have all matches in the ScoreDoc[] array, sorted by descending score.
>
> In both cases, TopDocs#totalHits gives information about the total
> number of matches of the query. On older versions of Lucene (<7.0)
> this is an integer that is always accurate, while on more recent
> versions of Lucene (>= 8.0) it is a lower bound of the total number of
> matches. It typically returns the number of collected documents
> indeed, though this is an implementation detail that might change in
> the future.
>
> If you want to count the number of matches of a Query precisely, you
> can use IndexSearcher#count.
>
> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>> wrote:
>
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>
>
> looks like someone else also had this problem, too.
>
> Any suggestions please?
>
> Best regards
>
>
> On 6/8/21 1:36 AM, baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com> wrote:
> > Hi,-
> >
> >  I use IndexSearcher.search API with two parameters like Query
> and int
> > number (i set as 20).
> >
> > However, when i look at the TopDocs object which is the result
> of this
> > above API call
> >
> > i see thousands of hits from totalhits. Is this inaccurate or
> Lucene
> > is doing actually search based on that many results?
> >
> > But when i iterate over result of above API call's scoreDocs
> object i
> > get int number of hits (ie, 20 hits).
> >
> >
> > I am trying to find out why
> org.apache.lucene.search.Topdocs.TotalHits
> > report a number of collected results than
> >
> > the actual number of results. I see on the order of couple of
> > thousands vs 20.
> >
> >
> > Best regards
> >
> >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
>
>
>
> --
> Adrien
Re: An interesting case [ In reply to ]
If you don't need any information about the total hit count, you could
create a TopScoreDocCollector that has the same value for numHits
and totalHitsThreshold. This way Lucene will spend as little energy as
possible computing the number of matches of the query.

On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com> wrote:

> i am currently happy with Lucene performance but i want to understand
> and speedup further
>
> by limiting the results concretely. So i still donot know why totalHits
> and scoredocs report
>
> different number of hits.
>
>
> Best regards
>
>
> On 6/8/21 2:52 AM, Baris Kazar wrote:
> > my worry is actually about the lucene's performance.
> >
> > if lucene collects thousands of hits instead of actually n (<<< a
> > couple of 1000s) hits, then this creates performance issue.
> >
> > ScoreDoc array is ok as i mentioned ie, it has size n.
> > i will check count api.
> >
> > Best regards
> > ------------------------------------------------------------------------
> > *From:* Adrien Grand <jpountz@gmail.com>
> > *Sent:* Tuesday, June 8, 2021 2:46 AM
> > *To:* Lucene Users Mailing List
> > *Cc:* Baris Kazar
> > *Subject:* Re: An interesting case
> > When you call IndexSearcher#search(Query query, int n), there are two
> > cases:
> > - either your query matches n hits or more, and the TopDocs object
> > will have a ScoreDoc[] array that contains the n best scoring hits
> > sorted by descending score,
> > - or your query matches less then n hits and then the TopDocs object
> > will have all matches in the ScoreDoc[] array, sorted by descending
> score.
> >
> > In both cases, TopDocs#totalHits gives information about the total
> > number of matches of the query. On older versions of Lucene (<7.0)
> > this is an integer that is always accurate, while on more recent
> > versions of Lucene (>= 8.0) it is a lower bound of the total number of
> > matches. It typically returns the number of collected documents
> > indeed, though this is an implementation detail that might change in
> > the future.
> >
> > If you want to count the number of matches of a Query precisely, you
> > can use IndexSearcher#count.
> >
> > On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
> > <mailto:baris.kazar@oracle.com>> wrote:
> >
> >
> https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search
> > <
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
> >
> >
> > looks like someone else also had this problem, too.
> >
> > Any suggestions please?
> >
> > Best regards
> >
> >
> > On 6/8/21 1:36 AM, baris.kazar@oracle.com
> > <mailto:baris.kazar@oracle.com> wrote:
> > > Hi,-
> > >
> > > I use IndexSearcher.search API with two parameters like Query
> > and int
> > > number (i set as 20).
> > >
> > > However, when i look at the TopDocs object which is the result
> > of this
> > > above API call
> > >
> > > i see thousands of hits from totalhits. Is this inaccurate or
> > Lucene
> > > is doing actually search based on that many results?
> > >
> > > But when i iterate over result of above API call's scoreDocs
> > object i
> > > get int number of hits (ie, 20 hits).
> > >
> > >
> > > I am trying to find out why
> > org.apache.lucene.search.Topdocs.TotalHits
> > > report a number of collected results than
> > >
> > > the actual number of results. I see on the order of couple of
> > > thousands vs 20.
> > >
> > >
> > > Best regards
> > >
> > >
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > <mailto:java-user-unsubscribe@lucene.apache.org>
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> > <mailto:java-user-help@lucene.apache.org>
> >
> >
> >
> > --
> > Adrien
>


--
Adrien
Re: An interesting case [ In reply to ]
Adrien my concern is not actually the number mismatch

as i mentioned it is the performance.


seeing those numbers mismatch it seems that lucene is still doing same

amount of work to get results no matter how many results you need in the
indexsearcher search api.


i thought i was clear on that.


Lucene should not spend any energy for the count as scoredocs already
has that.

But seeing totalhits high number, that worries me as i explained above.


Best regards


On 6/8/21 1:12 PM, Adrien Grand wrote:
> If you don't need any information about the total hit count, you could
> create a TopScoreDocCollector that has the same value for numHits
> and totalHitsThreshold. This way Lucene will spend as little energy as
> possible computing the number of matches of the query.
>
> On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com> wrote:
>
>> i am currently happy with Lucene performance but i want to understand
>> and speedup further
>>
>> by limiting the results concretely. So i still donot know why totalHits
>> and scoredocs report
>>
>> different number of hits.
>>
>>
>> Best regards
>>
>>
>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>> my worry is actually about the lucene's performance.
>>>
>>> if lucene collects thousands of hits instead of actually n (<<< a
>>> couple of 1000s) hits, then this creates performance issue.
>>>
>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>> i will check count api.
>>>
>>> Best regards
>>> ------------------------------------------------------------------------
>>> *From:* Adrien Grand <jpountz@gmail.com>
>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>> *To:* Lucene Users Mailing List
>>> *Cc:* Baris Kazar
>>> *Subject:* Re: An interesting case
>>> When you call IndexSearcher#search(Query query, int n), there are two
>>> cases:
>>> - either your query matches n hits or more, and the TopDocs object
>>> will have a ScoreDoc[] array that contains the n best scoring hits
>>> sorted by descending score,
>>> - or your query matches less then n hits and then the TopDocs object
>>> will have all matches in the ScoreDoc[] array, sorted by descending
>> score.
>>> In both cases, TopDocs#totalHits gives information about the total
>>> number of matches of the query. On older versions of Lucene (<7.0)
>>> this is an integer that is always accurate, while on more recent
>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of
>>> matches. It typically returns the number of collected documents
>>> indeed, though this is an implementation detail that might change in
>>> the future.
>>>
>>> If you want to count the number of matches of a Query precisely, you
>>> can use IndexSearcher#count.
>>>
>>> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>> wrote:
>>>
>>>
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
>>> <
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
>>>
>>> looks like someone else also had this problem, too.
>>>
>>> Any suggestions please?
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 1:36 AM, baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com> wrote:
>>> > Hi,-
>>> >
>>> > I use IndexSearcher.search API with two parameters like Query
>>> and int
>>> > number (i set as 20).
>>> >
>>> > However, when i look at the TopDocs object which is the result
>>> of this
>>> > above API call
>>> >
>>> > i see thousands of hits from totalhits. Is this inaccurate or
>>> Lucene
>>> > is doing actually search based on that many results?
>>> >
>>> > But when i iterate over result of above API call's scoreDocs
>>> object i
>>> > get int number of hits (ie, 20 hits).
>>> >
>>> >
>>> > I am trying to find out why
>>> org.apache.lucene.search.Topdocs.TotalHits
>>> > report a number of collected results than
>>> >
>>> > the actual number of results. I see on the order of couple of
>>> > thousands vs 20.
>>> >
>>> >
>>> > Best regards
>>> >
>>> >
>>> >
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> <mailto:java-user-unsubscribe@lucene.apache.org>
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>> <mailto:java-user-help@lucene.apache.org>
>>>
>>>
>>>
>>> --
>>> Adrien
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: An interesting case [ In reply to ]
Ok i think you meant something else here.

you are not refering to total number of hits calculation or the
mismatch, right?



so to make lucene minimum work to reach the matched docs


TopScoreDocCollector should be used, right?


Let me check this class.

Thanks


On 6/8/21 1:16 PM, baris.kazar@oracle.com wrote:
> Adrien my concern is not actually the number mismatch
>
> as i mentioned it is the performance.
>
>
> seeing those numbers mismatch it seems that lucene is still doing same
>
> amount of work to get results no matter how many results you need in
> the indexsearcher search api.
>
>
> i thought i was clear on that.
>
>
> Lucene should not spend any energy for the count as scoredocs already
> has that.
>
> But seeing totalhits high number, that worries me as i explained above.
>
>
> Best regards
>
>
> On 6/8/21 1:12 PM, Adrien Grand wrote:
>> If you don't need any information about the total hit count, you could
>> create a TopScoreDocCollector that has the same value for numHits
>> and totalHitsThreshold. This way Lucene will spend as little energy as
>> possible computing the number of matches of the query.
>>
>> On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com> wrote:
>>
>>> i am currently happy with Lucene performance but i want to understand
>>> and speedup further
>>>
>>> by limiting the results concretely. So i still donot know why totalHits
>>> and scoredocs report
>>>
>>> different number of hits.
>>>
>>>
>>> Best regards
>>>
>>>
>>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>>> my worry is actually about the lucene's performance.
>>>>
>>>> if lucene collects thousands of hits instead of actually n (<<< a
>>>> couple of 1000s) hits, then this creates performance issue.
>>>>
>>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>>> i will check count api.
>>>>
>>>> Best regards
>>>> ------------------------------------------------------------------------
>>>>
>>>> *From:* Adrien Grand <jpountz@gmail.com>
>>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>>> *To:* Lucene Users Mailing List
>>>> *Cc:* Baris Kazar
>>>> *Subject:* Re: An interesting case
>>>> When you call IndexSearcher#search(Query query, int n), there are two
>>>> cases:
>>>>   - either your query matches n hits or more, and the TopDocs object
>>>> will have a ScoreDoc[] array that contains the n best scoring hits
>>>> sorted by descending score,
>>>>   - or your query matches less then n hits and then the TopDocs object
>>>> will have all matches in the ScoreDoc[] array, sorted by descending
>>> score.
>>>> In both cases, TopDocs#totalHits gives information about the total
>>>> number of matches of the query. On older versions of Lucene (<7.0)
>>>> this is an integer that is always accurate, while on more recent
>>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of
>>>> matches. It typically returns the number of collected documents
>>>> indeed, though this is an implementation detail that might change in
>>>> the future.
>>>>
>>>> If you want to count the number of matches of a Query precisely, you
>>>> can use IndexSearcher#count.
>>>>
>>>> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
>>>> <mailto:baris.kazar@oracle.com>> wrote:
>>>>
>>>>
>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
>>>
>>>>      <
>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
>>>
>>>>
>>>>      looks like someone else also had this problem, too.
>>>>
>>>>      Any suggestions please?
>>>>
>>>>      Best regards
>>>>
>>>>
>>>>      On 6/8/21 1:36 AM, baris.kazar@oracle.com
>>>>      <mailto:baris.kazar@oracle.com> wrote:
>>>>      > Hi,-
>>>>      >
>>>>      >  I use IndexSearcher.search API with two parameters like Query
>>>>      and int
>>>>      > number (i set as 20).
>>>>      >
>>>>      > However, when i look at the TopDocs object which is the result
>>>>      of this
>>>>      > above API call
>>>>      >
>>>>      > i see thousands of hits from totalhits. Is this inaccurate or
>>>>      Lucene
>>>>      > is doing actually search based on that many results?
>>>>      >
>>>>      > But when i iterate over result of above API call's scoreDocs
>>>>      object i
>>>>      > get int number of hits (ie, 20 hits).
>>>>      >
>>>>      >
>>>>      > I am trying to find out why
>>>>      org.apache.lucene.search.Topdocs.TotalHits
>>>>      > report a number of collected results than
>>>>      >
>>>>      > the actual number of results. I see on the order of couple of
>>>>      > thousands vs 20.
>>>>      >
>>>>      >
>>>>      > Best regards
>>>>      >
>>>>      >
>>>>      >
>>>>
>>>> ---------------------------------------------------------------------
>>>>      To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>>      <mailto:java-user-unsubscribe@lucene.apache.org>
>>>>      For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>      <mailto:java-user-help@lucene.apache.org>
>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: An interesting case [ In reply to ]
Yes, for instance if you care about the top 10 hits only, you could call
TopScoreDocsCollector.create(10, null, 10). By default, IndexSearcher is
configured to count at least 1,000 hits, and creates its top docs collector
with TopScoreDocsCollector.create(10, null, 1000).

On Tue, Jun 8, 2021 at 7:19 PM <baris.kazar@oracle.com> wrote:

> Ok i think you meant something else here.
>
> you are not refering to total number of hits calculation or the
> mismatch, right?
>
>
>
> so to make lucene minimum work to reach the matched docs
>
>
> TopScoreDocCollector should be used, right?
>
>
> Let me check this class.
>
> Thanks
>
>
> On 6/8/21 1:16 PM, baris.kazar@oracle.com wrote:
> > Adrien my concern is not actually the number mismatch
> >
> > as i mentioned it is the performance.
> >
> >
> > seeing those numbers mismatch it seems that lucene is still doing same
> >
> > amount of work to get results no matter how many results you need in
> > the indexsearcher search api.
> >
> >
> > i thought i was clear on that.
> >
> >
> > Lucene should not spend any energy for the count as scoredocs already
> > has that.
> >
> > But seeing totalhits high number, that worries me as i explained above.
> >
> >
> > Best regards
> >
> >
> > On 6/8/21 1:12 PM, Adrien Grand wrote:
> >> If you don't need any information about the total hit count, you could
> >> create a TopScoreDocCollector that has the same value for numHits
> >> and totalHitsThreshold. This way Lucene will spend as little energy as
> >> possible computing the number of matches of the query.
> >>
> >> On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com> wrote:
> >>
> >>> i am currently happy with Lucene performance but i want to understand
> >>> and speedup further
> >>>
> >>> by limiting the results concretely. So i still donot know why totalHits
> >>> and scoredocs report
> >>>
> >>> different number of hits.
> >>>
> >>>
> >>> Best regards
> >>>
> >>>
> >>> On 6/8/21 2:52 AM, Baris Kazar wrote:
> >>>> my worry is actually about the lucene's performance.
> >>>>
> >>>> if lucene collects thousands of hits instead of actually n (<<< a
> >>>> couple of 1000s) hits, then this creates performance issue.
> >>>>
> >>>> ScoreDoc array is ok as i mentioned ie, it has size n.
> >>>> i will check count api.
> >>>>
> >>>> Best regards
> >>>>
> ------------------------------------------------------------------------
> >>>>
> >>>> *From:* Adrien Grand <jpountz@gmail.com>
> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
> >>>> *To:* Lucene Users Mailing List
> >>>> *Cc:* Baris Kazar
> >>>> *Subject:* Re: An interesting case
> >>>> When you call IndexSearcher#search(Query query, int n), there are two
> >>>> cases:
> >>>> - either your query matches n hits or more, and the TopDocs object
> >>>> will have a ScoreDoc[] array that contains the n best scoring hits
> >>>> sorted by descending score,
> >>>> - or your query matches less then n hits and then the TopDocs object
> >>>> will have all matches in the ScoreDoc[] array, sorted by descending
> >>> score.
> >>>> In both cases, TopDocs#totalHits gives information about the total
> >>>> number of matches of the query. On older versions of Lucene (<7.0)
> >>>> this is an integer that is always accurate, while on more recent
> >>>> versions of Lucene (>= 8.0) it is a lower bound of the total number of
> >>>> matches. It typically returns the number of collected documents
> >>>> indeed, though this is an implementation detail that might change in
> >>>> the future.
> >>>>
> >>>> If you want to count the number of matches of a Query precisely, you
> >>>> can use IndexSearcher#count.
> >>>>
> >>>> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
> >>>> <mailto:baris.kazar@oracle.com>> wrote:
> >>>>
> >>>>
> >>>
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
> >>>
> >>>> <
> >>>
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
> >>>
> >>>>
> >>>> looks like someone else also had this problem, too.
> >>>>
> >>>> Any suggestions please?
> >>>>
> >>>> Best regards
> >>>>
> >>>>
> >>>> On 6/8/21 1:36 AM, baris.kazar@oracle.com
> >>>> <mailto:baris.kazar@oracle.com> wrote:
> >>>> > Hi,-
> >>>> >
> >>>> > I use IndexSearcher.search API with two parameters like Query
> >>>> and int
> >>>> > number (i set as 20).
> >>>> >
> >>>> > However, when i look at the TopDocs object which is the result
> >>>> of this
> >>>> > above API call
> >>>> >
> >>>> > i see thousands of hits from totalhits. Is this inaccurate or
> >>>> Lucene
> >>>> > is doing actually search based on that many results?
> >>>> >
> >>>> > But when i iterate over result of above API call's scoreDocs
> >>>> object i
> >>>> > get int number of hits (ie, 20 hits).
> >>>> >
> >>>> >
> >>>> > I am trying to find out why
> >>>> org.apache.lucene.search.Topdocs.TotalHits
> >>>> > report a number of collected results than
> >>>> >
> >>>> > the actual number of results. I see on the order of couple of
> >>>> > thousands vs 20.
> >>>> >
> >>>> >
> >>>> > Best regards
> >>>> >
> >>>> >
> >>>> >
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >>>> <mailto:java-user-unsubscribe@lucene.apache.org>
> >>>> For additional commands, e-mail:
> java-user-help@lucene.apache.org
> >>>> <mailto:java-user-help@lucene.apache.org>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Adrien
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien
Re: An interesting case [ In reply to ]
yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.

So TopScoreDocsCollector is working underneath IndexSearcher.search api,
right?

in other words TopScoreDocsCollector will be saving time, right?

Thanks


On 6/8/21 1:27 PM, Adrien Grand wrote:
> Yes, for instance if you care about the top 10 hits only, you could
> call TopScoreDocsCollector.create(10, null, 10). By default,
> IndexSearcher is configured to count at least 1,000 hits, and creates
> its top docs collector with TopScoreDocsCollector.create(10, null, 1000).
>
> On Tue, Jun 8, 2021 at 7:19 PM <baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>> wrote:
>
> Ok i think you meant something else here.
>
> you are not refering to total number of hits calculation or the
> mismatch, right?
>
>
>
> so to make lucene minimum work to reach the matched docs
>
>
> TopScoreDocCollector should be used, right?
>
>
> Let me check this class.
>
> Thanks
>
>
> On 6/8/21 1:16 PM, baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com> wrote:
> > Adrien my concern is not actually the number mismatch
> >
> > as i mentioned it is the performance.
> >
> >
> > seeing those numbers mismatch it seems that lucene is still
> doing same
> >
> > amount of work to get results no matter how many results you
> need in
> > the indexsearcher search api.
> >
> >
> > i thought i was clear on that.
> >
> >
> > Lucene should not spend any energy for the count as scoredocs
> already
> > has that.
> >
> > But seeing totalhits high number, that worries me as i explained
> above.
> >
> >
> > Best regards
> >
> >
> > On 6/8/21 1:12 PM, Adrien Grand wrote:
> >> If you don't need any information about the total hit count,
> you could
> >> create a TopScoreDocCollector that has the same value for numHits
> >> and totalHitsThreshold. This way Lucene will spend as little
> energy as
> >> possible computing the number of matches of the query.
> >>
> >> On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>> wrote:
> >>
> >>> i am currently happy with Lucene performance but i want to
> understand
> >>> and speedup further
> >>>
> >>> by limiting the results concretely. So i still donot know why
> totalHits
> >>> and scoredocs report
> >>>
> >>> different number of hits.
> >>>
> >>>
> >>> Best regards
> >>>
> >>>
> >>> On 6/8/21 2:52 AM, Baris Kazar wrote:
> >>>> my worry is actually about the lucene's performance.
> >>>>
> >>>> if lucene collects thousands of hits instead of actually n (<<< a
> >>>> couple of 1000s) hits, then this creates performance issue.
> >>>>
> >>>> ScoreDoc array is ok as i mentioned ie, it has size n.
> >>>> i will check count api.
> >>>>
> >>>> Best regards
> >>>>
> ------------------------------------------------------------------------
>
> >>>>
> >>>> *From:* Adrien Grand <jpountz@gmail.com
> <mailto:jpountz@gmail.com>>
> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
> >>>> *To:* Lucene Users Mailing List
> >>>> *Cc:* Baris Kazar
> >>>> *Subject:* Re: An interesting case
> >>>> When you call IndexSearcher#search(Query query, int n), there
> are two
> >>>> cases:
> >>>>   - either your query matches n hits or more, and the TopDocs
> object
> >>>> will have a ScoreDoc[] array that contains the n best scoring
> hits
> >>>> sorted by descending score,
> >>>>   - or your query matches less then n hits and then the
> TopDocs object
> >>>> will have all matches in the ScoreDoc[] array, sorted by
> descending
> >>> score.
> >>>> In both cases, TopDocs#totalHits gives information about the
> total
> >>>> number of matches of the query. On older versions of Lucene
> (<7.0)
> >>>> this is an integer that is always accurate, while on more recent
> >>>> versions of Lucene (>= 8.0) it is a lower bound of the total
> number of
> >>>> matches. It typically returns the number of collected documents
> >>>> indeed, though this is an implementation detail that might
> change in
> >>>> the future.
> >>>>
> >>>> If you want to count the number of matches of a Query
> precisely, you
> >>>> can use IndexSearcher#count.
> >>>>
> >>>> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>
> >>>> <mailto:baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>>> wrote:
> >>>>
> >>>>
> >>>
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$>
>
> >>>
> >>>>      <
> >>>
> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>
>
> >>>
> >>>>
> >>>>      looks like someone else also had this problem, too.
> >>>>
> >>>>      Any suggestions please?
> >>>>
> >>>>      Best regards
> >>>>
> >>>>
> >>>>      On 6/8/21 1:36 AM, baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>
> >>>>      <mailto:baris.kazar@oracle.com
> <mailto:baris.kazar@oracle.com>> wrote:
> >>>>      > Hi,-
> >>>>      >
> >>>>      >  I use IndexSearcher.search API with two parameters
> like Query
> >>>>      and int
> >>>>      > number (i set as 20).
> >>>>      >
> >>>>      > However, when i look at the TopDocs object which is
> the result
> >>>>      of this
> >>>>      > above API call
> >>>>      >
> >>>>      > i see thousands of hits from totalhits. Is this
> inaccurate or
> >>>>      Lucene
> >>>>      > is doing actually search based on that many results?
> >>>>      >
> >>>>      > But when i iterate over result of above API call's
> scoreDocs
> >>>>      object i
> >>>>      > get int number of hits (ie, 20 hits).
> >>>>      >
> >>>>      >
> >>>>      > I am trying to find out why
> >>>> org.apache.lucene.search.Topdocs.TotalHits
> >>>>      > report a number of collected results than
> >>>>      >
> >>>>      > the actual number of results. I see on the order of
> couple of
> >>>>      > thousands vs 20.
> >>>>      >
> >>>>      >
> >>>>      > Best regards
> >>>>      >
> >>>>      >
> >>>>      >
> >>>>
> >>>>
> ---------------------------------------------------------------------
> >>>>      To unsubscribe, e-mail:
> java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> >>>>      <mailto:java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>>
> >>>>      For additional commands, e-mail:
> java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
> >>>>      <mailto:java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Adrien
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
>
>
>
> --
> Adrien
Re: An interesting case [ In reply to ]
May i please again suggest?

the Javadocs need to be enhanced for Lucene

There needs to be more info and explain parameters and

more importantly in terms of performance why these two classes
(TopScoreDocsCollector vs IndexSearcher) differ for performance.


Thanks


On 6/8/21 2:07 PM, baris.kazar@oracle.com wrote:
>
> yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.
>
> So TopScoreDocsCollector is working underneath IndexSearcher.search
> api, right?
>
> in other words TopScoreDocsCollector will be saving time, right?
>
> Thanks
>
>
> On 6/8/21 1:27 PM, Adrien Grand wrote:
>> Yes, for instance if you care about the top 10 hits only, you could
>> call TopScoreDocsCollector.create(10, null, 10). By default,
>> IndexSearcher is configured to count at least 1,000 hits, and creates
>> its top docs collector with TopScoreDocsCollector.create(10, null, 1000).
>>
>> On Tue, Jun 8, 2021 at 7:19 PM <baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com>> wrote:
>>
>> Ok i think you meant something else here.
>>
>> you are not refering to total number of hits calculation or the
>> mismatch, right?
>>
>>
>>
>> so to make lucene minimum work to reach the matched docs
>>
>>
>> TopScoreDocCollector should be used, right?
>>
>>
>> Let me check this class.
>>
>> Thanks
>>
>>
>> On 6/8/21 1:16 PM, baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com> wrote:
>> > Adrien my concern is not actually the number mismatch
>> >
>> > as i mentioned it is the performance.
>> >
>> >
>> > seeing those numbers mismatch it seems that lucene is still
>> doing same
>> >
>> > amount of work to get results no matter how many results you
>> need in
>> > the indexsearcher search api.
>> >
>> >
>> > i thought i was clear on that.
>> >
>> >
>> > Lucene should not spend any energy for the count as scoredocs
>> already
>> > has that.
>> >
>> > But seeing totalhits high number, that worries me as i
>> explained above.
>> >
>> >
>> > Best regards
>> >
>> >
>> > On 6/8/21 1:12 PM, Adrien Grand wrote:
>> >> If you don't need any information about the total hit count,
>> you could
>> >> create a TopScoreDocCollector that has the same value for numHits
>> >> and totalHitsThreshold. This way Lucene will spend as little
>> energy as
>> >> possible computing the number of matches of the query.
>> >>
>> >> On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com>> wrote:
>> >>
>> >>> i am currently happy with Lucene performance but i want to
>> understand
>> >>> and speedup further
>> >>>
>> >>> by limiting the results concretely. So i still donot know why
>> totalHits
>> >>> and scoredocs report
>> >>>
>> >>> different number of hits.
>> >>>
>> >>>
>> >>> Best regards
>> >>>
>> >>>
>> >>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>> >>>> my worry is actually about the lucene's performance.
>> >>>>
>> >>>> if lucene collects thousands of hits instead of actually n
>> (<<< a
>> >>>> couple of 1000s) hits, then this creates performance issue.
>> >>>>
>> >>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>> >>>> i will check count api.
>> >>>>
>> >>>> Best regards
>> >>>>
>> ------------------------------------------------------------------------
>>
>> >>>>
>> >>>> *From:* Adrien Grand <jpountz@gmail.com
>> <mailto:jpountz@gmail.com>>
>> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>> >>>> *To:* Lucene Users Mailing List
>> >>>> *Cc:* Baris Kazar
>> >>>> *Subject:* Re: An interesting case
>> >>>> When you call IndexSearcher#search(Query query, int n),
>> there are two
>> >>>> cases:
>> >>>>   - either your query matches n hits or more, and the
>> TopDocs object
>> >>>> will have a ScoreDoc[] array that contains the n best
>> scoring hits
>> >>>> sorted by descending score,
>> >>>>   - or your query matches less then n hits and then the
>> TopDocs object
>> >>>> will have all matches in the ScoreDoc[] array, sorted by
>> descending
>> >>> score.
>> >>>> In both cases, TopDocs#totalHits gives information about the
>> total
>> >>>> number of matches of the query. On older versions of Lucene
>> (<7.0)
>> >>>> this is an integer that is always accurate, while on more recent
>> >>>> versions of Lucene (>= 8.0) it is a lower bound of the total
>> number of
>> >>>> matches. It typically returns the number of collected documents
>> >>>> indeed, though this is an implementation detail that might
>> change in
>> >>>> the future.
>> >>>>
>> >>>> If you want to count the number of matches of a Query
>> precisely, you
>> >>>> can use IndexSearcher#count.
>> >>>>
>> >>>> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com>
>> >>>> <mailto:baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com>>> wrote:
>> >>>>
>> >>>>
>> >>>
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
>> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$>
>>
>> >>>
>> >>>>      <
>> >>>
>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
>> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>
>>
>> >>>
>> >>>>
>> >>>>      looks like someone else also had this problem, too.
>> >>>>
>> >>>>      Any suggestions please?
>> >>>>
>> >>>>      Best regards
>> >>>>
>> >>>>
>> >>>>      On 6/8/21 1:36 AM, baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com>
>> >>>>      <mailto:baris.kazar@oracle.com
>> <mailto:baris.kazar@oracle.com>> wrote:
>> >>>>      > Hi,-
>> >>>>      >
>> >>>>      >  I use IndexSearcher.search API with two parameters
>> like Query
>> >>>>      and int
>> >>>>      > number (i set as 20).
>> >>>>      >
>> >>>>      > However, when i look at the TopDocs object which is
>> the result
>> >>>>      of this
>> >>>>      > above API call
>> >>>>      >
>> >>>>      > i see thousands of hits from totalhits. Is this
>> inaccurate or
>> >>>>      Lucene
>> >>>>      > is doing actually search based on that many results?
>> >>>>      >
>> >>>>      > But when i iterate over result of above API call's
>> scoreDocs
>> >>>>      object i
>> >>>>      > get int number of hits (ie, 20 hits).
>> >>>>      >
>> >>>>      >
>> >>>>      > I am trying to find out why
>> >>>> org.apache.lucene.search.Topdocs.TotalHits
>> >>>>      > report a number of collected results than
>> >>>>      >
>> >>>>      > the actual number of results. I see on the order of
>> couple of
>> >>>>      > thousands vs 20.
>> >>>>      >
>> >>>>      >
>> >>>>      > Best regards
>> >>>>      >
>> >>>>      >
>> >>>>      >
>> >>>>
>> >>>>
>> ---------------------------------------------------------------------
>> >>>>      To unsubscribe, e-mail:
>> java-user-unsubscribe@lucene.apache.org
>> <mailto:java-user-unsubscribe@lucene.apache.org>
>> >>>>      <mailto:java-user-unsubscribe@lucene.apache.org
>> <mailto:java-user-unsubscribe@lucene.apache.org>>
>> >>>>      For additional commands, e-mail:
>> java-user-help@lucene.apache.org
>> <mailto:java-user-help@lucene.apache.org>
>> >>>>      <mailto:java-user-help@lucene.apache.org
>> <mailto:java-user-help@lucene.apache.org>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Adrien
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> <mailto:java-user-unsubscribe@lucene.apache.org>
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>> <mailto:java-user-help@lucene.apache.org>
>>
>>
>>
>> --
>> Adrien
Re: An interesting case [ In reply to ]
Ok, i think i fully understand now and thanks.


https://stackoverflow.com/questions/15589186/lucene-4-pagination

This post was really good and i would like a similar text to this to
appear in the Javadocs please as it helps everyone.


/"I agree with the solution explained by Jaimie. But I want to point out
another aspect you have to be aware of and which is helping to
understand the general mechanism of a search engine.//
//
//With the TopDocCollector you can define how much hits you want to be
collected matching your search query, before the result is sorted by
score or other sort criterias.//
//
//See the following example://
//
//collector = TopScoreDocCollector.create(9999, true);//
//searcher.search(parser.parse("Clone Warrior"), collector);//
//// get first page//
//topDocs = collector.topDocs(0, 10);//
//int resultSize=topDocs.scoreDocs.length; // 10 or less//
//int totalHits=topDocs.totalHits; // 9999 or less//
//We tell Lucene here to collect a maximum of 9999 documents containing
the search phrase 'Clone Warrior'. This means, if the index contains
more than 9999 documents containing this search phrase, the collector
will stop after it is filled up with 9999 hits!//
//
//This means, that as greater you choose the MAX_RESULTS as better
become your search result. But this is only relevant if you expect a
large number of hits. On the other side if you search for "luke
skywalker" and you will expect only one hit, than the MAX_RESULTS can
also be set to 1.//
//
//So changing the MAX_RESULTS can influence the returned scoreDocs as
the sorting will be performed on the collected hits. It is practically
to set MAX_RESULTS to a size which is large enough so that the human
user can not argue to miss a specific document. This concept is totally
contrary to the behavior of a SQL database, which does always consider
the complete data pool.//
//
//But lucene also supports another mechanism. You can, instead of
defining the MAX_RESULTS for the collector, alternatively define the
amount of time you want to wait for the resultset. So for example you
can define that you always want to stop the collector after 300ms. This
is a good approach to protect your application for performance issues.
But if you want to make sure that you count all relevant documents than
you have to set the parameter for MAX_RESULTS or the maximum wait time
to a endless value."/


Thanks to Ralph who posted this at the above stackoverflow link and
thanks to Adrien.

i want to limit the number 9999 above to 100 maybe.

Best regards



On 6/8/21 4:19 PM, baris.kazar@oracle.com wrote:
>
> May i please again suggest?
>
> the Javadocs need to be enhanced for Lucene
>
> There needs to be more info and explain parameters and
>
> more importantly in terms of performance why these two classes
> (TopScoreDocsCollector vs IndexSearcher) differ for performance.
>
>
> Thanks
>
>
> On 6/8/21 2:07 PM, baris.kazar@oracle.com wrote:
>>
>> yes i see sometimes 4000+, sometimes 3000+ hits from totalhits.
>>
>> So TopScoreDocsCollector is working underneath IndexSearcher.search
>> api, right?
>>
>> in other words TopScoreDocsCollector will be saving time, right?
>>
>> Thanks
>>
>>
>> On 6/8/21 1:27 PM, Adrien Grand wrote:
>>> Yes, for instance if you care about the top 10 hits only, you could
>>> call TopScoreDocsCollector.create(10, null, 10). By default,
>>> IndexSearcher is configured to count at least 1,000 hits, and
>>> creates its top docs collector with TopScoreDocsCollector.create(10,
>>> null, 1000).
>>>
>>> On Tue, Jun 8, 2021 at 7:19 PM <baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>> wrote:
>>>
>>> Ok i think you meant something else here.
>>>
>>> you are not refering to total number of hits calculation or the
>>> mismatch, right?
>>>
>>>
>>>
>>> so to make lucene minimum work to reach the matched docs
>>>
>>>
>>> TopScoreDocCollector should be used, right?
>>>
>>>
>>> Let me check this class.
>>>
>>> Thanks
>>>
>>>
>>> On 6/8/21 1:16 PM, baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com> wrote:
>>> > Adrien my concern is not actually the number mismatch
>>> >
>>> > as i mentioned it is the performance.
>>> >
>>> >
>>> > seeing those numbers mismatch it seems that lucene is still
>>> doing same
>>> >
>>> > amount of work to get results no matter how many results you
>>> need in
>>> > the indexsearcher search api.
>>> >
>>> >
>>> > i thought i was clear on that.
>>> >
>>> >
>>> > Lucene should not spend any energy for the count as scoredocs
>>> already
>>> > has that.
>>> >
>>> > But seeing totalhits high number, that worries me as i
>>> explained above.
>>> >
>>> >
>>> > Best regards
>>> >
>>> >
>>> > On 6/8/21 1:12 PM, Adrien Grand wrote:
>>> >> If you don't need any information about the total hit count,
>>> you could
>>> >> create a TopScoreDocCollector that has the same value for numHits
>>> >> and totalHitsThreshold. This way Lucene will spend as little
>>> energy as
>>> >> possible computing the number of matches of the query.
>>> >>
>>> >> On Tue, Jun 8, 2021 at 6:28 PM <baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>> wrote:
>>> >>
>>> >>> i am currently happy with Lucene performance but i want to
>>> understand
>>> >>> and speedup further
>>> >>>
>>> >>> by limiting the results concretely. So i still donot know
>>> why totalHits
>>> >>> and scoredocs report
>>> >>>
>>> >>> different number of hits.
>>> >>>
>>> >>>
>>> >>> Best regards
>>> >>>
>>> >>>
>>> >>> On 6/8/21 2:52 AM, Baris Kazar wrote:
>>> >>>> my worry is actually about the lucene's performance.
>>> >>>>
>>> >>>> if lucene collects thousands of hits instead of actually n
>>> (<<< a
>>> >>>> couple of 1000s) hits, then this creates performance issue.
>>> >>>>
>>> >>>> ScoreDoc array is ok as i mentioned ie, it has size n.
>>> >>>> i will check count api.
>>> >>>>
>>> >>>> Best regards
>>> >>>>
>>> ------------------------------------------------------------------------
>>>
>>> >>>>
>>> >>>> *From:* Adrien Grand <jpountz@gmail.com
>>> <mailto:jpountz@gmail.com>>
>>> >>>> *Sent:* Tuesday, June 8, 2021 2:46 AM
>>> >>>> *To:* Lucene Users Mailing List
>>> >>>> *Cc:* Baris Kazar
>>> >>>> *Subject:* Re: An interesting case
>>> >>>> When you call IndexSearcher#search(Query query, int n),
>>> there are two
>>> >>>> cases:
>>> >>>>   - either your query matches n hits or more, and the
>>> TopDocs object
>>> >>>> will have a ScoreDoc[] array that contains the n best
>>> scoring hits
>>> >>>> sorted by descending score,
>>> >>>>   - or your query matches less then n hits and then the
>>> TopDocs object
>>> >>>> will have all matches in the ScoreDoc[] array, sorted by
>>> descending
>>> >>> score.
>>> >>>> In both cases, TopDocs#totalHits gives information about
>>> the total
>>> >>>> number of matches of the query. On older versions of Lucene
>>> (<7.0)
>>> >>>> this is an integer that is always accurate, while on more
>>> recent
>>> >>>> versions of Lucene (>= 8.0) it is a lower bound of the
>>> total number of
>>> >>>> matches. It typically returns the number of collected documents
>>> >>>> indeed, though this is an implementation detail that might
>>> change in
>>> >>>> the future.
>>> >>>>
>>> >>>> If you want to count the number of matches of a Query
>>> precisely, you
>>> >>>> can use IndexSearcher#count.
>>> >>>>
>>> >>>> On Tue, Jun 8, 2021 at 7:51 AM <baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>
>>> >>>> <mailto:baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>>> wrote:
>>> >>>>
>>> >>>>
>>> >>>
>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$
>>> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!LRsX8rEVxyiW7z_x1SgYFeTYHDh861CsGCbMnMgKAuawz8u5_hiRv52XJ08nfvhVHw$>
>>>
>>> >>>
>>> >>>>      <
>>> >>>
>>> https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$
>>> <https://urldefense.com/v3/__https://stackoverflow.com/questions/50368313/relation-between-topdocs-totalhits-and-parameter-n-of-indexsearcher-search__;!!GqivPVa7Brio!JjLGw8TaYQcqSC7BtpPSZl5dl-WqgwwcgGFhOqHSUKIsCaTSNpoDvOJjq0BbkQhfpw$>
>>>
>>> >>>
>>> >>>>
>>> >>>>      looks like someone else also had this problem, too.
>>> >>>>
>>> >>>>      Any suggestions please?
>>> >>>>
>>> >>>>      Best regards
>>> >>>>
>>> >>>>
>>> >>>>      On 6/8/21 1:36 AM, baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>
>>> >>>>      <mailto:baris.kazar@oracle.com
>>> <mailto:baris.kazar@oracle.com>> wrote:
>>> >>>>      > Hi,-
>>> >>>>      >
>>> >>>>      >  I use IndexSearcher.search API with two parameters
>>> like Query
>>> >>>>      and int
>>> >>>>      > number (i set as 20).
>>> >>>>      >
>>> >>>>      > However, when i look at the TopDocs object which is
>>> the result
>>> >>>>      of this
>>> >>>>      > above API call
>>> >>>>      >
>>> >>>>      > i see thousands of hits from totalhits. Is this
>>> inaccurate or
>>> >>>>      Lucene
>>> >>>>      > is doing actually search based on that many results?
>>> >>>>      >
>>> >>>>      > But when i iterate over result of above API call's
>>> scoreDocs
>>> >>>>      object i
>>> >>>>      > get int number of hits (ie, 20 hits).
>>> >>>>      >
>>> >>>>      >
>>> >>>>      > I am trying to find out why
>>> >>>> org.apache.lucene.search.Topdocs.TotalHits
>>> >>>>      > report a number of collected results than
>>> >>>>      >
>>> >>>>      > the actual number of results. I see on the order of
>>> couple of
>>> >>>>      > thousands vs 20.
>>> >>>>      >
>>> >>>>      >
>>> >>>>      > Best regards
>>> >>>>      >
>>> >>>>      >
>>> >>>>      >
>>> >>>>
>>> >>>>
>>> ---------------------------------------------------------------------
>>> >>>>      To unsubscribe, e-mail:
>>> java-user-unsubscribe@lucene.apache.org
>>> <mailto:java-user-unsubscribe@lucene.apache.org>
>>> >>>>      <mailto:java-user-unsubscribe@lucene.apache.org
>>> <mailto:java-user-unsubscribe@lucene.apache.org>>
>>> >>>>      For additional commands, e-mail:
>>> java-user-help@lucene.apache.org
>>> <mailto:java-user-help@lucene.apache.org>
>>> >>>>      <mailto:java-user-help@lucene.apache.org
>>> <mailto:java-user-help@lucene.apache.org>>
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Adrien
>>> >>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> <mailto:java-user-unsubscribe@lucene.apache.org>
>>> For additional commands, e-mail:
>>> java-user-help@lucene.apache.org
>>> <mailto:java-user-help@lucene.apache.org>
>>>
>>>
>>>
>>> --
>>> Adrien