Mailing List Archive: Should segment order impact hit count estimation?

Should segment order impact hit count estimation?

zhai7631 at gmail

May 20, 2021, 11:28 PM

Post #1 of 7 (518 views)

Hi folks

For the past few weeks I've been working with Mike McCandless to use the
recent introduced IndexRearranger to replace the old way of guarantee
deterministic index -- using a single index thread and a LogDocMergePolicy.

In the progress we found out that with two concurrently built but
rearranged indexes, the estimation hit count will show a small difference.
I've carefully checked the index and found they're almost the same but the
segment order is different (index 1 might be segment 1,2,3,4,5 while index
2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
documents and sorted using the same criteria). So I suspected the segment
order impacted the hit count estimation and to confirm that I turned off
the concurrency of rearranger so that it will always create segments in
order. The result proved my theory that the segment order was impacting the
hit count estimation.

Later on I did some investigation and found in TopScoreDocCollector we do
have logic of updating the global minScore so I guess that's where makes
the difference. Mike and I both feel a little weird that segment order will
affect the hit count estimation, so just want to
1. See whether there's any chance we could improve the API or documentation
2. Seek some advice on how should we tackle the problem, obviously we don't
want rearranger to execute on only 1 thread (since we use it for speed!),
currently what we're considering is to relax the check for hit count
estimation, but maybe there's a better way?

Best
Patrick

Re: Should segment order impact hit count estimation? [ In reply to ]

jpountz at gmail

May 21, 2021, 2:24 AM

Post #2 of 7 (518 views)

Hi Patrick,

Why do you feel weird about the fact that segment order impacts the hit
count estimation? It feels ok to me, especially as segment order has deeper
implications, e.g. you could get different top hits given that Lucene uses
the global doc ID as a tie breaker for documents that produce the same
score.

Could IndexRearranger write segments in deterministic order in the segments
file to improve reproducibility? Or could your application configure leaf
order explicitly? (LUCENE-9507)

Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7631@gmail.com> a écrit :

> Hi folks
>
> For the past few weeks I've been working with Mike McCandless to use the
> recent introduced IndexRearranger to replace the old way of guarantee
> deterministic index -- using a single index thread and a LogDocMergePolicy.
>
> In the progress we found out that with two concurrently built but
> rearranged indexes, the estimation hit count will show a small difference.
> I've carefully checked the index and found they're almost the same but the
> segment order is different (index 1 might be segment 1,2,3,4,5 while index
> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
> documents and sorted using the same criteria). So I suspected the segment
> order impacted the hit count estimation and to confirm that I turned off
> the concurrency of rearranger so that it will always create segments in
> order. The result proved my theory that the segment order was impacting the
> hit count estimation.
>
> Later on I did some investigation and found in TopScoreDocCollector we do
> have logic of updating the global minScore so I guess that's where makes
> the difference. Mike and I both feel a little weird that segment order will
> affect the hit count estimation, so just want to
> 1. See whether there's any chance we could improve the API or documentation
> 2. Seek some advice on how should we tackle the problem, obviously we
> don't want rearranger to execute on only 1 thread (since we use it for
> speed!), currently what we're considering is to relax the check for hit
> count estimation, but maybe there's a better way?
>
> Best
> Patrick
>

Re: Should segment order impact hit count estimation? [ In reply to ]

gsmiller at gmail

May 21, 2021, 11:21 AM

Post #3 of 7 (518 views)

Hi Patrick-

Interesting finds and deep dive!

Just to confirm, in the cases of differing hit counts that you've
observed, these are TotalHits with a Relation of
"GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is
"EQUAL_TO"? I ask because that would change my opinion of
whether-or-not this is concerning. If we're proving a hit count with
"equal to" semantics, it should be correct and shouldn't change based
on segment ordering, etc. But, on the other hand, if we're just
providing a floor (i.e., "there are at least this many hits"), then
I'm not particularly concerned by this (assuming that floor is
actually correct in call cases and there aren't fewer hits).

Cheers,
-Greg

On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> Hi Patrick,
>
>
> Why do you feel weird about the fact that segment order impacts the hit count estimation? It feels ok to me, especially as segment order has deeper implications, e.g. you could get different top hits given that Lucene uses the global doc ID as a tie breaker for documents that produce the same score.
>
> Could IndexRearranger write segments in deterministic order in the segments file to improve reproducibility? Or could your application configure leaf order explicitly? (LUCENE-9507)
>
> Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7631@gmail.com> a écrit :
>>
>> Hi folks
>>
>> For the past few weeks I've been working with Mike McCandless to use the recent introduced IndexRearranger to replace the old way of guarantee deterministic index -- using a single index thread and a LogDocMergePolicy.
>>
>> In the progress we found out that with two concurrently built but rearranged indexes, the estimation hit count will show a small difference. I've carefully checked the index and found they're almost the same but the segment order is different (index 1 might be segment 1,2,3,4,5 while index 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same documents and sorted using the same criteria). So I suspected the segment order impacted the hit count estimation and to confirm that I turned off the concurrency of rearranger so that it will always create segments in order. The result proved my theory that the segment order was impacting the hit count estimation.
>>
>> Later on I did some investigation and found in TopScoreDocCollector we do have logic of updating the global minScore so I guess that's where makes the difference. Mike and I both feel a little weird that segment order will affect the hit count estimation, so just want to
>> 1. See whether there's any chance we could improve the API or documentation
>> 2. Seek some advice on how should we tackle the problem, obviously we don't want rearranger to execute on only 1 thread (since we use it for speed!), currently what we're considering is to relax the check for hit count estimation, but maybe there's a better way?
>>
>> Best
>> Patrick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Should segment order impact hit count estimation? [ In reply to ]

msokolov at gmail

May 23, 2021, 5:24 AM

Post #4 of 7 (512 views)

Agree with Adrien and Greg, approximate hit counts may vary. That global
docid tiebreaking may affect the actual results is an interesting
observation too, thanks Adrien. It seems to me the best way forward is to
relax the check. Depending on how the index is constructed the difference
could be quite large. Eg if the index is sorted and all the best documents
are in one segment, we could end up with quite different counts; they could
get many multiples different if you collect single threaded.

On Fri, May 21, 2021, 2:22 PM Greg Miller <gsmiller@gmail.com> wrote:

> Hi Patrick-
>
> Interesting finds and deep dive!
>
> Just to confirm, in the cases of differing hit counts that you've
> observed, these are TotalHits with a Relation of
> "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is
> "EQUAL_TO"? I ask because that would change my opinion of
> whether-or-not this is concerning. If we're proving a hit count with
> "equal to" semantics, it should be correct and shouldn't change based
> on segment ordering, etc. But, on the other hand, if we're just
> providing a floor (i.e., "there are at least this many hits"), then
> I'm not particularly concerned by this (assuming that floor is
> actually correct in call cases and there aren't fewer hits).
>
> Cheers,
> -Greg
>
> On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Hi Patrick,
> >
> >
> > Why do you feel weird about the fact that segment order impacts the hit
> count estimation? It feels ok to me, especially as segment order has deeper
> implications, e.g. you could get different top hits given that Lucene uses
> the global doc ID as a tie breaker for documents that produce the same
> score.
> >
> > Could IndexRearranger write segments in deterministic order in the
> segments file to improve reproducibility? Or could your application
> configure leaf order explicitly? (LUCENE-9507)
> >
> > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7631@gmail.com> a écrit :
> >>
> >> Hi folks
> >>
> >> For the past few weeks I've been working with Mike McCandless to use
> the recent introduced IndexRearranger to replace the old way of guarantee
> deterministic index -- using a single index thread and a LogDocMergePolicy.
> >>
> >> In the progress we found out that with two concurrently built but
> rearranged indexes, the estimation hit count will show a small difference.
> I've carefully checked the index and found they're almost the same but the
> segment order is different (index 1 might be segment 1,2,3,4,5 while index
> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
> documents and sorted using the same criteria). So I suspected the segment
> order impacted the hit count estimation and to confirm that I turned off
> the concurrency of rearranger so that it will always create segments in
> order. The result proved my theory that the segment order was impacting the
> hit count estimation.
> >>
> >> Later on I did some investigation and found in TopScoreDocCollector we
> do have logic of updating the global minScore so I guess that's where makes
> the difference. Mike and I both feel a little weird that segment order will
> affect the hit count estimation, so just want to
> >> 1. See whether there's any chance we could improve the API or
> documentation
> >> 2. Seek some advice on how should we tackle the problem, obviously we
> don't want rearranger to execute on only 1 thread (since we use it for
> speed!), currently what we're considering is to relax the check for hit
> count estimation, but maybe there's a better way?
> >>
> >> Best
> >> Patrick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Should segment order impact hit count estimation? [ In reply to ]

lucene at mikemccandless

May 24, 2021, 6:54 AM

Post #5 of 7 (512 views)

OK, it sounds like this is indeed expected (thanks Adrien, Greg, Mike,
Patrick!), that the estimated hit count (GREATER_THAN_OR_EQUAL_TO) can
indeed change if the segment order is different but otherwise identical
documents are in each segment (yes, with different global doc ids, so the
true identical document case would tie break by those).

That was/is surprising to me -- I thought segment order did not alter
anything at search time (except the tie-break-by-global-docid). I thought
each segment was searched, independently, by itself, and then results
"merged" together. But this has become more complex/interesting with time
:)

So is this because BMW will use the best results from previous segments to
optimize/improve skipping while searching future segments? I.e. some sort
of state carries over from having searched past segments, to better
optimize searching the next segment? This makes total sense (that we would
optimize this way). I guess this also means that concurrent search ("thread
per slice/segment") would not be able to optimize with BMW as well?

Mike McCandless

http://blog.mikemccandless.com

On Sun, May 23, 2021 at 8:24 AM Michael Sokolov <msokolov@gmail.com> wrote:

> Agree with Adrien and Greg, approximate hit counts may vary. That global
> docid tiebreaking may affect the actual results is an interesting
> observation too, thanks Adrien. It seems to me the best way forward is to
> relax the check. Depending on how the index is constructed the difference
> could be quite large. Eg if the index is sorted and all the best documents
> are in one segment, we could end up with quite different counts; they could
> get many multiples different if you collect single threaded.
>
> On Fri, May 21, 2021, 2:22 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Hi Patrick-
>>
>> Interesting finds and deep dive!
>>
>> Just to confirm, in the cases of differing hit counts that you've
>> observed, these are TotalHits with a Relation of
>> "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is
>> "EQUAL_TO"? I ask because that would change my opinion of
>> whether-or-not this is concerning. If we're proving a hit count with
>> "equal to" semantics, it should be correct and shouldn't change based
>> on segment ordering, etc. But, on the other hand, if we're just
>> providing a floor (i.e., "there are at least this many hits"), then
>> I'm not particularly concerned by this (assuming that floor is
>> actually correct in call cases and there aren't fewer hits).
>>
>> Cheers,
>> -Greg
>>
>> On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > Hi Patrick,
>> >
>> >
>> > Why do you feel weird about the fact that segment order impacts the hit
>> count estimation? It feels ok to me, especially as segment order has deeper
>> implications, e.g. you could get different top hits given that Lucene uses
>> the global doc ID as a tie breaker for documents that produce the same
>> score.
>> >
>> > Could IndexRearranger write segments in deterministic order in the
>> segments file to improve reproducibility? Or could your application
>> configure leaf order explicitly? (LUCENE-9507)
>> >
>> > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7631@gmail.com> a écrit
>> :
>> >>
>> >> Hi folks
>> >>
>> >> For the past few weeks I've been working with Mike McCandless to use
>> the recent introduced IndexRearranger to replace the old way of guarantee
>> deterministic index -- using a single index thread and a LogDocMergePolicy.
>> >>
>> >> In the progress we found out that with two concurrently built but
>> rearranged indexes, the estimation hit count will show a small difference.
>> I've carefully checked the index and found they're almost the same but the
>> segment order is different (index 1 might be segment 1,2,3,4,5 while index
>> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
>> documents and sorted using the same criteria). So I suspected the segment
>> order impacted the hit count estimation and to confirm that I turned off
>> the concurrency of rearranger so that it will always create segments in
>> order. The result proved my theory that the segment order was impacting the
>> hit count estimation.
>> >>
>> >> Later on I did some investigation and found in TopScoreDocCollector we
>> do have logic of updating the global minScore so I guess that's where makes
>> the difference. Mike and I both feel a little weird that segment order will
>> affect the hit count estimation, so just want to
>> >> 1. See whether there's any chance we could improve the API or
>> documentation
>> >> 2. Seek some advice on how should we tackle the problem, obviously we
>> don't want rearranger to execute on only 1 thread (since we use it for
>> speed!), currently what we're considering is to relax the check for hit
>> count estimation, but maybe there's a better way?
>> >>
>> >> Best
>> >> Patrick
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: Should segment order impact hit count estimation? [ In reply to ]

jpountz at gmail

May 24, 2021, 10:38 AM

Post #6 of 7 (512 views)

Indeed collectors keep state about the minimum required score for a hit to
be competitive, typically the next float after the score of the bottom of
the priority queue. It can only go up as more documents/segments get
collected. Collectors have a feedback loop with scorers that enables
scorers to skip hits that cannot possibly make it to the top hits.

If you can put segments that produce higher scores first, you will get
better performance. I don't think it can be easily done when sorting by
score, but when sorting by a field that correlates with index order, it
certainly can. This is the point of LUCENE-9507.

With multiple slices/threads, every slice gets its own collector, but
LUCENE-8978 made collectors able to share information about their minimum
competitive scores across threads so that seeing good hits on a segment
could still help skip hits on other segments like when searching without a
threadpool.

A possibly surprising consequence is that running the same query multiple
times could return different hit count estimations when IndexSearcher is
configured with a threadpool.

Le lun. 24 mai 2021 à 15:55, Michael McCandless <lucene@mikemccandless.com>
a écrit :

> OK, it sounds like this is indeed expected (thanks Adrien, Greg, Mike,
> Patrick!), that the estimated hit count (GREATER_THAN_OR_EQUAL_TO) can
> indeed change if the segment order is different but otherwise identical
> documents are in each segment (yes, with different global doc ids, so the
> true identical document case would tie break by those).
>
> That was/is surprising to me -- I thought segment order did not alter
> anything at search time (except the tie-break-by-global-docid). I thought
> each segment was searched, independently, by itself, and then results
> "merged" together. But this has become more complex/interesting with time
> :)
>
> So is this because BMW will use the best results from previous segments to
> optimize/improve skipping while searching future segments? I.e. some sort
> of state carries over from having searched past segments, to better
> optimize searching the next segment? This makes total sense (that we would
> optimize this way). I guess this also means that concurrent search ("thread
> per slice/segment") would not be able to optimize with BMW as well?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sun, May 23, 2021 at 8:24 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
>> Agree with Adrien and Greg, approximate hit counts may vary. That global
>> docid tiebreaking may affect the actual results is an interesting
>> observation too, thanks Adrien. It seems to me the best way forward is to
>> relax the check. Depending on how the index is constructed the difference
>> could be quite large. Eg if the index is sorted and all the best documents
>> are in one segment, we could end up with quite different counts; they could
>> get many multiples different if you collect single threaded.
>>
>> On Fri, May 21, 2021, 2:22 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Hi Patrick-
>>>
>>> Interesting finds and deep dive!
>>>
>>> Just to confirm, in the cases of differing hit counts that you've
>>> observed, these are TotalHits with a Relation of
>>> "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is
>>> "EQUAL_TO"? I ask because that would change my opinion of
>>> whether-or-not this is concerning. If we're proving a hit count with
>>> "equal to" semantics, it should be correct and shouldn't change based
>>> on segment ordering, etc. But, on the other hand, if we're just
>>> providing a floor (i.e., "there are at least this many hits"), then
>>> I'm not particularly concerned by this (assuming that floor is
>>> actually correct in call cases and there aren't fewer hits).
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpountz@gmail.com> wrote:
>>> >
>>> > Hi Patrick,
>>> >
>>> >
>>> > Why do you feel weird about the fact that segment order impacts the
>>> hit count estimation? It feels ok to me, especially as segment order has
>>> deeper implications, e.g. you could get different top hits given that
>>> Lucene uses the global doc ID as a tie breaker for documents that produce
>>> the same score.
>>> >
>>> > Could IndexRearranger write segments in deterministic order in the
>>> segments file to improve reproducibility? Or could your application
>>> configure leaf order explicitly? (LUCENE-9507)
>>> >
>>> > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7631@gmail.com> a
>>> écrit :
>>> >>
>>> >> Hi folks
>>> >>
>>> >> For the past few weeks I've been working with Mike McCandless to use
>>> the recent introduced IndexRearranger to replace the old way of guarantee
>>> deterministic index -- using a single index thread and a LogDocMergePolicy.
>>> >>
>>> >> In the progress we found out that with two concurrently built but
>>> rearranged indexes, the estimation hit count will show a small difference.
>>> I've carefully checked the index and found they're almost the same but the
>>> segment order is different (index 1 might be segment 1,2,3,4,5 while index
>>> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
>>> documents and sorted using the same criteria). So I suspected the segment
>>> order impacted the hit count estimation and to confirm that I turned off
>>> the concurrency of rearranger so that it will always create segments in
>>> order. The result proved my theory that the segment order was impacting the
>>> hit count estimation.
>>> >>
>>> >> Later on I did some investigation and found in TopScoreDocCollector
>>> we do have logic of updating the global minScore so I guess that's where
>>> makes the difference. Mike and I both feel a little weird that segment
>>> order will affect the hit count estimation, so just want to
>>> >> 1. See whether there's any chance we could improve the API or
>>> documentation
>>> >> 2. Seek some advice on how should we tackle the problem, obviously we
>>> don't want rearranger to execute on only 1 thread (since we use it for
>>> speed!), currently what we're considering is to relax the check for hit
>>> count estimation, but maybe there's a better way?
>>> >>
>>> >> Best
>>> >> Patrick
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: Should segment order impact hit count estimation? [ In reply to ]

lucene at mikemccandless

May 25, 2021, 7:46 AM

Post #7 of 7 (512 views)

Great, thanks for the explanation Adrien! I still learn something new
about Lucene every day ;)

Mike McCandless

http://blog.mikemccandless.com

On Mon, May 24, 2021 at 1:38 PM Adrien Grand <jpountz@gmail.com> wrote:

> Indeed collectors keep state about the minimum required score for a hit to
> be competitive, typically the next float after the score of the bottom of
> the priority queue. It can only go up as more documents/segments get
> collected. Collectors have a feedback loop with scorers that enables
> scorers to skip hits that cannot possibly make it to the top hits.
>
> If you can put segments that produce higher scores first, you will get
> better performance. I don't think it can be easily done when sorting by
> score, but when sorting by a field that correlates with index order, it
> certainly can. This is the point of LUCENE-9507.
>
> With multiple slices/threads, every slice gets its own collector, but
> LUCENE-8978 made collectors able to share information about their minimum
> competitive scores across threads so that seeing good hits on a segment
> could still help skip hits on other segments like when searching without a
> threadpool.
>
> A possibly surprising consequence is that running the same query multiple
> times could return different hit count estimations when IndexSearcher is
> configured with a threadpool.
>
>
> Le lun. 24 mai 2021 à 15:55, Michael McCandless <lucene@mikemccandless.com>
> a écrit :
>
>> OK, it sounds like this is indeed expected (thanks Adrien, Greg, Mike,
>> Patrick!), that the estimated hit count (GREATER_THAN_OR_EQUAL_TO) can
>> indeed change if the segment order is different but otherwise identical
>> documents are in each segment (yes, with different global doc ids, so the
>> true identical document case would tie break by those).
>>
>> That was/is surprising to me -- I thought segment order did not alter
>> anything at search time (except the tie-break-by-global-docid). I thought
>> each segment was searched, independently, by itself, and then results
>> "merged" together. But this has become more complex/interesting with time
>> :)
>>
>> So is this because BMW will use the best results from previous segments
>> to optimize/improve skipping while searching future segments? I.e. some
>> sort of state carries over from having searched past segments, to better
>> optimize searching the next segment? This makes total sense (that we would
>> optimize this way). I guess this also means that concurrent search ("thread
>> per slice/segment") would not be able to optimize with BMW as well?
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sun, May 23, 2021 at 8:24 AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> Agree with Adrien and Greg, approximate hit counts may vary. That global
>>> docid tiebreaking may affect the actual results is an interesting
>>> observation too, thanks Adrien. It seems to me the best way forward is to
>>> relax the check. Depending on how the index is constructed the difference
>>> could be quite large. Eg if the index is sorted and all the best documents
>>> are in one segment, we could end up with quite different counts; they could
>>> get many multiples different if you collect single threaded.
>>>
>>> On Fri, May 21, 2021, 2:22 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Hi Patrick-
>>>>
>>>> Interesting finds and deep dive!
>>>>
>>>> Just to confirm, in the cases of differing hit counts that you've
>>>> observed, these are TotalHits with a Relation of
>>>> "GREATER_THAN_OR_EQUAL_TO" right? Never cases where the Relation is
>>>> "EQUAL_TO"? I ask because that would change my opinion of
>>>> whether-or-not this is concerning. If we're proving a hit count with
>>>> "equal to" semantics, it should be correct and shouldn't change based
>>>> on segment ordering, etc. But, on the other hand, if we're just
>>>> providing a floor (i.e., "there are at least this many hits"), then
>>>> I'm not particularly concerned by this (assuming that floor is
>>>> actually correct in call cases and there aren't fewer hits).
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Fri, May 21, 2021 at 2:25 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>> >
>>>> > Hi Patrick,
>>>> >
>>>> >
>>>> > Why do you feel weird about the fact that segment order impacts the
>>>> hit count estimation? It feels ok to me, especially as segment order has
>>>> deeper implications, e.g. you could get different top hits given that
>>>> Lucene uses the global doc ID as a tie breaker for documents that produce
>>>> the same score.
>>>> >
>>>> > Could IndexRearranger write segments in deterministic order in the
>>>> segments file to improve reproducibility? Or could your application
>>>> configure leaf order explicitly? (LUCENE-9507)
>>>> >
>>>> > Le ven. 21 mai 2021 à 08:29, Patrick Zhai <zhai7631@gmail.com> a
>>>> écrit :
>>>> >>
>>>> >> Hi folks
>>>> >>
>>>> >> For the past few weeks I've been working with Mike McCandless to use
>>>> the recent introduced IndexRearranger to replace the old way of guarantee
>>>> deterministic index -- using a single index thread and a LogDocMergePolicy.
>>>> >>
>>>> >> In the progress we found out that with two concurrently built but
>>>> rearranged indexes, the estimation hit count will show a small difference.
>>>> I've carefully checked the index and found they're almost the same but the
>>>> segment order is different (index 1 might be segment 1,2,3,4,5 while index
>>>> 2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
>>>> documents and sorted using the same criteria). So I suspected the segment
>>>> order impacted the hit count estimation and to confirm that I turned off
>>>> the concurrency of rearranger so that it will always create segments in
>>>> order. The result proved my theory that the segment order was impacting the
>>>> hit count estimation.
>>>> >>
>>>> >> Later on I did some investigation and found in TopScoreDocCollector
>>>> we do have logic of updating the global minScore so I guess that's where
>>>> makes the difference. Mike and I both feel a little weird that segment
>>>> order will affect the hit count estimation, so just want to
>>>> >> 1. See whether there's any chance we could improve the API or
>>>> documentation
>>>> >> 2. Seek some advice on how should we tackle the problem, obviously
>>>> we don't want rearranger to execute on only 1 thread (since we use it for
>>>> speed!), currently what we're considering is to relax the check for hit
>>>> count estimation, but maybe there's a better way?
>>>> >>
>>>> >> Best
>>>> >> Patrick
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>>