Hi folks
For the past few weeks I've been working with Mike McCandless to use the
recent introduced IndexRearranger to replace the old way of guarantee
deterministic index -- using a single index thread and a LogDocMergePolicy.
In the progress we found out that with two concurrently built but
rearranged indexes, the estimation hit count will show a small difference.
I've carefully checked the index and found they're almost the same but the
segment order is different (index 1 might be segment 1,2,3,4,5 while index
2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
documents and sorted using the same criteria). So I suspected the segment
order impacted the hit count estimation and to confirm that I turned off
the concurrency of rearranger so that it will always create segments in
order. The result proved my theory that the segment order was impacting the
hit count estimation.
Later on I did some investigation and found in TopScoreDocCollector we do
have logic of updating the global minScore so I guess that's where makes
the difference. Mike and I both feel a little weird that segment order will
affect the hit count estimation, so just want to
1. See whether there's any chance we could improve the API or documentation
2. Seek some advice on how should we tackle the problem, obviously we don't
want rearranger to execute on only 1 thread (since we use it for speed!),
currently what we're considering is to relax the check for hit count
estimation, but maybe there's a better way?
Best
Patrick
For the past few weeks I've been working with Mike McCandless to use the
recent introduced IndexRearranger to replace the old way of guarantee
deterministic index -- using a single index thread and a LogDocMergePolicy.
In the progress we found out that with two concurrently built but
rearranged indexes, the estimation hit count will show a small difference.
I've carefully checked the index and found they're almost the same but the
segment order is different (index 1 might be segment 1,2,3,4,5 while index
2 might be segment 2,1,3,5,4 where nth segment contains exactly the same
documents and sorted using the same criteria). So I suspected the segment
order impacted the hit count estimation and to confirm that I turned off
the concurrency of rearranger so that it will always create segments in
order. The result proved my theory that the segment order was impacting the
hit count estimation.
Later on I did some investigation and found in TopScoreDocCollector we do
have logic of updating the global minScore so I guess that's where makes
the difference. Mike and I both feel a little weird that segment order will
affect the hit count estimation, so just want to
1. See whether there's any chance we could improve the API or documentation
2. Seek some advice on how should we tackle the problem, obviously we don't
want rearranger to execute on only 1 thread (since we use it for speed!),
currently what we're considering is to relax the check for hit count
estimation, but maybe there's a better way?
Best
Patrick