Mailing List Archive: AddIndexes(CodecReader...) API Question

Hi All,

I am working on a patch that would leverage the MergePolicy and
MergeScheduler to run addIndexes(CodecReader...) triggered merges
concurrently (Lucene-10216
<https://issues.apache.org/jira/browse/LUCENE-10216>, WIP-PR
<https://github.com/apache/lucene/pull/633>). I had some general questions
about the APIs current implementation.

At the start of the API, we trigger a flush(triggerMerge: false,
applyAllDeletes: true)
<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L3132>.
I was wondering why we need this. My understanding is that the readers
brought in by addIndexes() API would be unrelated to any pending updates or
deletes.

I tried removing this call, and testExistingDeletes
<https://github.com/apache/lucene/blob/main/lucene/core/src/test/org/apache/lucene/index/TestAddIndexes.java#L1022-L1052>
(). failed. This leads me to understand that we flush and applyAllDeletes,
so that, if there was a pending delete by term, it does not impact incoming
readers that coincidentally contained docs with the same term.

Is this correct?
Also, since we may still get such a delete before the API completes, and
those deletes would get applied, this is likely a best effort scenario,
right?

On a related note, the regular merge for existing segments writes all
pending DV updates before merging, but we skip this in the addIndexes API.
Should we be doing this in both places?

Thanks,
Vigya

From looking at https://issues.apache.org/jira/browse/LUCENE-2996 I
think your analysis is correct, but I wasn't around for that so I am
just reading the historical record, same as you. To my way of
thinking, doing these opportunistic flushes clutter up the logic and
ideally we would *not* do either kind of flushing (neither deletes nor
DV updates) here, rather allow them to happen in the normal merging
flow, but there may be some good practical reason for it, not sure.
Perhaps we don't get a chance to run a "normal" merge soon after doing
addIndexes, so this is the best opportunity? In which case, yeah it
would make sense to me to also flush DV updates.

On Tue, Feb 1, 2022 at 3:04 AM Vigya Sharma <vigya.work@gmail.com> wrote:
>
> Hi All,
>
> I am working on a patch that would leverage the MergePolicy and MergeScheduler to run addIndexes(CodecReader...) triggered merges concurrently (Lucene-10216, WIP-PR). I had some general questions about the APIs current implementation.
>
> At the start of the API, we trigger a flush(triggerMerge: false, applyAllDeletes: true). I was wondering why we need this. My understanding is that the readers brought in by addIndexes() API would be unrelated to any pending updates or deletes.
>
> I tried removing this call, and testExistingDeletes(). failed. This leads me to understand that we flush and applyAllDeletes, so that, if there was a pending delete by term, it does not impact incoming readers that coincidentally contained docs with the same term.
>
> Is this correct?
> Also, since we may still get such a delete before the API completes, and those deletes would get applied, this is likely a best effort scenario, right?
>
> On a related note, the regular merge for existing segments writes all pending DV updates before merging, but we skip this in the addIndexes API. Should we be doing this in both places?
>
> Thanks,
> Vigya
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org