Mailing List Archive: Should IndexWriter.flush return seqNo?

Should IndexWriter.flush return seqNo?

zhai7631 at gmail

Apr 19, 2023, 10:28 PM

Post #1 of 8 (303 views)

Hi folks,
I just realized that while "commit" returns the sequence number which
represents the latest event that committed in the index, "flush" still
returns nothing. Since they're essentially the same except fsync I wonder
whether there's any specific reason to not do so?

Best
Patrick

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

rcmuir at gmail

Apr 21, 2023, 7:16 AM

Post #2 of 8 (303 views)

This is not true: if i call IndexWriter.commit, then i can open an
indexreader and see the documents.

IndexWriter.flush doesn't do anything at all, really, just moves stuff
from RAM to disk but not in a way that indexreader can see it or
anything, right?

It doesn't make much sense that this method is public in the API,
definitely adding sequence number makes no sense since nothing was
committed here.

On Thu, Apr 20, 2023 at 1:28?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi folks,
> I just realized that while "commit" returns the sequence number which represents the latest event that committed in the index, "flush" still returns nothing. Since they're essentially the same except fsync I wonder whether there's any specific reason to not do so?
>
> Best
> Patrick

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

zhai7631 at gmail

Apr 21, 2023, 11:11 AM

Post #3 of 8 (303 views)

Hi Rob,
Thanks for explaining, that makes sense to me.

Patrick

On Fri, Apr 21, 2023 at 7:18?AM Robert Muir <rcmuir@gmail.com> wrote:

> This is not true: if i call IndexWriter.commit, then i can open an
> indexreader and see the documents.
>
> IndexWriter.flush doesn't do anything at all, really, just moves stuff
> from RAM to disk but not in a way that indexreader can see it or
> anything, right?
>
> It doesn't make much sense that this method is public in the API,
> definitely adding sequence number makes no sense since nothing was
> committed here.
>
> On Thu, Apr 20, 2023 at 1:28?AM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi folks,
> > I just realized that while "commit" returns the sequence number which
> represents the latest event that committed in the index, "flush" still
> returns nothing. Since they're essentially the same except fsync I wonder
> whether there's any specific reason to not do so?
> >
> > Best
> > Patrick
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

uwe at thetaphi

Apr 23, 2023, 3:19 AM

Post #4 of 8 (299 views)

Hi,

Am 21.04.2023 um 16:16 schrieb Robert Muir:
> This is not true: if i call IndexWriter.commit, then i can open an
> indexreader and see the documents.
>
> IndexWriter.flush doesn't do anything at all, really, just moves stuff
> from RAM to disk but not in a way that indexreader can see it or
> anything, right?

Yes thats true, I just have to add: You can still open a NRT reader
directly from IndexWriter. But you don't need a sequence number there as
its hidden completely. So flushing is fine to allow users to get a new
NRT reader with the state up to that point, but it does not need to
return anything.

Having the sequence number public in API does not bring any benefit, as
you cannot use it for anything.

> It doesn't make much sense that this method is public in the API,
> definitely adding sequence number makes no sense since nothing was
> committed here.
+1
>
> On Thu, Apr 20, 2023 at 1:28?AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> Hi folks,
>> I just realized that while "commit" returns the sequence number which represents the latest event that committed in the index, "flush" still returns nothing. Since they're essentially the same except fsync I wonder whether there's any specific reason to not do so?
>>
>> Best
>> Patrick
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

rcmuir at gmail

Apr 23, 2023, 3:31 AM

Post #5 of 8 (299 views)

>
> Yes thats true, I just have to add: You can still open a NRT reader
> directly from IndexWriter. But you don't need a sequence number there as
> its hidden completely. So flushing is fine to allow users to get a new
> NRT reader with the state up to that point, but it does not need to
> return anything.
>

Uwe, sorry, I must correct you: flushing doesnt do that. It doesn't
allow you to get an NRT reader or any other type of reader. it is the
same as if you filled up the RAMBuffer with documents, that is all. If
you want NRTReader you should be calling openIfChanged (and calling
flush yourself is irrelevant/unnecessary). The two methods are
completely separate, to me unrelated. That's why flush makes no sense
in the api.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

lucene at mikemccandless

Apr 25, 2023, 6:05 AM

Post #6 of 8 (295 views)

On Sun, Apr 23, 2023 at 6:19?AM Uwe Schindler <uwe@thetaphi.de> wrote:

Having the sequence number public in API does not bring any benefit, as
> you cannot use it for anything.
>

Actually there are some interesting use cases for sequence numbers:

They enable the caller to know the effective order of operations of
concurrent indexing events. This can be useful for applications that might
sometimes update the same document at the same time across threads to
implement optimistic concurrency to re-index the same document if the order
was not correct according to the applications external version tracking for
out-of-order updates. OpenSearch has an array of locks to implement
pessimistic concurrency (ensuring the that same id is never updated
concurrently) but for cases where the conflicts are rare, the optimistic
implementation based on Lucene's sequence numbers is likely more efficient.

Another use case is precise indexing operation replay (e.g. from a Kinesis
queue or transaction log or whatever) on recovering from a commit point:
upon commit, you know which precise indexing event was captured in the
commit, and on recovering you can resume indexing from precisely the next
indexing event. This doesn't matter for idempotent updates, but, for other
cases like append only, it is useful and performant.

I also don't see why flush should return a sequence number -- it is not an
externally visible event. Patrick maybe you had an interesting use case in
mind? Note that commit also writes (and fsyncs) the next segments_N file,
to light all the newly written/fsync'd segments for the next reader to open.

Mike McCandless

http://blog.mikemccandless.com

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

ichattopadhyaya at gmail

Apr 25, 2023, 7:25 PM

Post #7 of 8 (292 views)

I think Apache Solr could explore leveraging the returned sequence number
for its transaction logs.

On Tue, 25 Apr 2023 at 18:36, Michael McCandless <lucene@mikemccandless.com>
wrote:

> On Sun, Apr 23, 2023 at 6:19?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> Having the sequence number public in API does not bring any benefit, as
>> you cannot use it for anything.
>>
>
> Actually there are some interesting use cases for sequence numbers:
>
> They enable the caller to know the effective order of operations of
> concurrent indexing events. This can be useful for applications that might
> sometimes update the same document at the same time across threads to
> implement optimistic concurrency to re-index the same document if the order
> was not correct according to the applications external version tracking for
> out-of-order updates. OpenSearch has an array of locks to implement
> pessimistic concurrency (ensuring the that same id is never updated
> concurrently) but for cases where the conflicts are rare, the optimistic
> implementation based on Lucene's sequence numbers is likely more efficient.
>
> Another use case is precise indexing operation replay (e.g. from a Kinesis
> queue or transaction log or whatever) on recovering from a commit point:
> upon commit, you know which precise indexing event was captured in the
> commit, and on recovering you can resume indexing from precisely the next
> indexing event. This doesn't matter for idempotent updates, but, for other
> cases like append only, it is useful and performant.
>
> I also don't see why flush should return a sequence number -- it is not an
> externally visible event. Patrick maybe you had an interesting use case in
> mind? Note that commit also writes (and fsyncs) the next segments_N file,
> to light all the newly written/fsync'd segments for the next reader to open.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>

Re: Should IndexWriter.flush return seqNo? [ In reply to ]

zhai7631 at gmail

Apr 26, 2023, 4:19 PM

Post #8 of 8 (285 views)

> Patrick maybe you had an interesting use case in mind?

I had one, but later on I found out that I don't necessarily use flush to
achieve that so it's not really a valid use case that definitely need
flush...

On Tue, Apr 25, 2023 at 7:26?PM Ishan Chattopadhyaya <
ichattopadhyaya@gmail.com> wrote:

> I think Apache Solr could explore leveraging the returned sequence number
> for its transaction logs.
>
> On Tue, 25 Apr 2023 at 18:36, Michael McCandless <
> lucene@mikemccandless.com> wrote:
>
>> On Sun, Apr 23, 2023 at 6:19?AM Uwe Schindler <uwe@thetaphi.de> wrote:
>>
>> Having the sequence number public in API does not bring any benefit, as
>>> you cannot use it for anything.
>>>
>>
>> Actually there are some interesting use cases for sequence numbers:
>>
>> They enable the caller to know the effective order of operations of
>> concurrent indexing events. This can be useful for applications that might
>> sometimes update the same document at the same time across threads to
>> implement optimistic concurrency to re-index the same document if the order
>> was not correct according to the applications external version tracking for
>> out-of-order updates. OpenSearch has an array of locks to implement
>> pessimistic concurrency (ensuring the that same id is never updated
>> concurrently) but for cases where the conflicts are rare, the optimistic
>> implementation based on Lucene's sequence numbers is likely more efficient.
>>
>> Another use case is precise indexing operation replay (e.g. from a
>> Kinesis queue or transaction log or whatever) on recovering from a commit
>> point: upon commit, you know which precise indexing event was captured in
>> the commit, and on recovering you can resume indexing from precisely the
>> next indexing event. This doesn't matter for idempotent updates, but, for
>> other cases like append only, it is useful and performant.
>>
>> I also don't see why flush should return a sequence number -- it is not
>> an externally visible event. Patrick maybe you had an interesting use case
>> in mind? Note that commit also writes (and fsyncs) the next segments_N
>> file, to light all the newly written/fsync'd segments for the next reader
>> to open.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>