Mailing List Archive

Is there a way to customize segment names?
Hi Folks,

We're trying to build a search architecture using segment replication
(indexer and searcher are separated and indexer shipping new segments to
searchers) right now and one of the problems we're facing is: for
availability reason we need to have multiple indexers running, and when the
searcher is switching from consuming one indexer to another, there are
chances where the segment names collide with each other (because segment
names are count based) and the searcher have to reload the whole index.
To avoid that we're looking for a way to name the segments so that Lucene
is able to tell the difference and load only the difference (by calling
`openIfChanged`). I've checked the IndexWriter and the DocumentsWriter and
it seems it is controlled by a private final method `newSegmentName()` so
likely not possible there. So I wonder whether there's any other ways
people are aware of that can help control the segment names?

A example of the situation described above:
Searcher previously consuming from indexer 1, and have following segments:
_1, _2, _3, _4
Indexer 2 previously sync'd from indexer 1, sharing the first 3 segments,
and produced its own 4th segments (notioned as _4', but it shares the same
"_4" name): _1, _2, _3, _4'
Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer 2,
then when it finished downloading the segments and trying to refresh the
reader, it will likely hit the exception here
<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/StandardDirectoryReader.java#L218>,
and seems all we can do right now is to reload the whole index and that
could be potentially a high cost.

Sorry for the long email and thank you in advance for any replies!

Best
Patrick
Re: Is there a way to customize segment names? [ In reply to ]
This multiple-writer isn't going to work and customizing names won't
allow it anyway. Each file also contains a unique identifier tied to
its commit so that we know everything is intact.

I would look at the segment replication in lucene/replicator and not
try to play games with files and mixing multiple writers.

On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Folks,
>
> We're trying to build a search architecture using segment replication (indexer and searcher are separated and indexer shipping new segments to searchers) right now and one of the problems we're facing is: for availability reason we need to have multiple indexers running, and when the searcher is switching from consuming one indexer to another, there are chances where the segment names collide with each other (because segment names are count based) and the searcher have to reload the whole index.
> To avoid that we're looking for a way to name the segments so that Lucene is able to tell the difference and load only the difference (by calling `openIfChanged`). I've checked the IndexWriter and the DocumentsWriter and it seems it is controlled by a private final method `newSegmentName()` so likely not possible there. So I wonder whether there's any other ways people are aware of that can help control the segment names?
>
> A example of the situation described above:
> Searcher previously consuming from indexer 1, and have following segments: _1, _2, _3, _4
> Indexer 2 previously sync'd from indexer 1, sharing the first 3 segments, and produced its own 4th segments (notioned as _4', but it shares the same "_4" name): _1, _2, _3, _4'
> Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer 2, then when it finished downloading the segments and trying to refresh the reader, it will likely hit the exception here, and seems all we can do right now is to reload the whole index and that could be potentially a high cost.
>
> Sorry for the long email and thank you in advance for any replies!
>
> Best
> Patrick
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Is there a way to customize segment names? [ In reply to ]
Hi Robert,

Maybe I didn't explain it clearly but we're not going to constantly switch
between writers or share effort between writers, it's purely for
availability: the second writer only kicks in when the first writer is not
available for some reason.
And as far as I know the replicator/nrt module has not provided a solution
on when the primary node (main indexer) is down, how would we recover with
a back up indexer?

Thanks
Patrick


On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com> wrote:

> This multiple-writer isn't going to work and customizing names won't
> allow it anyway. Each file also contains a unique identifier tied to
> its commit so that we know everything is intact.
>
> I would look at the segment replication in lucene/replicator and not
> try to play games with files and mixing multiple writers.
>
> On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi Folks,
> >
> > We're trying to build a search architecture using segment replication
> (indexer and searcher are separated and indexer shipping new segments to
> searchers) right now and one of the problems we're facing is: for
> availability reason we need to have multiple indexers running, and when the
> searcher is switching from consuming one indexer to another, there are
> chances where the segment names collide with each other (because segment
> names are count based) and the searcher have to reload the whole index.
> > To avoid that we're looking for a way to name the segments so that
> Lucene is able to tell the difference and load only the difference (by
> calling `openIfChanged`). I've checked the IndexWriter and the
> DocumentsWriter and it seems it is controlled by a private final method
> `newSegmentName()` so likely not possible there. So I wonder whether
> there's any other ways people are aware of that can help control the
> segment names?
> >
> > A example of the situation described above:
> > Searcher previously consuming from indexer 1, and have following
> segments: _1, _2, _3, _4
> > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> segments, and produced its own 4th segments (notioned as _4', but it shares
> the same "_4" name): _1, _2, _3, _4'
> > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> 2, then when it finished downloading the segments and trying to refresh the
> reader, it will likely hit the exception here, and seems all we can do
> right now is to reload the whole index and that could be potentially a high
> cost.
> >
> > Sorry for the long email and thank you in advance for any replies!
> >
> > Best
> > Patrick
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Is there a way to customize segment names? [ In reply to ]
You are still talking "Multiple writers". Like i said, going down this
path (playing tricks with filenames) isn't going to work out well.

On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Robert,
>
> Maybe I didn't explain it clearly but we're not going to constantly switch
> between writers or share effort between writers, it's purely for
> availability: the second writer only kicks in when the first writer is not
> available for some reason.
> And as far as I know the replicator/nrt module has not provided a solution
> on when the primary node (main indexer) is down, how would we recover with
> a back up indexer?
>
> Thanks
> Patrick
>
>
> On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com> wrote:
>
> > This multiple-writer isn't going to work and customizing names won't
> > allow it anyway. Each file also contains a unique identifier tied to
> > its commit so that we know everything is intact.
> >
> > I would look at the segment replication in lucene/replicator and not
> > try to play games with files and mixing multiple writers.
> >
> > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com> wrote:
> > >
> > > Hi Folks,
> > >
> > > We're trying to build a search architecture using segment replication
> > (indexer and searcher are separated and indexer shipping new segments to
> > searchers) right now and one of the problems we're facing is: for
> > availability reason we need to have multiple indexers running, and when the
> > searcher is switching from consuming one indexer to another, there are
> > chances where the segment names collide with each other (because segment
> > names are count based) and the searcher have to reload the whole index.
> > > To avoid that we're looking for a way to name the segments so that
> > Lucene is able to tell the difference and load only the difference (by
> > calling `openIfChanged`). I've checked the IndexWriter and the
> > DocumentsWriter and it seems it is controlled by a private final method
> > `newSegmentName()` so likely not possible there. So I wonder whether
> > there's any other ways people are aware of that can help control the
> > segment names?
> > >
> > > A example of the situation described above:
> > > Searcher previously consuming from indexer 1, and have following
> > segments: _1, _2, _3, _4
> > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > segments, and produced its own 4th segments (notioned as _4', but it shares
> > the same "_4" name): _1, _2, _3, _4'
> > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> > 2, then when it finished downloading the segments and trying to refresh the
> > reader, it will likely hit the exception here, and seems all we can do
> > right now is to reload the whole index and that could be potentially a high
> > cost.
> > >
> > > Sorry for the long email and thank you in advance for any replies!
> > >
> > > Best
> > > Patrick
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Is there a way to customize segment names? [ In reply to ]
+1 trying to coordinate multiple writers running independently will
not work. My 2c for availability: you can have a single primary active
writer with a backup one waiting, receiving all the segments from the
primary. Then if the primary goes down, the secondary one has the most
recent commit replicated from the primary (identical commit, same
segments etc) and can pick up from there. You would need a mechanism
to replay the writes the primary never had a chance to commit.

On Fri, Dec 16, 2022 at 5:41 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> You are still talking "Multiple writers". Like i said, going down this
> path (playing tricks with filenames) isn't going to work out well.
>
> On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi Robert,
> >
> > Maybe I didn't explain it clearly but we're not going to constantly switch
> > between writers or share effort between writers, it's purely for
> > availability: the second writer only kicks in when the first writer is not
> > available for some reason.
> > And as far as I know the replicator/nrt module has not provided a solution
> > on when the primary node (main indexer) is down, how would we recover with
> > a back up indexer?
> >
> > Thanks
> > Patrick
> >
> >
> > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > > This multiple-writer isn't going to work and customizing names won't
> > > allow it anyway. Each file also contains a unique identifier tied to
> > > its commit so that we know everything is intact.
> > >
> > > I would look at the segment replication in lucene/replicator and not
> > > try to play games with files and mixing multiple writers.
> > >
> > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com> wrote:
> > > >
> > > > Hi Folks,
> > > >
> > > > We're trying to build a search architecture using segment replication
> > > (indexer and searcher are separated and indexer shipping new segments to
> > > searchers) right now and one of the problems we're facing is: for
> > > availability reason we need to have multiple indexers running, and when the
> > > searcher is switching from consuming one indexer to another, there are
> > > chances where the segment names collide with each other (because segment
> > > names are count based) and the searcher have to reload the whole index.
> > > > To avoid that we're looking for a way to name the segments so that
> > > Lucene is able to tell the difference and load only the difference (by
> > > calling `openIfChanged`). I've checked the IndexWriter and the
> > > DocumentsWriter and it seems it is controlled by a private final method
> > > `newSegmentName()` so likely not possible there. So I wonder whether
> > > there's any other ways people are aware of that can help control the
> > > segment names?
> > > >
> > > > A example of the situation described above:
> > > > Searcher previously consuming from indexer 1, and have following
> > > segments: _1, _2, _3, _4
> > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > > segments, and produced its own 4th segments (notioned as _4', but it shares
> > > the same "_4" name): _1, _2, _3, _4'
> > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
> > > 2, then when it finished downloading the segments and trying to refresh the
> > > reader, it will likely hit the exception here, and seems all we can do
> > > right now is to reload the whole index and that could be potentially a high
> > > cost.
> > > >
> > > > Sorry for the long email and thank you in advance for any replies!
> > > >
> > > > Best
> > > > Patrick
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: dev-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Is there a way to customize segment names? [ In reply to ]
Hi Mike, Robert

Thanks for replying, the system is almost like what Mike has described: one
writer is primary,
and the other is trying to catch up and wait, but in our internal
discussion we found there might
be small chances where the secondary mistakenly think itself as primary
(due to errors of other component)
while primary is still alive and thus goes into the situation I described.
And because we want to tolerate the error in case we can't prevent it from
happening, we're looking for customizing
filenames.

Thanks again for discussing this with me and I've learnt that playing with
filenames can become quite
troublesome, but still, even out of my own curiosity, I want to understand
whether we're able to control
the segment names in some way?

Best
Patrick


On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov <msokolov@gmail.com> wrote:

> +1 trying to coordinate multiple writers running independently will
> not work. My 2c for availability: you can have a single primary active
> writer with a backup one waiting, receiving all the segments from the
> primary. Then if the primary goes down, the secondary one has the most
> recent commit replicated from the primary (identical commit, same
> segments etc) and can pick up from there. You would need a mechanism
> to replay the writes the primary never had a chance to commit.
>
> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > You are still talking "Multiple writers". Like i said, going down this
> > path (playing tricks with filenames) isn't going to work out well.
> >
> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai <zhai7631@gmail.com> wrote:
> > >
> > > Hi Robert,
> > >
> > > Maybe I didn't explain it clearly but we're not going to constantly
> switch
> > > between writers or share effort between writers, it's purely for
> > > availability: the second writer only kicks in when the first writer is
> not
> > > available for some reason.
> > > And as far as I know the replicator/nrt module has not provided a
> solution
> > > on when the primary node (main indexer) is down, how would we recover
> with
> > > a back up indexer?
> > >
> > > Thanks
> > > Patrick
> > >
> > >
> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > > This multiple-writer isn't going to work and customizing names won't
> > > > allow it anyway. Each file also contains a unique identifier tied to
> > > > its commit so that we know everything is intact.
> > > >
> > > > I would look at the segment replication in lucene/replicator and not
> > > > try to play games with files and mixing multiple writers.
> > > >
> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> > > > >
> > > > > Hi Folks,
> > > > >
> > > > > We're trying to build a search architecture using segment
> replication
> > > > (indexer and searcher are separated and indexer shipping new
> segments to
> > > > searchers) right now and one of the problems we're facing is: for
> > > > availability reason we need to have multiple indexers running, and
> when the
> > > > searcher is switching from consuming one indexer to another, there
> are
> > > > chances where the segment names collide with each other (because
> segment
> > > > names are count based) and the searcher have to reload the whole
> index.
> > > > > To avoid that we're looking for a way to name the segments so that
> > > > Lucene is able to tell the difference and load only the difference
> (by
> > > > calling `openIfChanged`). I've checked the IndexWriter and the
> > > > DocumentsWriter and it seems it is controlled by a private final
> method
> > > > `newSegmentName()` so likely not possible there. So I wonder whether
> > > > there's any other ways people are aware of that can help control the
> > > > segment names?
> > > > >
> > > > > A example of the situation described above:
> > > > > Searcher previously consuming from indexer 1, and have following
> > > > segments: _1, _2, _3, _4
> > > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> > > > segments, and produced its own 4th segments (notioned as _4', but it
> shares
> > > > the same "_4" name): _1, _2, _3, _4'
> > > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to
> Indexer
> > > > 2, then when it finished downloading the segments and trying to
> refresh the
> > > > reader, it will likely hit the exception here, and seems all we can
> do
> > > > right now is to reload the whole index and that could be potentially
> a high
> > > > cost.
> > > > >
> > > > > Sorry for the long email and thank you in advance for any replies!
> > > > >
> > > > > Best
> > > > > Patrick
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Is there a way to customize segment names? [ In reply to ]
Hi,

have you thought about storing additional metadata in the commits? If
you want to have custom information in each commit, just save some
internal tracking identifiers so you can figure out if node A or node B
is primary by checking their latest commit metadata.

Generally I do not understand your request: do you want to give segments
some completely custom filenames like setters? This is impossible to do.
If you just want another "algorithm" to generate new segment names (it
is actually some base32-like ID) you can patch Lucene, but this would
not solve your problem.

Uwe

Am 17.12.2022 um 01:28 schrieb Patrick Zhai:
> Hi Mike, Robert
>
> Thanks for replying, the system is almost like what Mike has
> described: one writer is primary,
> and the other is trying to catch up and wait, but in our internal
> discussion we found there might
> be small chances where the secondary mistakenly think itself as
> primary (due to errors of other component)
> while primary is still alive and thus goes into the situation I described.
> And because we want to tolerate the error in case we can't prevent it
> from happening, we're looking for customizing
> filenames.
>
> Thanks again for discussing this with me and I've learnt that playing
> with filenames can become quite
> troublesome, but still, even out of my own curiosity, I want to
> understand whether we're able to control
> the segment names in some way?
>
> Best
> Patrick
>
>
> On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
>
> +1 trying to coordinate multiple writers running independently will
> not work. My 2c for availability: you can have a single primary active
> writer with a backup one waiting, receiving all the segments from the
> primary. Then if the primary goes down, the secondary one has the most
> recent commit replicated from the primary (identical commit, same
> segments etc) and can pick up from there. You would need a mechanism
> to replay the writes the primary never had a chance to commit.
>
> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > You are still talking "Multiple writers". Like i said, going
> down this
> > path (playing tricks with filenames) isn't going to work out well.
> >
> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai
> <zhai7631@gmail.com> wrote:
> > >
> > > Hi Robert,
> > >
> > > Maybe I didn't explain it clearly but we're not going to
> constantly switch
> > > between writers or share effort between writers, it's purely for
> > > availability: the second writer only kicks in when the first
> writer is not
> > > available for some reason.
> > > And as far as I know the replicator/nrt module has not
> provided a solution
> > > on when the primary node (main indexer) is down, how would we
> recover with
> > > a back up indexer?
> > >
> > > Thanks
> > > Patrick
> > >
> > >
> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> > >
> > > > This multiple-writer isn't going to work and customizing
> names won't
> > > > allow it anyway. Each file also contains a unique identifier
> tied to
> > > > its commit so that we know everything is intact.
> > > >
> > > > I would look at the segment replication in lucene/replicator
> and not
> > > > try to play games with files and mixing multiple writers.
> > > >
> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai
> <zhai7631@gmail.com> wrote:
> > > > >
> > > > > Hi Folks,
> > > > >
> > > > > We're trying to build a search architecture using segment
> replication
> > > > (indexer and searcher are separated and indexer shipping new
> segments to
> > > > searchers) right now and one of the problems we're facing
> is: for
> > > > availability reason we need to have multiple indexers
> running, and when the
> > > > searcher is switching from consuming one indexer to another,
> there are
> > > > chances where the segment names collide with each other
> (because segment
> > > > names are count based) and the searcher have to reload the
> whole index.
> > > > > To avoid that we're looking for a way to name the segments
> so that
> > > > Lucene is able to tell the difference and load only the
> difference (by
> > > > calling `openIfChanged`). I've checked the IndexWriter and the
> > > > DocumentsWriter and it seems it is controlled by a private
> final method
> > > > `newSegmentName()` so likely not possible there. So I wonder
> whether
> > > > there's any other ways people are aware of that can help
> control the
> > > > segment names?
> > > > >
> > > > > A example of the situation described above:
> > > > > Searcher previously consuming from indexer 1, and have
> following
> > > > segments: _1, _2, _3, _4
> > > > > Indexer 2 previously sync'd from indexer 1, sharing the
> first 3
> > > > segments, and produced its own 4th segments (notioned as
> _4', but it shares
> > > > the same "_4" name): _1, _2, _3, _4'
> > > > > Suddenly Indexer 1 dies and searcher switched from Indexer
> 1 to Indexer
> > > > 2, then when it finished downloading the segments and trying
> to refresh the
> > > > reader, it will likely hit the exception here, and seems all
> we can do
> > > > right now is to reload the whole index and that could be
> potentially a high
> > > > cost.
> > > > >
> > > > > Sorry for the long email and thank you in advance for any
> replies!
> > > > >
> > > > > Best
> > > > > Patrick
> > > > >
> > > >
> > > >
> ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: dev-help@lucene.apache.org
> > > >
> > > >
> >
> >
> ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail:uwe@thetaphi.de
Re: Is there a way to customize segment names? [ In reply to ]
No, you can't control them. And we must not open up anything to try to
support this.

On Fri, Dec 16, 2022 at 7:28 PM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Mike, Robert
>
> Thanks for replying, the system is almost like what Mike has described: one writer is primary,
> and the other is trying to catch up and wait, but in our internal discussion we found there might
> be small chances where the secondary mistakenly think itself as primary (due to errors of other component)
> while primary is still alive and thus goes into the situation I described.
> And because we want to tolerate the error in case we can't prevent it from happening, we're looking for customizing
> filenames.
>
> Thanks again for discussing this with me and I've learnt that playing with filenames can become quite
> troublesome, but still, even out of my own curiosity, I want to understand whether we're able to control
> the segment names in some way?
>
> Best
> Patrick
>
>
> On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov <msokolov@gmail.com> wrote:
>>
>> +1 trying to coordinate multiple writers running independently will
>> not work. My 2c for availability: you can have a single primary active
>> writer with a backup one waiting, receiving all the segments from the
>> primary. Then if the primary goes down, the secondary one has the most
>> recent commit replicated from the primary (identical commit, same
>> segments etc) and can pick up from there. You would need a mechanism
>> to replay the writes the primary never had a chance to commit.
>>
>> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > You are still talking "Multiple writers". Like i said, going down this
>> > path (playing tricks with filenames) isn't going to work out well.
>> >
>> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> > >
>> > > Hi Robert,
>> > >
>> > > Maybe I didn't explain it clearly but we're not going to constantly switch
>> > > between writers or share effort between writers, it's purely for
>> > > availability: the second writer only kicks in when the first writer is not
>> > > available for some reason.
>> > > And as far as I know the replicator/nrt module has not provided a solution
>> > > on when the primary node (main indexer) is down, how would we recover with
>> > > a back up indexer?
>> > >
>> > > Thanks
>> > > Patrick
>> > >
>> > >
>> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com> wrote:
>> > >
>> > > > This multiple-writer isn't going to work and customizing names won't
>> > > > allow it anyway. Each file also contains a unique identifier tied to
>> > > > its commit so that we know everything is intact.
>> > > >
>> > > > I would look at the segment replication in lucene/replicator and not
>> > > > try to play games with files and mixing multiple writers.
>> > > >
>> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com> wrote:
>> > > > >
>> > > > > Hi Folks,
>> > > > >
>> > > > > We're trying to build a search architecture using segment replication
>> > > > (indexer and searcher are separated and indexer shipping new segments to
>> > > > searchers) right now and one of the problems we're facing is: for
>> > > > availability reason we need to have multiple indexers running, and when the
>> > > > searcher is switching from consuming one indexer to another, there are
>> > > > chances where the segment names collide with each other (because segment
>> > > > names are count based) and the searcher have to reload the whole index.
>> > > > > To avoid that we're looking for a way to name the segments so that
>> > > > Lucene is able to tell the difference and load only the difference (by
>> > > > calling `openIfChanged`). I've checked the IndexWriter and the
>> > > > DocumentsWriter and it seems it is controlled by a private final method
>> > > > `newSegmentName()` so likely not possible there. So I wonder whether
>> > > > there's any other ways people are aware of that can help control the
>> > > > segment names?
>> > > > >
>> > > > > A example of the situation described above:
>> > > > > Searcher previously consuming from indexer 1, and have following
>> > > > segments: _1, _2, _3, _4
>> > > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
>> > > > segments, and produced its own 4th segments (notioned as _4', but it shares
>> > > > the same "_4" name): _1, _2, _3, _4'
>> > > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to Indexer
>> > > > 2, then when it finished downloading the segments and trying to refresh the
>> > > > reader, it will likely hit the exception here, and seems all we can do
>> > > > right now is to reload the whole index and that could be potentially a high
>> > > > cost.
>> > > > >
>> > > > > Sorry for the long email and thank you in advance for any replies!
>> > > > >
>> > > > > Best
>> > > > > Patrick
>> > > > >
>> > > >
>> > > > ---------------------------------------------------------------------
>> > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > > > For additional commands, e-mail: dev-help@lucene.apache.org
>> > > >
>> > > >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Is there a way to customize segment names? [ In reply to ]
Hi Robert, got it, thanks!

Hi Uwe, yes we do have a way to detect whether the segment is created by
node A or B even if they share the same name, however, lucene does not
allow such situation (same name but generated by different writer) when
calling `openIfChanged` to try to incrementally load the new index. So what
I want is to attach a prefix (or postfix, anything lol) to the segment
name, say "A_4" and "B_4" so that when DirectoryReader is doing
`openIfChanged` it will proceed without throwing any exception.

On Sat, Dec 17, 2022 at 4:56 AM Robert Muir <rcmuir@gmail.com> wrote:

> No, you can't control them. And we must not open up anything to try to
> support this.
>
> On Fri, Dec 16, 2022 at 7:28 PM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi Mike, Robert
> >
> > Thanks for replying, the system is almost like what Mike has described:
> one writer is primary,
> > and the other is trying to catch up and wait, but in our internal
> discussion we found there might
> > be small chances where the secondary mistakenly think itself as primary
> (due to errors of other component)
> > while primary is still alive and thus goes into the situation I
> described.
> > And because we want to tolerate the error in case we can't prevent it
> from happening, we're looking for customizing
> > filenames.
> >
> > Thanks again for discussing this with me and I've learnt that playing
> with filenames can become quite
> > troublesome, but still, even out of my own curiosity, I want to
> understand whether we're able to control
> > the segment names in some way?
> >
> > Best
> > Patrick
> >
> >
> > On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>
> >> +1 trying to coordinate multiple writers running independently will
> >> not work. My 2c for availability: you can have a single primary active
> >> writer with a backup one waiting, receiving all the segments from the
> >> primary. Then if the primary goes down, the secondary one has the most
> >> recent commit replicated from the primary (identical commit, same
> >> segments etc) and can pick up from there. You would need a mechanism
> >> to replay the writes the primary never had a chance to commit.
> >>
> >> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir <rcmuir@gmail.com> wrote:
> >> >
> >> > You are still talking "Multiple writers". Like i said, going down this
> >> > path (playing tricks with filenames) isn't going to work out well.
> >> >
> >> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> >> > >
> >> > > Hi Robert,
> >> > >
> >> > > Maybe I didn't explain it clearly but we're not going to constantly
> switch
> >> > > between writers or share effort between writers, it's purely for
> >> > > availability: the second writer only kicks in when the first writer
> is not
> >> > > available for some reason.
> >> > > And as far as I know the replicator/nrt module has not provided a
> solution
> >> > > on when the primary node (main indexer) is down, how would we
> recover with
> >> > > a back up indexer?
> >> > >
> >> > > Thanks
> >> > > Patrick
> >> > >
> >> > >
> >> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com>
> wrote:
> >> > >
> >> > > > This multiple-writer isn't going to work and customizing names
> won't
> >> > > > allow it anyway. Each file also contains a unique identifier tied
> to
> >> > > > its commit so that we know everything is intact.
> >> > > >
> >> > > > I would look at the segment replication in lucene/replicator and
> not
> >> > > > try to play games with files and mixing multiple writers.
> >> > > >
> >> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com>
> wrote:
> >> > > > >
> >> > > > > Hi Folks,
> >> > > > >
> >> > > > > We're trying to build a search architecture using segment
> replication
> >> > > > (indexer and searcher are separated and indexer shipping new
> segments to
> >> > > > searchers) right now and one of the problems we're facing is: for
> >> > > > availability reason we need to have multiple indexers running,
> and when the
> >> > > > searcher is switching from consuming one indexer to another,
> there are
> >> > > > chances where the segment names collide with each other (because
> segment
> >> > > > names are count based) and the searcher have to reload the whole
> index.
> >> > > > > To avoid that we're looking for a way to name the segments so
> that
> >> > > > Lucene is able to tell the difference and load only the
> difference (by
> >> > > > calling `openIfChanged`). I've checked the IndexWriter and the
> >> > > > DocumentsWriter and it seems it is controlled by a private final
> method
> >> > > > `newSegmentName()` so likely not possible there. So I wonder
> whether
> >> > > > there's any other ways people are aware of that can help control
> the
> >> > > > segment names?
> >> > > > >
> >> > > > > A example of the situation described above:
> >> > > > > Searcher previously consuming from indexer 1, and have following
> >> > > > segments: _1, _2, _3, _4
> >> > > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
> >> > > > segments, and produced its own 4th segments (notioned as _4', but
> it shares
> >> > > > the same "_4" name): _1, _2, _3, _4'
> >> > > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1 to
> Indexer
> >> > > > 2, then when it finished downloading the segments and trying to
> refresh the
> >> > > > reader, it will likely hit the exception here, and seems all we
> can do
> >> > > > right now is to reload the whole index and that could be
> potentially a high
> >> > > > cost.
> >> > > > >
> >> > > > > Sorry for the long email and thank you in advance for any
> replies!
> >> > > > >
> >> > > > > Best
> >> > > > > Patrick
> >> > > > >
> >> > > >
> >> > > >
> ---------------------------------------------------------------------
> >> > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > > > For additional commands, e-mail: dev-help@lucene.apache.org
> >> > > >
> >> > > >
> >> >
> >> > ---------------------------------------------------------------------
> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> > For additional commands, e-mail: dev-help@lucene.apache.org
> >> >
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Is there a way to customize segment names? [ In reply to ]
Hi Patrick,

This is an interesting question, and from what I understood, I see
correctness problems in what you're trying to implement. Let me make sure I
understand correctly...

So indexer-1 created segments 1,2,3,4 and indexer-2 created segments 1',
2', 3', 4' independently (they just have the same segment names, i.e.
name(1).equals(name(1')). Now you have a reader opened on (1, 2, 3) from
indexer-1, and you're trying to call openIfChanged() to only load segment
4' into it - after downloading 1', 2', 3' and 4' into your indexDir.

But indexer-1 and indexer-2 would have flushed at different intervals, and
would likely have different documents (and doc counts) in each segment. So
even if Lucene allowed it, you can miss some data if your reader is opened
on (1,2,3,4'). Worse, you might double count data. In general, it opens all
sorts of corruption bugs, which is indeed what the exception you link says.

If you're sure that you only want the incremental data from segment 4',
have you considered adding it via the addIndexes(Directory...) API, which
simply copies the segment into the dir. openIfChanged() would then work
like you (seem to) want.

It seems like a bad idea to silently replace segments from under open
readers but try to trick them into thinking they are still present.
Thankfully, Lucene provides important protections against it. Apologies if
I misunderstood your use case.


On Fri, Dec 30, 2022 at 10:10 AM Patrick Zhai <zhai7631@gmail.com> wrote:

> Hi Robert, got it, thanks!
>
> Hi Uwe, yes we do have a way to detect whether the segment is created by
> node A or B even if they share the same name, however, lucene does not
> allow such situation (same name but generated by different writer) when
> calling `openIfChanged` to try to incrementally load the new index. So what
> I want is to attach a prefix (or postfix, anything lol) to the segment
> name, say "A_4" and "B_4" so that when DirectoryReader is doing
> `openIfChanged` it will proceed without throwing any exception.
>
> On Sat, Dec 17, 2022 at 4:56 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> No, you can't control them. And we must not open up anything to try to
>> support this.
>>
>> On Fri, Dec 16, 2022 at 7:28 PM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >
>> > Hi Mike, Robert
>> >
>> > Thanks for replying, the system is almost like what Mike has described:
>> one writer is primary,
>> > and the other is trying to catch up and wait, but in our internal
>> discussion we found there might
>> > be small chances where the secondary mistakenly think itself as primary
>> (due to errors of other component)
>> > while primary is still alive and thus goes into the situation I
>> described.
>> > And because we want to tolerate the error in case we can't prevent it
>> from happening, we're looking for customizing
>> > filenames.
>> >
>> > Thanks again for discussing this with me and I've learnt that playing
>> with filenames can become quite
>> > troublesome, but still, even out of my own curiosity, I want to
>> understand whether we're able to control
>> > the segment names in some way?
>> >
>> > Best
>> > Patrick
>> >
>> >
>> > On Fri, Dec 16, 2022 at 6:36 AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>> >>
>> >> +1 trying to coordinate multiple writers running independently will
>> >> not work. My 2c for availability: you can have a single primary active
>> >> writer with a backup one waiting, receiving all the segments from the
>> >> primary. Then if the primary goes down, the secondary one has the most
>> >> recent commit replicated from the primary (identical commit, same
>> >> segments etc) and can pick up from there. You would need a mechanism
>> >> to replay the writes the primary never had a chance to commit.
>> >>
>> >> On Fri, Dec 16, 2022 at 5:41 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >
>> >> > You are still talking "Multiple writers". Like i said, going down
>> this
>> >> > path (playing tricks with filenames) isn't going to work out well.
>> >> >
>> >> > On Fri, Dec 16, 2022 at 2:48 AM Patrick Zhai <zhai7631@gmail.com>
>> wrote:
>> >> > >
>> >> > > Hi Robert,
>> >> > >
>> >> > > Maybe I didn't explain it clearly but we're not going to
>> constantly switch
>> >> > > between writers or share effort between writers, it's purely for
>> >> > > availability: the second writer only kicks in when the first
>> writer is not
>> >> > > available for some reason.
>> >> > > And as far as I know the replicator/nrt module has not provided a
>> solution
>> >> > > on when the primary node (main indexer) is down, how would we
>> recover with
>> >> > > a back up indexer?
>> >> > >
>> >> > > Thanks
>> >> > > Patrick
>> >> > >
>> >> > >
>> >> > > On Thu, Dec 15, 2022 at 7:16 PM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >> > >
>> >> > > > This multiple-writer isn't going to work and customizing names
>> won't
>> >> > > > allow it anyway. Each file also contains a unique identifier
>> tied to
>> >> > > > its commit so that we know everything is intact.
>> >> > > >
>> >> > > > I would look at the segment replication in lucene/replicator and
>> not
>> >> > > > try to play games with files and mixing multiple writers.
>> >> > > >
>> >> > > > On Thu, Dec 15, 2022 at 5:45 PM Patrick Zhai <zhai7631@gmail.com>
>> wrote:
>> >> > > > >
>> >> > > > > Hi Folks,
>> >> > > > >
>> >> > > > > We're trying to build a search architecture using segment
>> replication
>> >> > > > (indexer and searcher are separated and indexer shipping new
>> segments to
>> >> > > > searchers) right now and one of the problems we're facing is: for
>> >> > > > availability reason we need to have multiple indexers running,
>> and when the
>> >> > > > searcher is switching from consuming one indexer to another,
>> there are
>> >> > > > chances where the segment names collide with each other (because
>> segment
>> >> > > > names are count based) and the searcher have to reload the whole
>> index.
>> >> > > > > To avoid that we're looking for a way to name the segments so
>> that
>> >> > > > Lucene is able to tell the difference and load only the
>> difference (by
>> >> > > > calling `openIfChanged`). I've checked the IndexWriter and the
>> >> > > > DocumentsWriter and it seems it is controlled by a private final
>> method
>> >> > > > `newSegmentName()` so likely not possible there. So I wonder
>> whether
>> >> > > > there's any other ways people are aware of that can help control
>> the
>> >> > > > segment names?
>> >> > > > >
>> >> > > > > A example of the situation described above:
>> >> > > > > Searcher previously consuming from indexer 1, and have
>> following
>> >> > > > segments: _1, _2, _3, _4
>> >> > > > > Indexer 2 previously sync'd from indexer 1, sharing the first 3
>> >> > > > segments, and produced its own 4th segments (notioned as _4',
>> but it shares
>> >> > > > the same "_4" name): _1, _2, _3, _4'
>> >> > > > > Suddenly Indexer 1 dies and searcher switched from Indexer 1
>> to Indexer
>> >> > > > 2, then when it finished downloading the segments and trying to
>> refresh the
>> >> > > > reader, it will likely hit the exception here, and seems all we
>> can do
>> >> > > > right now is to reload the whole index and that could be
>> potentially a high
>> >> > > > cost.
>> >> > > > >
>> >> > > > > Sorry for the long email and thank you in advance for any
>> replies!
>> >> > > > >
>> >> > > > > Best
>> >> > > > > Patrick
>> >> > > > >
>> >> > > >
>> >> > > >
>> ---------------------------------------------------------------------
>> >> > > > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > > > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> > > >
>> >> > > >
>> >> >
>> >> > ---------------------------------------------------------------------
>> >> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> > For additional commands, e-mail: dev-help@lucene.apache.org
>> >> >
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
- Vigya