Mailing List Archive: Deterministic index construction

Deterministic index construction

zhai7631 at gmail

Dec 18, 2020, 10:26 AM

Post #1 of 9 (407 views)

Hi
Our team is seeking a way of construct (or rebuild) a deterministic sorted
index concurrently (I know lucene could achieve that in a sequential manner
but that might be too slow for us sometimes)
Currently we have roughly 2 ideas, all assuming there's a pre-built index
and have dumped a doc-segment map so that IndexWriter would be able to be
aware of which doc belong to which segment:
1. First build index in the normal way (concurrently), after the index is
built, using "addIndexes" functionality to merge documents into the correct
segment.
2. By controlling FlushPolicy and other related classes, make sure each
segment created (before merge) has only the documents that belong to one of
the segments in the pre-built index. And create a dedicated MergePolicy to
only merge segments belonging to one pre-built segment.

Basically we think first one is easier to implement and second one is
faster. Want to seek some ideas & suggestions & feedback here.

Thanks
Patrick Zhai

Re: Deterministic index construction [ In reply to ]

jpountz at gmail

Dec 19, 2020, 8:38 AM

Post #2 of 9 (407 views)

Have you considered leveraging Lucene's built-in index sorting? It supports
concurrent indexing and is quite fast.

On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:

> Hi
> Our team is seeking a way of construct (or rebuild) a deterministic sorted
> index concurrently (I know lucene could achieve that in a sequential manner
> but that might be too slow for us sometimes)
> Currently we have roughly 2 ideas, all assuming there's a pre-built index
> and have dumped a doc-segment map so that IndexWriter would be able to be
> aware of which doc belong to which segment:
> 1. First build index in the normal way (concurrently), after the index is
> built, using "addIndexes" functionality to merge documents into the correct
> segment.
> 2. By controlling FlushPolicy and other related classes, make sure each
> segment created (before merge) has only the documents that belong to one of
> the segments in the pre-built index. And create a dedicated MergePolicy to
> only merge segments belonging to one pre-built segment.
>
> Basically we think first one is easier to implement and second one is
> faster. Want to seek some ideas & suggestions & feedback here.
>
> Thanks
> Patrick Zhai
>

--
Adrien

Re: Deterministic index construction [ In reply to ]

msokolov at gmail

Dec 19, 2020, 9:11 AM

Post #3 of 9 (407 views)

I think the idea is to exert control over the distribution of documents
among the segments, in a deterministic reproducible way.

On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:

> Have you considered leveraging Lucene's built-in index sorting? It
> supports concurrent indexing and is quite fast.
>
> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>
>> Hi
>> Our team is seeking a way of construct (or rebuild) a deterministic
>> sorted index concurrently (I know lucene could achieve that in a sequential
>> manner but that might be too slow for us sometimes)
>> Currently we have roughly 2 ideas, all assuming there's a pre-built index
>> and have dumped a doc-segment map so that IndexWriter would be able to be
>> aware of which doc belong to which segment:
>> 1. First build index in the normal way (concurrently), after the index is
>> built, using "addIndexes" functionality to merge documents into the correct
>> segment.
>> 2. By controlling FlushPolicy and other related classes, make sure each
>> segment created (before merge) has only the documents that belong to one of
>> the segments in the pre-built index. And create a dedicated MergePolicy to
>> only merge segments belonging to one pre-built segment.
>>
>> Basically we think first one is easier to implement and second one is
>> faster. Want to seek some ideas & suggestions & feedback here.
>>
>> Thanks
>> Patrick Zhai
>>
>
>
> --
> Adrien
>

Re: Deterministic index construction [ In reply to ]

msokolov at gmail

Dec 19, 2020, 9:13 AM

Post #4 of 9 (407 views)

I don't know about addIndexes. Does that let you say which document goes
where somehow? Wouldn't you have to select a subset of documents from each
originally indexed segment?

On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com> wrote:

> I think the idea is to exert control over the distribution of documents
> among the segments, in a deterministic reproducible way.
>
> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:
>
>> Have you considered leveraging Lucene's built-in index sorting? It
>> supports concurrent indexing and is quite fast.
>>
>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>
>>> Hi
>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>> manner but that might be too slow for us sometimes)
>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>> to be aware of which doc belong to which segment:
>>> 1. First build index in the normal way (concurrently), after the index
>>> is built, using "addIndexes" functionality to merge documents into the
>>> correct segment.
>>> 2. By controlling FlushPolicy and other related classes, make sure each
>>> segment created (before merge) has only the documents that belong to one of
>>> the segments in the pre-built index. And create a dedicated MergePolicy to
>>> only merge segments belonging to one pre-built segment.
>>>
>>> Basically we think first one is easier to implement and second one is
>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>
>>> Thanks
>>> Patrick Zhai
>>>
>>
>>
>> --
>> Adrien
>>
>

Re: Deterministic index construction [ In reply to ]

zhai7631 at gmail

Dec 19, 2020, 11:50 AM

Post #5 of 9 (407 views)

Hi Adrien
I think Mike's comment is correct, we already have index sorted but we want
to reconstruct a index with exact same number of segments and each segment
contains exact same documents.

Mike
AddIndexes could take CodecReader as input [1], which allows us to pass in
a customized FilteredIndexReader I think? Then it knows which docs to take.
And then suppose original index has N segments, we could open N IndexWriter
concurrently and rebuilt those N segments, and at last somehow merge them
back to a whole index. (I am not quite sure about whether we could achieve
the last step easily, but that sounds not so hard?)

[1]
https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-

Michael Sokolov <msokolov@gmail.com> ?2020?12?19??? ??9:13???

> I don't know about addIndexes. Does that let you say which document goes
> where somehow? Wouldn't you have to select a subset of documents from each
> originally indexed segment?
>
> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> I think the idea is to exert control over the distribution of documents
>> among the segments, in a deterministic reproducible way.
>>
>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:
>>
>>> Have you considered leveraging Lucene's built-in index sorting? It
>>> supports concurrent indexing and is quite fast.
>>>
>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>>
>>>> Hi
>>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>>> manner but that might be too slow for us sometimes)
>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>>> to be aware of which doc belong to which segment:
>>>> 1. First build index in the normal way (concurrently), after the index
>>>> is built, using "addIndexes" functionality to merge documents into the
>>>> correct segment.
>>>> 2. By controlling FlushPolicy and other related classes, make sure each
>>>> segment created (before merge) has only the documents that belong to one of
>>>> the segments in the pre-built index. And create a dedicated MergePolicy to
>>>> only merge segments belonging to one pre-built segment.
>>>>
>>>> Basically we think first one is easier to implement and second one is
>>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>>
>>>> Thanks
>>>> Patrick Zhai
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>

Re: Deterministic index construction [ In reply to ]

lucene at mikemccandless

Dec 20, 2020, 7:23 AM

Post #6 of 9 (407 views)

I think the addIndexes approach could work as Haoyu describes! One
IndexWriter per segment in the original source index, using
FilterIndexReader to ... mark all documents NOT in the target segment as
deleted?

For the final step, you could use addIndexes(Directory[]) which more of
less does a simple file copy of the incoming segment's files.

But this is a whole extra and costly sounding step, that might undo the
wall clock speedup from the concurrent indexing in the first pass. Maybe
it is still faster net/net than what luceneutil benchmarks, which is
single-threaded-everything (single indexing thread, SerialMergeScheduler,
LogDocMergePolicy)?

The first option Haoyu listed sounds interesting too! Could we somehow
build a new index, concurrently, but force certain docs to go to certain
in-memory segments (DWPT)? Today the routing of incoming indexing thread
to DWPT is sort of random, but there is indeed a dedicated internal class
that decides that: DocumentsWriterPerThreadPool. And, here is a fun PR
that Adrien is working on to improve how threads are scheduled onto
in-memory segments, to try to create larger initially flushed segments and
less merge pressure as a result:
https://github.com/apache/lucene-solr/pull/1912

If we could carefully guide threads to the right DWPT during indexing the
2nd time, and then use a custom MergePolicy that is also careful to only
merge segments that "belong" together, and the index is sorted, I think you
would get the same segment geometry in the end, and exact same documents in
each segments? This'd likely be nearly as fast as freely building an index
concurrently! It'd be a nice addition to luceneutil benchmarks too, since
now it takes crazy long to build the deterministic index.

Mike McCandless

http://blog.mikemccandless.com

On Sat, Dec 19, 2020 at 2:50 PM Haoyu Zhai <zhai7631@gmail.com> wrote:

> Hi Adrien
> I think Mike's comment is correct, we already have index sorted but we
> want to reconstruct a index with exact same number of segments and each
> segment contains exact same documents.
>
> Mike
> AddIndexes could take CodecReader as input [1], which allows us to pass in
> a customized FilteredIndexReader I think? Then it knows which docs to take.
> And then suppose original index has N segments, we could open N IndexWriter
> concurrently and rebuilt those N segments, and at last somehow merge them
> back to a whole index. (I am not quite sure about whether we could achieve
> the last step easily, but that sounds not so hard?)
>
> [1]
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-
>
> Michael Sokolov <msokolov@gmail.com> ?2020?12?19??? ??9:13???
>
>> I don't know about addIndexes. Does that let you say which document goes
>> where somehow? Wouldn't you have to select a subset of documents from each
>> originally indexed segment?
>>
>> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think the idea is to exert control over the distribution of documents
>>> among the segments, in a deterministic reproducible way.
>>>
>>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>
>>>> Have you considered leveraging Lucene's built-in index sorting? It
>>>> supports concurrent indexing and is quite fast.
>>>>
>>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>>>> manner but that might be too slow for us sometimes)
>>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>>>> to be aware of which doc belong to which segment:
>>>>> 1. First build index in the normal way (concurrently), after the index
>>>>> is built, using "addIndexes" functionality to merge documents into the
>>>>> correct segment.
>>>>> 2. By controlling FlushPolicy and other related classes, make sure
>>>>> each segment created (before merge) has only the documents that belong to
>>>>> one of the segments in the pre-built index. And create a dedicated
>>>>> MergePolicy to only merge segments belonging to one pre-built segment.
>>>>>
>>>>> Basically we think first one is easier to implement and second one is
>>>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>>>
>>>>> Thanks
>>>>> Patrick Zhai
>>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>

Re: Deterministic index construction [ In reply to ]

dsmiley at apache

Dec 23, 2020, 1:30 PM

Post #7 of 9 (402 views)

I like Mike McCandless's suggestion of controlling which DWPT (and thus
segment) an incoming document goes to. I've thought of this before for a
different use case grouping documents into segments by the underlying
"type" of the document. This could make sense for a use-case that queries
by document type, and you don't want to create an index per document type
(maybe because the index is too small to warrant it). It could even be
used in a kind of soft / hint kind of way -- not an absolute strict
separation. For example, say if some subset of DWPTs are known to hold
docs of a given type, then add incoming docs of that type to any of those
and not the others. But if none exist then just add to any DWPT. I also
thought of this sort of thing at the MergePolicy level, but at that point,
any mixing of doc types has already occurred and MP can't separate them, it
can only combine, though it can try to reduce introducing too much mixing.
It would be nice if it were possible to atomically merge some documents in
a segment but not the whole segment, thus still leaving the segment in
place but with the extracted documents marked deleted. This is similar to
"shard splitting" (index splitting) but to do so atomically/transactionally.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley

On Sun, Dec 20, 2020 at 10:24 AM Michael McCandless <
lucene@mikemccandless.com> wrote:

> I think the addIndexes approach could work as Haoyu describes! One
> IndexWriter per segment in the original source index, using
> FilterIndexReader to ... mark all documents NOT in the target segment as
> deleted?
>
> For the final step, you could use addIndexes(Directory[]) which more of
> less does a simple file copy of the incoming segment's files.
>
> But this is a whole extra and costly sounding step, that might undo the
> wall clock speedup from the concurrent indexing in the first pass. Maybe
> it is still faster net/net than what luceneutil benchmarks, which is
> single-threaded-everything (single indexing thread, SerialMergeScheduler,
> LogDocMergePolicy)?
>
> The first option Haoyu listed sounds interesting too! Could we somehow
> build a new index, concurrently, but force certain docs to go to certain
> in-memory segments (DWPT)? Today the routing of incoming indexing thread
> to DWPT is sort of random, but there is indeed a dedicated internal class
> that decides that: DocumentsWriterPerThreadPool. And, here is a fun PR
> that Adrien is working on to improve how threads are scheduled onto
> in-memory segments, to try to create larger initially flushed segments and
> less merge pressure as a result:
> https://github.com/apache/lucene-solr/pull/1912
>
> If we could carefully guide threads to the right DWPT during indexing the
> 2nd time, and then use a custom MergePolicy that is also careful to only
> merge segments that "belong" together, and the index is sorted, I think you
> would get the same segment geometry in the end, and exact same documents in
> each segments? This'd likely be nearly as fast as freely building an index
> concurrently! It'd be a nice addition to luceneutil benchmarks too, since
> now it takes crazy long to build the deterministic index.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Sat, Dec 19, 2020 at 2:50 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>
>> Hi Adrien
>> I think Mike's comment is correct, we already have index sorted but we
>> want to reconstruct a index with exact same number of segments and each
>> segment contains exact same documents.
>>
>> Mike
>> AddIndexes could take CodecReader as input [1], which allows us to pass
>> in a customized FilteredIndexReader I think? Then it knows which docs to
>> take. And then suppose original index has N segments, we could open N
>> IndexWriter concurrently and rebuilt those N segments, and at last somehow
>> merge them back to a whole index. (I am not quite sure about whether we
>> could achieve the last step easily, but that sounds not so hard?)
>>
>> [1]
>> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-
>>
>> Michael Sokolov <msokolov@gmail.com> ?2020?12?19??? ??9:13???
>>
>>> I don't know about addIndexes. Does that let you say which document goes
>>> where somehow? Wouldn't you have to select a subset of documents from each
>>> originally indexed segment?
>>>
>>> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com>
>>> wrote:
>>>
>>>> I think the idea is to exert control over the distribution of documents
>>>> among the segments, in a deterministic reproducible way.
>>>>
>>>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>>
>>>>> Have you considered leveraging Lucene's built-in index sorting? It
>>>>> supports concurrent indexing and is quite fast.
>>>>>
>>>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>>>>
>>>>>> Hi
>>>>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>>>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>>>>> manner but that might be too slow for us sometimes)
>>>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>>>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>>>>> to be aware of which doc belong to which segment:
>>>>>> 1. First build index in the normal way (concurrently), after the
>>>>>> index is built, using "addIndexes" functionality to merge documents into
>>>>>> the correct segment.
>>>>>> 2. By controlling FlushPolicy and other related classes, make sure
>>>>>> each segment created (before merge) has only the documents that belong to
>>>>>> one of the segments in the pre-built index. And create a dedicated
>>>>>> MergePolicy to only merge segments belonging to one pre-built segment.
>>>>>>
>>>>>> Basically we think first one is easier to implement and second one is
>>>>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>>>>
>>>>>> Thanks
>>>>>> Patrick Zhai
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>

Re: Deterministic index construction [ In reply to ]

simon.willnauer at gmail

Jan 5, 2021, 3:08 AM

Post #8 of 9 (399 views)

you can do something similar to this today by exploiting the
add/updateDocuments(Iterable<IndexableField> doc) API. All docs in
this iterable will be sent to the same segment in order. If you have
multiple threads you can feed a defined number of docs per iterable
(stream them to be memory efficient) and then let them go at the same
time. this way you have thread affinity (we had this in the early days
of DWPT, I'd be reluctant to make it configurable again). then with a
custom merge policy you should be able to get the exact same number of
segments without remerging etc. some sync overhead on top but it's
doable I think.

simon

On Wed, Dec 23, 2020 at 10:30 PM David Smiley <dsmiley@apache.org> wrote:
>
> I like Mike McCandless's suggestion of controlling which DWPT (and thus segment) an incoming document goes to. I've thought of this before for a different use case grouping documents into segments by the underlying "type" of the document. This could make sense for a use-case that queries by document type, and you don't want to create an index per document type (maybe because the index is too small to warrant it). It could even be used in a kind of soft / hint kind of way -- not an absolute strict separation. For example, say if some subset of DWPTs are known to hold docs of a given type, then add incoming docs of that type to any of those and not the others. But if none exist then just add to any DWPT. I also thought of this sort of thing at the MergePolicy level, but at that point, any mixing of doc types has already occurred and MP can't separate them, it can only combine, though it can try to reduce introducing too much mixing. It would be nice if it were possible to atomically merge some documents in a segment but not the whole segment, thus still leaving the segment in place but with the extracted documents marked deleted. This is similar to "shard splitting" (index splitting) but to do so atomically/transactionally.
>
> ~ David Smiley
> Apache Lucene/Solr Search Developer
> http://www.linkedin.com/in/davidwsmiley
>
>
> On Sun, Dec 20, 2020 at 10:24 AM Michael McCandless <lucene@mikemccandless.com> wrote:
>>
>> I think the addIndexes approach could work as Haoyu describes! One IndexWriter per segment in the original source index, using FilterIndexReader to ... mark all documents NOT in the target segment as deleted?
>>
>> For the final step, you could use addIndexes(Directory[]) which more of less does a simple file copy of the incoming segment's files.
>>
>> But this is a whole extra and costly sounding step, that might undo the wall clock speedup from the concurrent indexing in the first pass. Maybe it is still faster net/net than what luceneutil benchmarks, which is single-threaded-everything (single indexing thread, SerialMergeScheduler, LogDocMergePolicy)?
>>
>> The first option Haoyu listed sounds interesting too! Could we somehow build a new index, concurrently, but force certain docs to go to certain in-memory segments (DWPT)? Today the routing of incoming indexing thread to DWPT is sort of random, but there is indeed a dedicated internal class that decides that: DocumentsWriterPerThreadPool. And, here is a fun PR that Adrien is working on to improve how threads are scheduled onto in-memory segments, to try to create larger initially flushed segments and less merge pressure as a result: https://github.com/apache/lucene-solr/pull/1912
>>
>> If we could carefully guide threads to the right DWPT during indexing the 2nd time, and then use a custom MergePolicy that is also careful to only merge segments that "belong" together, and the index is sorted, I think you would get the same segment geometry in the end, and exact same documents in each segments? This'd likely be nearly as fast as freely building an index concurrently! It'd be a nice addition to luceneutil benchmarks too, since now it takes crazy long to build the deterministic index.
>>
>> Mike McCandless
>>
>> http://blog.mikemccandless.com
>>
>>
>> On Sat, Dec 19, 2020 at 2:50 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>>
>>> Hi Adrien
>>> I think Mike's comment is correct, we already have index sorted but we want to reconstruct a index with exact same number of segments and each segment contains exact same documents.
>>>
>>> Mike
>>> AddIndexes could take CodecReader as input [1], which allows us to pass in a customized FilteredIndexReader I think? Then it knows which docs to take. And then suppose original index has N segments, we could open N IndexWriter concurrently and rebuilt those N segments, and at last somehow merge them back to a whole index. (I am not quite sure about whether we could achieve the last step easily, but that sounds not so hard?)
>>>
>>> [1] https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-
>>>
>>> Michael Sokolov <msokolov@gmail.com> ?2020?12?19??? ??9:13???
>>>>
>>>> I don't know about addIndexes. Does that let you say which document goes where somehow? Wouldn't you have to select a subset of documents from each originally indexed segment?
>>>>
>>>> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com> wrote:
>>>>>
>>>>> I think the idea is to exert control over the distribution of documents among the segments, in a deterministic reproducible way.
>>>>>
>>>>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>>>>
>>>>>> Have you considered leveraging Lucene's built-in index sorting? It supports concurrent indexing and is quite fast.
>>>>>>
>>>>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>>>>>>
>>>>>>> Hi
>>>>>>> Our team is seeking a way of construct (or rebuild) a deterministic sorted index concurrently (I know lucene could achieve that in a sequential manner but that might be too slow for us sometimes)
>>>>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built index and have dumped a doc-segment map so that IndexWriter would be able to be aware of which doc belong to which segment:
>>>>>>> 1. First build index in the normal way (concurrently), after the index is built, using "addIndexes" functionality to merge documents into the correct segment.
>>>>>>> 2. By controlling FlushPolicy and other related classes, make sure each segment created (before merge) has only the documents that belong to one of the segments in the pre-built index. And create a dedicated MergePolicy to only merge segments belonging to one pre-built segment.
>>>>>>>
>>>>>>> Basically we think first one is easier to implement and second one is faster. Want to seek some ideas & suggestions & feedback here.
>>>>>>>
>>>>>>> Thanks
>>>>>>> Patrick Zhai
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Deterministic index construction [ In reply to ]

lucene at mikemccandless

Jan 14, 2021, 2:31 PM

Post #9 of 9 (396 views)

Thanks Simon, I think that could work!

At first I didn't think it would work, because we cannot control flushing,
but then I realized you could ask IndexWriter to flush after N documents
(not by RAM as is its default), and always then send >= N documents in each
stream you index (via IW.updateDocuments). This will ensure that IW
immediately flushes a new segment after your stream of >= N documents is
done, and the initially flushed segments will always contain documents from
a single final segment from the index you are re-creating.

Then, yeah, a custom MergePolicy to then choose segments to merge that
belong to the same eventual segment.

If you are using index sorting, you can be a bit loosy goosy here, indexing
your source documents in any convenient order, and merging segments in any
order, since Lucene will re-sort the docs back into the target order. But
if not, you must stream your documents in the precise order as the index
you are trying to recreate. Probably you could use
ConcurrentMergeScheduler in either case!

This is a nice solution ... you'd have nearly full (concurrent!)
performance as the original unconstrained index creation took, and no 2nd
step to re-organize the index. Impressive :) This would be super helpful
for benchmarking changes that alter the search index.

Mike McCandless

http://blog.mikemccandless.com

On Tue, Jan 5, 2021 at 6:08 AM Simon Willnauer <simon.willnauer@gmail.com>
wrote:

> you can do something similar to this today by exploiting the
> add/updateDocuments(Iterable<IndexableField> doc) API. All docs in
> this iterable will be sent to the same segment in order. If you have
> multiple threads you can feed a defined number of docs per iterable
> (stream them to be memory efficient) and then let them go at the same
> time. this way you have thread affinity (we had this in the early days
> of DWPT, I'd be reluctant to make it configurable again). then with a
> custom merge policy you should be able to get the exact same number of
> segments without remerging etc. some sync overhead on top but it's
> doable I think.
>
> simon
>
> On Wed, Dec 23, 2020 at 10:30 PM David Smiley <dsmiley@apache.org> wrote:
> >
> > I like Mike McCandless's suggestion of controlling which DWPT (and thus
> segment) an incoming document goes to. I've thought of this before for a
> different use case grouping documents into segments by the underlying
> "type" of the document. This could make sense for a use-case that queries
> by document type, and you don't want to create an index per document type
> (maybe because the index is too small to warrant it). It could even be
> used in a kind of soft / hint kind of way -- not an absolute strict
> separation. For example, say if some subset of DWPTs are known to hold
> docs of a given type, then add incoming docs of that type to any of those
> and not the others. But if none exist then just add to any DWPT. I also
> thought of this sort of thing at the MergePolicy level, but at that point,
> any mixing of doc types has already occurred and MP can't separate them, it
> can only combine, though it can try to reduce introducing too much mixing.
> It would be nice if it were possible to atomically merge some documents in
> a segment but not the whole segment, thus still leaving the segment in
> place but with the extracted documents marked deleted. This is similar to
> "shard splitting" (index splitting) but to do so atomically/transactionally.
> >
> > ~ David Smiley
> > Apache Lucene/Solr Search Developer
> > http://www.linkedin.com/in/davidwsmiley
> >
> >
> > On Sun, Dec 20, 2020 at 10:24 AM Michael McCandless <
> lucene@mikemccandless.com> wrote:
> >>
> >> I think the addIndexes approach could work as Haoyu describes! One
> IndexWriter per segment in the original source index, using
> FilterIndexReader to ... mark all documents NOT in the target segment as
> deleted?
> >>
> >> For the final step, you could use addIndexes(Directory[]) which more of
> less does a simple file copy of the incoming segment's files.
> >>
> >> But this is a whole extra and costly sounding step, that might undo the
> wall clock speedup from the concurrent indexing in the first pass. Maybe
> it is still faster net/net than what luceneutil benchmarks, which is
> single-threaded-everything (single indexing thread, SerialMergeScheduler,
> LogDocMergePolicy)?
> >>
> >> The first option Haoyu listed sounds interesting too! Could we somehow
> build a new index, concurrently, but force certain docs to go to certain
> in-memory segments (DWPT)? Today the routing of incoming indexing thread
> to DWPT is sort of random, but there is indeed a dedicated internal class
> that decides that: DocumentsWriterPerThreadPool. And, here is a fun PR
> that Adrien is working on to improve how threads are scheduled onto
> in-memory segments, to try to create larger initially flushed segments and
> less merge pressure as a result:
> https://github.com/apache/lucene-solr/pull/1912
> >>
> >> If we could carefully guide threads to the right DWPT during indexing
> the 2nd time, and then use a custom MergePolicy that is also careful to
> only merge segments that "belong" together, and the index is sorted, I
> think you would get the same segment geometry in the end, and exact same
> documents in each segments? This'd likely be nearly as fast as freely
> building an index concurrently! It'd be a nice addition to luceneutil
> benchmarks too, since now it takes crazy long to build the deterministic
> index.
> >>
> >> Mike McCandless
> >>
> >> http://blog.mikemccandless.com
> >>
> >>
> >> On Sat, Dec 19, 2020 at 2:50 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
> >>>
> >>> Hi Adrien
> >>> I think Mike's comment is correct, we already have index sorted but we
> want to reconstruct a index with exact same number of segments and each
> segment contains exact same documents.
> >>>
> >>> Mike
> >>> AddIndexes could take CodecReader as input [1], which allows us to
> pass in a customized FilteredIndexReader I think? Then it knows which docs
> to take. And then suppose original index has N segments, we could open N
> IndexWriter concurrently and rebuilt those N segments, and at last somehow
> merge them back to a whole index. (I am not quite sure about whether we
> could achieve the last step easily, but that sounds not so hard?)
> >>>
> >>> [1]
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-
> >>>
> >>> Michael Sokolov <msokolov@gmail.com> ?2020?12?19??? ??9:13???
> >>>>
> >>>> I don't know about addIndexes. Does that let you say which document
> goes where somehow? Wouldn't you have to select a subset of documents from
> each originally indexed segment?
> >>>>
> >>>> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com>
> wrote:
> >>>>>
> >>>>> I think the idea is to exert control over the distribution of
> documents among the segments, in a deterministic reproducible way.
> >>>>>
> >>>>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com>
> wrote:
> >>>>>>
> >>>>>> Have you considered leveraging Lucene's built-in index sorting? It
> supports concurrent indexing and is quite fast.
> >>>>>>
> >>>>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> Hi
> >>>>>>> Our team is seeking a way of construct (or rebuild) a
> deterministic sorted index concurrently (I know lucene could achieve that
> in a sequential manner but that might be too slow for us sometimes)
> >>>>>>> Currently we have roughly 2 ideas, all assuming there's a
> pre-built index and have dumped a doc-segment map so that IndexWriter would
> be able to be aware of which doc belong to which segment:
> >>>>>>> 1. First build index in the normal way (concurrently), after the
> index is built, using "addIndexes" functionality to merge documents into
> the correct segment.
> >>>>>>> 2. By controlling FlushPolicy and other related classes, make sure
> each segment created (before merge) has only the documents that belong to
> one of the segments in the pre-built index. And create a dedicated
> MergePolicy to only merge segments belonging to one pre-built segment.
> >>>>>>>
> >>>>>>> Basically we think first one is easier to implement and second one
> is faster. Want to seek some ideas & suggestions & feedback here.
> >>>>>>>
> >>>>>>> Thanks
> >>>>>>> Patrick Zhai
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>