I think the addIndexes approach could work as Haoyu describes! One
IndexWriter per segment in the original source index, using
FilterIndexReader to ... mark all documents NOT in the target segment as
deleted?
For the final step, you could use addIndexes(Directory[]) which more of
less does a simple file copy of the incoming segment's files.
But this is a whole extra and costly sounding step, that might undo the
wall clock speedup from the concurrent indexing in the first pass. Maybe
it is still faster net/net than what luceneutil benchmarks, which is
single-threaded-everything (single indexing thread, SerialMergeScheduler,
LogDocMergePolicy)?
The first option Haoyu listed sounds interesting too! Could we somehow
build a new index, concurrently, but force certain docs to go to certain
in-memory segments (DWPT)? Today the routing of incoming indexing thread
to DWPT is sort of random, but there is indeed a dedicated internal class
that decides that: DocumentsWriterPerThreadPool. And, here is a fun PR
that Adrien is working on to improve how threads are scheduled onto
in-memory segments, to try to create larger initially flushed segments and
less merge pressure as a result:
https://github.com/apache/lucene-solr/pull/1912 If we could carefully guide threads to the right DWPT during indexing the
2nd time, and then use a custom MergePolicy that is also careful to only
merge segments that "belong" together, and the index is sorted, I think you
would get the same segment geometry in the end, and exact same documents in
each segments? This'd likely be nearly as fast as freely building an index
concurrently! It'd be a nice addition to luceneutil benchmarks too, since
now it takes crazy long to build the deterministic index.
Mike McCandless
http://blog.mikemccandless.com On Sat, Dec 19, 2020 at 2:50 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
> Hi Adrien
> I think Mike's comment is correct, we already have index sorted but we
> want to reconstruct a index with exact same number of segments and each
> segment contains exact same documents.
>
> Mike
> AddIndexes could take CodecReader as input [1], which allows us to pass in
> a customized FilteredIndexReader I think? Then it knows which docs to take.
> And then suppose original index has N segments, we could open N IndexWriter
> concurrently and rebuilt those N segments, and at last somehow merge them
> back to a whole index. (I am not quite sure about whether we could achieve
> the last step easily, but that sounds not so hard?)
>
> [1]
> https://lucene.apache.org/core/7_4_0/core/org/apache/lucene/index/IndexWriter.html#addIndexes-org.apache.lucene.index.CodecReader...-
>
> Michael Sokolov <msokolov@gmail.com> ?2020?12?19??? ??9:13???
>
>> I don't know about addIndexes. Does that let you say which document goes
>> where somehow? Wouldn't you have to select a subset of documents from each
>> originally indexed segment?
>>
>> On Sat, Dec 19, 2020, 12:11 PM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> I think the idea is to exert control over the distribution of documents
>>> among the segments, in a deterministic reproducible way.
>>>
>>> On Sat, Dec 19, 2020, 11:39 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>
>>>> Have you considered leveraging Lucene's built-in index sorting? It
>>>> supports concurrent indexing and is quite fast.
>>>>
>>>> On Fri, Dec 18, 2020 at 7:26 PM Haoyu Zhai <zhai7631@gmail.com> wrote:
>>>>
>>>>> Hi
>>>>> Our team is seeking a way of construct (or rebuild) a deterministic
>>>>> sorted index concurrently (I know lucene could achieve that in a sequential
>>>>> manner but that might be too slow for us sometimes)
>>>>> Currently we have roughly 2 ideas, all assuming there's a pre-built
>>>>> index and have dumped a doc-segment map so that IndexWriter would be able
>>>>> to be aware of which doc belong to which segment:
>>>>> 1. First build index in the normal way (concurrently), after the index
>>>>> is built, using "addIndexes" functionality to merge documents into the
>>>>> correct segment.
>>>>> 2. By controlling FlushPolicy and other related classes, make sure
>>>>> each segment created (before merge) has only the documents that belong to
>>>>> one of the segments in the pre-built index. And create a dedicated
>>>>> MergePolicy to only merge segments belonging to one pre-built segment.
>>>>>
>>>>> Basically we think first one is easier to implement and second one is
>>>>> faster. Want to seek some ideas & suggestions & feedback here.
>>>>>
>>>>> Thanks
>>>>> Patrick Zhai
>>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>