After some more reading, the NoMergePolicy seems to mostly solve my problem.
I've configured my IndexWriterConfig with:
.setMaxBufferedDocs(Integer.MAX_VALUE)
.setRAMBufferSizeMB(Double.MAX_VALUE)
.setMergePolicy(NoMergePolicy.INSTANCE)
With this config I consistently end up with a number of segments that is a
multiple of the number of processors on the indexing VM. I don't have to
force merge at all. This also makes the indexing job faster overall.
I think I was previously confused by the behavior of the
ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I
really need to just move as many docs as possible as fast as possible to a
predictable number of segments, so the NoMergePolicy seems to be a good
choice for my use-case.
Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords
<
https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>,
and his great post about MMapDirectory from a few years ago
<
https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>.
Definitely recommended for others.
Thanks,
Alex
On Mon, Jul 5, 2021 at 1:53 PM Alex K <aklibisz@gmail.com> wrote:
> Ok, so it sounds like if you want a very specific number of segments you
> have to do a forceMerge at some point?
>
> Is there some simple summary on how segments are formed in the first
> place? Something like, "one segment is created every time you flush from an
> IndexWriter"? Based on some experimenting and reading the code, it seems to
> be quite complicated, especially once you start calling addDocument from
> several threads in parallel.
>
> It's good to learn about the MultiReader. I'll look into that some more.
>
> Thanks,
> Alex
>
> On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> If you want an exact number of segments, create 64 indexes, each
>> forceMerged to one segment.
>> After that use MultiReader to create a view on all separate indexes.
>> MultiReaders's contents are always flattened to a list of those 64 indexes.
>>
>> But keep in mind that this should only ever be done with *static*
>> indexes. As soon as you have updates, this is a bad idea (forceMerge in
>> general) and also splitting indexes like this. Parallelization should
>> normally come from multiple queries running in parallel, but you shouldn't
>> force Lucene to run a single query over so many indexes.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Alex K <aklibisz@gmail.com>
>> > Sent: Monday, July 5, 2021 4:04 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Control the number of segments without using forceMerge.
>> >
>> > Hi all,
>> >
>> > I'm trying to figure out if there is a way to control the number of
>> > segments in an index without explicitly calling forceMerge.
>> >
>> > My use-case looks like this: I need to index a static dataset of ~1
>> > billion documents. I know the exact number of docs before indexing
>> starts.
>> > I know the VM where this index is searched has 64 threads. I'd like to
>> end
>> > up with exactly 64 segments, so I can search them in a parallelized
>> fashion.
>> >
>> > I know that I could call forceMerge(64), but this takes an extremely
>> long
>> > time.
>> >
>> > Is there a straightforward way to ensure that I end up with 64 threads
>> > without force-merging after adding all of the documents?
>> >
>> > Thanks in advance for any tips
>> >
>> > Alex Klibisz
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>