After some more reading, the NoMergePolicy seems to mostly solve my problem.
I've configured my IndexWriterConfig with:
With this config I consistently end up with a number of segments that is a
multiple of the number of processors on the indexing VM. I don't have to
force merge at all. This also makes the indexing job faster overall.
I think I was previously confused by the behavior of the
ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I
really need to just move as many docs as possible as fast as possible to a
predictable number of segments, so the NoMergePolicy seems to be a good
choice for my use-case.
Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords
and his great post about MMapDirectory from a few years ago
Definitely recommended for others.
On Mon, Jul 5, 2021 at 1:53 PM Alex K <firstname.lastname@example.org> wrote: > Ok, so it sounds like if you want a very specific number of segments you
> have to do a forceMerge at some point?
> Is there some simple summary on how segments are formed in the first
> place? Something like, "one segment is created every time you flush from an
> IndexWriter"? Based on some experimenting and reading the code, it seems to
> be quite complicated, especially once you start calling addDocument from
> several threads in parallel.
> It's good to learn about the MultiReader. I'll look into that some more.
> On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler <email@example.com> wrote:
>> If you want an exact number of segments, create 64 indexes, each
>> forceMerged to one segment.
>> After that use MultiReader to create a view on all separate indexes.
>> MultiReaders's contents are always flattened to a list of those 64 indexes.
>> But keep in mind that this should only ever be done with *static*
>> indexes. As soon as you have updates, this is a bad idea (forceMerge in
>> general) and also splitting indexes like this. Parallelization should
>> normally come from multiple queries running in parallel, but you shouldn't
>> force Lucene to run a single query over so many indexes.
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> eMail: firstname.lastname@example.org
>> > -----Original Message-----
>> > From: Alex K <email@example.com>
>> > Sent: Monday, July 5, 2021 4:04 AM
>> > To: firstname.lastname@example.org
>> > Subject: Control the number of segments without using forceMerge.
>> > Hi all,
>> > I'm trying to figure out if there is a way to control the number of
>> > segments in an index without explicitly calling forceMerge.
>> > My use-case looks like this: I need to index a static dataset of ~1
>> > billion documents. I know the exact number of docs before indexing
>> > I know the VM where this index is searched has 64 threads. I'd like to
>> > up with exactly 64 segments, so I can search them in a parallelized
>> > I know that I could call forceMerge(64), but this takes an extremely
>> > time.
>> > Is there a straightforward way to ensure that I end up with 64 threads
>> > without force-merging after adding all of the documents?
>> > Thanks in advance for any tips
>> > Alex Klibisz
>> To unsubscribe, e-mail: email@example.com
>> For additional commands, e-mail: firstname.lastname@example.org