Mailing List Archive

Control the number of segments without using forceMerge.
Hi all,

I'm trying to figure out if there is a way to control the number of
segments in an index without explicitly calling forceMerge.

My use-case looks like this: I need to index a static dataset of ~1
billion documents. I know the exact number of docs before indexing starts.
I know the VM where this index is searched has 64 threads. I'd like to end
up with exactly 64 segments, so I can search them in a parallelized fashion.

I know that I could call forceMerge(64), but this takes an extremely long
time.

Is there a straightforward way to ensure that I end up with 64 threads
without force-merging after adding all of the documents?

Thanks in advance for any tips

Alex Klibisz
RE: Control the number of segments without using forceMerge. [ In reply to ]
If you want an exact number of segments, create 64 indexes, each forceMerged to one segment.
After that use MultiReader to create a view on all separate indexes. MultiReaders's contents are always flattened to a list of those 64 indexes.

But keep in mind that this should only ever be done with *static* indexes. As soon as you have updates, this is a bad idea (forceMerge in general) and also splitting indexes like this. Parallelization should normally come from multiple queries running in parallel, but you shouldn't force Lucene to run a single query over so many indexes.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Alex K <aklibisz@gmail.com>
> Sent: Monday, July 5, 2021 4:04 AM
> To: java-user@lucene.apache.org
> Subject: Control the number of segments without using forceMerge.
>
> Hi all,
>
> I'm trying to figure out if there is a way to control the number of
> segments in an index without explicitly calling forceMerge.
>
> My use-case looks like this: I need to index a static dataset of ~1
> billion documents. I know the exact number of docs before indexing starts.
> I know the VM where this index is searched has 64 threads. I'd like to end
> up with exactly 64 segments, so I can search them in a parallelized fashion.
>
> I know that I could call forceMerge(64), but this takes an extremely long
> time.
>
> Is there a straightforward way to ensure that I end up with 64 threads
> without force-merging after adding all of the documents?
>
> Thanks in advance for any tips
>
> Alex Klibisz


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Control the number of segments without using forceMerge. [ In reply to ]
Ok, so it sounds like if you want a very specific number of segments you
have to do a forceMerge at some point?

Is there some simple summary on how segments are formed in the first place?
Something like, "one segment is created every time you flush from an
IndexWriter"? Based on some experimenting and reading the code, it seems to
be quite complicated, especially once you start calling addDocument from
several threads in parallel.

It's good to learn about the MultiReader. I'll look into that some more.

Thanks,
Alex

On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler <uwe@thetaphi.de> wrote:

> If you want an exact number of segments, create 64 indexes, each
> forceMerged to one segment.
> After that use MultiReader to create a view on all separate indexes.
> MultiReaders's contents are always flattened to a list of those 64 indexes.
>
> But keep in mind that this should only ever be done with *static* indexes.
> As soon as you have updates, this is a bad idea (forceMerge in general) and
> also splitting indexes like this. Parallelization should normally come from
> multiple queries running in parallel, but you shouldn't force Lucene to run
> a single query over so many indexes.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
> > -----Original Message-----
> > From: Alex K <aklibisz@gmail.com>
> > Sent: Monday, July 5, 2021 4:04 AM
> > To: java-user@lucene.apache.org
> > Subject: Control the number of segments without using forceMerge.
> >
> > Hi all,
> >
> > I'm trying to figure out if there is a way to control the number of
> > segments in an index without explicitly calling forceMerge.
> >
> > My use-case looks like this: I need to index a static dataset of ~1
> > billion documents. I know the exact number of docs before indexing
> starts.
> > I know the VM where this index is searched has 64 threads. I'd like to
> end
> > up with exactly 64 segments, so I can search them in a parallelized
> fashion.
> >
> > I know that I could call forceMerge(64), but this takes an extremely long
> > time.
> >
> > Is there a straightforward way to ensure that I end up with 64 threads
> > without force-merging after adding all of the documents?
> >
> > Thanks in advance for any tips
> >
> > Alex Klibisz
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Control the number of segments without using forceMerge. [ In reply to ]
After some more reading, the NoMergePolicy seems to mostly solve my problem.

I've configured my IndexWriterConfig with:

.setMaxBufferedDocs(Integer.MAX_VALUE)
.setRAMBufferSizeMB(Double.MAX_VALUE)
.setMergePolicy(NoMergePolicy.INSTANCE)

With this config I consistently end up with a number of segments that is a
multiple of the number of processors on the indexing VM. I don't have to
force merge at all. This also makes the indexing job faster overall.

I think I was previously confused by the behavior of the
ConcurrentMergeScheduler. I'm sure it's great for most use-cases, but I
really need to just move as many docs as possible as fast as possible to a
predictable number of segments, so the NoMergePolicy seems to be a good
choice for my use-case.

Also, I learned a lot from Uwe's recent talk at Berlin Buzzwords
<https://2021.berlinbuzzwords.de/sites/berlinbuzzwords.de/files/2021-06/The%20future%20of%20Lucene%27s%20MMapDirectory.pdf>,
and his great post about MMapDirectory from a few years ago
<https://blog.thetaphi.de/2012/07/use-lucenes-mmapdirectory-on-64bit.html>.
Definitely recommended for others.

Thanks,
Alex

On Mon, Jul 5, 2021 at 1:53 PM Alex K <aklibisz@gmail.com> wrote:

> Ok, so it sounds like if you want a very specific number of segments you
> have to do a forceMerge at some point?
>
> Is there some simple summary on how segments are formed in the first
> place? Something like, "one segment is created every time you flush from an
> IndexWriter"? Based on some experimenting and reading the code, it seems to
> be quite complicated, especially once you start calling addDocument from
> several threads in parallel.
>
> It's good to learn about the MultiReader. I'll look into that some more.
>
> Thanks,
> Alex
>
> On Mon, Jul 5, 2021 at 9:14 AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
>> If you want an exact number of segments, create 64 indexes, each
>> forceMerged to one segment.
>> After that use MultiReader to create a view on all separate indexes.
>> MultiReaders's contents are always flattened to a list of those 64 indexes.
>>
>> But keep in mind that this should only ever be done with *static*
>> indexes. As soon as you have updates, this is a bad idea (forceMerge in
>> general) and also splitting indexes like this. Parallelization should
>> normally come from multiple queries running in parallel, but you shouldn't
>> force Lucene to run a single query over so many indexes.
>>
>> Uwe
>>
>> -----
>> Uwe Schindler
>> Achterdiek 19, D-28357 Bremen
>> https://www.thetaphi.de
>> eMail: uwe@thetaphi.de
>>
>> > -----Original Message-----
>> > From: Alex K <aklibisz@gmail.com>
>> > Sent: Monday, July 5, 2021 4:04 AM
>> > To: java-user@lucene.apache.org
>> > Subject: Control the number of segments without using forceMerge.
>> >
>> > Hi all,
>> >
>> > I'm trying to figure out if there is a way to control the number of
>> > segments in an index without explicitly calling forceMerge.
>> >
>> > My use-case looks like this: I need to index a static dataset of ~1
>> > billion documents. I know the exact number of docs before indexing
>> starts.
>> > I know the VM where this index is searched has 64 threads. I'd like to
>> end
>> > up with exactly 64 segments, so I can search them in a parallelized
>> fashion.
>> >
>> > I know that I could call forceMerge(64), but this takes an extremely
>> long
>> > time.
>> >
>> > Is there a straightforward way to ensure that I end up with 64 threads
>> > without force-merging after adding all of the documents?
>> >
>> > Thanks in advance for any tips
>> >
>> > Alex Klibisz
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>