Mailing List Archive

forceMerge(1) leads to ~10% perf gains
After testing on 4800 fairly complex queries, I see a performance gain of
10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from 209
ms per query, to 185 ms per query.

Queries are quite complex, often about 30 or words, of the format OR
text:<word>

It went from 214 to 14 files on the forceMerge.

It's a 6GB static/read only index with about 6.4M documents. Documents are
around 1MB or so of text.

Was wondering - are there any other techniques which can be used to speed
up that work well when forceMerge works like this?

Is there a better way to query and still maintain accuracy than simply word
tokenizing a sentence and joining with OR text: ?
Re: forceMerge(1) leads to ~10% perf gains [ In reply to ]
Hi,

Yes, a force-merged index can be faster, as less work is spent on
looking up terms in different index segments.

If you are looking for higher speed, non-merged indexes can actually
perform better, IF you parallelize. You can do this by adding an
Executor instance to IndexSearcher
(<https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/IndexSearcher.html#%3Cinit%3E(org.apache.lucene.index.IndexReader,java.util.concurrent.Executor)>).
If you do this each segment of the index is searched in parallel (using
the thread pool limits of the Executor) and results are merged at end.

If an index is read-only and static, fore-merge is a good idea - unless
you want to parallelize.

Tokenizing and joining with OR is the correct way, but for speed you may
also use AND. To further improve the speed also take a look at Blockmax
WAND: If you are not interested in the total number of documents, you
can get huge speed improvements. By default this is enabled in Lucene
9.x with default IndexSearcher, but on Solr/Elasticsearch you may need
to actively request it. In that case it will only count exact number of
hits till 1000 docs are found.

Uwe

Am 22.09.2023 um 03:40 schrieb qrdl kaggle:
> After testing on 4800 fairly complex queries, I see a performance gain of
> 10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from 209
> ms per query, to 185 ms per query.
>
> Queries are quite complex, often about 30 or words, of the format OR
> text:<word>
>
> It went from 214 to 14 files on the forceMerge.
>
> It's a 6GB static/read only index with about 6.4M documents. Documents are
> around 1MB or so of text.
>
> Was wondering - are there any other techniques which can be used to speed
> up that work well when forceMerge works like this?
>
> Is there a better way to query and still maintain accuracy than simply word
> tokenizing a sentence and joining with OR text: ?
>
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: forceMerge(1) leads to ~10% perf gains [ In reply to ]
> Was wondering - are there any other techniques which can be used to speed
up that work well when forceMerge works like this?

Lucene 9.8 (to be released in a few days hopefully) will add support
to recursive graph bisection, which is another thing that can be used
to speed up querying on read-only indices.

https://github.com/apache/lucene/pull/12489

On Fri, Sep 22, 2023 at 12:54?PM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> Hi,
>
> Yes, a force-merged index can be faster, as less work is spent on
> looking up terms in different index segments.
>
> If you are looking for higher speed, non-merged indexes can actually
> perform better, IF you parallelize. You can do this by adding an
> Executor instance to IndexSearcher
> (<https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/IndexSearcher.html#%3Cinit%3E(org.apache.lucene.index.IndexReader,java.util.concurrent.Executor)>).
> If you do this each segment of the index is searched in parallel (using
> the thread pool limits of the Executor) and results are merged at end.
>
> If an index is read-only and static, fore-merge is a good idea - unless
> you want to parallelize.
>
> Tokenizing and joining with OR is the correct way, but for speed you may
> also use AND. To further improve the speed also take a look at Blockmax
> WAND: If you are not interested in the total number of documents, you
> can get huge speed improvements. By default this is enabled in Lucene
> 9.x with default IndexSearcher, but on Solr/Elasticsearch you may need
> to actively request it. In that case it will only count exact number of
> hits till 1000 docs are found.
>
> Uwe
>
> Am 22.09.2023 um 03:40 schrieb qrdl kaggle:
> > After testing on 4800 fairly complex queries, I see a performance gain of
> > 10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from 209
> > ms per query, to 185 ms per query.
> >
> > Queries are quite complex, often about 30 or words, of the format OR
> > text:<word>
> >
> > It went from 214 to 14 files on the forceMerge.
> >
> > It's a 6GB static/read only index with about 6.4M documents. Documents are
> > around 1MB or so of text.
> >
> > Was wondering - are there any other techniques which can be used to speed
> > up that work well when forceMerge works like this?
> >
> > Is there a better way to query and still maintain accuracy than simply word
> > tokenizing a sentence and joining with OR text: ?
> >
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: forceMerge(1) leads to ~10% perf gains [ In reply to ]
Also, try index sorting. Often, there are performance gains to be had with
the right sort key for various query workloads.

On Fri, 22 Sept, 2023, 4:28 pm Adrien Grand, <jpountz@gmail.com> wrote:

> > Was wondering - are there any other techniques which can be used to speed
> up that work well when forceMerge works like this?
>
> Lucene 9.8 (to be released in a few days hopefully) will add support
> to recursive graph bisection, which is another thing that can be used
> to speed up querying on read-only indices.
>
> https://github.com/apache/lucene/pull/12489
>
> On Fri, Sep 22, 2023 at 12:54?PM Uwe Schindler <uwe@thetaphi.de> wrote:
> >
> > Hi,
> >
> > Yes, a force-merged index can be faster, as less work is spent on
> > looking up terms in different index segments.
> >
> > If you are looking for higher speed, non-merged indexes can actually
> > perform better, IF you parallelize. You can do this by adding an
> > Executor instance to IndexSearcher
> > (<
> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/search/IndexSearcher.html#%3Cinit%3E(org.apache.lucene.index.IndexReader,java.util.concurrent.Executor)
> >).
> > If you do this each segment of the index is searched in parallel (using
> > the thread pool limits of the Executor) and results are merged at end.
> >
> > If an index is read-only and static, fore-merge is a good idea - unless
> > you want to parallelize.
> >
> > Tokenizing and joining with OR is the correct way, but for speed you may
> > also use AND. To further improve the speed also take a look at Blockmax
> > WAND: If you are not interested in the total number of documents, you
> > can get huge speed improvements. By default this is enabled in Lucene
> > 9.x with default IndexSearcher, but on Solr/Elasticsearch you may need
> > to actively request it. In that case it will only count exact number of
> > hits till 1000 docs are found.
> >
> > Uwe
> >
> > Am 22.09.2023 um 03:40 schrieb qrdl kaggle:
> > > After testing on 4800 fairly complex queries, I see a performance gain
> of
> > > 10% after doing indexWriter.forceMerge(1); indexWriter.commit(); from
> 209
> > > ms per query, to 185 ms per query.
> > >
> > > Queries are quite complex, often about 30 or words, of the format OR
> > > text:<word>
> > >
> > > It went from 214 to 14 files on the forceMerge.
> > >
> > > It's a 6GB static/read only index with about 6.4M documents.
> Documents are
> > > around 1MB or so of text.
> > >
> > > Was wondering - are there any other techniques which can be used to
> speed
> > > up that work well when forceMerge works like this?
> > >
> > > Is there a better way to query and still maintain accuracy than simply
> word
> > > tokenizing a sentence and joining with OR text: ?
> > >
> > --
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de
> > eMail: uwe@thetaphi.de
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
>
> --
> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>