Mailing List Archive

Returning large resultset is slow and resource intensive
Hello everyone,
For our use case, we need to run queries which return the full
matched result set. In some cases, this result set can be large (50k+
results out of 4 million total documents).
Perf test showed that just 4 threads running random queries returning 50k
results make Lucene utilize 100% CPU on a 4-core machine (profiler
screenshot
<https://user-images.githubusercontent.com/6069066/157188814-fbd9d205-c2e4-45b6-b98d-b7622b6ac801.png>).
The query is very simple and contains only a single-term filter clause, all
unrelated parts of the application are disabled, no stored fields are
fetched, GC is doing minimal amount of work
<https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc-41c1-4af1-afcf-37d0c5f86054.png>
.

My understanding is that fetching a large result set is not exactly
the best use case for Lucene, as explained here
<http://philosophyforprogrammers.blogspot.com/2010/09/lucene-performance.html>.
But I wonder if there are ways to optimize something / use a special type
of collector in order to minimize CPU utilization?

Thank you,
Alex
RE: Returning large resultset is slow and resource intensive [ In reply to ]
Hi,

> For our use case, we need to run queries which return the full
> matched result set. In some cases, this result set can be large (50k+
> results out of 4 million total documents).
> Perf test showed that just 4 threads running random queries returning 50k
> results make Lucene utilize 100% CPU on a 4-core machine (profiler
> screenshot
> <https://user-images.githubusercontent.com/6069066/157188814-fbd9d205-
> c2e4-45b6-b98d-b7622b6ac801.png>).

This screenshot shows the problem: The search methods returning TopDocs (or TopFieldDocs) should never ever be used to retrieve a larger amount or ALL results. This is called "deep paging" problem. Lucene cannot return "paged" results easily starting at a specific result page, it has to score all results and insert them into a priority queue - this does not scale well because the priority queue approach is made for quuickly getting top-ranking results. So to get all results, don't call: <https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,int)>

If you just want to get all results then you should write your own collector (single threaded as subclass of SimpleCollector, an alternative is CollectorManager for multithreaded search with a separate "reduce" step to merge results of each index segment) that just retrieves document ids and processes them. If you don't need the score, don't call the scoring methods in the Scorable.

For this you have to create a subclass of SimpleCollector (and CollectorManager, if needed) and implement its methods that are called by the query internal as a kind of "notifications" about which index segment you are and which result *relative* to this index segment you. Important things:
- you get notified about new segments using SimpleCollector#doSetNextReader. Save the content in a local field of the collector for later usage
- if you need the scores also implement SimpleCollector#setScorer().
- for each search hit of the reader passed in the previous call you get the SimpleCollector#collect() method called. Use the document id passed and resolve it using the leaf reader to the actual document and its fields/doc values. To get the score ask the Scoreable from previous call.

Another approach is to use searchAfter with smaller windows, but for getting all results this is still slower as a priority queue has to be managed, too (just smaller ones).

> The query is very simple and contains only a single-term filter clause, all
> unrelated parts of the application are disabled, no stored fields are
> fetched, GC is doing minimal amount of work
> <https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc-
> 41c1-4af1-afcf-37d0c5f86054.png>

Lucene never uses much heap space, so GC should always be low.

Uwe


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Returning large resultset is slow and resource intensive [ In reply to ]
Another approach for retrieving large result sets can work if you have
a unique sort key. and don't mind retrieving your results sorted by
this key. Then you can retrieve the results in batches using a
cursor-style approach; request the top N sorted by the key. Then
request the top N s.t. the key is greater than the greatest value in
the last batch. rinse and repeat.

On Tue, Mar 8, 2022 at 4:13 AM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> Hi,
>
> > For our use case, we need to run queries which return the full
> > matched result set. In some cases, this result set can be large (50k+
> > results out of 4 million total documents).
> > Perf test showed that just 4 threads running random queries returning 50k
> > results make Lucene utilize 100% CPU on a 4-core machine (profiler
> > screenshot
> > <https://user-images.githubusercontent.com/6069066/157188814-fbd9d205-
> > c2e4-45b6-b98d-b7622b6ac801.png>).
>
> This screenshot shows the problem: The search methods returning TopDocs (or TopFieldDocs) should never ever be used to retrieve a larger amount or ALL results. This is called "deep paging" problem. Lucene cannot return "paged" results easily starting at a specific result page, it has to score all results and insert them into a priority queue - this does not scale well because the priority queue approach is made for quuickly getting top-ranking results. So to get all results, don't call: <https://lucene.apache.org/core/9_0_0/core/org/apache/lucene/search/IndexSearcher.html#search(org.apache.lucene.search.Query,int)>
>
> If you just want to get all results then you should write your own collector (single threaded as subclass of SimpleCollector, an alternative is CollectorManager for multithreaded search with a separate "reduce" step to merge results of each index segment) that just retrieves document ids and processes them. If you don't need the score, don't call the scoring methods in the Scorable.
>
> For this you have to create a subclass of SimpleCollector (and CollectorManager, if needed) and implement its methods that are called by the query internal as a kind of "notifications" about which index segment you are and which result *relative* to this index segment you. Important things:
> - you get notified about new segments using SimpleCollector#doSetNextReader. Save the content in a local field of the collector for later usage
> - if you need the scores also implement SimpleCollector#setScorer().
> - for each search hit of the reader passed in the previous call you get the SimpleCollector#collect() method called. Use the document id passed and resolve it using the leaf reader to the actual document and its fields/doc values. To get the score ask the Scoreable from previous call.
>
> Another approach is to use searchAfter with smaller windows, but for getting all results this is still slower as a priority queue has to be managed, too (just smaller ones).
>
> > The query is very simple and contains only a single-term filter clause, all
> > unrelated parts of the application are disabled, no stored fields are
> > fetched, GC is doing minimal amount of work
> > <https://user-images.githubusercontent.com/6069066/157191646-eb8c5ccc-
> > 41c1-4af1-afcf-37d0c5f86054.png>
>
> Lucene never uses much heap space, so GC should always be low.
>
> Uwe
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org