Mailing List Archive

NRT readers and overall indexing/querying throughput
Hello everyone,

We are considering switching from regular to NRT readers, hoping it would
improve overall indexing/querying throughput and also optimize the
turnaround time.
I did some benchmarks, mostly to understand how much benefit we can get and
make sure I'm implementing everything correctly.

To my surprise, no matter how I tweak it, our indexing throughput is 10%
lower with NRT, and query throughput (goes in parallel with indexing) is
pretty much the same. I do see almost x5 turnaround time improvement though.
Maybe I have wrong expectations, and less frequent commits with NRT refresh
were not intended to improve overall performance?

Some details about the tests -
Base implementation commits and refreshes a regular reader every second.
NRT implementation commits every 60 seconds and refreshes NRT reader every
second.
The indexing rate is about 23 Mb/sec, query rate ~300 rps (text search with
avg 50ms latency). Documents size is about 35 Kb.
36 core machine is used for the tests, and I don't see a big difference in
JVM metrics between the tests. Also, there is no obvious bottleneck in
CPU/memory/disk utilization (everything is way below 100%)
NRT readers are implemented using the SearchManager, the same as the
implementation
in the Lucene benchmark
<https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/NRTPerfTest.java>
repository.
With NRT, commit latency is about 3 sec, average refresh latency is 150ms.
In the base approach, commit latency is about 500 ms, refresh 300 ms.
I tried NRTCachingDirectory (with MmapDirectory and NIOFSDirectory), insert
vs update workload, `applyAllDeletes=false`, single indexing thread -
nothing helps to match the base version throughput.

I'd appreciate any advice. Am I missing something obvious, or the
expectation that NRT with less frequent commits going to be more
performant/resource-efficient is incorrect?

--
Regards,
Alex
RE: NRT readers and overall indexing/querying throughput [ In reply to ]
Hi,

in general, NRT indexing throughput is always a bit slower than a normal indexing as it reopens readers and needs to flush segments more often (and therefor you should use NRTCachingDirectory). So 10% slower indexing throughput is quite normal. You can improve by parallelizing, but still during a refresh you have a small delay on each reopen of readers by SearcherManager.

Searching is mostly same speed, because while indexing, most of the segments don't change and can be reused after reopen, only new but small segments are cold. Merged segments also need warming, so generally you only see small spikes in search performance when new merged and possibly huge "cold" segments get live.

Of course, if you use more parallel threads during indexing you will also see a slowdown in search performance.

When doing NRT always use NRTCachingDirectory, for "normal bulk indexing", MMapDirectory alone is fine.

I don't fully understand your expectations, but all what you describe looks quite normal. The main reason to use NRT indexing is shorter turnaround times by not doing expensive commits. And that's what you see -- while indexing performance and also search performance go down depending on refresh rate.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Alexander Lukyanchikov <alexanderlukyanchikov@gmail.com>
> Sent: Wednesday, August 4, 2021 4:43 AM
> To: java-user@lucene.apache.org
> Subject: NRT readers and overall indexing/querying throughput
>
> Hello everyone,
>
> We are considering switching from regular to NRT readers, hoping it would
> improve overall indexing/querying throughput and also optimize the
> turnaround time.
> I did some benchmarks, mostly to understand how much benefit we can get
> and
> make sure I'm implementing everything correctly.
>
> To my surprise, no matter how I tweak it, our indexing throughput is 10%
> lower with NRT, and query throughput (goes in parallel with indexing) is
> pretty much the same. I do see almost x5 turnaround time improvement
> though.
> Maybe I have wrong expectations, and less frequent commits with NRT refresh
> were not intended to improve overall performance?
>
> Some details about the tests -
> Base implementation commits and refreshes a regular reader every second.
> NRT implementation commits every 60 seconds and refreshes NRT reader every
> second.
> The indexing rate is about 23 Mb/sec, query rate ~300 rps (text search with
> avg 50ms latency). Documents size is about 35 Kb.
> 36 core machine is used for the tests, and I don't see a big difference in
> JVM metrics between the tests. Also, there is no obvious bottleneck in
> CPU/memory/disk utilization (everything is way below 100%)
> NRT readers are implemented using the SearchManager, the same as the
> implementation
> in the Lucene benchmark
> <https://github.com/mikemccand/luceneutil/blob/master/src/main/perf/NRTP
> erfTest.java>
> repository.
> With NRT, commit latency is about 3 sec, average refresh latency is 150ms.
> In the base approach, commit latency is about 500 ms, refresh 300 ms.
> I tried NRTCachingDirectory (with MmapDirectory and NIOFSDirectory), insert
> vs update workload, `applyAllDeletes=false`, single indexing thread -
> nothing helps to match the base version throughput.
>
> I'd appreciate any advice. Am I missing something obvious, or the
> expectation that NRT with less frequent commits going to be more
> performant/resource-efficient is incorrect?
>
> --
> Regards,
> Alex


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: NRT readers and overall indexing/querying throughput [ In reply to ]
On Tue, Aug 3, 2021 at 10:43 PM Alexander Lukyanchikov
<alexanderlukyanchikov@gmail.com> wrote:
>
> Maybe I have wrong expectations, and less frequent commits with NRT refresh
> were not intended to improve overall performance?
>
> Some details about the tests -
> Base implementation commits and refreshes a regular reader every second.
> NRT implementation commits every 60 seconds and refreshes NRT reader every
> second.

fyi: if you really want to delay the sync of the data to a long value
such as every 60 seconds, you may also have to modify filesystem mount
options to get that.

For example, with ext4, it will do the sync itself every 5 seconds by
default, see 'commit=' option:
https://www.kernel.org/doc/Documentation/filesystems/ext4.txt

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org