Mailing List Archive: Inlining, virtual calls and BKDPointsTree

Inlining, virtual calls and BKDPointsTree

anton at blunders

Mar 1, 2024, 2:40 AM

Post #1 of 6 (138 views)

Hi everyone, long time lurker here.

I recently investigated Elasticsearch/OpenSearch performance in a blog post
[1], and saw some interesting behavior of numerical range queries and
numerical sorting with regards to inlining and virtual calls.

In short, the DocIdsWriter::readInts method seems to get much slower if it
is called with 3 or more implementations of IntersectVisitor during the JVM
lifetime. I believe that this is due to IntersectVisitory.visit(docid)
being heavily inlined with 2 or fewer IntersectVisitor implementations,
while becoming a virtual call with 3 or more.

This leads to two interesting points wrt Lucene

1) For benchmarks, warm ups should not only be done to trigger speedups by
the JIT, instead making the JVM be in a realistic production state. For the
BKDPointTree, this means at least 3 implementations of the
IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
benchmarks?
2) I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead
calling the IntersectVisitor with a DocIdSetItorator, to reduce the number
of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of
execution time of 35-45% for range queries and numerical sorting with this
patch applied. PR: https://github.com/apache/lucene/pull/13149

I have not been able to reproduce the speedup with lucenutil - I suspect
that the default tasks in it would not trigger this code path that much.

If you want understand more of my line of thinking, consider skimming
through the blog post [1]

[1] https://blunders.io/posts/es-benchmark-4-inlining
[2] https://github.com/elastic/elasticsearch-opensearch-benchmark

best regards,
Anton H

Re: Inlining, virtual calls and BKDPointsTree [ In reply to ]

worah.gautam at gmail

Mar 1, 2024, 3:29 PM

Post #2 of 6 (137 views)

Hi Anton,

It took me a while to get through the blog post, and I suspect I will need
to read through a couple more times to understand it fully.
Thanks for writing up something so detailed. I learnt a lot! (especially
about JVM inlining methods).

> I have not been able to reproduce the speedup with lucenutil - I suspect
that the default tasks in it would not trigger this code path that much.

Yeah, it seems like luceneutil is not stressing the code path that
ElasticSearch's benchmarks are?

> I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead
calling the IntersectVisitor with a DocIdSetItorator, to reduce the number
of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of
execution time of 35-45% for range queries and numerical sorting with this
patch applied.

So it seems like switching over from an iterative visit(int docID) call to
a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!

I am running Amazon Product Search's benchmarks to see if the change is
needle moving for us.

Small suggestion on the blog: The JVM inlining, ElasticSearch
short-circuiting/opto causing a difference in performance could've been a
blog on its own, part 1 maybe.. I got confused when the blog shifted from
the performance differences between ElasticSearch and OpenSearch, to how
you ended up improving Lucene.

Regards,
Gautam Worah.

On Fri, Mar 1, 2024 at 2:42?AM Anton Hägerstrand <anton@blunders.io> wrote:

> Hi everyone, long time lurker here.
>
> I recently investigated Elasticsearch/OpenSearch performance in a blog
> post [1], and saw some interesting behavior of numerical range queries and
> numerical sorting with regards to inlining and virtual calls.
>
> In short, the DocIdsWriter::readInts method seems to get much slower if it
> is called with 3 or more implementations of IntersectVisitor during the JVM
> lifetime. I believe that this is due to IntersectVisitory.visit(docid)
> being heavily inlined with 2 or fewer IntersectVisitor implementations,
> while becoming a virtual call with 3 or more.
>
> This leads to two interesting points wrt Lucene
>
> 1) For benchmarks, warm ups should not only be done to trigger speedups by
> the JIT, instead making the JVM be in a realistic production state. For the
> BKDPointTree, this means at least 3 implementations of the
> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
> benchmarks?
> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
> decrease of execution time of 35-45% for range queries and numerical
> sorting with this patch applied. PR:
> https://github.com/apache/lucene/pull/13149
>
> I have not been able to reproduce the speedup with lucenutil - I suspect
> that the default tasks in it would not trigger this code path that much.
>
> If you want understand more of my line of thinking, consider skimming
> through the blog post [1]
>
> [1] https://blunders.io/posts/es-benchmark-4-inlining
> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>
> best regards,
> Anton H
>

Re: Inlining, virtual calls and BKDPointsTree [ In reply to ]

anton at blunders

Mar 2, 2024, 2:31 AM

Post #3 of 6 (136 views)

Thank you Gautam!

> Yeah, it seems like luceneutil is not stressing the code path that
ElasticSearch's benchmarks are?

Yes, as far as I understand it - though it might just be that I don't
understand luceneutil good enough. I believe that in order to see the
performance diff numerical range queries or numerical sorting would have to
be involved - the more documents matched the larger the difference. This is
what the relevant benchmark operations from Elastic does.

> So it seems like switching over from an iterative visit(int docID) call
to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!

Yes, it seems like it, based on this benchmark.

> I am running Amazon Product Search's benchmarks to see if the change is
needle moving for us.

Thank you, much appreciated!

> Small suggestion on the blog...

Thank you for the feedback! The post is definitely a bit confusing, I
struggled with keeping it clear. I will try to make some edits to make it
clearer what conclusions can be made after each section.

/Anton

On Sat, 2 Mar 2024 at 00:30, Gautam Worah <worah.gautam@gmail.com> wrote:

> Hi Anton,
>
> It took me a while to get through the blog post, and I suspect I will need
> to read through a couple more times to understand it fully.
> Thanks for writing up something so detailed. I learnt a lot! (especially
> about JVM inlining methods).
>
> > I have not been able to reproduce the speedup with lucenutil - I suspect
> that the default tasks in it would not trigger this code path that much.
>
> Yeah, it seems like luceneutil is not stressing the code path that
> ElasticSearch's benchmarks are?
>
> > I tried changing the DocIdsWriter::readInts32 (and readDelta16), instead
> calling the IntersectVisitor with a DocIdSetItorator, to reduce the number
> of virtual calls. In the benchmark setup by Elastic [2] I saw a decrease of
> execution time of 35-45% for range queries and numerical sorting with this
> patch applied.
>
> So it seems like switching over from an iterative visit(int docID) call
> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>
> I am running Amazon Product Search's benchmarks to see if the change is
> needle moving for us.
>
> Small suggestion on the blog: The JVM inlining, ElasticSearch
> short-circuiting/opto causing a difference in performance could've been a
> blog on its own, part 1 maybe.. I got confused when the blog shifted from
> the performance differences between ElasticSearch and OpenSearch, to how
> you ended up improving Lucene.
>
> Regards,
> Gautam Worah.
>
>
> On Fri, Mar 1, 2024 at 2:42?AM Anton Hägerstrand <anton@blunders.io>
> wrote:
>
>> Hi everyone, long time lurker here.
>>
>> I recently investigated Elasticsearch/OpenSearch performance in a blog
>> post [1], and saw some interesting behavior of numerical range queries and
>> numerical sorting with regards to inlining and virtual calls.
>>
>> In short, the DocIdsWriter::readInts method seems to get much slower if
>> it is called with 3 or more implementations of IntersectVisitor during the
>> JVM lifetime. I believe that this is due to IntersectVisitory.visit(docid)
>> being heavily inlined with 2 or fewer IntersectVisitor implementations,
>> while becoming a virtual call with 3 or more.
>>
>> This leads to two interesting points wrt Lucene
>>
>> 1) For benchmarks, warm ups should not only be done to trigger speedups
>> by the JIT, instead making the JVM be in a realistic production state. For
>> the BKDPointTree, this means at least 3 implementations of the
>> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
>> benchmarks?
>> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>> decrease of execution time of 35-45% for range queries and numerical
>> sorting with this patch applied. PR:
>> https://github.com/apache/lucene/pull/13149
>>
>> I have not been able to reproduce the speedup with lucenutil - I suspect
>> that the default tasks in it would not trigger this code path that much.
>>
>> If you want understand more of my line of thinking, consider skimming
>> through the blog post [1]
>>
>> [1] https://blunders.io/posts/es-benchmark-4-inlining
>> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>>
>> best regards,
>> Anton H
>>
>

Re: Inlining, virtual calls and BKDPointsTree [ In reply to ]

worah.gautam at gmail

Mar 2, 2024, 8:45 AM

Post #4 of 6 (135 views)

> I am running Amazon Product Search's benchmarks to see if the change is
needle moving for us.

Results were flat to slightly positive (+0.94% redline QPS) on our workload.
Although we do have numeric range queries that would've improved, I suspect
it is flat because our workload is heavily dominated by TermQueries and
their combinations with various clauses.

I'll try tweaking the query set to target queries with more Point hits
during the week and see what comes out..

Regards,
Gautam Worah.

On Sat, Mar 2, 2024 at 2:33?AM Anton Hägerstrand <anton@blunders.io> wrote:

> Thank you Gautam!
>
> > Yeah, it seems like luceneutil is not stressing the code path that
> ElasticSearch's benchmarks are?
>
> Yes, as far as I understand it - though it might just be that I don't
> understand luceneutil good enough. I believe that in order to see the
> performance diff numerical range queries or numerical sorting would have to
> be involved - the more documents matched the larger the difference. This is
> what the relevant benchmark operations from Elastic does.
>
> > So it seems like switching over from an iterative visit(int docID) call
> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>
> Yes, it seems like it, based on this benchmark.
>
> > I am running Amazon Product Search's benchmarks to see if the change is
> needle moving for us.
>
> Thank you, much appreciated!
>
> > Small suggestion on the blog...
>
> Thank you for the feedback! The post is definitely a bit confusing, I
> struggled with keeping it clear. I will try to make some edits to make it
> clearer what conclusions can be made after each section.
>
> /Anton
>
> On Sat, 2 Mar 2024 at 00:30, Gautam Worah <worah.gautam@gmail.com> wrote:
>
>> Hi Anton,
>>
>> It took me a while to get through the blog post, and I suspect I will
>> need to read through a couple more times to understand it fully.
>> Thanks for writing up something so detailed. I learnt a lot! (especially
>> about JVM inlining methods).
>>
>> > I have not been able to reproduce the speedup with lucenutil - I
>> suspect that the default tasks in it would not trigger this code path that
>> much.
>>
>> Yeah, it seems like luceneutil is not stressing the code path that
>> ElasticSearch's benchmarks are?
>>
>> > I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>> decrease of execution time of 35-45% for range queries and numerical
>> sorting with this patch applied.
>>
>> So it seems like switching over from an iterative visit(int docID) call
>> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>>
>> I am running Amazon Product Search's benchmarks to see if the change is
>> needle moving for us.
>>
>> Small suggestion on the blog: The JVM inlining, ElasticSearch
>> short-circuiting/opto causing a difference in performance could've been a
>> blog on its own, part 1 maybe.. I got confused when the blog shifted from
>> the performance differences between ElasticSearch and OpenSearch, to how
>> you ended up improving Lucene.
>>
>> Regards,
>> Gautam Worah.
>>
>>
>> On Fri, Mar 1, 2024 at 2:42?AM Anton Hägerstrand <anton@blunders.io>
>> wrote:
>>
>>> Hi everyone, long time lurker here.
>>>
>>> I recently investigated Elasticsearch/OpenSearch performance in a blog
>>> post [1], and saw some interesting behavior of numerical range queries and
>>> numerical sorting with regards to inlining and virtual calls.
>>>
>>> In short, the DocIdsWriter::readInts method seems to get much slower if
>>> it is called with 3 or more implementations of IntersectVisitor during the
>>> JVM lifetime. I believe that this is due to IntersectVisitory.visit(docid)
>>> being heavily inlined with 2 or fewer IntersectVisitor implementations,
>>> while becoming a virtual call with 3 or more.
>>>
>>> This leads to two interesting points wrt Lucene
>>>
>>> 1) For benchmarks, warm ups should not only be done to trigger speedups
>>> by the JIT, instead making the JVM be in a realistic production state. For
>>> the BKDPointTree, this means at least 3 implementations of the
>>> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
>>> benchmarks?
>>> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>> decrease of execution time of 35-45% for range queries and numerical
>>> sorting with this patch applied. PR:
>>> https://github.com/apache/lucene/pull/13149
>>>
>>> I have not been able to reproduce the speedup with lucenutil - I suspect
>>> that the default tasks in it would not trigger this code path that much.
>>>
>>> If you want understand more of my line of thinking, consider skimming
>>> through the blog post [1]
>>>
>>> [1] https://blunders.io/posts/es-benchmark-4-inlining
>>> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>>>
>>> best regards,
>>> Anton H
>>>
>>

Re: Inlining, virtual calls and BKDPointsTree [ In reply to ]

worah.gautam at gmail

Mar 6, 2024, 5:30 PM

Post #5 of 6 (119 views)

> I'll try tweaking the query set to target queries with more Point hits
during the week and see what comes out..

I tried this and also tried benchmarking the change on 2 other types of
indexes, with slightly varying attributes. They roughly correlate to
indexes for different categories of products.
Performance on both throughput and latency was flat.

The change still LGTM.

Regards,
Gautam Worah.

On Sat, Mar 2, 2024 at 8:45?AM Gautam Worah <worah.gautam@gmail.com> wrote:

> > I am running Amazon Product Search's benchmarks to see if the change is
> needle moving for us.
>
> Results were flat to slightly positive (+0.94% redline QPS) on our
> workload.
> Although we do have numeric range queries that would've improved, I
> suspect it is flat because our workload is heavily dominated by TermQueries
> and their combinations with various clauses.
>
> I'll try tweaking the query set to target queries with more Point hits
> during the week and see what comes out..
>
> Regards,
> Gautam Worah.
>
>
> On Sat, Mar 2, 2024 at 2:33?AM Anton Hägerstrand <anton@blunders.io>
> wrote:
>
>> Thank you Gautam!
>>
>> > Yeah, it seems like luceneutil is not stressing the code path that
>> ElasticSearch's benchmarks are?
>>
>> Yes, as far as I understand it - though it might just be that I don't
>> understand luceneutil good enough. I believe that in order to see the
>> performance diff numerical range queries or numerical sorting would have to
>> be involved - the more documents matched the larger the difference. This is
>> what the relevant benchmark operations from Elastic does.
>>
>> > So it seems like switching over from an iterative visit(int docID)
>> call to a bulk visit(DocIdSetIterator iterator) gave us these gains?
>> Cool!
>>
>> Yes, it seems like it, based on this benchmark.
>>
>> > I am running Amazon Product Search's benchmarks to see if the change
>> is needle moving for us.
>>
>> Thank you, much appreciated!
>>
>> > Small suggestion on the blog...
>>
>> Thank you for the feedback! The post is definitely a bit confusing, I
>> struggled with keeping it clear. I will try to make some edits to make it
>> clearer what conclusions can be made after each section.
>>
>> /Anton
>>
>> On Sat, 2 Mar 2024 at 00:30, Gautam Worah <worah.gautam@gmail.com> wrote:
>>
>>> Hi Anton,
>>>
>>> It took me a while to get through the blog post, and I suspect I will
>>> need to read through a couple more times to understand it fully.
>>> Thanks for writing up something so detailed. I learnt a lot! (especially
>>> about JVM inlining methods).
>>>
>>> > I have not been able to reproduce the speedup with lucenutil - I
>>> suspect that the default tasks in it would not trigger this code path that
>>> much.
>>>
>>> Yeah, it seems like luceneutil is not stressing the code path that
>>> ElasticSearch's benchmarks are?
>>>
>>> > I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>> decrease of execution time of 35-45% for range queries and numerical
>>> sorting with this patch applied.
>>>
>>> So it seems like switching over from an iterative visit(int docID) call
>>> to a bulk visit(DocIdSetIterator iterator) gave us these gains? Cool!
>>>
>>> I am running Amazon Product Search's benchmarks to see if the change is
>>> needle moving for us.
>>>
>>> Small suggestion on the blog: The JVM inlining, ElasticSearch
>>> short-circuiting/opto causing a difference in performance could've been a
>>> blog on its own, part 1 maybe.. I got confused when the blog shifted from
>>> the performance differences between ElasticSearch and OpenSearch, to how
>>> you ended up improving Lucene.
>>>
>>> Regards,
>>> Gautam Worah.
>>>
>>>
>>> On Fri, Mar 1, 2024 at 2:42?AM Anton Hägerstrand <anton@blunders.io>
>>> wrote:
>>>
>>>> Hi everyone, long time lurker here.
>>>>
>>>> I recently investigated Elasticsearch/OpenSearch performance in a blog
>>>> post [1], and saw some interesting behavior of numerical range queries and
>>>> numerical sorting with regards to inlining and virtual calls.
>>>>
>>>> In short, the DocIdsWriter::readInts method seems to get much slower if
>>>> it is called with 3 or more implementations of IntersectVisitor during the
>>>> JVM lifetime. I believe that this is due to IntersectVisitory.visit(docid)
>>>> being heavily inlined with 2 or fewer IntersectVisitor implementations,
>>>> while becoming a virtual call with 3 or more.
>>>>
>>>> This leads to two interesting points wrt Lucene
>>>>
>>>> 1) For benchmarks, warm ups should not only be done to trigger speedups
>>>> by the JIT, instead making the JVM be in a realistic production state. For
>>>> the BKDPointTree, this means at least 3 implementations of the
>>>> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
>>>> benchmarks?
>>>> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>>> decrease of execution time of 35-45% for range queries and numerical
>>>> sorting with this patch applied. PR:
>>>> https://github.com/apache/lucene/pull/13149
>>>>
>>>> I have not been able to reproduce the speedup with lucenutil - I
>>>> suspect that the default tasks in it would not trigger this code path that
>>>> much.
>>>>
>>>> If you want understand more of my line of thinking, consider skimming
>>>> through the blog post [1]
>>>>
>>>> [1] https://blunders.io/posts/es-benchmark-4-inlining
>>>> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>>>>
>>>> best regards,
>>>> Anton H
>>>>
>>>

Re: Inlining, virtual calls and BKDPointsTree [ In reply to ]

anton at blunders

Mar 6, 2024, 11:05 PM

Post #6 of 6 (119 views)

> I tried this and also tried benchmarking the change on 2 other types of
indexes, with slightly varying attributes. They roughly correlate to
indexes for different categories of products.
> Performance on both throughput and latency was flat.

Thank you very much for running the benchmarks and reviewing the code!

After thinking a bit about this I think that it would be best if the PR
could be proven to improve performance in luceneutil before merging. Since
luceneutil does not currently, as far as I understand things, have good
coverage for point range queries and numerical sorting, this means adding
that functionality to luceneutil. I'll start looking into that next week
(I'm currently travelling).

best regards,
Anton

On Thu, 7 Mar 2024 at 02:30, Gautam Worah <worah.gautam@gmail.com> wrote:

> > I'll try tweaking the query set to target queries with more Point hits
> during the week and see what comes out..
>
> I tried this and also tried benchmarking the change on 2 other types of
> indexes, with slightly varying attributes. They roughly correlate to
> indexes for different categories of products.
> Performance on both throughput and latency was flat.
>
> The change still LGTM.
>
> Regards,
> Gautam Worah.
>
>
> On Sat, Mar 2, 2024 at 8:45?AM Gautam Worah <worah.gautam@gmail.com>
> wrote:
>
>> > I am running Amazon Product Search's benchmarks to see if the change
>> is needle moving for us.
>>
>> Results were flat to slightly positive (+0.94% redline QPS) on our
>> workload.
>> Although we do have numeric range queries that would've improved, I
>> suspect it is flat because our workload is heavily dominated by TermQueries
>> and their combinations with various clauses.
>>
>> I'll try tweaking the query set to target queries with more Point hits
>> during the week and see what comes out..
>>
>> Regards,
>> Gautam Worah.
>>
>>
>> On Sat, Mar 2, 2024 at 2:33?AM Anton Hägerstrand <anton@blunders.io>
>> wrote:
>>
>>> Thank you Gautam!
>>>
>>> > Yeah, it seems like luceneutil is not stressing the code path that
>>> ElasticSearch's benchmarks are?
>>>
>>> Yes, as far as I understand it - though it might just be that I don't
>>> understand luceneutil good enough. I believe that in order to see the
>>> performance diff numerical range queries or numerical sorting would have to
>>> be involved - the more documents matched the larger the difference. This is
>>> what the relevant benchmark operations from Elastic does.
>>>
>>> > So it seems like switching over from an iterative visit(int docID)
>>> call to a bulk visit(DocIdSetIterator iterator) gave us these gains?
>>> Cool!
>>>
>>> Yes, it seems like it, based on this benchmark.
>>>
>>> > I am running Amazon Product Search's benchmarks to see if the change
>>> is needle moving for us.
>>>
>>> Thank you, much appreciated!
>>>
>>> > Small suggestion on the blog...
>>>
>>> Thank you for the feedback! The post is definitely a bit confusing, I
>>> struggled with keeping it clear. I will try to make some edits to make it
>>> clearer what conclusions can be made after each section.
>>>
>>> /Anton
>>>
>>> On Sat, 2 Mar 2024 at 00:30, Gautam Worah <worah.gautam@gmail.com>
>>> wrote:
>>>
>>>> Hi Anton,
>>>>
>>>> It took me a while to get through the blog post, and I suspect I will
>>>> need to read through a couple more times to understand it fully.
>>>> Thanks for writing up something so detailed. I learnt a lot!
>>>> (especially about JVM inlining methods).
>>>>
>>>> > I have not been able to reproduce the speedup with lucenutil - I
>>>> suspect that the default tasks in it would not trigger this code path that
>>>> much.
>>>>
>>>> Yeah, it seems like luceneutil is not stressing the code path that
>>>> ElasticSearch's benchmarks are?
>>>>
>>>> > I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>>> decrease of execution time of 35-45% for range queries and numerical
>>>> sorting with this patch applied.
>>>>
>>>> So it seems like switching over from an iterative visit(int docID)
>>>> call to a bulk visit(DocIdSetIterator iterator) gave us these gains?
>>>> Cool!
>>>>
>>>> I am running Amazon Product Search's benchmarks to see if the change is
>>>> needle moving for us.
>>>>
>>>> Small suggestion on the blog: The JVM inlining, ElasticSearch
>>>> short-circuiting/opto causing a difference in performance could've been a
>>>> blog on its own, part 1 maybe.. I got confused when the blog shifted from
>>>> the performance differences between ElasticSearch and OpenSearch, to how
>>>> you ended up improving Lucene.
>>>>
>>>> Regards,
>>>> Gautam Worah.
>>>>
>>>>
>>>> On Fri, Mar 1, 2024 at 2:42?AM Anton Hägerstrand <anton@blunders.io>
>>>> wrote:
>>>>
>>>>> Hi everyone, long time lurker here.
>>>>>
>>>>> I recently investigated Elasticsearch/OpenSearch performance in a blog
>>>>> post [1], and saw some interesting behavior of numerical range queries and
>>>>> numerical sorting with regards to inlining and virtual calls.
>>>>>
>>>>> In short, the DocIdsWriter::readInts method seems to get much slower
>>>>> if it is called with 3 or more implementations of IntersectVisitor during
>>>>> the JVM lifetime. I believe that this is due to
>>>>> IntersectVisitory.visit(docid) being heavily inlined with 2 or fewer
>>>>> IntersectVisitor implementations, while becoming a virtual call with 3 or
>>>>> more.
>>>>>
>>>>> This leads to two interesting points wrt Lucene
>>>>>
>>>>> 1) For benchmarks, warm ups should not only be done to trigger
>>>>> speedups by the JIT, instead making the JVM be in a realistic production
>>>>> state. For the BKDPointTree, this means at least 3 implementations of the
>>>>> IntersectVisitor. I'm not sure if this is top of mind when writing Lucene
>>>>> benchmarks?
>>>>> 2) I tried changing the DocIdsWriter::readInts32 (and readDelta16),
>>>>> instead calling the IntersectVisitor with a DocIdSetItorator, to reduce the
>>>>> number of virtual calls. In the benchmark setup by Elastic [2] I saw a
>>>>> decrease of execution time of 35-45% for range queries and numerical
>>>>> sorting with this patch applied. PR:
>>>>> https://github.com/apache/lucene/pull/13149
>>>>>
>>>>> I have not been able to reproduce the speedup with lucenutil - I
>>>>> suspect that the default tasks in it would not trigger this code path that
>>>>> much.
>>>>>
>>>>> If you want understand more of my line of thinking, consider skimming
>>>>> through the blog post [1]
>>>>>
>>>>> [1] https://blunders.io/posts/es-benchmark-4-inlining
>>>>> [2] https://github.com/elastic/elasticsearch-opensearch-benchmark
>>>>>
>>>>> best regards,
>>>>> Anton H
>>>>>
>>>>