Mailing List Archive

30% query performance degradation for documents with small stored fields
Hello everyone,
We are in the process of upgrading from Lucene 8.5.0 and on the latest
version our query performance tests show significant latency degradation
for one of the important use cases. In this test, each query retrieves a
relatively large dataset of 40k documents with a small stored fields
payload (< 100 bytes per doc).
It looks like the change which affects this use case was introduced in
LUCENE-9486 <https://issues.apache.org/jira/browse/LUCENE-9486> (Lucene
8.7), on this version our tests show almost 3 times higher latency. Later
in LUCENE-9917 <https://issues.apache.org/jira/browse/LUCENE-9917> block
size for BEST_SPEED was reduced and since Lucene 8.10 we see about 30%
degradation.

It is still a significant performance regression, and in our case query
latency is more important than index size. Unless I'm missing something,
the only way to fix that today is to introduce our own Codec,
StoredFieldsFormat and CompressionMode - an experiment with disabled preset
dict and lower block size showed that these changes allow to achieve query
latency we need on Lucene 9.2. While it can solve the problem, there is a
concern about maintaining our own version of the codec and having more
complicated upgrades in the future.

Are there any less obvious ways to improve the situation for this use case?
If not, does it make sense to expose related settings so users can tune the
compression without copying several internal classes?

Thank you,
Alex
Re: 30% query performance degradation for documents with small stored fields [ In reply to ]
Hi Alexander,

Sorry that these changes impacted your workload negatively. We're trying to
have sensible defaults for the performance/compression trade-off in the
default codec, and indeed our guidance is to write a custom codec when it
doesn't work. As you identified, Lucene only guarantees backward
compatibility of file formats for the default codec, so if you write a
custom codec you will have to maintain backward compatibility on your own.

> Are there any less obvious ways to improve the situation for this use
case?

I can't think of other work arounds.

One approach that is supported consists of rewriting indexes to the default
codec to perform upgrades using `IndexWriter#addIndexes(CodecReader)`. Say
you have a custom codec, you could rewrite it to the default codec, then
upgrade to a new Lucene version, and rewrite the index again using your
custom codec. This doesn't remove the maintenance overhead entirely, but it
helps not have to worry about backward compatibility of file formats.

> does it make sense to expose related settings so users can tune the
compression without copying several internal classes?

Lucene exposes ways to customize stored fields, look at the constructor of
`Lucene90CompressingStoredFieldsFormat` for instance, which allows
configuring block sizes, compression strategies, etc. These classes are
considered internal so the API is not stable, but they could be used to
avoid copying lots of code from Lucene's stored fields format.

The consensus is that stored fields of the default codec shouldn't expose
more tuning options than BEST_SPEED/BEST_COMPRESSION. This is already quite
a burden in terms of testing and backward compatibility. The idea of
exposing more tuning options has been brought up a few times and rejected.

Not directly related to your question, but possibly still of interest to
you:
- We're now tracking the performance of stored fields on small documents
nightly:
http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
.
- If you're seeing a 30% performance degradation with recent changes to
stored fields, there are good chances that you could improve the
performance of this workload significantly with a custom codec that is
lighter on compression.


On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
alexanderlukyanchikov@gmail.com> wrote:

> Hello everyone,
> We are in the process of upgrading from Lucene 8.5.0 and on the latest
> version our query performance tests show significant latency degradation
> for one of the important use cases. In this test, each query retrieves a
> relatively large dataset of 40k documents with a small stored fields
> payload (< 100 bytes per doc).
> It looks like the change which affects this use case was introduced in
> LUCENE-9486 <https://issues.apache.org/jira/browse/LUCENE-9486> (Lucene
> 8.7), on this version our tests show almost 3 times higher latency. Later
> in LUCENE-9917 <https://issues.apache.org/jira/browse/LUCENE-9917> block
> size for BEST_SPEED was reduced and since Lucene 8.10 we see about 30%
> degradation.
>
> It is still a significant performance regression, and in our case query
> latency is more important than index size. Unless I'm missing something,
> the only way to fix that today is to introduce our own Codec,
> StoredFieldsFormat and CompressionMode - an experiment with disabled preset
> dict and lower block size showed that these changes allow to achieve query
> latency we need on Lucene 9.2. While it can solve the problem, there is a
> concern about maintaining our own version of the codec and having more
> complicated upgrades in the future.
>
> Are there any less obvious ways to improve the situation for this use
> case? If not, does it make sense to expose related settings so users can
> tune the compression without copying several internal classes?
>
> Thank you,
> Alex
>


--
Adrien
Re: 30% query performance degradation for documents with small stored fields [ In reply to ]
I wonder whether it would be worth trying switching from stored fields
to doc values. The access patterns are different, so the change would
not be trivial, but you might be able to achieve gains this way - I
really am not sure whether or not you would, the storage model is
completely different, but if you have a small number of fields, it
could be better?

On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> Hi Alexander,
>
> Sorry that these changes impacted your workload negatively. We're trying to have sensible defaults for the performance/compression trade-off in the default codec, and indeed our guidance is to write a custom codec when it doesn't work. As you identified, Lucene only guarantees backward compatibility of file formats for the default codec, so if you write a custom codec you will have to maintain backward compatibility on your own.
>
> > Are there any less obvious ways to improve the situation for this use case?
>
> I can't think of other work arounds.
>
> One approach that is supported consists of rewriting indexes to the default codec to perform upgrades using `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you could rewrite it to the default codec, then upgrade to a new Lucene version, and rewrite the index again using your custom codec. This doesn't remove the maintenance overhead entirely, but it helps not have to worry about backward compatibility of file formats.
>
> > does it make sense to expose related settings so users can tune the compression without copying several internal classes?
>
> Lucene exposes ways to customize stored fields, look at the constructor of `Lucene90CompressingStoredFieldsFormat` for instance, which allows configuring block sizes, compression strategies, etc. These classes are considered internal so the API is not stable, but they could be used to avoid copying lots of code from Lucene's stored fields format.
>
> The consensus is that stored fields of the default codec shouldn't expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is already quite a burden in terms of testing and backward compatibility. The idea of exposing more tuning options has been brought up a few times and rejected.
>
> Not directly related to your question, but possibly still of interest to you:
> - We're now tracking the performance of stored fields on small documents nightly: http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html.
> - If you're seeing a 30% performance degradation with recent changes to stored fields, there are good chances that you could improve the performance of this workload significantly with a custom codec that is lighter on compression.
>
>
> On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <alexanderlukyanchikov@gmail.com> wrote:
>>
>> Hello everyone,
>> We are in the process of upgrading from Lucene 8.5.0 and on the latest version our query performance tests show significant latency degradation for one of the important use cases. In this test, each query retrieves a relatively large dataset of 40k documents with a small stored fields payload (< 100 bytes per doc).
>> It looks like the change which affects this use case was introduced in LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced and since Lucene 8.10 we see about 30% degradation.
>>
>> It is still a significant performance regression, and in our case query latency is more important than index size. Unless I'm missing something, the only way to fix that today is to introduce our own Codec, StoredFieldsFormat and CompressionMode - an experiment with disabled preset dict and lower block size showed that these changes allow to achieve query latency we need on Lucene 9.2. While it can solve the problem, there is a concern about maintaining our own version of the codec and having more complicated upgrades in the future.
>>
>> Are there any less obvious ways to improve the situation for this use case? If not, does it make sense to expose related settings so users can tune the compression without copying several internal classes?
>>
>> Thank you,
>> Alex
>
>
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: 30% query performance degradation for documents with small stored fields [ In reply to ]
Hi Adrien, Michael
Thank you, your responses are very helpful.

> We're trying to have sensible defaults for the performance/compression
trade-off in the default codec
Sure, the compression improvement achieved with these changes is amazing
and the fetch speed tradeoff makes a lot of sense since it's likely
unnoticeable for a general use case with larger stored fields payload.

> One approach that is supported consists of rewriting indexes to the
default codec to perform upgrades using
`IndexWriter#addIndexes(CodecReader)`

That indeed could be really useful, although an ability to upgrade from the
previous Lucene version without re-indexing is very important for us. *Is
my understanding correct that changing only block size and disabling preset
dictionaries are the changes that won't likely require re-indexing and
could be as easily carried over to the next Lucene versions? I understand
there is no guarantee, but curious to know your opinion because it
introduces additional risks to us.*

> I wonder whether it would be worth trying switching from stored fields to
doc values

Yes, that is something we considered before, but discarded due to access
patterns specifics and the fact that payload size can also be large in some
cases. Although in the future we will likely need to use doc values for a
less generic feature, where small size is guaranteed.

Regards,
Alex


On Tue, Jun 7, 2022 at 5:45 AM Michael Sokolov <msokolov@gmail.com> wrote:

> I wonder whether it would be worth trying switching from stored fields
> to doc values. The access patterns are different, so the change would
> not be trivial, but you might be able to achieve gains this way - I
> really am not sure whether or not you would, the storage model is
> completely different, but if you have a small number of fields, it
> could be better?
>
> On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Hi Alexander,
> >
> > Sorry that these changes impacted your workload negatively. We're trying
> to have sensible defaults for the performance/compression trade-off in the
> default codec, and indeed our guidance is to write a custom codec when it
> doesn't work. As you identified, Lucene only guarantees backward
> compatibility of file formats for the default codec, so if you write a
> custom codec you will have to maintain backward compatibility on your own.
> >
> > > Are there any less obvious ways to improve the situation for this use
> case?
> >
> > I can't think of other work arounds.
> >
> > One approach that is supported consists of rewriting indexes to the
> default codec to perform upgrades using
> `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you
> could rewrite it to the default codec, then upgrade to a new Lucene
> version, and rewrite the index again using your custom codec. This doesn't
> remove the maintenance overhead entirely, but it helps not have to worry
> about backward compatibility of file formats.
> >
> > > does it make sense to expose related settings so users can tune the
> compression without copying several internal classes?
> >
> > Lucene exposes ways to customize stored fields, look at the constructor
> of `Lucene90CompressingStoredFieldsFormat` for instance, which allows
> configuring block sizes, compression strategies, etc. These classes are
> considered internal so the API is not stable, but they could be used to
> avoid copying lots of code from Lucene's stored fields format.
> >
> > The consensus is that stored fields of the default codec shouldn't
> expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is
> already quite a burden in terms of testing and backward compatibility. The
> idea of exposing more tuning options has been brought up a few times and
> rejected.
> >
> > Not directly related to your question, but possibly still of interest to
> you:
> > - We're now tracking the performance of stored fields on small
> documents nightly:
> http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
> .
> > - If you're seeing a 30% performance degradation with recent changes to
> stored fields, there are good chances that you could improve the
> performance of this workload significantly with a custom codec that is
> lighter on compression.
> >
> >
> > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
> alexanderlukyanchikov@gmail.com> wrote:
> >>
> >> Hello everyone,
> >> We are in the process of upgrading from Lucene 8.5.0 and on the latest
> version our query performance tests show significant latency degradation
> for one of the important use cases. In this test, each query retrieves a
> relatively large dataset of 40k documents with a small stored fields
> payload (< 100 bytes per doc).
> >> It looks like the change which affects this use case was introduced in
> LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times
> higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced
> and since Lucene 8.10 we see about 30% degradation.
> >>
> >> It is still a significant performance regression, and in our case query
> latency is more important than index size. Unless I'm missing something,
> the only way to fix that today is to introduce our own Codec,
> StoredFieldsFormat and CompressionMode - an experiment with disabled preset
> dict and lower block size showed that these changes allow to achieve query
> latency we need on Lucene 9.2. While it can solve the problem, there is a
> concern about maintaining our own version of the codec and having more
> complicated upgrades in the future.
> >>
> >> Are there any less obvious ways to improve the situation for this use
> case? If not, does it make sense to expose related settings so users can
> tune the compression without copying several internal classes?
> >>
> >> Thank you,
> >> Alex
> >
> >
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: 30% query performance degradation for documents with small stored fields [ In reply to ]
> Is my understanding correct that changing only block size and disabling
preset dictionaries are the changes that won't likely require re-indexing
and could be as easily carried over to the next Lucene versions? I
understand there is no guarantee, but curious to know your opinion because
it introduces additional risks to us.

This assessment looks correct to me.

On Tue, Jun 7, 2022 at 7:25 PM Alexander Lukyanchikov <
alexanderlukyanchikov@gmail.com> wrote:

> Hi Adrien, Michael
> Thank you, your responses are very helpful.
>
> > We're trying to have sensible defaults for the performance/compression
> trade-off in the default codec
> Sure, the compression improvement achieved with these changes is amazing
> and the fetch speed tradeoff makes a lot of sense since it's likely
> unnoticeable for a general use case with larger stored fields payload.
>
> > One approach that is supported consists of rewriting indexes to the
> default codec to perform upgrades using
> `IndexWriter#addIndexes(CodecReader)`
>
> That indeed could be really useful, although an ability to upgrade from
> the previous Lucene version without re-indexing is very important for us. *Is
> my understanding correct that changing only block size and disabling preset
> dictionaries are the changes that won't likely require re-indexing and
> could be as easily carried over to the next Lucene versions? I understand
> there is no guarantee, but curious to know your opinion because it
> introduces additional risks to us.*
>
> > I wonder whether it would be worth trying switching from stored fields
> to doc values
>
> Yes, that is something we considered before, but discarded due to access
> patterns specifics and the fact that payload size can also be large in some
> cases. Although in the future we will likely need to use doc values for a
> less generic feature, where small size is guaranteed.
>
> Regards,
> Alex
>
>
> On Tue, Jun 7, 2022 at 5:45 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> I wonder whether it would be worth trying switching from stored fields
>> to doc values. The access patterns are different, so the change would
>> not be trivial, but you might be able to achieve gains this way - I
>> really am not sure whether or not you would, the storage model is
>> completely different, but if you have a small number of fields, it
>> could be better?
>>
>> On Tue, Jun 7, 2022 at 3:16 AM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > Hi Alexander,
>> >
>> > Sorry that these changes impacted your workload negatively. We're
>> trying to have sensible defaults for the performance/compression trade-off
>> in the default codec, and indeed our guidance is to write a custom codec
>> when it doesn't work. As you identified, Lucene only guarantees backward
>> compatibility of file formats for the default codec, so if you write a
>> custom codec you will have to maintain backward compatibility on your own.
>> >
>> > > Are there any less obvious ways to improve the situation for this use
>> case?
>> >
>> > I can't think of other work arounds.
>> >
>> > One approach that is supported consists of rewriting indexes to the
>> default codec to perform upgrades using
>> `IndexWriter#addIndexes(CodecReader)`. Say you have a custom codec, you
>> could rewrite it to the default codec, then upgrade to a new Lucene
>> version, and rewrite the index again using your custom codec. This doesn't
>> remove the maintenance overhead entirely, but it helps not have to worry
>> about backward compatibility of file formats.
>> >
>> > > does it make sense to expose related settings so users can tune the
>> compression without copying several internal classes?
>> >
>> > Lucene exposes ways to customize stored fields, look at the constructor
>> of `Lucene90CompressingStoredFieldsFormat` for instance, which allows
>> configuring block sizes, compression strategies, etc. These classes are
>> considered internal so the API is not stable, but they could be used to
>> avoid copying lots of code from Lucene's stored fields format.
>> >
>> > The consensus is that stored fields of the default codec shouldn't
>> expose more tuning options than BEST_SPEED/BEST_COMPRESSION. This is
>> already quite a burden in terms of testing and backward compatibility. The
>> idea of exposing more tuning options has been brought up a few times and
>> rejected.
>> >
>> > Not directly related to your question, but possibly still of interest
>> to you:
>> > - We're now tracking the performance of stored fields on small
>> documents nightly:
>> http://people.apache.org/~mikemccand/lucenebench/stored_fields_benchmarks.html
>> .
>> > - If you're seeing a 30% performance degradation with recent changes
>> to stored fields, there are good chances that you could improve the
>> performance of this workload significantly with a custom codec that is
>> lighter on compression.
>> >
>> >
>> > On Tue, Jun 7, 2022 at 1:32 AM Alexander Lukyanchikov <
>> alexanderlukyanchikov@gmail.com> wrote:
>> >>
>> >> Hello everyone,
>> >> We are in the process of upgrading from Lucene 8.5.0 and on the latest
>> version our query performance tests show significant latency degradation
>> for one of the important use cases. In this test, each query retrieves a
>> relatively large dataset of 40k documents with a small stored fields
>> payload (< 100 bytes per doc).
>> >> It looks like the change which affects this use case was introduced in
>> LUCENE-9486 (Lucene 8.7), on this version our tests show almost 3 times
>> higher latency. Later in LUCENE-9917 block size for BEST_SPEED was reduced
>> and since Lucene 8.10 we see about 30% degradation.
>> >>
>> >> It is still a significant performance regression, and in our case
>> query latency is more important than index size. Unless I'm missing
>> something, the only way to fix that today is to introduce our own Codec,
>> StoredFieldsFormat and CompressionMode - an experiment with disabled preset
>> dict and lower block size showed that these changes allow to achieve query
>> latency we need on Lucene 9.2. While it can solve the problem, there is a
>> concern about maintaining our own version of the codec and having more
>> complicated upgrades in the future.
>> >>
>> >> Are there any less obvious ways to improve the situation for this use
>> case? If not, does it make sense to expose related settings so users can
>> tune the compression without copying several internal classes?
>> >>
>> >> Thank you,
>> >> Alex
>> >
>> >
>> >
>> > --
>> > Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Adrien