Mailing List Archive

Needs help reviewing on Lucene PostingsFormat memory improvement
Hi Lucene devs!

I have 2 PRs to optimize Lucene PostingsFormat
(Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new
feature to stream the FST to IndexOutput directly, bypassing the on-heap
writing:
- https://github.com/apache/lucene/pull/12980
- https://github.com/apache/lucene/pull/12985

It would be great if someone can help reviewing. I also have some general
questions:
- How do I measure the memory improvement impact in Lucene?
- Is Lucene90BlockTreePostingsFormat the main index format used in Lucene?
If not, what is the main format?
- Are there other places worth using the new streaming FST feature?

Thank you!
Anh Dung Bui
Re: Needs help reviewing on Lucene PostingsFormat memory improvement [ In reply to ]
Hi Anh D?ng Bùi,

Thank you for tackling these and being so gently patient/persisting! Sorry
for the delay. I will try to review them soon. The off-heap (streaming?)
building of FSTs is really a massive improvement to Lucene, inspired by
Tantivy's FST implementation: https://blog.burntsushi.net/transducers/

Read-time for Lucene90BlockTreePostingsFormat was already off-heap? And
your PR changes write-time to do so as well? This will reduce RAM pressure
during indexing which is great. And some Lucene usages generate incredibly
large FSTs (I'm looking at you HathiTrust!). I don't think we need to
explicitly measure any performance impact before merging?, but let's watch
the nightly benchy to see if there is any measurable impact?

And, yes, Lucene90BlockTreePostingsFormat is the default. You find the
default codec from Codec.getDefault() and then trace downwards to all its
sources.

Maybe building the synonyms FST (SynonymMap.Builder) would be a good place
for off-heap writing too?

And this exciting PR <https://github.com/apache/lucene/pull/12688> (still a
work in progres) would likely strongly benefit from streaming FST building,
since its FSTs will be much larger than the Lucene90BlockTree since it
stores all terms (not just the sampled prefix/index) in a single FST for
the segment.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Feb 1, 2024 at 10:40?PM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:

> Hi Lucene devs!
>
> I have 2 PRs to optimize Lucene PostingsFormat
> (Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new
> feature to stream the FST to IndexOutput directly, bypassing the on-heap
> writing:
> - https://github.com/apache/lucene/pull/12980
> - https://github.com/apache/lucene/pull/12985
>
> It would be great if someone can help reviewing. I also have some general
> questions:
> - How do I measure the memory improvement impact in Lucene?
> - Is Lucene90BlockTreePostingsFormat the main index format used in Lucene?
> If not, what is the main format?
> - Are there other places worth using the new streaming FST feature?
>
> Thank you!
> Anh Dung Bui
>
Re: Needs help reviewing on Lucene PostingsFormat memory improvement [ In reply to ]
Thanks Mike for the reply!

> Read-time for Lucene90BlockTreePostingsFormat was already off-heap? And
your PR changes write-time to do so as well?

Yeah that's the idea. I changed just the Terms Writer to be off-heap.
Thanks, let's monitor it after the merge.

> Maybe building the synonyms FST (SynonymMap.Builder) would be a good
place for off-heap writing too?

This is a good idea. I see there's one on-going PR that tackles this
already: https://github.com/apache/lucene/pull/13054. I'm excited to see
the feature is rolling out to different parts of Lucene.

> And this exciting PR <https://github.com/apache/lucene/pull/12688> (still
a work in progres) would likely strongly benefit from streaming FST
building, since its FSTs will be much larger than the Lucene90BlockTree
since it stores all terms (not just the sampled prefix/index) in a single
FST for the segment.

I can try to fork this PR and convert to off-heap writing as well.

Regards,
Anh Dung Bui

On Thu, Feb 8, 2024 at 7:43?AM Michael McCandless <lucene@mikemccandless.com>
wrote:

> Hi Anh D?ng Bùi,
>
> Thank you for tackling these and being so gently patient/persisting!
> Sorry for the delay. I will try to review them soon. The off-heap
> (streaming?) building of FSTs is really a massive improvement to Lucene,
> inspired by Tantivy's FST implementation:
> https://blog.burntsushi.net/transducers/
>
> Read-time for Lucene90BlockTreePostingsFormat was already off-heap? And
> your PR changes write-time to do so as well? This will reduce RAM pressure
> during indexing which is great. And some Lucene usages generate incredibly
> large FSTs (I'm looking at you HathiTrust!). I don't think we need to
> explicitly measure any performance impact before merging?, but let's watch
> the nightly benchy to see if there is any measurable impact?
>
> And, yes, Lucene90BlockTreePostingsFormat is the default. You find the
> default codec from Codec.getDefault() and then trace downwards to all its
> sources.
>
> Maybe building the synonyms FST (SynonymMap.Builder) would be a good place
> for off-heap writing too?
>
> And this exciting PR <https://github.com/apache/lucene/pull/12688> (still
> a work in progres) would likely strongly benefit from streaming FST
> building, since its FSTs will be much larger than the Lucene90BlockTree
> since it stores all terms (not just the sampled prefix/index) in a single
> FST for the segment.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Feb 1, 2024 at 10:40?PM Anh D?ng Bùi <dungba.sg@gmail.com> wrote:
>
>> Hi Lucene devs!
>>
>> I have 2 PRs to optimize Lucene PostingsFormat
>> (Lucene90BlockTreePostingsFormat and FSTPostingsFormat) by utilizing a new
>> feature to stream the FST to IndexOutput directly, bypassing the on-heap
>> writing:
>> - https://github.com/apache/lucene/pull/12980
>> - https://github.com/apache/lucene/pull/12985
>>
>> It would be great if someone can help reviewing. I also have some general
>> questions:
>> - How do I measure the memory improvement impact in Lucene?
>> - Is Lucene90BlockTreePostingsFormat the main index format used in
>> Lucene? If not, what is the main format?
>> - Are there other places worth using the new streaming FST feature?
>>
>> Thank you!
>> Anh Dung Bui
>>
>