Mailing List Archive: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

Unnecessary float[256] allocation on every (non-scoring) BM25Scorer

msfroh at gmail

May 2, 2023, 9:48 AM

Post #1 of 6 (114 views)

Hi all,

I was looking into a customer issue where they noticed some increased GC
time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps
from both systems, the big difference was tracked down to the float[256]
allocated (as a norms cache) when creating a BM25Scorer (in
BM25Similarity.scorer()).

The change seems to have come in with
https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1,
which removed some of the special-case logic around the "non-scoring
similarity" embedded in IndexSearcher (returned in the false case from the
old IndexSearcher#scorer(boolean needsScores)).

While I really like that we no longer have that special-case logic in
IndexSearcher, we now have the issue that every time we create a new
TermWeight (or other Weight) it allocates a float[256], even if the
TermWeight doesn't need scores. Also, I think it's the exact same
float[256] for all non-scoring weights, since it's being computed using the
same "all 1s" CollectionStatistics and TermStatistics.

(For the record, yes, the queries in question have an obscene number of
TermQueries, so 1024 bytes times lots of TermWeights, times multiple
queries running concurrently makes lots of heap allocation.)

I'd like to submit a patch to fix this, but I'm wondering what approach to
take. One option I'm considering is precomputing a singleton float[256] for
the non-scoring case (where CollectionStatistics and TermStatistics are all
1s). That would have the least functional impact, but would let all
non-scoring clauses share the same array. Is there a better way to tackle
this?

Thanks,
Froh

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer [ In reply to ]

rcmuir at gmail

May 2, 2023, 11:34 AM

Post #2 of 6 (114 views)

On Tue, May 2, 2023 at 12:49?PM Michael Froh <msfroh@gmail.com> wrote:
>
> Hi all,
>
> I was looking into a customer issue where they noticed some increased GC time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps from both systems, the big difference was tracked down to the float[256] allocated (as a norms cache) when creating a BM25Scorer (in BM25Similarity.scorer()).
>
> The change seems to have come in with https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1, which removed some of the special-case logic around the "non-scoring similarity" embedded in IndexSearcher (returned in the false case from the old IndexSearcher#scorer(boolean needsScores)).
>
> While I really like that we no longer have that special-case logic in IndexSearcher, we now have the issue that every time we create a new TermWeight (or other Weight) it allocates a float[256], even if the TermWeight doesn't need scores. Also, I think it's the exact same float[256] for all non-scoring weights, since it's being computed using the same "all 1s" CollectionStatistics and TermStatistics.
>
> (For the record, yes, the queries in question have an obscene number of TermQueries, so 1024 bytes times lots of TermWeights, times multiple queries running concurrently makes lots of heap allocation.)
>
> I'd like to submit a patch to fix this, but I'm wondering what approach to take. One option I'm considering is precomputing a singleton float[256] for the non-scoring case (where CollectionStatistics and TermStatistics are all 1s). That would have the least functional impact, but would let all non-scoring clauses share the same array. Is there a better way to tackle this?
>

This seems ok if it isn't invasive. I still feel like something is
"off" if you are seeing GC time from 1KB-per-segment allocation. Do
you have way too many segments?

Originally (for various similar reasons) there was a place in the API
to do this, so it would only happen per-Weight instead of per-Scorer,
which was the SimWeight that got eliminated by the commit you point
to. But I'd love if we could steer clear of that complexity:
simplifying the API here was definitely the right move. Its been more
than 5 years since this change was made, and this is the first
complaint i've heard about the 1KB, which is why i asked about your
setup.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer [ In reply to ]

rcmuir at gmail

May 2, 2023, 11:42 AM

Post #3 of 6 (114 views)

On Tue, May 2, 2023 at 2:34?PM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Tue, May 2, 2023 at 12:49?PM Michael Froh <msfroh@gmail.com> wrote:
> >
> > Hi all,
> >
> > I was looking into a customer issue where they noticed some increased GC time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps from both systems, the big difference was tracked down to the float[256] allocated (as a norms cache) when creating a BM25Scorer (in BM25Similarity.scorer()).
> >
> > The change seems to have come in with https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1, which removed some of the special-case logic around the "non-scoring similarity" embedded in IndexSearcher (returned in the false case from the old IndexSearcher#scorer(boolean needsScores)).
> >
> > While I really like that we no longer have that special-case logic in IndexSearcher, we now have the issue that every time we create a new TermWeight (or other Weight) it allocates a float[256], even if the TermWeight doesn't need scores. Also, I think it's the exact same float[256] for all non-scoring weights, since it's being computed using the same "all 1s" CollectionStatistics and TermStatistics.
> >
> > (For the record, yes, the queries in question have an obscene number of TermQueries, so 1024 bytes times lots of TermWeights, times multiple queries running concurrently makes lots of heap allocation.)
> >
> > I'd like to submit a patch to fix this, but I'm wondering what approach to take. One option I'm considering is precomputing a singleton float[256] for the non-scoring case (where CollectionStatistics and TermStatistics are all 1s). That would have the least functional impact, but would let all non-scoring clauses share the same array. Is there a better way to tackle this?
> >
>
> This seems ok if it isn't invasive. I still feel like something is
> "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> you have way too many segments?
>
> Originally (for various similar reasons) there was a place in the API
> to do this, so it would only happen per-Weight instead of per-Scorer,
> which was the SimWeight that got eliminated by the commit you point
> to. But I'd love if we could steer clear of that complexity:
> simplifying the API here was definitely the right move. Its been more
> than 5 years since this change was made, and this is the first
> complaint i've heard about the 1KB, which is why i asked about your
> setup.

One last thought: we should re-check if the cache is still needed :) I
think decoding norms used to be more expensive in the past. This cache
is now only precomputing part of the bm25 formula to save some
add/multiply/divide.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer [ In reply to ]

msfroh at gmail

May 2, 2023, 12:23 PM

Post #4 of 6 (114 views)

> This seems ok if it isn't invasive. I still feel like something is
> "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> you have way too many segments?

From what I saw, it's 1KB per "leaf query" to create the
BM25Scorer instance (at the Weight level), but then that BM25Scorer is
shared across all scorer (DISI) instances for all segments. So it doesn't
scale with segment count. It looks like the old logic used to allocate a
SimScorer per segment, so this is a big improvement in that regard (for
scoring clauses, since the non-scoring clauses had a super-lightweight
SimScorer).

In this particular case, they're running these gnarly machine-generated
BoolenQuery trees with at least 512 non-scoring TermQuery clauses (across a
bunch of different fields, so TermInSetQuery isn't an option). From what I
can see, each of those TermQueries produces a TermWeight that holds a
BM25Scorer that holds yet another instance of this float[256] array, for
512KB+ of these caches per running query. It's definitely only going to be
an issue for folks who are flying close to the max clause count.

> One last thought: we should re-check if the cache is still needed :) I
> think decoding norms used to be more expensive in the past. This cache
> is now only precomputing part of the bm25 formula to save some
> add/multiply/divide.

Yeah -- when I saw the cache calculation, it reminded me of precomputed
tables of trigonometric functions in the demoscene.

I could try inlining those calculations and measuring the impact with the
luceneutil benchmarks.

On Tue, May 2, 2023 at 11:34?AM Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, May 2, 2023 at 12:49?PM Michael Froh <msfroh@gmail.com> wrote:
> >
> > Hi all,
> >
> > I was looking into a customer issue where they noticed some increased GC
> time after upgrading from Lucene 7.x to 9.x. After taking some heap dumps
> from both systems, the big difference was tracked down to the float[256]
> allocated (as a norms cache) when creating a BM25Scorer (in
> BM25Similarity.scorer()).
> >
> > The change seems to have come in with
> https://github.com/apache/lucene/commit/8fd7ead940f69a892dfc951a1aa042e8cae806c1,
> which removed some of the special-case logic around the "non-scoring
> similarity" embedded in IndexSearcher (returned in the false case from the
> old IndexSearcher#scorer(boolean needsScores)).
> >
> > While I really like that we no longer have that special-case logic in
> IndexSearcher, we now have the issue that every time we create a new
> TermWeight (or other Weight) it allocates a float[256], even if the
> TermWeight doesn't need scores. Also, I think it's the exact same
> float[256] for all non-scoring weights, since it's being computed using the
> same "all 1s" CollectionStatistics and TermStatistics.
> >
> > (For the record, yes, the queries in question have an obscene number of
> TermQueries, so 1024 bytes times lots of TermWeights, times multiple
> queries running concurrently makes lots of heap allocation.)
> >
> > I'd like to submit a patch to fix this, but I'm wondering what approach
> to take. One option I'm considering is precomputing a singleton float[256]
> for the non-scoring case (where CollectionStatistics and TermStatistics are
> all 1s). That would have the least functional impact, but would let all
> non-scoring clauses share the same array. Is there a better way to tackle
> this?
> >
>
> This seems ok if it isn't invasive. I still feel like something is
> "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> you have way too many segments?
>
> Originally (for various similar reasons) there was a place in the API
> to do this, so it would only happen per-Weight instead of per-Scorer,
> which was the SimWeight that got eliminated by the commit you point
> to. But I'd love if we could steer clear of that complexity:
> simplifying the API here was definitely the right move. Its been more
> than 5 years since this change was made, and this is the first
> complaint i've heard about the 1KB, which is why i asked about your
> setup.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer [ In reply to ]

rcmuir at gmail

May 2, 2023, 1:13 PM

Post #5 of 6 (114 views)

On Tue, May 2, 2023 at 3:24?PM Michael Froh <msfroh@gmail.com> wrote:
>
> > This seems ok if it isn't invasive. I still feel like something is
> > "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> > you have way too many segments?
>
> From what I saw, it's 1KB per "leaf query" to create the BM25Scorer instance (at the Weight level), but then that BM25Scorer is shared across all scorer (DISI) instances for all segments. So it doesn't scale with segment count. It looks like the old logic used to allocate a SimScorer per segment, so this is a big improvement in that regard (for scoring clauses, since the non-scoring clauses had a super-lightweight SimScorer).
>
> In this particular case, they're running these gnarly machine-generated BoolenQuery trees with at least 512 non-scoring TermQuery clauses (across a bunch of different fields, so TermInSetQuery isn't an option). From what I can see, each of those TermQueries produces a TermWeight that holds a BM25Scorer that holds yet another instance of this float[256] array, for 512KB+ of these caches per running query. It's definitely only going to be an issue for folks who are flying close to the max clause count.
>

Yeah, but the same situation could be said for buffers like this one:
https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java#L311-L312
So I'm actually still confused why this float[256] stands out in your
measurejments vs two long[128]'s. Maybe its a profiler ghost?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Unnecessary float[256] allocation on every (non-scoring) BM25Scorer [ In reply to ]

msfroh at gmail

May 2, 2023, 3:53 PM

Post #6 of 6 (114 views)

> So I'm actually still confused why this float[256] stands out in your
> measurejments vs two long[128]'s. Maybe its a profiler ghost?

Huh... that's a really good point.

I'm going to spend a bit more time digging and see if I can reliably
reproduce it on my own machine. I've just been comparing heap dumps from
production hosts so far, so I'll try measuring in an environment where I
can see what's going on.

On Tue, May 2, 2023 at 1:14?PM Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, May 2, 2023 at 3:24?PM Michael Froh <msfroh@gmail.com> wrote:
> >
> > > This seems ok if it isn't invasive. I still feel like something is
> > > "off" if you are seeing GC time from 1KB-per-segment allocation. Do
> > > you have way too many segments?
> >
> > From what I saw, it's 1KB per "leaf query" to create the BM25Scorer
> instance (at the Weight level), but then that BM25Scorer is shared across
> all scorer (DISI) instances for all segments. So it doesn't scale with
> segment count. It looks like the old logic used to allocate a SimScorer per
> segment, so this is a big improvement in that regard (for scoring clauses,
> since the non-scoring clauses had a super-lightweight SimScorer).
> >
> > In this particular case, they're running these gnarly machine-generated
> BoolenQuery trees with at least 512 non-scoring TermQuery clauses (across a
> bunch of different fields, so TermInSetQuery isn't an option). From what I
> can see, each of those TermQueries produces a TermWeight that holds a
> BM25Scorer that holds yet another instance of this float[256] array, for
> 512KB+ of these caches per running query. It's definitely only going to be
> an issue for folks who are flying close to the max clause count.
> >
>
> Yeah, but the same situation could be said for buffers like this one:
>
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/codecs/lucene90/Lucene90PostingsReader.java#L311-L312
> So I'm actually still confused why this float[256] stands out in your
> measurejments vs two long[128]'s. Maybe its a profiler ghost?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>