Mailing List Archive

Lucene cpu utilization & scoring
Hi,

We have a large index that we divide into X lucene indices - we use lucene
6.5.0. On each of our serving machines serves 8 lucene indices in parallel.
We are getting realtime updates to each of these 8 indices. We are seeing a
couple of things:

a) When we turn off realtime updates, performance is significantly better.
When we turn on realtime updates, due to accumulation of segments - CPU
utilization by lucene goes up by at least *3X* [based on profiling].

b) A profile shows that the vast majority of time is being spent in
scoring methods even though we are setting *needsScores() to false* in our
collectors.

We do commit our index frequently and we are roughly at ~25 segments per
index - so a total of 8 * 25 ~ 200 segments across all the 8 indices.

Changing the number of 8 indices per machine to reduce the number of
segments is a significant effort. So, we would like to know if there are
ways to improve performance, w.r.t a) & b)

i) We have tried some parameters with the merge policy &
NRTCachingDirectory and they did not help significantly
ii) Since we dont care about lucene level scores, is there a way to
completely disable scoring ? Should setting needsScores() to false in our
collectors do the trick ? Should we create our own dummy weight/scorer and
injecting it into the Query classes ?

Thanks
Varun
Re: Lucene cpu utilization & scoring [ In reply to ]
I think the usual usage pattern is to *refresh* frequently and commit
less frequently. Is there a reason you need to commit often?

You may also have overlooked this newish method: MergePolicy.findFullFlushMerges

If you implement that, you can tell IndexWriter to (for example) merge
multiple small segments on commit, which may be piling up given
frequent commits, and if you are indexing across multiple threads. We
found this can help reduce the number of segments, and the variability
in the number of segments. I don't know if that is truly a root cause
of your performance problems here though.

Regarding scoring costs -I don't think creating dummy Weight and
Scorer will do what you think - Scorers are doing matching in fact as
well as scoring. You won't get any results if you don't have any real
Scorer.

I *think* that setting needsScores() to false should disable work done
to compute relevance scores - you can confirm by looking at the scores
you get back with your hits - are they all zero? Also, we did
something similar in our system, and then later re-enabled scoring,
and it did not add significant cost for us. YMMV, but are you sure the
costs you are seeing are related to computing scores and not required
for matching?

-Mike

On Fri, Aug 20, 2021 at 2:02 PM Varun Sharma
<varun.sharma@airbnb.com.invalid> wrote:
>
> Hi,
>
> We have a large index that we divide into X lucene indices - we use lucene
> 6.5.0. On each of our serving machines serves 8 lucene indices in parallel.
> We are getting realtime updates to each of these 8 indices. We are seeing a
> couple of things:
>
> a) When we turn off realtime updates, performance is significantly better.
> When we turn on realtime updates, due to accumulation of segments - CPU
> utilization by lucene goes up by at least *3X* [based on profiling].
>
> b) A profile shows that the vast majority of time is being spent in
> scoring methods even though we are setting *needsScores() to false* in our
> collectors.
>
> We do commit our index frequently and we are roughly at ~25 segments per
> index - so a total of 8 * 25 ~ 200 segments across all the 8 indices.
>
> Changing the number of 8 indices per machine to reduce the number of
> segments is a significant effort. So, we would like to know if there are
> ways to improve performance, w.r.t a) & b)
>
> i) We have tried some parameters with the merge policy &
> NRTCachingDirectory and they did not help significantly
> ii) Since we dont care about lucene level scores, is there a way to
> completely disable scoring ? Should setting needsScores() to false in our
> collectors do the trick ? Should we create our own dummy weight/scorer and
> injecting it into the Query classes ?
>
> Thanks
> Varun

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Lucene cpu utilization & scoring [ In reply to ]
Thanks, Michael. Its good to know that scorers are also doing matching. I
will check and verify whether the scores returned are 0 or not.

Just to give some background, we have two setups:
a) Old setup - Each machine serves a single lucene index which has roughly
30'ish segments with realtime updates.
b) New setup - Each machine serves 8 lucene indices with 1/8th the data.
Each index has 25'ish segments - a total of 200 segments. Total data is the
same.

We find that % CPU time spent in lucene goes up by a factor of 2-3X as we
move from a) to b). When realtime updates are disabled, then setup in a)
has 1 segment while setup in b) has 8 total segments. The CPU time spent
increase is not as bad with realtime updates disabled.

This makes us think that somehow, the large cpu increase is coming from the
increase in segments (30 -> 200), even though the total amount of data
across the segments is the same. Couple of questions:
a) Should we expect a cpu increase of this magnitude or could there be
something wrong with our setup ?
b) Does index refresh also lead to creation of a new segment ? we can
experiment with committing less often though we probably need to refresh at
least every minute for index freshness. (unfortunately, the
findFullFlushMerges is not present in lucene 6.5.0 MergePolicy).

Thanks
Varun

On Fri, Aug 20, 2021 at 12:57 PM Michael Sokolov <msokolov@gmail.com> wrote:

> I think the usual usage pattern is to *refresh* frequently and commit
> less frequently. Is there a reason you need to commit often?
>
> You may also have overlooked this newish method:
> MergePolicy.findFullFlushMerges
>
> If you implement that, you can tell IndexWriter to (for example) merge
> multiple small segments on commit, which may be piling up given
> frequent commits, and if you are indexing across multiple threads. We
> found this can help reduce the number of segments, and the variability
> in the number of segments. I don't know if that is truly a root cause
> of your performance problems here though.
>
> Regarding scoring costs -I don't think creating dummy Weight and
> Scorer will do what you think - Scorers are doing matching in fact as
> well as scoring. You won't get any results if you don't have any real
> Scorer.
>
> I *think* that setting needsScores() to false should disable work done
> to compute relevance scores - you can confirm by looking at the scores
> you get back with your hits - are they all zero? Also, we did
> something similar in our system, and then later re-enabled scoring,
> and it did not add significant cost for us. YMMV, but are you sure the
> costs you are seeing are related to computing scores and not required
> for matching?
>
> -Mike
>
> On Fri, Aug 20, 2021 at 2:02 PM Varun Sharma
> <varun.sharma@airbnb.com.invalid> wrote:
> >
> > Hi,
> >
> > We have a large index that we divide into X lucene indices - we use
> lucene
> > 6.5.0. On each of our serving machines serves 8 lucene indices in
> parallel.
> > We are getting realtime updates to each of these 8 indices. We are
> seeing a
> > couple of things:
> >
> > a) When we turn off realtime updates, performance is significantly
> better.
> > When we turn on realtime updates, due to accumulation of segments - CPU
> > utilization by lucene goes up by at least *3X* [based on profiling].
> >
> > b) A profile shows that the vast majority of time is being spent in
> > scoring methods even though we are setting *needsScores() to false* in
> our
> > collectors.
> >
> > We do commit our index frequently and we are roughly at ~25 segments per
> > index - so a total of 8 * 25 ~ 200 segments across all the 8 indices.
> >
> > Changing the number of 8 indices per machine to reduce the number of
> > segments is a significant effort. So, we would like to know if there are
> > ways to improve performance, w.r.t a) & b)
> >
> > i) We have tried some parameters with the merge policy &
> > NRTCachingDirectory and they did not help significantly
> > ii) Since we dont care about lucene level scores, is there a way to
> > completely disable scoring ? Should setting needsScores() to false in our
> > collectors do the trick ? Should we create our own dummy weight/scorer
> and
> > injecting it into the Query classes ?
> >
> > Thanks
> > Varun
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>