Mailing List Archive

Why is AttributeSource#addAttribute a hotspot for nightly benchmarks?
Hello,

I've been looking a bit more carefully at nightly benchmarks recently and
I'm puzzled by the fact that indexing spends almost 5% of the time on
AttributeSource#addAttribute. Here is the link
<http://people.apache.org/~mikemccand/lucenebench/2021.10.20.08.24.09.html#profiler_4kb_indexing_1_cpu>
.

4.37% 14731
org.apache.lucene.util.AttributeSource#addAttribute()
at
org.apache.lucene.document.Field$StringTokenStream#()
at
org.apache.lucene.document.Field#tokenStream()
at
org.apache.lucene.index.IndexingChain$PerField#invert()
at
org.apache.lucene.index.IndexingChain#processField()
at
org.apache.lucene.index.IndexingChain#processDocument()
at
org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments()
at
org.apache.lucene.index.DocumentsWriter#updateDocuments()
at
org.apache.lucene.index.IndexWriter#updateDocuments()
at
org.apache.lucene.index.IndexWriter#updateDocument()
at
org.apache.lucene.index.IndexWriter#addDocument()
at perf.IndexThreads$IndexThread#run()

Given that nightly benchmarks reuse Field instances across documents, this
should only happen once per thread, so why does it show up as a bottleneck
in our nightly benchmarks? I tried to reproduce locally, but I'm not seeing
AttributeSource among top CPU consumers.

--
Adrien
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
Looking at the source/stating the obvious, creating a new
StringTokenString from here only happens under certain conditions:
* field is indexed
* field is not tokenized (e.g. not using analyzer, ID field or similar)
* incoming "reuse" parameter is not a StringTokenStream

What is puzzling to me is that it only seems to hit the 4KB documents.
If there is an issue here, I'd expect it to have an even higher impact
for the 1KB documents indexing.

But also the internal reuse of IndexingChain.PerField (which houses
the reused tokenstream) isn't just per-thread, it is
per-thread-per-segment, right? So if Mike is indexing with 100
threads, and flushes 200 times, I'd expect 20k of these things to be
made. There's a lot going on in the benchmark code for nightly and it
is tricky for me to try to navigate the various cases (1KB,
1KB-with-vectors, 4KB, "deterministic indexing", etc)

On Thu, Oct 21, 2021 at 3:40 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> Hello,
>
> I've been looking a bit more carefully at nightly benchmarks recently and I'm puzzled by the fact that indexing spends almost 5% of the time on AttributeSource#addAttribute. Here is the link.
>
> 4.37% 14731 org.apache.lucene.util.AttributeSource#addAttribute()
> at org.apache.lucene.document.Field$StringTokenStream#()
> at org.apache.lucene.document.Field#tokenStream()
> at org.apache.lucene.index.IndexingChain$PerField#invert()
> at org.apache.lucene.index.IndexingChain#processField()
> at org.apache.lucene.index.IndexingChain#processDocument()
> at org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments()
> at org.apache.lucene.index.DocumentsWriter#updateDocuments()
> at org.apache.lucene.index.IndexWriter#updateDocuments()
> at org.apache.lucene.index.IndexWriter#updateDocument()
> at org.apache.lucene.index.IndexWriter#addDocument()
> at perf.IndexThreads$IndexThread#run()
>
> Given that nightly benchmarks reuse Field instances across documents, this should only happen once per thread, so why does it show up as a bottleneck in our nightly benchmarks? I tried to reproduce locally, but I'm not seeing AttributeSource among top CPU consumers.
>
> --
> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> But also the internal reuse of IndexingChain.PerField (which houses
> the reused tokenstream) isn't just per-thread, it is
> per-thread-per-segment, right? So if Mike is indexing with 100
> threads, and flushes 200 times, I'd expect 20k of these things to be
> made. There's a lot going on in the benchmark code for nightly and it
> is tricky for me to try to navigate the various cases (1KB,
> 1KB-with-vectors, 4KB, "deterministic indexing", etc)

I think this might be the case with your link. If you look at the URL
of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
?
This makes me think i'm looking at the profiler output of the
"deterministic indexing".
For this one, LogDocMergePolicy is used.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
Small correction: nightly benchmarks currently uses "only" 36 threads.

And indeed the "#profiler_4kb_indexing_1_cpu" is single-threaded indexing!
Why on earth is addAttribute so costly in that case! Wow, I'm glad I named
those anchor links well lol.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com> wrote:

> Looking at the source/stating the obvious, creating a new
> StringTokenString from here only happens under certain conditions:
> * field is indexed
> * field is not tokenized (e.g. not using analyzer, ID field or similar)
> * incoming "reuse" parameter is not a StringTokenStream
>
> What is puzzling to me is that it only seems to hit the 4KB documents.
> If there is an issue here, I'd expect it to have an even higher impact
> for the 1KB documents indexing.
>
> But also the internal reuse of IndexingChain.PerField (which houses
> the reused tokenstream) isn't just per-thread, it is
> per-thread-per-segment, right? So if Mike is indexing with 100
> threads, and flushes 200 times, I'd expect 20k of these things to be
> made. There's a lot going on in the benchmark code for nightly and it
> is tricky for me to try to navigate the various cases (1KB,
> 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>
> On Thu, Oct 21, 2021 at 3:40 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > Hello,
> >
> > I've been looking a bit more carefully at nightly benchmarks recently
> and I'm puzzled by the fact that indexing spends almost 5% of the time on
> AttributeSource#addAttribute. Here is the link.
> >
> > 4.37% 14731
> org.apache.lucene.util.AttributeSource#addAttribute()
> > at
> org.apache.lucene.document.Field$StringTokenStream#()
> > at
> org.apache.lucene.document.Field#tokenStream()
> > at
> org.apache.lucene.index.IndexingChain$PerField#invert()
> > at
> org.apache.lucene.index.IndexingChain#processField()
> > at
> org.apache.lucene.index.IndexingChain#processDocument()
> > at
> org.apache.lucene.index.DocumentsWriterPerThread#updateDocuments()
> > at
> org.apache.lucene.index.DocumentsWriter#updateDocuments()
> > at
> org.apache.lucene.index.IndexWriter#updateDocuments()
> > at
> org.apache.lucene.index.IndexWriter#updateDocument()
> > at
> org.apache.lucene.index.IndexWriter#addDocument()
> > at perf.IndexThreads$IndexThread#run()
> >
> > Given that nightly benchmarks reuse Field instances across documents,
> this should only happen once per thread, so why does it show up as a
> bottleneck in our nightly benchmarks? I tried to reproduce locally, but I'm
> not seeing AttributeSource among top CPU consumers.
> >
> > --
> > Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
Yeah, I'm pretty lost in all the ways we index here. But if we are
passing maxBufferedDocs <low number> for this deterministic indexing,
I think it would cause the issue? I have no idea what the IW config
here is...

On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > But also the internal reuse of IndexingChain.PerField (which houses
> > the reused tokenstream) isn't just per-thread, it is
> > per-thread-per-segment, right? So if Mike is indexing with 100
> > threads, and flushes 200 times, I'd expect 20k of these things to be
> > made. There's a lot going on in the benchmark code for nightly and it
> > is tricky for me to try to navigate the various cases (1KB,
> > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>
> I think this might be the case with your link. If you look at the URL
> of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
> ?
> This makes me think i'm looking at the profiler output of the
> "deterministic indexing".
> For this one, LogDocMergePolicy is used.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
Ahh we are indeed doing that. The maxBufferedDocs is total-doc-count /
555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby
segments" consistent segment geometry in the end.

But that works out to:

maxBufferedDocs=49774

Which is not too tiny?

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <rcmuir@gmail.com> wrote:

> Yeah, I'm pretty lost in all the ways we index here. But if we are
> passing maxBufferedDocs <low number> for this deterministic indexing,
> I think it would cause the issue? I have no idea what the IW config
> here is...
>
> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com> wrote:
> > >
> > > But also the internal reuse of IndexingChain.PerField (which houses
> > > the reused tokenstream) isn't just per-thread, it is
> > > per-thread-per-segment, right? So if Mike is indexing with 100
> > > threads, and flushes 200 times, I'd expect 20k of these things to be
> > > made. There's a lot going on in the benchmark code for nightly and it
> > > is tricky for me to try to navigate the various cases (1KB,
> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
> >
> > I think this might be the case with your link. If you look at the URL
> > of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
> > ?
> > This makes me think i'm looking at the profiler output of the
> > "deterministic indexing".
> > For this one, LogDocMergePolicy is used.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
So ~ 555 flushes?

I see over 3k samples from Adrien's link in
org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#()
I still think the issue is, tokenstreams from analyzers reuse "better"
than ones from StringField, because they have a threadlocal? Whereas
the StringField relies upon the reuse of IndexingChain.PerField.

Maybe it can be better inside IndexWriter, so that it isn't lost on
flush? Just don't cross the tokenstreams. It would be bad :)

On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> Ahh we are indeed doing that. The maxBufferedDocs is total-doc-count / 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby segments" consistent segment geometry in the end.
>
> But that works out to:
>
> maxBufferedDocs=49774
>
> Which is not too tiny?
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> Yeah, I'm pretty lost in all the ways we index here. But if we are
>> passing maxBufferedDocs <low number> for this deterministic indexing,
>> I think it would cause the issue? I have no idea what the IW config
>> here is...
>>
>> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com> wrote:
>> > >
>> > > But also the internal reuse of IndexingChain.PerField (which houses
>> > > the reused tokenstream) isn't just per-thread, it is
>> > > per-thread-per-segment, right? So if Mike is indexing with 100
>> > > threads, and flushes 200 times, I'd expect 20k of these things to be
>> > > made. There's a lot going on in the benchmark code for nightly and it
>> > > is tricky for me to try to navigate the various cases (1KB,
>> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>> >
>> > I think this might be the case with your link. If you look at the URL
>> > of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
>> > ?
>> > This makes me think i'm looking at the profiler output of the
>> > "deterministic indexing".
>> > For this one, LogDocMergePolicy is used.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
LOL don't cross the tokenstreams!

Yeah should be 555 or 556 flushes I think. Probably times the number of
indexed fields, gets us to the 3K count?

+1 to improve IW's internal re-use in the non-analyzed StringField case.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Oct 21, 2021 at 9:14 AM Robert Muir <rcmuir@gmail.com> wrote:

> So ~ 555 flushes?
>
> I see over 3k samples from Adrien's link in
>
> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#()
> I still think the issue is, tokenstreams from analyzers reuse "better"
> than ones from StringField, because they have a threadlocal? Whereas
> the StringField relies upon the reuse of IndexingChain.PerField.
>
> Maybe it can be better inside IndexWriter, so that it isn't lost on
> flush? Just don't cross the tokenstreams. It would be bad :)
>
> On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless
> <lucene@mikemccandless.com> wrote:
> >
> > Ahh we are indeed doing that. The maxBufferedDocs is total-doc-count /
> 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby
> segments" consistent segment geometry in the end.
> >
> > But that works out to:
> >
> > maxBufferedDocs=49774
> >
> > Which is not too tiny?
> >
> > Mike McCandless
> >
> > http://blog.mikemccandless.com
> >
> >
> > On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> Yeah, I'm pretty lost in all the ways we index here. But if we are
> >> passing maxBufferedDocs <low number> for this deterministic indexing,
> >> I think it would cause the issue? I have no idea what the IW config
> >> here is...
> >>
> >> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <rcmuir@gmail.com> wrote:
> >> >
> >> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com> wrote:
> >> > >
> >> > > But also the internal reuse of IndexingChain.PerField (which houses
> >> > > the reused tokenstream) isn't just per-thread, it is
> >> > > per-thread-per-segment, right? So if Mike is indexing with 100
> >> > > threads, and flushes 200 times, I'd expect 20k of these things to be
> >> > > made. There's a lot going on in the benchmark code for nightly and
> it
> >> > > is tricky for me to try to navigate the various cases (1KB,
> >> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
> >> >
> >> > I think this might be the case with your link. If you look at the URL
> >> > of your actual link, you see it ends with #profiler_4kb_indexing_1_cpu
> >> > ?
> >> > This makes me think i'm looking at the profiler output of the
> >> > "deterministic indexing".
> >> > For this one, LogDocMergePolicy is used.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Why is AttributeSource#addAttribute a hotspot for nightly benchmarks? [ In reply to ]
Thanks Robert and Mike for helping dig. I opened
https://issues.apache.org/jira/browse/LUCENE-10203.

On Thu, Oct 21, 2021 at 3:22 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> LOL don't cross the tokenstreams!
>
> Yeah should be 555 or 556 flushes I think. Probably times the number of
> indexed fields, gets us to the 3K count?
>
> +1 to improve IW's internal re-use in the non-analyzed StringField case.
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Thu, Oct 21, 2021 at 9:14 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> So ~ 555 flushes?
>>
>> I see over 3k samples from Adrien's link in
>>
>> org.apache.lucene.codecs.lucene90.blocktree.Lucene90BlockTreeTermsWriter$TermsWriter#()
>> I still think the issue is, tokenstreams from analyzers reuse "better"
>> than ones from StringField, because they have a threadlocal? Whereas
>> the StringField relies upon the reuse of IndexingChain.PerField.
>>
>> Maybe it can be better inside IndexWriter, so that it isn't lost on
>> flush? Just don't cross the tokenstreams. It would be bad :)
>>
>> On Thu, Oct 21, 2021 at 9:03 AM Michael McCandless
>> <lucene@mikemccandless.com> wrote:
>> >
>> > Ahh we are indeed doing that. The maxBufferedDocs is total-doc-count /
>> 555, to provoke precisely a "5 big segments + 5 medium segments + 5 baby
>> segments" consistent segment geometry in the end.
>> >
>> > But that works out to:
>> >
>> > maxBufferedDocs=49774
>> >
>> > Which is not too tiny?
>> >
>> > Mike McCandless
>> >
>> > http://blog.mikemccandless.com
>> >
>> >
>> > On Thu, Oct 21, 2021 at 8:52 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> Yeah, I'm pretty lost in all the ways we index here. But if we are
>> >> passing maxBufferedDocs <low number> for this deterministic indexing,
>> >> I think it would cause the issue? I have no idea what the IW config
>> >> here is...
>> >>
>> >> On Thu, Oct 21, 2021 at 8:48 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >> >
>> >> > On Thu, Oct 21, 2021 at 8:36 AM Robert Muir <rcmuir@gmail.com>
>> wrote:
>> >> > >
>> >> > > But also the internal reuse of IndexingChain.PerField (which houses
>> >> > > the reused tokenstream) isn't just per-thread, it is
>> >> > > per-thread-per-segment, right? So if Mike is indexing with 100
>> >> > > threads, and flushes 200 times, I'd expect 20k of these things to
>> be
>> >> > > made. There's a lot going on in the benchmark code for nightly and
>> it
>> >> > > is tricky for me to try to navigate the various cases (1KB,
>> >> > > 1KB-with-vectors, 4KB, "deterministic indexing", etc)
>> >> >
>> >> > I think this might be the case with your link. If you look at the URL
>> >> > of your actual link, you see it ends with
>> #profiler_4kb_indexing_1_cpu
>> >> > ?
>> >> > This makes me think i'm looking at the profiler output of the
>> >> > "deterministic indexing".
>> >> > For this one, LogDocMergePolicy is used.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Adrien