Mailing List Archive: Analyzer lifecycles

Analyzer lifecycles

Jun 8, 2021, 7:16 AM

Post #1 of 5 (563 views)

Hi all,

I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…

Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.

It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.

Any thoughts?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Analyzer lifecycles [ In reply to ]

rcmuir at gmail

Jun 8, 2021, 8:12 AM

Post #2 of 5 (563 views)

Permalink

Alan: a couple thoughts:

Analyzers are not just used for formulating queries, but also may be
used by highlighters and other things on document results at query
time.
Some analyzers may do too-expensive/garbage-creating stuff on
construction, that you wouldn't want to do at query-time.
Separately, I think Analyzer being closable makes sense.
Users still need to carefully consider the lifecycle of this thing for
performance, and may want to return their own resources for some
reason (close() is a non-final method today)
Analyzers might require large amounts of resources (such as parsing
files/lists, ml models, who knows what).
For the built-in minimal resources that we ship, we try to make
construction cheap and use static holder classes, and so on. I'm
concerned some of these are costly.
But I'm definitely worried about longer files and stuff that many
users might use.

I feel like some of this "large threadpool" stuff is just a java
antipattern for search. I configure servers with fixed threadpools
matching the number of CPU cores, and tell my load balancer about that
number (e.g. haproxy maxconn), so that it can effectively queue and
not overload search servers.

On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseygeek@gmail.com> wrote:
>
> Hi all,
>
> I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…
>
> Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.
>
> It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.
>
> Any thoughts?
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Analyzer lifecycles [ In reply to ]

romseygeek at gmail

Jun 9, 2021, 6:03 AM

Post #3 of 5 (563 views)

Permalink

Hey Robert,

Analyzers themselves can be heavy and load large data files, etc, I agree, but I’m really talking about token stream construction. The way things are set up, we expect the heavy lifting to be done when the Analyzer is constructed, but these heavy resources should then be shared between token streams (I fixed a bug in the Ukrainian analyzer a while back that was getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and use the same instances at query and at index time, and there’s no worry about reloading large dictionaries on every use.

But re-using token streams is different. The reason we have TokenStreamComponents and ReuseStrategies (as I understand it) is not because they may have to load large resource files or dictionaries or whatever, but it’s because building a TokenStream is itself quite a heavy operation due to AttributeFactories and reflection. My argument is that this is only heavy relative to the cost of indexing a single field, and that this only really matters when you have documents with lots of small fields in them. For query building or highlighting or MoreLlikeThis, the cost of building a small number of token streams is tiny compared to all the other heavy lifting and IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter we wouldn’t have to have a close() method on Analyzer (because the thread locals go away, and we expect file resources etc to be closed once the analyzer has finished building itself), and delegating or wrapping analyzers becomes much simpler.

Does that make more sense?

(I agree on the thread pool stuff, but we need to be careful about not blowing up users systems even if they are implementing anti-patterns!)

> On 8 Jun 2021, at 16:12, Robert Muir <rcmuir@gmail.com> wrote:
>
> Alan: a couple thoughts:
>
> Analyzers are not just used for formulating queries, but also may be
> used by highlighters and other things on document results at query
> time.
> Some analyzers may do too-expensive/garbage-creating stuff on
> construction, that you wouldn't want to do at query-time.
> Separately, I think Analyzer being closable makes sense.
> Users still need to carefully consider the lifecycle of this thing for
> performance, and may want to return their own resources for some
> reason (close() is a non-final method today)
> Analyzers might require large amounts of resources (such as parsing
> files/lists, ml models, who knows what).
> For the built-in minimal resources that we ship, we try to make
> construction cheap and use static holder classes, and so on. I'm
> concerned some of these are costly.
> But I'm definitely worried about longer files and stuff that many
> users might use.
>
> I feel like some of this "large threadpool" stuff is just a java
> antipattern for search. I configure servers with fixed threadpools
> matching the number of CPU cores, and tell my load balancer about that
> number (e.g. haproxy maxconn), so that it can effectively queue and
> not overload search servers.
>
> On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseygeek@gmail.com> wrote:
>>
>> Hi all,
>>
>> I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…
>>
>> Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.
>>
>> It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.
>>
>> Any thoughts?
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Analyzer lifecycles [ In reply to ]

rcmuir at gmail

Jun 9, 2021, 6:22 AM

Post #4 of 5 (563 views)

Permalink

Yes I'm using the term "Analyzer" in a generic sense, also concerned
about TokenStream init costs, garbage, etc.

There are a ton of uses here other than indexwriter,
AnalyzingSuggesters building FSTs, etc etc.

I don't think we need to try to add even more complexity because of
users implementing these anti-patterns on their end. This problem of
using hundreds/thousands of threads is a uniquely "java-developer"
problem. I don't see these issues with applications written in other
programming languages. We really can't shield them from it. If we stop
reusing, they will just get different _symptoms_ (slow performance, GC
issues, etc), but the underlying problem is using all those
unnecessary threads. That's what should get fixed.

On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward <romseygeek@gmail.com> wrote:
>
> Hey Robert,
>
> Analyzers themselves can be heavy and load large data files, etc, I agree, but I’m really talking about token stream construction. The way things are set up, we expect the heavy lifting to be done when the Analyzer is constructed, but these heavy resources should then be shared between token streams (I fixed a bug in the Ukrainian analyzer a while back that was getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and use the same instances at query and at index time, and there’s no worry about reloading large dictionaries on every use.
>
> But re-using token streams is different. The reason we have TokenStreamComponents and ReuseStrategies (as I understand it) is not because they may have to load large resource files or dictionaries or whatever, but it’s because building a TokenStream is itself quite a heavy operation due to AttributeFactories and reflection. My argument is that this is only heavy relative to the cost of indexing a single field, and that this only really matters when you have documents with lots of small fields in them. For query building or highlighting or MoreLlikeThis, the cost of building a small number of token streams is tiny compared to all the other heavy lifting and IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter we wouldn’t have to have a close() method on Analyzer (because the thread locals go away, and we expect file resources etc to be closed once the analyzer has finished building itself), and delegating or wrapping analyzers becomes much simpler.
>
> Does that make more sense?
>
> (I agree on the thread pool stuff, but we need to be careful about not blowing up users systems even if they are implementing anti-patterns!)
>
> > On 8 Jun 2021, at 16:12, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Alan: a couple thoughts:
> >
> > Analyzers are not just used for formulating queries, but also may be
> > used by highlighters and other things on document results at query
> > time.
> > Some analyzers may do too-expensive/garbage-creating stuff on
> > construction, that you wouldn't want to do at query-time.
> > Separately, I think Analyzer being closable makes sense.
> > Users still need to carefully consider the lifecycle of this thing for
> > performance, and may want to return their own resources for some
> > reason (close() is a non-final method today)
> > Analyzers might require large amounts of resources (such as parsing
> > files/lists, ml models, who knows what).
> > For the built-in minimal resources that we ship, we try to make
> > construction cheap and use static holder classes, and so on. I'm
> > concerned some of these are costly.
> > But I'm definitely worried about longer files and stuff that many
> > users might use.
> >
> > I feel like some of this "large threadpool" stuff is just a java
> > antipattern for search. I configure servers with fixed threadpools
> > matching the number of CPU cores, and tell my load balancer about that
> > number (e.g. haproxy maxconn), so that it can effectively queue and
> > not overload search servers.
> >
> > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseygeek@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…
> >>
> >> Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.
> >>
> >> It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.
> >>
> >> Any thoughts?
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Analyzer lifecycles [ In reply to ]

rcmuir at gmail

Jun 9, 2021, 6:33 AM

Post #5 of 5 (563 views)

Permalink

Alan, I'd also like to comment on this:

The reason we have TokenStreamComponents and ReuseStrategies (as I
understand it) is not because they may have to load large resource
files or dictionaries or whatever, but it’s because building a
TokenStream is itself quite a heavy operation due to
AttributeFactories and reflection.

That's actually not my major concern. There are plenty of
TokenStream.<init>'s doing a fair amount of work, creating objects,
setting up buffers, anything to make the actual processing fast. This
makes sense today because they are reused. But if we *sometimes* reuse
tokenstreams (indexwriter) and *other times dont* (query time), it
just adds more pain to keeping the analyzers efficient. Now they have
to optimize for 2 cases.

I also don't want all this stuff added up to increased garbage for
users that actually are search-heavy. Some of these users don't care
about index speed at all and are more concerned with QPS, latencies,
etc. This is very different from e.g. logging use-cases where people
are just indexing all day and maybe rarely searching. Current analyzer
design is efficient for both use-cases.

On Wed, Jun 9, 2021 at 9:03 AM Alan Woodward <romseygeek@gmail.com> wrote:
>
> Hey Robert,
>
> Analyzers themselves can be heavy and load large data files, etc, I agree, but I’m really talking about token stream construction. The way things are set up, we expect the heavy lifting to be done when the Analyzer is constructed, but these heavy resources should then be shared between token streams (I fixed a bug in the Ukrainian analyzer a while back that was getting this wrong, see LUCENE-9930). So you’d build your Analyzers once and use the same instances at query and at index time, and there’s no worry about reloading large dictionaries on every use.
>
> But re-using token streams is different. The reason we have TokenStreamComponents and ReuseStrategies (as I understand it) is not because they may have to load large resource files or dictionaries or whatever, but it’s because building a TokenStream is itself quite a heavy operation due to AttributeFactories and reflection. My argument is that this is only heavy relative to the cost of indexing a single field, and that this only really matters when you have documents with lots of small fields in them. For query building or highlighting or MoreLlikeThis, the cost of building a small number of token streams is tiny compared to all the other heavy lifting and IO going on. And so if we pushed this *TokenStream* reuse into IndexWriter we wouldn’t have to have a close() method on Analyzer (because the thread locals go away, and we expect file resources etc to be closed once the analyzer has finished building itself), and delegating or wrapping analyzers becomes much simpler.
>
> Does that make more sense?
>
> (I agree on the thread pool stuff, but we need to be careful about not blowing up users systems even if they are implementing anti-patterns!)
>
> > On 8 Jun 2021, at 16:12, Robert Muir <rcmuir@gmail.com> wrote:
> >
> > Alan: a couple thoughts:
> >
> > Analyzers are not just used for formulating queries, but also may be
> > used by highlighters and other things on document results at query
> > time.
> > Some analyzers may do too-expensive/garbage-creating stuff on
> > construction, that you wouldn't want to do at query-time.
> > Separately, I think Analyzer being closable makes sense.
> > Users still need to carefully consider the lifecycle of this thing for
> > performance, and may want to return their own resources for some
> > reason (close() is a non-final method today)
> > Analyzers might require large amounts of resources (such as parsing
> > files/lists, ml models, who knows what).
> > For the built-in minimal resources that we ship, we try to make
> > construction cheap and use static holder classes, and so on. I'm
> > concerned some of these are costly.
> > But I'm definitely worried about longer files and stuff that many
> > users might use.
> >
> > I feel like some of this "large threadpool" stuff is just a java
> > antipattern for search. I configure servers with fixed threadpools
> > matching the number of CPU cores, and tell my load balancer about that
> > number (e.g. haproxy maxconn), so that it can effectively queue and
> > not overload search servers.
> >
> > On Tue, Jun 8, 2021 at 10:23 AM Alan Woodward <romseygeek@gmail.com> wrote:
> >>
> >> Hi all,
> >>
> >> I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…
> >>
> >> Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.
> >>
> >> It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.
> >>
> >> Any thoughts?
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: dev-help@lucene.apache.org
> >
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org