Hi all,
I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…
Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.
It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.
Any thoughts?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
I’ve been on holiday and away from a keyboard for a week, so that means I of course spent my time thinking about lucene Analyzers and specifically their ReuseStrategies…
Building a TokenStream can be quite a heavy operation, and so we try and reuse already-constructed token streams as much as possible. This is particularly important at index time, as having to create lots and lots of very short-lived token streams for documents with many short text fields could mean that we spend longer building these objects than we do pulling data from them. To help support this, lucene Analyzers have a ReuseStrategy, which defaults to storing a map of fields to token streams in a ThreadLocal object. Because ThreadLocals can behave badly when it comes to containers that have large thread pools, we use a special CloseableThreadLocal class that can null out its contents once the Analyzer is done with, and this leads to Analyzer itself being Closeable. This makes extending analyzers more complicated, as delegating wrappers need to ensure that they don’t end up sharing token streams with their delegates.
It’s common to use the same analyzer for indexing and for parsing user queries. At query time, reusing token streams is a lot less important - the amount of time spent building the query is typically much lower than the amount of time spent rewriting and executing it. The fact that this re-use is only really useful for index time and that the lifecycle of the analyzer is therefore very closely tied to the lifecycle of its associated IndexWriter makes me think that we should think about moving the re-use strategies into IndexWriter itself. One option would be to have token streams be constructed once per DocumentsWriterPerThread, which would lose some re-use but would mean we could avoid ThreadLocals entirely.
Any thoughts?
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org