Mailing List Archive

Analyzer.createComponents(String fieldname) only being called once, when indexing multiple documents
Hello

I hope somebody can offer suggestions/advice regarding this.

I'm going through some old Lucene code and have a custom Analyzer which
overrides the createComponents method. See snippet below

public class BulletinPayloadsAnalyzer extends Analyzer {
private boolean bulletin;
private float boost;

BulletinPayloadsAnalyzer(float boost) {
this.boost = boost;
}

public void setBulletin(boolean bulletin) {
this.bulletin = bulletin;
}

@Override
protected TokenStreamComponents createComponents(String fieldName) {
Tokenizer src = new StandardTokenizer();
BulletinPayloadsFilter result = new BulletinPayloadsFilter(src, boost,
bulletin);
return new TokenStreamComponents(src, result);
}

I then use the boost and bulletin params inside my BullletinPayloadsFilter
for some specialized logic e.g. if bulletin is true, and a keyword is
tokenized, then boost the document by setting a PayloadAttribute with the
boost amount.
However I've noticed when indexing several documents at once, the
createComponents method is only called the first time. For all subsequent
documents execution goes straight into the incrementToken method of my
custom BulletinPayloadsFilter.

Is there a way of ensuring the createComponents method is called when
indexing each document? As I need to make sure the correct parameters are
passed to the filter. These params could change for each document.

Thank you
Usman
Re: Analyzer.createComponents(String fieldname) only being called once, when indexing multiple documents [ In reply to ]
Hi Usman,

Long ago Lucene switched to reusing these analysis components (per
Analyzer, per thread), so that explains why createComponents is called once.

However, the reuse policy is controllable (expert usage), so in theory you
could implement an Analyzer.ReuseStrategy that never reuses and pass that
to super() when you create your custom Analyzer.

However, that is generally not a great idea in general -- poor indexing
throughput.

Another possibility is to create a Field with a pre-analyzed TokenStream,
basically bypassing Analyzer entirely and making your own TokenStream chain
that will alter these payload values.

Usually payloads are set/derived from the incoming tokens and would not be
dynamically set externally. Or, such a parameter that changes per document
but not per token could be set in a doc values field instead.

Mike McCandless

http://blog.mikemccandless.com


On Thu, Jun 8, 2023 at 7:08?AM Usman Shaikh <shaikhu@gmail.com> wrote:

> Hello
>
> I hope somebody can offer suggestions/advice regarding this.
>
> I'm going through some old Lucene code and have a custom Analyzer which
> overrides the createComponents method. See snippet below
>
> public class BulletinPayloadsAnalyzer extends Analyzer {
> private boolean bulletin;
> private float boost;
>
> BulletinPayloadsAnalyzer(float boost) {
> this.boost = boost;
> }
>
> public void setBulletin(boolean bulletin) {
> this.bulletin = bulletin;
> }
>
> @Override
> protected TokenStreamComponents createComponents(String fieldName) {
> Tokenizer src = new StandardTokenizer();
> BulletinPayloadsFilter result = new BulletinPayloadsFilter(src, boost,
> bulletin);
> return new TokenStreamComponents(src, result);
> }
>
> I then use the boost and bulletin params inside my BullletinPayloadsFilter
> for some specialized logic e.g. if bulletin is true, and a keyword is
> tokenized, then boost the document by setting a PayloadAttribute with the
> boost amount.
> However I've noticed when indexing several documents at once, the
> createComponents method is only called the first time. For all subsequent
> documents execution goes straight into the incrementToken method of my
> custom BulletinPayloadsFilter.
>
> Is there a way of ensuring the createComponents method is called when
> indexing each document? As I need to make sure the correct parameters are
> passed to the filter. These params could change for each document.
>
> Thank you
> Usman
>