Mailing List Archive

Help to find the RC of incompatible analyers
Hi, Lucene dev community:

Our current code is based on Lucene7.
In some analyzer testcase, give a string "*Google's biologist’s*", the
tokenization result is, *["google", "biologist"]*

But after I migrating the codebase to Lucene9,
the result becomes, *["googles", "**biologist’s**"]*

It looks like some behavior has changed among the major versions.

But I cannot find exactly where is the RC that causes this.
Could someone please provide some clues? Maybe some grammar has changed?

The analyzer uses the following three Lucene libraries:

org.apache.lucene.analysis.core.FlattenGraphFilter;

org.apache.lucene.analysis.shingle.ShingleFilter;

org.apache.lucene.analysis.synonym.SynonymGraphFilter;


Thanks
Re: Help to find the RC of incompatible analyers [ In reply to ]
It sounds like an EnglishPossessiveFilter is missing and I think it is not
relevant to the filters you listed?
Are there other Lucene filters you're using?

Also what exact versions are you upgrading from and to?

On Fri, Apr 28, 2023 at 10:20?AM MyCoy Z <mycoy.zhang@gmail.com> wrote:

> Hi, Lucene dev community:
>
> Our current code is based on Lucene7.
> In some analyzer testcase, give a string "*Google's biologist’s*", the
> tokenization result is, *["google", "biologist"]*
>
> But after I migrating the codebase to Lucene9,
> the result becomes, *["googles", "**biologist’s**"]*
>
> It looks like some behavior has changed among the major versions.
>
> But I cannot find exactly where is the RC that causes this.
> Could someone please provide some clues? Maybe some grammar has changed?
>
> The analyzer uses the following three Lucene libraries:
>
> org.apache.lucene.analysis.core.FlattenGraphFilter;
>
> org.apache.lucene.analysis.shingle.ShingleFilter;
>
> org.apache.lucene.analysis.synonym.SynonymGraphFilter;
>
>
> Thanks
>
>
Re: Help to find the RC of incompatible analyers [ In reply to ]
You provided a list of TokenFilters that you use in your Analyzer,
but you didn't mention anything about what Tokenizer you are using.

You also mentioned seeing a difference in the "tokenization result" and
the example output you gave does in fact seem to be the output of the
tokenizer -- not the output of the TokenFilters you mentioned -- since
ShingleFilter would be producing more output tokens then you listed.

All of which suggests that the discrepency you are seeing is in your
tokenizer.

Generally speaking: the best way to ensure folks on the mailing list can
make sense of your situation and offer assistance is if you can provide
reproducible snippets of code w/hardcoded input (ala unit tests) that
demonstrates what you're seeing.

: Our current code is based on Lucene7.
: In some analyzer testcase, give a string "*Google's biologist’s*", the
: tokenization result is, *["google", "biologist"]*
:
: But after I migrating the codebase to Lucene9,
: the result becomes, *["googles", "**biologist’s**"]*


: The analyzer uses the following three Lucene libraries:
:
: org.apache.lucene.analysis.core.FlattenGraphFilter;
:
: org.apache.lucene.analysis.shingle.ShingleFilter;
:
: org.apache.lucene.analysis.synonym.SynonymGraphFilter;


-Hoss
http://www.lucidworks.com/
Re: Help to find the RC of incompatible analyers [ In reply to ]
Thanks Chris and Patrick's help.

Chris is right. I don't have much knowledge in the analyzers, so I've
missed many details.
Following Chris's advice, I've digged deeper in the the related code about
tokenizers, and fixed the problem.

Thanks for help.

On Fri, Apr 28, 2023 at 4:44?PM Chris Hostetter <hossman_lucene@fucit.org>
wrote:

>
> You provided a list of TokenFilters that you use in your Analyzer,
> but you didn't mention anything about what Tokenizer you are using.
>
> You also mentioned seeing a difference in the "tokenization result" and
> the example output you gave does in fact seem to be the output of the
> tokenizer -- not the output of the TokenFilters you mentioned -- since
> ShingleFilter would be producing more output tokens then you listed.
>
> All of which suggests that the discrepency you are seeing is in your
> tokenizer.
>
> Generally speaking: the best way to ensure folks on the mailing list can
> make sense of your situation and offer assistance is if you can provide
> reproducible snippets of code w/hardcoded input (ala unit tests) that
> demonstrates what you're seeing.
>
> : Our current code is based on Lucene7.
> : In some analyzer testcase, give a string "*Google's biologist’s*", the
> : tokenization result is, *["google", "biologist"]*
> :
> : But after I migrating the codebase to Lucene9,
> : the result becomes, *["googles", "**biologist’s**"]*
>
>
> : The analyzer uses the following three Lucene libraries:
> :
> : org.apache.lucene.analysis.core.FlattenGraphFilter;
> :
> : org.apache.lucene.analysis.shingle.ShingleFilter;
> :
> : org.apache.lucene.analysis.synonym.SynonymGraphFilter;
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org