Hi All!
I think I've found a bug with sentence boundary detection explained in
detail here https://github.com/apache/lucene/issues/11735
It affects KeywordRepeatFilter + OpenNLPLemmatizer configuration which
apparently is thought to be common enough to be directly mentioned in solr
documentation/examples
https://solr.apache.org/guide/7_3/language-analysis.html#opennlp-lemmatizer-filter
The bug should be fairly easy to verify with the this test
https://github.com/kotman12/lucene/blob/8ecd42ec88685f47d42a88dd2536e879028af023/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298
and
I'd greatly appreciate if someone could give this a look. I'm also
proposing a fix here https://github.com/apache/lucene/pull/11734 but
naturally I am open to other thoughts on how to approach this.
Thanks,
Luke
I think I've found a bug with sentence boundary detection explained in
detail here https://github.com/apache/lucene/issues/11735
It affects KeywordRepeatFilter + OpenNLPLemmatizer configuration which
apparently is thought to be common enough to be directly mentioned in solr
documentation/examples
https://solr.apache.org/guide/7_3/language-analysis.html#opennlp-lemmatizer-filter
The bug should be fairly easy to verify with the this test
https://github.com/kotman12/lucene/blob/8ecd42ec88685f47d42a88dd2536e879028af023/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298
and
I'd greatly appreciate if someone could give this a look. I'm also
proposing a fix here https://github.com/apache/lucene/pull/11734 but
naturally I am open to other thoughts on how to approach this.
Thanks,
Luke