Mailing List Archive

Sentence boundary bug
Hi All!

I think I've found a bug with sentence boundary detection explained in
detail here https://github.com/apache/lucene/issues/11735

It affects KeywordRepeatFilter + OpenNLPLemmatizer configuration which
apparently is thought to be common enough to be directly mentioned in solr
documentation/examples
https://solr.apache.org/guide/7_3/language-analysis.html#opennlp-lemmatizer-filter

The bug should be fairly easy to verify with the this test
https://github.com/kotman12/lucene/blob/8ecd42ec88685f47d42a88dd2536e879028af023/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298
and
I'd greatly appreciate if someone could give this a look. I'm also
proposing a fix here https://github.com/apache/lucene/pull/11734 but
naturally I am open to other thoughts on how to approach this.

Thanks,
Luke
Sentence boundary bug [ In reply to ]
Hi All!

I think I've found a bug with sentence boundary detection explained in
detail here https://github.com/apache/lucene/issues/11735

It affects KeywordRepeatFilter + OpenNLPLemmatizer configuration which
apparently is thought to be common enough to be directly mentioned in solr
documentation/examples
https://solr.apache.org/guide/7_3/language-analysis.html#opennlp-lemmatizer-filter

The bug should be fairly easy to verify with the this test
https://github.com/kotman12/lucene/blob/8ecd42ec88685f47d42a88dd2536e879028af023/lucene/analysis/opennlp/src/test/org/apache/lucene/analysis/opennlp/TestOpenNLPLemmatizerFilterFactory.java#L298
and
I'd greatly appreciate if someone could give this a look. I'm also
proposing a fix here https://github.com/apache/lucene/pull/11734 but
naturally I am open to other thoughts on how to approach this.

Thanks,
Luke