Mailing List Archive

1 2  View All
Re: Hunspell performance [ In reply to ]
On Fri, Feb 12, 2021 at 7:05 AM Peter Gromov
<peter.gromov@jetbrains.com.invalid> wrote:

>
> Robert, for n=20 the speedup is quite small, 2-8% for me depending on the
> language. Unfortunately Hunspell dictionaries don't have stop word
> information, it'd be quite useful.
>
>
OK, maybe with a cache size that small it won't cache the stopwords, I
don't know. Was just mentioning it on the side. We do have stopword
information for a lot of languages as resource files in lucene:

https://github.com/apache/lucene-solr/tree/master/lucene/analysis/common/src/resources/org/apache/lucene/analysis

Some users will remove them before they get to the hunspell, some users
won't.

But we also have a way in the analysis chain to override stemming for
particular words. It stems them the way you want and then sets a marker so
that Hunspell wouldn't even be called on them:

https://lucene.apache.org/core/8_8_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/StemmerOverrideFilter.html

So if the user really wants to keep the stopwords, they could put this
"thing" in front of it to prevent them from slowing stuff down.

1 2  View All