Mailing List Archive: Hunspell performance

Hunspell performance

peter.gromov at jetbrains

Feb 10, 2021, 3:15 AM

Post #1 of 26 (1677 views)

Hi there,

I'm mostly done with supporting major Hunspell features necessary for most
european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but
of course I anticipate more minor fixes to come). Thanks Dawid Weiss for
thorough reviews and prompt accepting my PRs so far!

Now I'd like to make this Hunspell implementation at least as fast as the
native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times
slower for me, depending on the language (I've checked en/de/fr so far).
I've profiled it, done some minor optimizations, and now it appears that
most time is taken by FST traversals. I've prototyped decreasing the number
of these traversals, and the execution time goes down noticeably (e.g.
30%), but it's still not enough, and the code becomes complicated.

So I'm considering other data structures instead of FSTs (Hunspell/C++
itself doesn't bother with tries: it uses hash tables and linear searches
instead). The problem is, FST is very well space-optimized, and other data
structures consume more memory.

So my question is: what's the relative importance of speed and memory in
Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK
to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
improvement in stemming)? Or, with a BytesRefHash plus an array I can make
it ~9MB, with almost the same speedup (but more complex code).

How much memory usage is acceptable at all?

Maybe there are other suitable data structures in Lucene core that I'm not
aware of? I basically need a Map<String, Object>, which'd be better queried
with a char[]+offset+length keys (like CharArrayMap does).

Peter

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 10, 2021, 4:53 AM

Post #2 of 26 (1676 views)

The RAM usage used to be bad as you describe, it blows up way worse
for other languages than German. There were many issues :)

For Lucene, one common issue was that users wanted to have a lot of
these things in RAM: e.g. supporting many different languages on a
single server (multilingual data) and so forth.
Can we speed up your use-case by using the FST in a smarter way? Why
are there so many traversals... is it the way it is doing inexact
matching? decomposition?

That was the trick done with stemming, and the stemming was
accelerated with some side data structures. For example "patternIndex"
thing which is a scary precomputed list of tableized DFAs... its
wasting a "little" space with these tables to speed up hotspot for
stemming. In that patternIndex example, some assumptions / limits had
to be set, that hopefully no dictionary would ever make: that's all
the "please report this to dev@lucene.apache.org" checks in the code.
some tests were run against all the crazy OO dictionaries out there to
examine the memory usage when looking at changes like this. Some of
these are really, really crazy and do surprising things.

On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
<peter.gromov@jetbrains.com.invalid> wrote:
>
> Hi there,
>
> I'm mostly done with supporting major Hunspell features necessary for most european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but of course I anticipate more minor fixes to come). Thanks Dawid Weiss for thorough reviews and prompt accepting my PRs so far!
>
> Now I'd like to make this Hunspell implementation at least as fast as the native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times slower for me, depending on the language (I've checked en/de/fr so far). I've profiled it, done some minor optimizations, and now it appears that most time is taken by FST traversals. I've prototyped decreasing the number of these traversals, and the execution time goes down noticeably (e.g. 30%), but it's still not enough, and the code becomes complicated.
>
> So I'm considering other data structures instead of FSTs (Hunspell/C++ itself doesn't bother with tries: it uses hash tables and linear searches instead). The problem is, FST is very well space-optimized, and other data structures consume more memory.
>
> So my question is: what's the relative importance of speed and memory in Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% improvement in stemming)? Or, with a BytesRefHash plus an array I can make it ~9MB, with almost the same speedup (but more complex code).
>
> How much memory usage is acceptable at all?
>
> Maybe there are other suitable data structures in Lucene core that I'm not aware of? I basically need a Map<String, Object>, which'd be better queried with a char[]+offset+length keys (like CharArrayMap does).
>
> Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 10, 2021, 5:51 AM

Post #3 of 26 (1676 views)

Just throwing out another random idea: if you are doing a lot of FST
traversals (e.g. for inexact matching or decomposition), you may end
out "hammering" the root arcs of the FST heavily, depending on how the
algorithm works. Because root arcs are "busy", they end out being
O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
are doing "decompounding" too, and have hacks to waste some RAM,
ensuring the heavy root arc traversals are O(1):
https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java

Bruno did some FST improvements across the board here, but last time
we checked, these hacks were still needed for segmentation usecases
like this: see his benchmark here: https://s.apache.org/ffelc

For example, maybe it makes sense to cache a few hundred nodes here in
a similar way, depending on dictionary's alphabet size, to accelerate
segmentation, I don't know if it will help. Maybe also the current FST
"INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
rather than BYTE2 or BYTE4 or whatever it is using now. The current
stemming doesn't put much pressure on this, so it isn't optimized.

On Wed, Feb 10, 2021 at 7:53 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> The RAM usage used to be bad as you describe, it blows up way worse
> for other languages than German. There were many issues :)
>
> For Lucene, one common issue was that users wanted to have a lot of
> these things in RAM: e.g. supporting many different languages on a
> single server (multilingual data) and so forth.
> Can we speed up your use-case by using the FST in a smarter way? Why
> are there so many traversals... is it the way it is doing inexact
> matching? decomposition?
>
> That was the trick done with stemming, and the stemming was
> accelerated with some side data structures. For example "patternIndex"
> thing which is a scary precomputed list of tableized DFAs... its
> wasting a "little" space with these tables to speed up hotspot for
> stemming. In that patternIndex example, some assumptions / limits had
> to be set, that hopefully no dictionary would ever make: that's all
> the "please report this to dev@lucene.apache.org" checks in the code.
> some tests were run against all the crazy OO dictionaries out there to
> examine the memory usage when looking at changes like this. Some of
> these are really, really crazy and do surprising things.
>
> On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
> <peter.gromov@jetbrains.com.invalid> wrote:
> >
> > Hi there,
> >
> > I'm mostly done with supporting major Hunspell features necessary for most european languages (https://issues.apache.org/jira/browse/LUCENE-9687) (but of course I anticipate more minor fixes to come). Thanks Dawid Weiss for thorough reviews and prompt accepting my PRs so far!
> >
> > Now I'd like to make this Hunspell implementation at least as fast as the native Hunspell called via JNI, ideally faster. Now it seems 1.5-3 times slower for me, depending on the language (I've checked en/de/fr so far). I've profiled it, done some minor optimizations, and now it appears that most time is taken by FST traversals. I've prototyped decreasing the number of these traversals, and the execution time goes down noticeably (e.g. 30%), but it's still not enough, and the code becomes complicated.
> >
> > So I'm considering other data structures instead of FSTs (Hunspell/C++ itself doesn't bother with tries: it uses hash tables and linear searches instead). The problem is, FST is very well space-optimized, and other data structures consume more memory.
> >
> > So my question is: what's the relative importance of speed and memory in Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be OK to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45% improvement in stemming)? Or, with a BytesRefHash plus an array I can make it ~9MB, with almost the same speedup (but more complex code).
> >
> > How much memory usage is acceptable at all?
> >
> > Maybe there are other suitable data structures in Lucene core that I'm not aware of? I basically need a Map<String, Object>, which'd be better queried with a char[]+offset+length keys (like CharArrayMap does).
> >
> > Peter

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hunspell performance [ In reply to ]

peter.gromov at jetbrains

Feb 10, 2021, 6:43 AM

Post #4 of 26 (1676 views)

Hi Robert,

Yes, having multiple dictionaries in the same process would increase the
memory significantly. Do you have any idea about how many of them people
are loading, and how much memory they give to Lucene?

Yes, I've mentioned I've prototyped "using FST in a smarter way" :) Namely,
it's possible to cache the arcs/outputs used for searching for
"electrification" and reuse most of them after an affix is stripped and
we're now faced with "electrify". This allocates a bit more for each token,
but gives a noticeable speedup. I'm not entirely happy with the resulting
code complexity and performance, but I can create a PR.

I'm talking only about plain old affix removal. I have no inexact matching.
Decomposition basically works like "try to break the word in various places
and stem them separately, looking at some additional flags". For the first
word part, some arc/outputs could be reused from initial analysis, but for
the next ones most likely not. And when I tried the aforementioned reusing,
the code became so unpleasant that I started looking for alternatives :)

One thing I don't like about the arc caching approach is that it looks like
a dead end: the FST invocation count seems to be already close to minimal,
and yet its traversal is still very visible in the CPU snapshots. And I see
no low-hanging fruits in FST internals. They just seem to need
reading/analyzing too many bytes, doing much more work than a typical
hashmap access :)

Thanks for the idea about root arcs. I've done some quick sampling and
tracing (for German). 80% of root arc processing time is spent in direct
addressing, and the remainder is linear scan (so root acrs don't seem to
present major issues). For non-root arcs, ~50% is directly addressed, ~45%
linearly-scanned, and the remainder binary-searched. Overall there's about
60% of direct addressing, both in time and invocation counts, which doesn't
seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
that might increase the number of directly addressed arcs, but I'm not sure
that'd speed up much given that time and invocation counts seem to
correlate.

Peter

On Wed, Feb 10, 2021 at 2:52 PM Robert Muir <rcmuir@gmail.com> wrote:

> Just throwing out another random idea: if you are doing a lot of FST
> traversals (e.g. for inexact matching or decomposition), you may end
> out "hammering" the root arcs of the FST heavily, depending on how the
> algorithm works. Because root arcs are "busy", they end out being
> O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
> are doing "decompounding" too, and have hacks to waste some RAM,
> ensuring the heavy root arc traversals are O(1):
>
> https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
>
> Bruno did some FST improvements across the board here, but last time
> we checked, these hacks were still needed for segmentation usecases
> like this: see his benchmark here: https://s.apache.org/ffelc
>
> For example, maybe it makes sense to cache a few hundred nodes here in
> a similar way, depending on dictionary's alphabet size, to accelerate
> segmentation, I don't know if it will help. Maybe also the current FST
> "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
> rather than BYTE2 or BYTE4 or whatever it is using now. The current
> stemming doesn't put much pressure on this, so it isn't optimized.
>
> On Wed, Feb 10, 2021 at 7:53 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > The RAM usage used to be bad as you describe, it blows up way worse
> > for other languages than German. There were many issues :)
> >
> > For Lucene, one common issue was that users wanted to have a lot of
> > these things in RAM: e.g. supporting many different languages on a
> > single server (multilingual data) and so forth.
> > Can we speed up your use-case by using the FST in a smarter way? Why
> > are there so many traversals... is it the way it is doing inexact
> > matching? decomposition?
> >
> > That was the trick done with stemming, and the stemming was
> > accelerated with some side data structures. For example "patternIndex"
> > thing which is a scary precomputed list of tableized DFAs... its
> > wasting a "little" space with these tables to speed up hotspot for
> > stemming. In that patternIndex example, some assumptions / limits had
> > to be set, that hopefully no dictionary would ever make: that's all
> > the "please report this to dev@lucene.apache.org" checks in the code.
> > some tests were run against all the crazy OO dictionaries out there to
> > examine the memory usage when looking at changes like this. Some of
> > these are really, really crazy and do surprising things.
> >
> > On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
> > <peter.gromov@jetbrains.com.invalid> wrote:
> > >
> > > Hi there,
> > >
> > > I'm mostly done with supporting major Hunspell features necessary for
> most european languages (https://issues.apache.org/jira/browse/LUCENE-9687)
> (but of course I anticipate more minor fixes to come). Thanks Dawid Weiss
> for thorough reviews and prompt accepting my PRs so far!
> > >
> > > Now I'd like to make this Hunspell implementation at least as fast as
> the native Hunspell called via JNI, ideally faster. Now it seems 1.5-3
> times slower for me, depending on the language (I've checked en/de/fr so
> far). I've profiled it, done some minor optimizations, and now it appears
> that most time is taken by FST traversals. I've prototyped decreasing the
> number of these traversals, and the execution time goes down noticeably
> (e.g. 30%), but it's still not enough, and the code becomes complicated.
> > >
> > > So I'm considering other data structures instead of FSTs (Hunspell/C++
> itself doesn't bother with tries: it uses hash tables and linear searches
> instead). The problem is, FST is very well space-optimized, and other data
> structures consume more memory.
> > >
> > > So my question is: what's the relative importance of speed and memory
> in Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be
> OK to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
> improvement in stemming)? Or, with a BytesRefHash plus an array I can make
> it ~9MB, with almost the same speedup (but more complex code).
> > >
> > > How much memory usage is acceptable at all?
> > >
> > > Maybe there are other suitable data structures in Lucene core that I'm
> not aware of? I basically need a Map<String, Object>, which'd be better
> queried with a char[]+offset+length keys (like CharArrayMap does).
> > >
> > > Peter
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 10, 2021, 8:01 AM

Post #5 of 26 (1676 views)

> They just seem to need reading/analyzing too many bytes, doing much more
work than a typical hashmap access :)

This is a very tough score to beat... Pretty much any trie structure will
have to descend somehow. FSTs are additionally densely packed in Lucene and
outgoing arc lookup is what's causing the slowdown. An alternative
memory-conservative fst representation is a hash table of [fst-node_id,
out-arc-label] -> fst-node_id. A structure like this one can be made
reasonably compact and path traversals (lookups) are constant-time... at
the price of not being able to enumerate all of node's outgoing arcs. I've
used this approach in the past and it gave a reasonable speedup, although
it's a really specialized data structure.

D.

On Wed, Feb 10, 2021 at 3:43 PM Peter Gromov
<peter.gromov@jetbrains.com.invalid> wrote:

> Hi Robert,
>
> Yes, having multiple dictionaries in the same process would increase the
> memory significantly. Do you have any idea about how many of them people
> are loading, and how much memory they give to Lucene?
>
> Yes, I've mentioned I've prototyped "using FST in a smarter way" :)
> Namely, it's possible to cache the arcs/outputs used for searching for
> "electrification" and reuse most of them after an affix is stripped and
> we're now faced with "electrify". This allocates a bit more for each token,
> but gives a noticeable speedup. I'm not entirely happy with the resulting
> code complexity and performance, but I can create a PR.
>
> I'm talking only about plain old affix removal. I have no inexact
> matching. Decomposition basically works like "try to break the word in
> various places and stem them separately, looking at some additional flags".
> For the first word part, some arc/outputs could be reused from initial
> analysis, but for the next ones most likely not. And when I tried the
> aforementioned reusing, the code became so unpleasant that I started
> looking for alternatives :)
>
> One thing I don't like about the arc caching approach is that it looks
> like a dead end: the FST invocation count seems to be already close to
> minimal, and yet its traversal is still very visible in the CPU snapshots.
> And I see no low-hanging fruits in FST internals. They just seem to need
> reading/analyzing too many bytes, doing much more work than a typical
> hashmap access :)
>
> Thanks for the idea about root arcs. I've done some quick sampling and
> tracing (for German). 80% of root arc processing time is spent in direct
> addressing, and the remainder is linear scan (so root acrs don't seem to
> present major issues). For non-root arcs, ~50% is directly addressed, ~45%
> linearly-scanned, and the remainder binary-searched. Overall there's about
> 60% of direct addressing, both in time and invocation counts, which doesn't
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
> that might increase the number of directly addressed arcs, but I'm not sure
> that'd speed up much given that time and invocation counts seem to
> correlate.
>
> Peter
>
> On Wed, Feb 10, 2021 at 2:52 PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> Just throwing out another random idea: if you are doing a lot of FST
>> traversals (e.g. for inexact matching or decomposition), you may end
>> out "hammering" the root arcs of the FST heavily, depending on how the
>> algorithm works. Because root arcs are "busy", they end out being
>> O(logN) lookups in the FST and get slow. Japanese and Korean analyzers
>> are doing "decompounding" too, and have hacks to waste some RAM,
>> ensuring the heavy root arc traversals are O(1):
>>
>> https://github.com/apache/lucene-solr/blob/7f8b7ffbcad2265b047a5e2195f76cc924028063/lucene/analysis/kuromoji/src/java/org/apache/lucene/analysis/ja/dict/TokenInfoFST.java
>>
>> Bruno did some FST improvements across the board here, but last time
>> we checked, these hacks were still needed for segmentation usecases
>> like this: see his benchmark here: https://s.apache.org/ffelc
>>
>> For example, maybe it makes sense to cache a few hundred nodes here in
>> a similar way, depending on dictionary's alphabet size, to accelerate
>> segmentation, I don't know if it will help. Maybe also the current FST
>> "INPUT_TYPEs" are inappropriate and it would work better as BYTE1 FST
>> rather than BYTE2 or BYTE4 or whatever it is using now. The current
>> stemming doesn't put much pressure on this, so it isn't optimized.
>>
>> On Wed, Feb 10, 2021 at 7:53 AM Robert Muir <rcmuir@gmail.com> wrote:
>> >
>> > The RAM usage used to be bad as you describe, it blows up way worse
>> > for other languages than German. There were many issues :)
>> >
>> > For Lucene, one common issue was that users wanted to have a lot of
>> > these things in RAM: e.g. supporting many different languages on a
>> > single server (multilingual data) and so forth.
>> > Can we speed up your use-case by using the FST in a smarter way? Why
>> > are there so many traversals... is it the way it is doing inexact
>> > matching? decomposition?
>> >
>> > That was the trick done with stemming, and the stemming was
>> > accelerated with some side data structures. For example "patternIndex"
>> > thing which is a scary precomputed list of tableized DFAs... its
>> > wasting a "little" space with these tables to speed up hotspot for
>> > stemming. In that patternIndex example, some assumptions / limits had
>> > to be set, that hopefully no dictionary would ever make: that's all
>> > the "please report this to dev@lucene.apache.org" checks in the code.
>> > some tests were run against all the crazy OO dictionaries out there to
>> > examine the memory usage when looking at changes like this. Some of
>> > these are really, really crazy and do surprising things.
>> >
>> > On Wed, Feb 10, 2021 at 6:16 AM Peter Gromov
>> > <peter.gromov@jetbrains.com.invalid> wrote:
>> > >
>> > > Hi there,
>> > >
>> > > I'm mostly done with supporting major Hunspell features necessary for
>> most european languages (
>> https://issues.apache.org/jira/browse/LUCENE-9687) (but of course I
>> anticipate more minor fixes to come). Thanks Dawid Weiss for thorough
>> reviews and prompt accepting my PRs so far!
>> > >
>> > > Now I'd like to make this Hunspell implementation at least as fast as
>> the native Hunspell called via JNI, ideally faster. Now it seems 1.5-3
>> times slower for me, depending on the language (I've checked en/de/fr so
>> far). I've profiled it, done some minor optimizations, and now it appears
>> that most time is taken by FST traversals. I've prototyped decreasing the
>> number of these traversals, and the execution time goes down noticeably
>> (e.g. 30%), but it's still not enough, and the code becomes complicated.
>> > >
>> > > So I'm considering other data structures instead of FSTs
>> (Hunspell/C++ itself doesn't bother with tries: it uses hash tables and
>> linear searches instead). The problem is, FST is very well space-optimized,
>> and other data structures consume more memory.
>> > >
>> > > So my question is: what's the relative importance of speed and memory
>> in Lucene's stemmer? E.g. now the FST for German takes 2.2MB. Would it be
>> OK to use a CharArrayMap taking 20-25MB, but be much faster on lookup (45%
>> improvement in stemming)? Or, with a BytesRefHash plus an array I can make
>> it ~9MB, with almost the same speedup (but more complex code).
>> > >
>> > > How much memory usage is acceptable at all?
>> > >
>> > > Maybe there are other suitable data structures in Lucene core that
>> I'm not aware of? I basically need a Map<String, Object>, which'd be better
>> queried with a char[]+offset+length keys (like CharArrayMap does).
>> > >
>> > > Peter
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 10, 2021, 8:04 AM

Post #6 of 26 (1676 views)

Peter, looks like you are way ahead of me :) Thanks for all the work
you have been doing here, and thanks to Dawid for helping!

You probably know a lot of this code better than me at this point, but
I remember a couple of these pain points, inline below:

On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
<peter.gromov@jetbrains.com.invalid> wrote:
>
> Hi Robert,
>
> Yes, having multiple dictionaries in the same process would increase the memory significantly. Do you have any idea about how many of them people are loading, and how much memory they give to Lucene?

Yeah in many cases, the user is using a server such as solr or elasticsearch.
Let's use solr as an example, as others are here to correct it, if I am wrong.

Example to understand the challenges: user uses one of solr's 3
mechanisms to detect language and send to different pipeline:
https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
Now we know these language detectors are imperfect, if the user maps a
lot of languages to hunspell pipelines, they may load lots of
dictionaries, even by just one stray miscategorized document.
So it doesn't have to be some extreme "enterprise" use-case like
wikipedia.org, it can happen for a little guy faced with a
multilingual corpus.

Imagine the user decides to go further, and host solr search in this
way for a couple local businesses or govt agencies.
They support many languages and possibly use this detection scheme
above to try to make language a "non-issue".
The user may assign each customer a solr "core" (separate index) with
this configuration.
Does each solr core load its own HunspellStemFactory? I think it might
(in isolated classloader), I could be wrong.

For the elasticsearch case, maybe the resource usage in the same case
is lower, because they reuse dictionaries per-node?
I think this is how it works, but I honestly can't remember.
Still the problem remains, easy to end up with dozens of these things in memory.

Also we have the problem that memory usage for a specific can blow up
in several ways.
Some languages have bigger .aff file than .dic!

> Thanks for the idea about root arcs. I've done some quick sampling and tracing (for German). 80% of root arc processing time is spent in direct addressing, and the remainder is linear scan (so root acrs don't seem to present major issues). For non-root arcs, ~50% is directly addressed, ~45% linearly-scanned, and the remainder binary-searched. Overall there's about 60% of direct addressing, both in time and invocation counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing that might increase the number of directly addressed arcs, but I'm not sure that'd speed up much given that time and invocation counts seem to correlate.
>

Sure, but 20% of those linear scans are maybe 7x slower, its
O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
Hard to reason about, but maybe worth testing out. It still helps for
all the other segmenters (japanese, korean) using fst.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hunspell performance [ In reply to ]

peter.gromov at jetbrains

Feb 10, 2021, 8:53 AM

Post #7 of 26 (1676 views)

>
> at the price of not being able to enumerate all of node's outgoing arcs.
>

So FSTEnum isn't possible there? Too bad, I need it for suggestions.

Re: Hunspell performance [ In reply to ]

peter.gromov at jetbrains

Feb 10, 2021, 9:52 AM

Post #8 of 26 (1676 views)

I was hoping for some numbers :) In the meantime, I've got some of my own.
I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
(there's more, but I ignored dialects of the same base language). Together
they currently consume a humble 166MB. With one of my less memory-hungry
approaches, they'd take ~500MB (maybe less if I optimize, but probably not
significantly). Is this very bad or tolerable for, say, 50% speedup?

I've seen huge *.aff files, and I'm planning to do something with affix
FSTs, too. They take some noticeable time, too, but much less than *.dic-s
one, so for now I concentrate on *.dic.

> Sure, but 20% of those linear scans are maybe 7x slower

Checked that. The distribution appears to be decreasing monotonically. No
linear scans are longer than 8, and ~85% of all linear scans end after no
more than 1 miss.

I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
complicated :(

On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <rcmuir@gmail.com> wrote:

> Peter, looks like you are way ahead of me :) Thanks for all the work
> you have been doing here, and thanks to Dawid for helping!
>
> You probably know a lot of this code better than me at this point, but
> I remember a couple of these pain points, inline below:
>
> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
> <peter.gromov@jetbrains.com.invalid> wrote:
> >
> > Hi Robert,
> >
> > Yes, having multiple dictionaries in the same process would increase the
> memory significantly. Do you have any idea about how many of them people
> are loading, and how much memory they give to Lucene?
>
> Yeah in many cases, the user is using a server such as solr or
> elasticsearch.
> Let's use solr as an example, as others are here to correct it, if I am
> wrong.
>
> Example to understand the challenges: user uses one of solr's 3
> mechanisms to detect language and send to different pipeline:
>
> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
> Now we know these language detectors are imperfect, if the user maps a
> lot of languages to hunspell pipelines, they may load lots of
> dictionaries, even by just one stray miscategorized document.
> So it doesn't have to be some extreme "enterprise" use-case like
> wikipedia.org, it can happen for a little guy faced with a
> multilingual corpus.
>
> Imagine the user decides to go further, and host solr search in this
> way for a couple local businesses or govt agencies.
> They support many languages and possibly use this detection scheme
> above to try to make language a "non-issue".
> The user may assign each customer a solr "core" (separate index) with
> this configuration.
> Does each solr core load its own HunspellStemFactory? I think it might
> (in isolated classloader), I could be wrong.
>
> For the elasticsearch case, maybe the resource usage in the same case
> is lower, because they reuse dictionaries per-node?
> I think this is how it works, but I honestly can't remember.
> Still the problem remains, easy to end up with dozens of these things in
> memory.
>
> Also we have the problem that memory usage for a specific can blow up
> in several ways.
> Some languages have bigger .aff file than .dic!
>
> > Thanks for the idea about root arcs. I've done some quick sampling and
> tracing (for German). 80% of root arc processing time is spent in direct
> addressing, and the remainder is linear scan (so root acrs don't seem to
> present major issues). For non-root arcs, ~50% is directly addressed, ~45%
> linearly-scanned, and the remainder binary-searched. Overall there's about
> 60% of direct addressing, both in time and invocation counts, which doesn't
> seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing
> that might increase the number of directly addressed arcs, but I'm not sure
> that'd speed up much given that time and invocation counts seem to
> correlate.
> >
>
> Sure, but 20% of those linear scans are maybe 7x slower, its
> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
> Hard to reason about, but maybe worth testing out. It still helps for
> all the other segmenters (japanese, korean) using fst.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 10, 2021, 10:02 AM

Post #9 of 26 (1676 views)

50% speedup for the HunspellStemmer use case? for 3x the memory space?

Just my opinion: Seems like the correct tradeoff to me.
Analysis chain is a serious bottleneck for indexing speed: this
hunspell is one of the slower ones.

To me the challenge with such a change is just trying to prevent
strange dictionaries from blowing up to 30x the space :)

On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
<peter.gromov@jetbrains.com.invalid> wrote:
>
> I was hoping for some numbers :) In the meantime, I've got some of my own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries (there's more, but I ignored dialects of the same base language). Together they currently consume a humble 166MB. With one of my less memory-hungry approaches, they'd take ~500MB (maybe less if I optimize, but probably not significantly). Is this very bad or tolerable for, say, 50% speedup?
>
> I've seen huge *.aff files, and I'm planning to do something with affix FSTs, too. They take some noticeable time, too, but much less than *.dic-s one, so for now I concentrate on *.dic.
>
> > Sure, but 20% of those linear scans are maybe 7x slower
>
> Checked that. The distribution appears to be decreasing monotonically. No linear scans are longer than 8, and ~85% of all linear scans end after no more than 1 miss.
>
> I'll try BYTE1 if I manage to do it. It turned out to be surprisingly complicated :(
>
> On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> Peter, looks like you are way ahead of me :) Thanks for all the work
>> you have been doing here, and thanks to Dawid for helping!
>>
>> You probably know a lot of this code better than me at this point, but
>> I remember a couple of these pain points, inline below:
>>
>> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>> <peter.gromov@jetbrains.com.invalid> wrote:
>> >
>> > Hi Robert,
>> >
>> > Yes, having multiple dictionaries in the same process would increase the memory significantly. Do you have any idea about how many of them people are loading, and how much memory they give to Lucene?
>>
>> Yeah in many cases, the user is using a server such as solr or elasticsearch.
>> Let's use solr as an example, as others are here to correct it, if I am wrong.
>>
>> Example to understand the challenges: user uses one of solr's 3
>> mechanisms to detect language and send to different pipeline:
>> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
>> Now we know these language detectors are imperfect, if the user maps a
>> lot of languages to hunspell pipelines, they may load lots of
>> dictionaries, even by just one stray miscategorized document.
>> So it doesn't have to be some extreme "enterprise" use-case like
>> wikipedia.org, it can happen for a little guy faced with a
>> multilingual corpus.
>>
>> Imagine the user decides to go further, and host solr search in this
>> way for a couple local businesses or govt agencies.
>> They support many languages and possibly use this detection scheme
>> above to try to make language a "non-issue".
>> The user may assign each customer a solr "core" (separate index) with
>> this configuration.
>> Does each solr core load its own HunspellStemFactory? I think it might
>> (in isolated classloader), I could be wrong.
>>
>> For the elasticsearch case, maybe the resource usage in the same case
>> is lower, because they reuse dictionaries per-node?
>> I think this is how it works, but I honestly can't remember.
>> Still the problem remains, easy to end up with dozens of these things in memory.
>>
>> Also we have the problem that memory usage for a specific can blow up
>> in several ways.
>> Some languages have bigger .aff file than .dic!
>>
>> > Thanks for the idea about root arcs. I've done some quick sampling and tracing (for German). 80% of root arc processing time is spent in direct addressing, and the remainder is linear scan (so root acrs don't seem to present major issues). For non-root arcs, ~50% is directly addressed, ~45% linearly-scanned, and the remainder binary-searched. Overall there's about 60% of direct addressing, both in time and invocation counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4 inputs are used. Reducing that might increase the number of directly addressed arcs, but I'm not sure that'd speed up much given that time and invocation counts seem to correlate.
>> >
>>
>> Sure, but 20% of those linear scans are maybe 7x slower, its
>> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
>> Hard to reason about, but maybe worth testing out. It still helps for
>> all the other segmenters (japanese, korean) using fst.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 10, 2021, 12:04 PM

Post #10 of 26 (1676 views)

> To me the challenge with such a change is just trying to prevent

strange dictionaries from blowing up to 30x the space :)
>

Maybe the "backend" could be configurable somehow so that you could change
the strategy depending on your needs?... I haven't looked at how FSTs are
used but if can be hidden behind a facade then an alternative
implementation could be provided depending on one's need?

D.

>
> On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
> <peter.gromov@jetbrains.com.invalid> wrote:
> >
> > I was hoping for some numbers :) In the meantime, I've got some of my
> own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
> (there's more, but I ignored dialects of the same base language). Together
> they currently consume a humble 166MB. With one of my less memory-hungry
> approaches, they'd take ~500MB (maybe less if I optimize, but probably not
> significantly). Is this very bad or tolerable for, say, 50% speedup?
> >
> > I've seen huge *.aff files, and I'm planning to do something with affix
> FSTs, too. They take some noticeable time, too, but much less than *.dic-s
> one, so for now I concentrate on *.dic.
> >
> > > Sure, but 20% of those linear scans are maybe 7x slower
> >
> > Checked that. The distribution appears to be decreasing monotonically.
> No linear scans are longer than 8, and ~85% of all linear scans end after
> no more than 1 miss.
> >
> > I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
> complicated :(
> >
> > On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <rcmuir@gmail.com> wrote:
> >>
> >> Peter, looks like you are way ahead of me :) Thanks for all the work
> >> you have been doing here, and thanks to Dawid for helping!
> >>
> >> You probably know a lot of this code better than me at this point, but
> >> I remember a couple of these pain points, inline below:
> >>
> >> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
> >> <peter.gromov@jetbrains.com.invalid> wrote:
> >> >
> >> > Hi Robert,
> >> >
> >> > Yes, having multiple dictionaries in the same process would increase
> the memory significantly. Do you have any idea about how many of them
> people are loading, and how much memory they give to Lucene?
> >>
> >> Yeah in many cases, the user is using a server such as solr or
> elasticsearch.
> >> Let's use solr as an example, as others are here to correct it, if I am
> wrong.
> >>
> >> Example to understand the challenges: user uses one of solr's 3
> >> mechanisms to detect language and send to different pipeline:
> >>
> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
> >> Now we know these language detectors are imperfect, if the user maps a
> >> lot of languages to hunspell pipelines, they may load lots of
> >> dictionaries, even by just one stray miscategorized document.
> >> So it doesn't have to be some extreme "enterprise" use-case like
> >> wikipedia.org, it can happen for a little guy faced with a
> >> multilingual corpus.
> >>
> >> Imagine the user decides to go further, and host solr search in this
> >> way for a couple local businesses or govt agencies.
> >> They support many languages and possibly use this detection scheme
> >> above to try to make language a "non-issue".
> >> The user may assign each customer a solr "core" (separate index) with
> >> this configuration.
> >> Does each solr core load its own HunspellStemFactory? I think it might
> >> (in isolated classloader), I could be wrong.
> >>
> >> For the elasticsearch case, maybe the resource usage in the same case
> >> is lower, because they reuse dictionaries per-node?
> >> I think this is how it works, but I honestly can't remember.
> >> Still the problem remains, easy to end up with dozens of these things
> in memory.
> >>
> >> Also we have the problem that memory usage for a specific can blow up
> >> in several ways.
> >> Some languages have bigger .aff file than .dic!
> >>
> >> > Thanks for the idea about root arcs. I've done some quick sampling
> and tracing (for German). 80% of root arc processing time is spent in
> direct addressing, and the remainder is linear scan (so root acrs don't
> seem to present major issues). For non-root arcs, ~50% is directly
> addressed, ~45% linearly-scanned, and the remainder binary-searched.
> Overall there's about 60% of direct addressing, both in time and invocation
> counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4
> inputs are used. Reducing that might increase the number of directly
> addressed arcs, but I'm not sure that'd speed up much given that time and
> invocation counts seem to correlate.
> >> >
> >>
> >> Sure, but 20% of those linear scans are maybe 7x slower, its
> >> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
> >> Hard to reason about, but maybe worth testing out. It still helps for
> >> all the other segmenters (japanese, korean) using fst.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: dev-help@lucene.apache.org
> >>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Hunspell performance [ In reply to ]

gus.heck at gmail

Feb 10, 2021, 12:39 PM

Post #11 of 26 (1676 views)

+1 to configurability that is well documented, and reasonably actionable
downstream in Solr... Some folks struggle with the costs of buying machines
with lots of memory.

On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:

>
>
>> To me the challenge with such a change is just trying to prevent
>
> strange dictionaries from blowing up to 30x the space :)
>>
>
> Maybe the "backend" could be configurable somehow so that you could change
> the strategy depending on your needs?... I haven't looked at how FSTs are
> used but if can be hidden behind a facade then an alternative
> implementation could be provided depending on one's need?
>
> D.
>
>
>>
>> On Wed, Feb 10, 2021 at 12:53 PM Peter Gromov
>> <peter.gromov@jetbrains.com.invalid> wrote:
>> >
>> > I was hoping for some numbers :) In the meantime, I've got some of my
>> own. I loaded 90 dictionaries from https://github.com/wooorm/dictionaries
>> (there's more, but I ignored dialects of the same base language). Together
>> they currently consume a humble 166MB. With one of my less memory-hungry
>> approaches, they'd take ~500MB (maybe less if I optimize, but probably not
>> significantly). Is this very bad or tolerable for, say, 50% speedup?
>> >
>> > I've seen huge *.aff files, and I'm planning to do something with affix
>> FSTs, too. They take some noticeable time, too, but much less than *.dic-s
>> one, so for now I concentrate on *.dic.
>> >
>> > > Sure, but 20% of those linear scans are maybe 7x slower
>> >
>> > Checked that. The distribution appears to be decreasing monotonically.
>> No linear scans are longer than 8, and ~85% of all linear scans end after
>> no more than 1 miss.
>> >
>> > I'll try BYTE1 if I manage to do it. It turned out to be surprisingly
>> complicated :(
>> >
>> > On Wed, Feb 10, 2021 at 5:04 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>
>> >> Peter, looks like you are way ahead of me :) Thanks for all the work
>> >> you have been doing here, and thanks to Dawid for helping!
>> >>
>> >> You probably know a lot of this code better than me at this point, but
>> >> I remember a couple of these pain points, inline below:
>> >>
>> >> On Wed, Feb 10, 2021 at 9:44 AM Peter Gromov
>> >> <peter.gromov@jetbrains.com.invalid> wrote:
>> >> >
>> >> > Hi Robert,
>> >> >
>> >> > Yes, having multiple dictionaries in the same process would increase
>> the memory significantly. Do you have any idea about how many of them
>> people are loading, and how much memory they give to Lucene?
>> >>
>> >> Yeah in many cases, the user is using a server such as solr or
>> elasticsearch.
>> >> Let's use solr as an example, as others are here to correct it, if I
>> am wrong.
>> >>
>> >> Example to understand the challenges: user uses one of solr's 3
>> >> mechanisms to detect language and send to different pipeline:
>> >>
>> https://lucene.apache.org/solr/guide/8_8/detecting-languages-during-indexing.html
>> >> Now we know these language detectors are imperfect, if the user maps a
>> >> lot of languages to hunspell pipelines, they may load lots of
>> >> dictionaries, even by just one stray miscategorized document.
>> >> So it doesn't have to be some extreme "enterprise" use-case like
>> >> wikipedia.org, it can happen for a little guy faced with a
>> >> multilingual corpus.
>> >>
>> >> Imagine the user decides to go further, and host solr search in this
>> >> way for a couple local businesses or govt agencies.
>> >> They support many languages and possibly use this detection scheme
>> >> above to try to make language a "non-issue".
>> >> The user may assign each customer a solr "core" (separate index) with
>> >> this configuration.
>> >> Does each solr core load its own HunspellStemFactory? I think it might
>> >> (in isolated classloader), I could be wrong.
>> >>
>> >> For the elasticsearch case, maybe the resource usage in the same case
>> >> is lower, because they reuse dictionaries per-node?
>> >> I think this is how it works, but I honestly can't remember.
>> >> Still the problem remains, easy to end up with dozens of these things
>> in memory.
>> >>
>> >> Also we have the problem that memory usage for a specific can blow up
>> >> in several ways.
>> >> Some languages have bigger .aff file than .dic!
>> >>
>> >> > Thanks for the idea about root arcs. I've done some quick sampling
>> and tracing (for German). 80% of root arc processing time is spent in
>> direct addressing, and the remainder is linear scan (so root acrs don't
>> seem to present major issues). For non-root arcs, ~50% is directly
>> addressed, ~45% linearly-scanned, and the remainder binary-searched.
>> Overall there's about 60% of direct addressing, both in time and invocation
>> counts, which doesn't seem too bad (or am I mistaken?). Currently BYTE4
>> inputs are used. Reducing that might increase the number of directly
>> addressed arcs, but I'm not sure that'd speed up much given that time and
>> invocation counts seem to correlate.
>> >> >
>> >>
>> >> Sure, but 20% of those linear scans are maybe 7x slower, its
>> >> O(log2(alphabet_size)) right (assuming alphabet size ~ 128)?
>> >> Hard to reason about, but maybe worth testing out. It still helps for
>> >> all the other segmenters (japanese, korean) using fst.
>> >>
>> >> ---------------------------------------------------------------------
>> >> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
http://www.needhamsoftware.com (work)
http://www.the111shift.com (play)

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 10, 2021, 4:16 PM

Post #12 of 26 (1676 views)

On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
> Maybe the "backend" could be configurable somehow so that you could change the strategy depending on your needs?... I haven't looked at how FSTs are used but if can be hidden behind a facade then an alternative implementation could be provided depending on one's need?
>
> D.
>

I don't have any confidence that solr would default to the "smaller"
option or fix how they manage different solr cores or thousands of
threads or any of the analyzer issues. And who would maintain this
separate hunspell backend? I don't think it is fair to Peter to have
to cope with 2 implementations of hunspell, 1 is certainly enough...
:). It's all apache license, at the end of the day if someone wants to
step up, let 'em. otherwise let's get out of their way.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 10, 2021, 11:54 PM

Post #13 of 26 (1676 views)

I didn't mean for Peter to write both backends but perhaps, if he's
experimenting already anyway, make it possible to extract an interface
which could be substituted externally with different implementations. Makes
it easier to tinker with various options, even for us.

D.

On Thu, Feb 11, 2021 at 1:16 AM Robert Muir <rcmuir@gmail.com> wrote:

> On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
> > Maybe the "backend" could be configurable somehow so that you could
> change the strategy depending on your needs?... I haven't looked at how
> FSTs are used but if can be hidden behind a facade then an alternative
> implementation could be provided depending on one's need?
> >
> > D.
> >
>
> I don't have any confidence that solr would default to the "smaller"
> option or fix how they manage different solr cores or thousands of
> threads or any of the analyzer issues. And who would maintain this
> separate hunspell backend? I don't think it is fair to Peter to have
> to cope with 2 implementations of hunspell, 1 is certainly enough...
> :). It's all apache license, at the end of the day if someone wants to
> step up, let 'em. otherwise let's get out of their way.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 11, 2021, 12:08 AM

Post #14 of 26 (1676 views)

I peeked at the code and I still think it's not a bad idea to experiment
with extracting a facade for construction and lookup of words. there may
even be a middle ground between size and speed - if you assume zipfian
distribution of words, the top common ones could be stored/ cached outside
of the fst (even in an associative dictionary). This would require external
frequency information during construction but this isn't something
difficult.

D.

On Thu, Feb 11, 2021 at 8:54 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

>
> I didn't mean for Peter to write both backends but perhaps, if he's
> experimenting already anyway, make it possible to extract an interface
> which could be substituted externally with different implementations. Makes
> it easier to tinker with various options, even for us.
>
> D.
>
> On Thu, Feb 11, 2021 at 1:16 AM Robert Muir <rcmuir@gmail.com> wrote:
>
>> On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <dawid.weiss@gmail.com>
>> wrote:
>> > Maybe the "backend" could be configurable somehow so that you could
>> change the strategy depending on your needs?... I haven't looked at how
>> FSTs are used but if can be hidden behind a facade then an alternative
>> implementation could be provided depending on one's need?
>> >
>> > D.
>> >
>>
>> I don't have any confidence that solr would default to the "smaller"
>> option or fix how they manage different solr cores or thousands of
>> threads or any of the analyzer issues. And who would maintain this
>> separate hunspell backend? I don't think it is fair to Peter to have
>> to cope with 2 implementations of hunspell, 1 is certainly enough...
>> :). It's all apache license, at the end of the day if someone wants to
>> step up, let 'em. otherwise let's get out of their way.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

Re: Hunspell performance [ In reply to ]

peter.gromov at jetbrains

Feb 11, 2021, 1:32 AM

Post #15 of 26 (1676 views)

> To me the challenge with such a change is just trying to prevent
strange dictionaries from blowing up to 30x the space :)

Checked that. And indeed, such dictionaries exist, 20x and even 30x, and
then they start taking up to 30MB. Not nice.

> if you assume zipfian distribution of words, the top common ones could be
stored/ cached outside of the fst (even in an associative dictionary). This
would require external frequency information during construction but this
isn't something difficult.

That's an interesting idea about storing shortest (easy) or most frequent
(hard) words separately. Unfortunately the distribution isn't entirely
zipfian, as the dictionaries tend to contain a lot of short but uncommon
words, like abbreviations. Still, something for me to explore, thanks!

As for configurability, I'm considering that option as well (but would
prefer to avoid it). It's not as easy as adding a facade around one
"lookup" method. Well, now it is, but if we're staying with FST, then I'd
better finish that arc caching optimization I described (some 20-30%
speedup, not bad), and that'd require changing multiple signatures to pass
around some arc cache info in addition to the simple char[].

On Thu, Feb 11, 2021 at 9:09 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

>
> I peeked at the code and I still think it's not a bad idea to experiment
> with extracting a facade for construction and lookup of words. there may
> even be a middle ground between size and speed - if you assume zipfian
> distribution of words, the top common ones could be stored/ cached outside
> of the fst (even in an associative dictionary). This would require external
> frequency information during construction but this isn't something
> difficult.
>
> D.
>
> On Thu, Feb 11, 2021 at 8:54 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>>
>> I didn't mean for Peter to write both backends but perhaps, if he's
>> experimenting already anyway, make it possible to extract an interface
>> which could be substituted externally with different implementations. Makes
>> it easier to tinker with various options, even for us.
>>
>> D.
>>
>> On Thu, Feb 11, 2021 at 1:16 AM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> On Wed, Feb 10, 2021 at 3:05 PM Dawid Weiss <dawid.weiss@gmail.com>
>>> wrote:
>>> > Maybe the "backend" could be configurable somehow so that you could
>>> change the strategy depending on your needs?... I haven't looked at how
>>> FSTs are used but if can be hidden behind a facade then an alternative
>>> implementation could be provided depending on one's need?
>>> >
>>> > D.
>>> >
>>>
>>> I don't have any confidence that solr would default to the "smaller"
>>> option or fix how they manage different solr cores or thousands of
>>> threads or any of the analyzer issues. And who would maintain this
>>> separate hunspell backend? I don't think it is fair to Peter to have
>>> to cope with 2 implementations of hunspell, 1 is certainly enough...
>>> :). It's all apache license, at the end of the day if someone wants to
>>> step up, let 'em. otherwise let's get out of their way.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 11, 2021, 3:17 AM

Post #16 of 26 (1676 views)

>
> That's an interesting idea about storing shortest (easy) or most frequent
> (hard) words separately. Unfortunately the distribution isn't entirely
> zipfian, as the dictionaries tend to contain a lot of short but uncommon
> words, like abbreviations. Still, something for me to explore, thanks!
>

I meant taking advantage of empirical distribution of words occurring in an
average text (not their lengths) and providing a faster code path for
these.
I'd assume a typical run would spend the majority of the time looking up
the same words over and over. A small cache could be added to the
dictionary but also anywhere else - even on top of Stemmer or SpellChecker.
This cache could be very, very simplistic - even a dumb throw-away HashMap
discarded after it reaches a certain size (cache hits would occur only
until it's filled up). The distribution of the input still will make it
perform relatively well.

Here's an example - I used Mike McCandless's enwiki and ran English test
from TestPerformance on it. My master branch results:

Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945,
1955, 1959, 1972, 1952]
Spellchecking en: average 2438.1428571428573, all times = [2465, 2425,
2429, 2434, 2452, 2429, 2433]

I then added [1] a dumb no-eviction-policy caching on top of stemming and
spell checking.

Results for cache size of 2500 entries:
Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889,
890, 886, 882]
Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187,
1179, 1170, 1170]

Results for cache size of 5000 entries:
Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766,
763, 766, 767]
Spellchecking en: average 1065.2857142857142, all times = [1070, 1068,
1061, 1065, 1064, 1067, 1062]

Results for cache size of 10000 entries:
Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
Spellchecking en: average 926.8571428571429, all times = [924, 925, 930,
927, 925, 930, 927]

The sizes of these caches are literally nothing and they don't require any
pretraining or eviction policies. This is all I meant by "hybrid" lookup
based on the distribution... this could be built in the dictionary but
doesn't have to be.

D.

[1]
https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121

Re: Hunspell performance [ In reply to ]

gus.heck at gmail

Feb 11, 2021, 8:50 AM

Post #17 of 26 (1676 views)

>
> I don't have any confidence that solr would default to the "smaller"
> option or fix how they manage different solr cores or thousands of
> threads or any of the analyzer issues.

Certainly there's work to be done there. Many things to improve. Separate
issue however.

>

And who would maintain this
> separate hunspell backend? I don't think it is fair to Peter to have
> to cope with 2 implementations of hunspell, 1 is certainly enough...
> :). It's all apache license, at the end of the day if someone wants to
> step up, let 'em. otherwise let's get out of their way.
>

Entirely valid point, but what I wouldn't want to see is a case where
someone using an existing install had to buy significantly more servers to
continue using it with the new version. I also think it's great to have
improved performance :) I've had several customers that have been
disappointed at the cost of servers necessary for the size of their data.
Usually this cost is due to memory requirements, not cpu needs. I often
have to explain that search is all about trading memory for speed, but have
found myself wishing that it were easier to vary the degree of that
trade-off. So that's the root of my comment...

Re: Hunspell performance [ In reply to ]

peter.gromov at jetbrains

Feb 11, 2021, 9:27 AM

Post #18 of 26 (1676 views)

Dawid,

I like these numbers very much! Did you put the caching inside
Dictionary#lookupWord? Did you cache negative results as well?

Peter

On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> That's an interesting idea about storing shortest (easy) or most frequent
>> (hard) words separately. Unfortunately the distribution isn't entirely
>> zipfian, as the dictionaries tend to contain a lot of short but uncommon
>> words, like abbreviations. Still, something for me to explore, thanks!
>>
>
> I meant taking advantage of empirical distribution of words occurring in
> an average text (not their lengths) and providing a faster code path for
> these.
> I'd assume a typical run would spend the majority of the time looking up
> the same words over and over. A small cache could be added to the
> dictionary but also anywhere else - even on top of Stemmer or SpellChecker.
> This cache could be very, very simplistic - even a dumb throw-away HashMap
> discarded after it reaches a certain size (cache hits would occur only
> until it's filled up). The distribution of the input still will make it
> perform relatively well.
>
> Here's an example - I used Mike McCandless's enwiki and ran English test
> from TestPerformance on it. My master branch results:
>
> Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
> Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945,
> 1955, 1959, 1972, 1952]
> Spellchecking en: average 2438.1428571428573, all times = [2465, 2425,
> 2429, 2434, 2452, 2429, 2433]
>
> I then added [1] a dumb no-eviction-policy caching on top of stemming and
> spell checking.
>
> Results for cache size of 2500 entries:
> Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889,
> 890, 886, 882]
> Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187,
> 1179, 1170, 1170]
>
> Results for cache size of 5000 entries:
> Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766,
> 763, 766, 767]
> Spellchecking en: average 1065.2857142857142, all times = [1070, 1068,
> 1061, 1065, 1064, 1067, 1062]
>
> Results for cache size of 10000 entries:
> Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
> Spellchecking en: average 926.8571428571429, all times = [924, 925, 930,
> 927, 925, 930, 927]
>
> The sizes of these caches are literally nothing and they don't require any
> pretraining or eviction policies. This is all I meant by "hybrid" lookup
> based on the distribution... this could be built in the dictionary but
> doesn't have to be.
>
> D.
>
> [1]
> https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
>
>

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 11, 2021, 10:43 AM

Post #19 of 26 (1676 views)

Hi Peter,

I provided a link to the commit diff? It was just a dumb wrapper in perf.
test - take a look. I'm sure it can be done better but the approach itself
is sound (and I've done it before). The cache-no-eviction-policy approach
essentially relies on non-uniform distribution of the elements on input
(which is the case in many situations). I've used it before on token
streams (to cache token attributes of frequent tokens). I'm sure it'd do
some good here as well.

It doesn't negate the need to work on speeding up the core of the algorithm
itself, but it's an easy way to squeeze some additional performance without
making things much more complex.

Dawid

On Thu, Feb 11, 2021 at 6:28 PM Peter Gromov
<peter.gromov@jetbrains.com.invalid> wrote:

> Dawid,
>
> I like these numbers very much! Did you put the caching inside
> Dictionary#lookupWord? Did you cache negative results as well?
>
> Peter
>
> On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss <dawid.weiss@gmail.com>
> wrote:
>
>> That's an interesting idea about storing shortest (easy) or most frequent
>>> (hard) words separately. Unfortunately the distribution isn't entirely
>>> zipfian, as the dictionaries tend to contain a lot of short but uncommon
>>> words, like abbreviations. Still, something for me to explore, thanks!
>>>
>>
>> I meant taking advantage of empirical distribution of words occurring in
>> an average text (not their lengths) and providing a faster code path for
>> these.
>> I'd assume a typical run would spend the majority of the time looking up
>> the same words over and over. A small cache could be added to the
>> dictionary but also anywhere else - even on top of Stemmer or SpellChecker.
>> This cache could be very, very simplistic - even a dumb throw-away HashMap
>> discarded after it reaches a certain size (cache hits would occur only
>> until it's filled up). The distribution of the input still will make it
>> perform relatively well.
>>
>> Here's an example - I used Mike McCandless's enwiki and ran English test
>> from TestPerformance on it. My master branch results:
>>
>> Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
>> Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945,
>> 1955, 1959, 1972, 1952]
>> Spellchecking en: average 2438.1428571428573, all times = [2465, 2425,
>> 2429, 2434, 2452, 2429, 2433]
>>
>> I then added [1] a dumb no-eviction-policy caching on top of stemming and
>> spell checking.
>>
>> Results for cache size of 2500 entries:
>> Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889,
>> 890, 886, 882]
>> Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187,
>> 1179, 1170, 1170]
>>
>> Results for cache size of 5000 entries:
>> Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766,
>> 763, 766, 767]
>> Spellchecking en: average 1065.2857142857142, all times = [1070, 1068,
>> 1061, 1065, 1064, 1067, 1062]
>>
>> Results for cache size of 10000 entries:
>> Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634,
>> 630]
>> Spellchecking en: average 926.8571428571429, all times = [924, 925, 930,
>> 927, 925, 930, 927]
>>
>> The sizes of these caches are literally nothing and they don't require
>> any pretraining or eviction policies. This is all I meant by "hybrid"
>> lookup based on the distribution... this could be built in the dictionary
>> but doesn't have to be.
>>
>> D.
>>
>> [1]
>> https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
>>
>>

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 11, 2021, 10:55 AM

Post #20 of 26 (1676 views)

for TokenStreams, they are simple, accessed by only one thread: you
can also do a proper LRU via LinkedHashMap which is maybe even less
code.

but putting caches around tokenstream/stemmer has implications, if the
user has huge numbers of threads and especially high churn (like solr
with its dynamic threadpool with max of 10000, my earlier mail).
another alternative would be to centralize it around hunspell
"Dictionary", but then it needs to be thread-safe and so on.

On Thu, Feb 11, 2021 at 1:44 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>
> Hi Peter,
>
> I provided a link to the commit diff? It was just a dumb wrapper in perf. test - take a look. I'm sure it can be done better but the approach itself is sound (and I've done it before). The cache-no-eviction-policy approach essentially relies on non-uniform distribution of the elements on input (which is the case in many situations). I've used it before on token streams (to cache token attributes of frequent tokens). I'm sure it'd do some good here as well.
>
> It doesn't negate the need to work on speeding up the core of the algorithm itself, but it's an easy way to squeeze some additional performance without making things much more complex.
>
> Dawid
>
> On Thu, Feb 11, 2021 at 6:28 PM Peter Gromov <peter.gromov@jetbrains.com.invalid> wrote:
>>
>> Dawid,
>>
>> I like these numbers very much! Did you put the caching inside Dictionary#lookupWord? Did you cache negative results as well?
>>
>> Peter
>>
>> On Thu, Feb 11, 2021 at 12:17 PM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>>
>>>> That's an interesting idea about storing shortest (easy) or most frequent (hard) words separately. Unfortunately the distribution isn't entirely zipfian, as the dictionaries tend to contain a lot of short but uncommon words, like abbreviations. Still, something for me to explore, thanks!
>>>
>>>
>>> I meant taking advantage of empirical distribution of words occurring in an average text (not their lengths) and providing a faster code path for these.
>>> I'd assume a typical run would spend the majority of the time looking up the same words over and over. A small cache could be added to the dictionary but also anywhere else - even on top of Stemmer or SpellChecker. This cache could be very, very simplistic - even a dumb throw-away HashMap discarded after it reaches a certain size (cache hits would occur only until it's filled up). The distribution of the input still will make it perform relatively well.
>>>
>>> Here's an example - I used Mike McCandless's enwiki and ran English test from TestPerformance on it. My master branch results:
>>>
>>> Loaded c:\_tmp\hunspell\libreoffice\en\en_AU.aff
>>> Stemming en: average 1958.7142857142858, all times = [1968, 1960, 1945, 1955, 1959, 1972, 1952]
>>> Spellchecking en: average 2438.1428571428573, all times = [2465, 2425, 2429, 2434, 2452, 2429, 2433]
>>>
>>> I then added [1] a dumb no-eviction-policy caching on top of stemming and spell checking.
>>>
>>> Results for cache size of 2500 entries:
>>> Stemming en: average 886.5714285714286, all times = [885, 885, 889, 889, 890, 886, 882]
>>> Spellchecking en: average 1182.0, all times = [1207, 1177, 1184, 1187, 1179, 1170, 1170]
>>>
>>> Results for cache size of 5000 entries:
>>> Stemming en: average 762.8571428571429, all times = [764, 758, 756, 766, 763, 766, 767]
>>> Spellchecking en: average 1065.2857142857142, all times = [1070, 1068, 1061, 1065, 1064, 1067, 1062]
>>>
>>> Results for cache size of 10000 entries:
>>> Stemming en: average 628.0, all times = [626, 628, 627, 624, 627, 634, 630]
>>> Spellchecking en: average 926.8571428571429, all times = [924, 925, 930, 927, 925, 930, 927]
>>>
>>> The sizes of these caches are literally nothing and they don't require any pretraining or eviction policies. This is all I meant by "hybrid" lookup based on the distribution... this could be built in the dictionary but doesn't have to be.
>>>
>>> D.
>>>
>>> [1] https://github.com/apache/lucene-solr/commit/2b3cc7d1fc4ee3d999873f056764f70129390121
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 12, 2021, 12:38 AM

Post #21 of 26 (1676 views)

> for TokenStreams, they are simple, accessed by only one thread: you
> can also do a proper LRU via LinkedHashMap which is maybe even less
> code.
>

Maybe surprisingly, the naive approach of just discarding the hash map
worked surprisingly well
on real data (for me). The difference between LRU (and anything else)
compared to that naive strategy was fairly small.

Updated that branch to include LRU as well, here are the results:

Naive clear on hashmap:

Cache size: 0: Stemming en: average 1934.5714285714287, all times = [1952,
1941, 1922, 1932, 1938, 1931, 1926]
Cache size: 1000: Stemming en: average 990.2857142857143, all times = [981,
984, 985, 995, 999, 997, 991]
Cache size: 2000: Stemming en: average 879.7142857142857, all times = [878,
875, 874, 882, 880, 885, 884]
Cache size: 4000: Stemming en: average 774.2857142857143, all times = [771,
770, 773, 774, 775, 777, 780]
Cache size: 8000: Stemming en: average 659.0, all times = [652, 658, 657,
673, 672, 655, 646]

Cache size: 0: Spellchecking en: average 2422.285714285714, all times =
[2421, 2418, 2437, 2445, 2422, 2406, 2407]
Cache size: 1000: Spellchecking en: average 1264.857142857143, all times =
[1259, 1251, 1255, 1254, 1274, 1274, 1287]
Cache size: 2000: Spellchecking en: average 1172.2857142857142, all times =
[1189, 1185, 1185, 1157, 1159, 1169, 1162]
Cache size: 4000: Spellchecking en: average 1058.0, all times = [1052,
1050, 1063, 1070, 1059, 1057, 1055]
Cache size: 8000: Spellchecking en: average 937.0, all times = [932, 942,
943, 927, 925, 935, 955]

LRU on linked hash map:

Cache size: 0 Stemming en: average 1960.142857142857, all times = [1955,
1954, 1966, 1974, 1960, 1957, 1955]
Cache size: 1000 Stemming en: average 928.8571428571429, all times = [921,
929, 953, 929, 925, 924, 921]
Cache size: 2000 Stemming en: average 817.8571428571429, all times = [818,
818, 818, 820, 819, 815, 817]
Cache size: 4000 Stemming en: average 706.0, all times = [705, 705, 708,
705, 706, 706, 707]
Cache size: 8000 Stemming en: average 583.0, all times = [583, 584, 582,
584, 582, 585, 581]

Cache size: 0 Spellchecking en: average 2452.714285714286, all times =
[2496, 2477, 2436, 2435, 2441, 2439, 2445]
Cache size: 1000 Spellchecking en: average 1203.7142857142858, all times =
[1205, 1204, 1203, 1201, 1202, 1205, 1206]
Cache size: 2000 Spellchecking en: average 1108.0, all times = [1110, 1107,
1105, 1106, 1106, 1111, 1111]
Cache size: 4000 Spellchecking en: average 1004.0, all times = [999, 995,
993, 996, 997, 1019, 1029]
Cache size: 8000 Spellchecking en: average 885.1428571428571, all times =
[902, 904, 885, 872, 876, 877, 880]

A cache-all for comparison (a lower bound... of sorts):

Cache size: 0 Stemming en: average 36.142857142857146, all times = [53, 36,
35, 32, 32, 33, 32]
Cache size: 1000 Stemming en: average 33.285714285714285, all times = [32,
31, 32, 35, 34, 35, 34]

Cache size: 0 Spellchecking en: average 31.285714285714285, all times =
[33, 32, 36, 30, 29, 30, 29]
Cache size: 1000 Spellchecking en: average 31.571428571428573, all times =
[31, 31, 32, 31, 31, 35, 30]

but putting caches around tokenstream/stemmer has implications, if the
> user has huge numbers of threads and especially high churn (like solr
> with its dynamic threadpool with max of 10000, my earlier mail).

Right. I'm not saying it's for everyone, just throwing an idea that worked
really well in practice.

another alternative would be to centralize it around hunspell
> "Dictionary", but then it needs to be thread-safe and so on.
>

I'd leave it for application layers higher up, to be honest. Where you know
the usage context better.

D.

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 12, 2021, 1:00 AM

Post #22 of 26 (1676 views)

Corrected a dumb mistake in lru (insertion vs. access order), the results
are not that much different though:

Insertion-order

LRU on linked hash map:
>
> Cache size: 1000 Stemming en: average 928.8571428571429, all times = [921,
> 929, 953, 929, 925, 924, 921]
> Cache size: 2000 Stemming en: average 817.8571428571429, all times = [818,
> 818, 818, 820, 819, 815, 817]
> Cache size: 4000 Stemming en: average 706.0, all times = [705, 705, 708,
> 705, 706, 706, 707]
> Cache size: 8000 Stemming en: average 583.0, all times = [583, 584, 582,
> 584, 582, 585, 581]
>

Access-order:

Cache size: 1000 Stemming en: average 894.5714285714286, all times = [896,
894, 894, 897, 895, 895, 891]
Cache size: 2000 Stemming en: average 782.0, all times = [782, 783, 783,
784, 781, 781, 780]
Cache size: 4000 Stemming en: average 667.5714285714286, all times = [661,
666, 666, 666, 677, 668, 669]
Cache size: 8000 Stemming en: average 534.2857142857143, all times = [544,
532, 531, 532, 542, 529, 530]
Cache size: 30000 Stemming en: average 323.2857142857143, all times = [317,
330, 326, 325, 319, 325, 321]
Cache size: 50000 Stemming en: average 250.28571428571428, all times =
[256, 249, 247, 250, 249, 256, 245]
Cache size: 53000 Stemming en: average 242.42857142857142, all times =
[246, 239, 240, 245, 241, 245, 241]
Cache size: 55000 Stemming en: average 44.285714285714285, all times = [44,
45, 44, 44, 44, 45, 44]

There are exactly 53869 unique keys on input and it's interesting to see
how the performance drops between "barely" fitting cache and fully cached
input. I think it may be the jit just optimizing away something...

It's all an intellectual exercise though. I consider the initial
performance drop on small cache windows a nice, cheap, win. Increasing
the cache leads to other problems that may sting later (gc activity, memory
consumption on multiple parallel threads, etc.).

D.

Re: Hunspell performance [ In reply to ]

rcmuir at gmail

Feb 12, 2021, 3:56 AM

Post #23 of 26 (1675 views)

On Fri, Feb 12, 2021 at 4:01 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

>
> It's all an intellectual exercise though. I consider the initial
> performance drop on small cache windows a nice, cheap, win. Increasing
> the cache leads to other problems that may sting later (gc activity,
> memory consumption on multiple parallel threads, etc.).
>
>
Is that because of stopwords that aren't being removed? I guess what I'm
asking is, for this test is n=20 enough? :)

If so, leads to many other potential solutions (without caches). Also it
would suggest the benchmark might be a bit too biased.

Re: Hunspell performance [ In reply to ]

peter.gromov at jetbrains

Feb 12, 2021, 4:04 AM

Post #24 of 26 (1675 views)

Dawid, I didn't notice the commit link then, thanks for pointing that out!

This "// TODO: make sure these returned charsref are immutable?" is a good
point, because now they're very mutable, referring to internal preallocated
buffers in Stemmer which are constantly reused.

In cache-all condition, you ignore the maxSize intentionally, right?

I've reproduced your results for English. I also checked German and French,
which have compounds and more advanced inflection. They're improved as
well, but not so much (30-40% on cache=10000, while calling native Hunspell
via JNI is 2-4 times faster).

I dream of making Hunspell Stemmer thread-safe, and have even got rid of
some preallocated stuff there, but there still remains some, so in the near
future it'll remain thread-unsafe and caching can fit in there.

Clients might deduplicate the requests themselves, I've done that a couple
of times. Then the cache inside Hunspell would be useless and just add some
overhead (luckily, not much, as per my CPU snapshots).

Robert, for n=20 the speedup is quite small, 2-8% for me depending on the
language. Unfortunately Hunspell dictionaries don't have stop word
information, it'd be quite useful.

On Fri, Feb 12, 2021 at 12:56 PM Robert Muir <rcmuir@gmail.com> wrote:

>
> On Fri, Feb 12, 2021 at 4:01 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>>
>> It's all an intellectual exercise though. I consider the initial
>> performance drop on small cache windows a nice, cheap, win. Increasing
>> the cache leads to other problems that may sting later (gc activity,
>> memory consumption on multiple parallel threads, etc.).
>>
>>
> Is that because of stopwords that aren't being removed? I guess what I'm
> asking is, for this test is n=20 enough? :)
>
> If so, leads to many other potential solutions (without caches). Also it
> would suggest the benchmark might be a bit too biased.
>

Re: Hunspell performance [ In reply to ]

dawid.weiss at gmail

Feb 12, 2021, 4:16 AM

Post #25 of 26 (1675 views)

> This "// TODO: make sure these returned charsref are immutable?" is a good
> point, because now they're very mutable, referring to internal preallocated
> buffers in Stemmer which are constantly reused.
>

You'd need to copy them or otherwise make sure they remain constant while
in the cache, obviously.

> In cache-all condition, you ignore the maxSize intentionally, right?
>

Yes - I was just trying to figure out how many tokens are there in that
sample.

> I've reproduced your results for English. I also checked German and
> French, which have compounds and more advanced inflection. They're improved
> as well, but not so much (30-40% on cache=10000, while calling native
> Hunspell via JNI is 2-4 times faster).
>

Sure. Like I said, it was just an idea off the top of my head.

D.