Mailing List Archive

Lucene Hunpell Spell checker
Hello,
I'm trying to create a java-wrapper library to lang-detect and then spell check for the detected languages. I'm currently using Apache Tika as a lang detector and i'm trying to use lucene.analysis.hunspell package for spell-checking, as i've i seen it supports many languages.My issue is, i cant get good accuracy for some languages that have "special" characters.e.g in sweedish im checking the word bästa, which is classified as misspelled and the word basta is suggested instead.bästa exists in the dictionary, so i think this is some encoding issue.

I'm on windows, w/ lucene 8.11.2.
Im using lucene.analysis.hunspell.Hunspell as a spellchcker
and lucene.analysis.hunspell.Dictionary to create the dicts.
I'm using .dic and .aff files from here.


Any guidance on where i should look, or how i should implement to perform spellchecks would be welcome, as i've hardly found anything :)

Thanks a lot in advance,
Thanos
RE: Lucene Hunpell Spell checker [ In reply to ]
*here
Re: Lucene Hunpell Spell checker [ In reply to ]
It'd be good if you could share the problematic scenario as a piece of code
(ideally a forked Lucene repository, with a test case?) so that we can take
a look. There's been a ton of improvements to hunspell packages in Lucene 9
(and on the main branch) - you should take a look and perhaps take some
inspiration from existing test cases there?

Dawid

On Mon, Feb 13, 2023 at 1:52 PM Thanos Agelakpoulos
<agel_thanos@yahoo.gr.invalid> wrote:

> Hello,
> I'm trying to create a java-wrapper library to lang-detect and then spell
> check for the detected languages. I'm currently using Apache Tika as a lang
> detector and i'm trying to use lucene.analysis.hunspell package for
> spell-checking, as i've i seen it supports many languages.My issue is, i
> cant get good accuracy for some languages that have "special"
> characters.e.g in sweedish im checking the word bästa, which is classified
> as misspelled and the word basta is suggested instead.bästa exists in the
> dictionary, so i think this is some encoding issue.
>
> I'm on windows, w/ lucene 8.11.2.
> Im using lucene.analysis.hunspell.Hunspell as a spellchcker
> and lucene.analysis.hunspell.Dictionary to create the dicts.
> I'm using .dic and .aff files from here.
>
>
> Any guidance on where i should look, or how i should implement to perform
> spellchecks would be welcome, as i've hardly found anything :)
>
> Thanks a lot in advance,
> Thanos
>
RE: Lucene Hunpell Spell checker [ In reply to ]
Thanks for the response David ! 

I created a quick repo just to showcase, https://github.com/aggelako/JavaSpellchecker
In there you can see how im using lucene, in the SpellChecker class/ the spellCheck function where im performing a spellcheck.I have also provided the dicts as resources.
You can also see the cases i'm reffering to in the last 2 tests.This happens for a bunch of the languages, just presented 2 examples.
Feel free to propose any changes, comments fixes :)
Thank's a lot in advance,
Thanos
Re: Lucene Hunpell Spell checker [ In reply to ]
Can't open this repository, it's probably private.

Dawid

On Tue, Feb 14, 2023 at 2:42 PM Thanos Agelakpoulos
<agel_thanos@yahoo.gr.invalid> wrote:

>
> Thanks for the response David !
>
> I created a quick repo just to showcase,
> https://github.com/aggelako/JavaSpellchecker
> In there you can see how im using lucene, in the SpellChecker class/ the
> spellCheck function where im performing a spellcheck.I have also provided
> the dicts as resources.
> You can also see the cases i'm reffering to in the last 2 tests.This
> happens for a bunch of the languages, just presented 2 examples.
> Feel free to propose any changes, comments fixes :)
> Thank's a lot in advance,
> Thanos
Re: Lucene Hunpell Spell checker [ In reply to ]
FIY, from what I saw there there was a `dictionary gap` - kind of
incomplete dictionary files.

Another question always makes me wonder: why there is no a hunspell based
suggester, spellchecker in Lucene codebase?

On Fri, Feb 17, 2023 at 11:23 AM Dawid Weiss <dawid.weiss@gmail.com> wrote:

> Can't open this repository, it's probably private.
>
> Dawid
>
> On Tue, Feb 14, 2023 at 2:42 PM Thanos Agelakpoulos
> <agel_thanos@yahoo.gr.invalid> wrote:
>
> >
> > Thanks for the response David !
> >
> > I created a quick repo just to showcase,
> > https://github.com/aggelako/JavaSpellchecker
> > In there you can see how im using lucene, in the SpellChecker class/ the
> > spellCheck function where im performing a spellcheck.I have also provided
> > the dicts as resources.
> > You can also see the cases i'm reffering to in the last 2 tests.This
> > happens for a bunch of the languages, just presented 2 examples.
> > Feel free to propose any changes, comments fixes :)
> > Thank's a lot in advance,
> > Thanos
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!