Mailing List Archive

Limitations of StempelStemmer
Hi,

I have just checked out the latest version of Lucene from Git master branch.

I have tried to stem a few words using StempelStemmer for Polish.
However, it looks it cannot handle some words properly, e.g.

joyce -> ??
wielce -> ??
piwko -> ??
royce -> ??
pip -> ??
xyz -> xyz

1. I surprised it cannot handle Polish words like wielce, piwko and
royce. Is this a limitation of the stemming algorithm or a training of
the algorithm or something else? The latter would help improve the
situation. How can I improve that behaviour?
2. I am surprised that for non-Polish words it returns "a?". I would
expect that for words it has not be trained for it will return their
original forms, as it happens, for instance, when stemming words like
"xyz".

With kind regards,
Maciej Gawinecki

Here's minimal example to reproduce the issue:

package org.apache.lucene.analysis;

import java.io.InputStream;
import org.apache.lucene.analysis.stempel.StempelStemmer;

public class Try {

public static void main(String[] args) throws Exception {
InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
.getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
for (String word : words) {
System.out.println(String.format("%s -> %s", word,
stemmer.stem("piwko")));
}

}

}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Limitations of StempelStemmer [ In reply to ]
Hi Maciej,

Stempel uses a pretrained heuristic. You can find a longer description
at [1] and [2]. The specific reason for the problems you mentioned may
be the smaller training dictionary used for the version embedded in
Lucene, I honestly don't know. If you need exact stemming/
lemmatization then take a look at dictionary methods -- Morfologik or
the tools listed at [3].

Dawid

[1] http://www.getopt.org/stempel/
[2] https://lucene.apache.org/core/8_2_0/analyzers-stempel/index.html
[3] http://zil.ipipan.waw.pl/

On Tue, Sep 10, 2019 at 9:31 PM Maciej Gawinecki <mgawinecki@gmail.com> wrote:
>
> Hi,
>
> I have just checked out the latest version of Lucene from Git master branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ??
> wielce -> ??
> piwko -> ??
> royce -> ??
> pip -> ??
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "a?". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
> public static void main(String[] args) throws Exception {
> InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
> .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
> StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
> String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
> for (String word : words) {
> System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
> }
>
> }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Limitations of StempelStemmer [ In reply to ]
Hi,

On Tue, Sep 10, 2019, 22:31 Maciej Gawinecki <mgawinecki@gmail.com> wrote:

> Hi,
>
> I have just checked out the latest version of Lucene from Git master
> branch.
>
> I have tried to stem a few words using StempelStemmer for Polish.
> However, it looks it cannot handle some words properly, e.g.
>
> joyce -> ??
> wielce -> ??
> piwko -> ??
> royce -> ??
> pip -> ??
> xyz -> xyz
>
> 1. I surprised it cannot handle Polish words like wielce, piwko and
> royce. Is this a limitation of the stemming algorithm or a training of
> the algorithm or something else? The latter would help improve the
> situation. How can I improve that behaviour?
> 2. I am surprised that for non-Polish words it returns "a?". I would
> expect that for words it has not be trained for it will return their
> original forms, as it happens, for instance, when stemming words like
> "xyz".
>
> With kind regards,
> Maciej Gawinecki
>
> Here's minimal example to reproduce the issue:
>
> package org.apache.lucene.analysis;
>
> import java.io.InputStream;
> import org.apache.lucene.analysis.stempel.StempelStemmer;
>
> public class Try {
>
> public static void main(String[] args) throws Exception {
> InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
>
> .getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
> StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
> String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
> for (String word : words) {
> System.out.println(String.format("%s -> %s", word,
> stemmer.stem("piwko")));
>

You always pass "piwko" for stemming.

}
>
> }
>
> }
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: Limitations of StempelStemmer [ In reply to ]
> You always pass "piwko" for stemming.

I'm afraid that's not correct? You should *never* pass on piwko when
stemming. :)

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Limitations of StempelStemmer [ In reply to ]
> You always pass "piwko" for stemming.

Right, I've spotted my mistake once I've posted my question but
didn't want spam with too many posts (there's no way to edit already
posted question in a mailing list :-)). Anyway, the issue still
persists. Here's the corrected version to reproduce it:

import java.io.InputStream;
import org.apache.lucene.analysis.stempel.StempelStemmer;

public class Try {

public static void main(String[] args) throws Exception {
InputStream stemmerTabke = ClassLoader.getSystemClassLoader()
.getResourceAsStream("org/apache/lucene/analysis/pl/stemmer_20000.tbl");
StempelStemmer stemmer = new StempelStemmer(stemmerTabke);
String[] words = {"joyce", "wielce", "piwko", "royce", "pip", "xyz"};
for (String word : words) {
System.out.println(String.format("%s -> %s", word,
stemmer.stem(word)));
}

}
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Limitations of StempelStemmer [ In reply to ]
>
> > You always pass "piwko" for stemming.
>
> I'm afraid that's not correct? You should *never* pass on piwko when
> stemming. :)

Haha, right, one should not mix both.

Anyway, thank your for your original suggestions. Training it with a
bigger corpus of inflection forms seems like a great idea. Now we have
many more corpora available (e.g., SGJP [1], Polimorf [2]
morphological dictionaries from Morfeusz) Andrzej Bia?ecki, the
original author, had when training the stemmer. I might give it a try,
just need to find some spare time :-)

[1]: http://download.sgjp.pl/morfeusz/20190925/sgjp-20190925.tab.gz
[2]: http://download.sgjp.pl/morfeusz/20190925/polimorf-20190925.tab.gz

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org