Mailing List Archive: Question about the light and minimal French stemmers

Question about the light and minimal French stemmers

adriengallou at gmail

Jul 23, 2019, 5:53 AM

Post #1 of 8 (1266 views)

Hi,

I'm using both light and minimal French stemmers and encountered an issue
when using the minimal stemmer.

The light stemmer removes the last character of a word if the last two
characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
In this light stemmer, there is a check to avoid altering the token if the
token is a number.

The minimal stemmer also removes the last character of a word if the last
two characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

But in this minimal stemmer there is no check to see if the character is a
letter or not.
So when we have numeric tokens with the last two characters identical they
are altered.

Is there a reason for this?
Should I file an issue on Jira to add this check?

Thanks,

Adrien Gallou

Question about the light and minimal French stemmers [ In reply to ]

adriengallou at gmail

Jul 23, 2019, 6:04 AM

Post #2 of 8 (1266 views)

Hi,

I'm using both light and minimal French stemmers and encountered an issue
when using the minimal stemmer.

The light stemmer removes the last character of a word if the last two
characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
In this light stemmer, there is a check to avoid altering the token if the
token is a number.

The minimal stemmer also removes the last character of a word if the last
two characters are identical.
We can see that here:
https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77

But in this minimal stemmer there is no check to see if the character is a
letter or not.
So when we have numeric tokens with the last two characters identical they
are altered.

Is there a reason for this?
Should I file an issue on Jira to add this check?

Thanks,

Adrien Gallou

Re: Question about the light and minimal French stemmers [ In reply to ]

tomoko.uchida.1111 at gmail

Jul 27, 2019, 4:29 AM

Post #3 of 8 (1258 views)

Hi Adrien,

To me, it sounds simply a bug. Can you please open a JIRA (with a
patch if possible)?

Tomoko

2019?7?23?(?) 22:05 Adrien Gallou <adriengallou@gmail.com>:
>
> Hi,
>
> I'm using both light and minimal French stemmers and encountered an issue
> when using the minimal stemmer.
>
> The light stemmer removes the last character of a word if the last two
> characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> In this light stemmer, there is a check to avoid altering the token if the
> token is a number.
>
> The minimal stemmer also removes the last character of a word if the last
> two characters are identical.
> We can see that here:
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
>
> But in this minimal stemmer there is no check to see if the character is a
> letter or not.
> So when we have numeric tokens with the last two characters identical they
> are altered.
>
> Is there a reason for this?
> Should I file an issue on Jira to add this check?
>
> Thanks,
>
> Adrien Gallou

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about the light and minimal French stemmers [ In reply to ]

tomoko.uchida.1111 at gmail

Jul 27, 2019, 5:05 AM

Post #4 of 8 (1258 views)

I found an issue which adds the isLetter() check on FrenchLightStemmer.
https://issues.apache.org/jira/browse/LUCENE-4063

Seems the same change has not been applied to FrenchMinimalStemmer,
would it be a good idea that we add the same check to it to avoid too
aggressive stemming?

Tomoko

2019?7?27?(?) 20:29 Tomoko Uchida <tomoko.uchida.1111@gmail.com>:
>
> Hi Adrien,
>
> To me, it sounds simply a bug. Can you please open a JIRA (with a
> patch if possible)?
>
> Tomoko
>
> 2019?7?23?(?) 22:05 Adrien Gallou <adriengallou@gmail.com>:
> >
> > Hi,
> >
> > I'm using both light and minimal French stemmers and encountered an issue
> > when using the minimal stemmer.
> >
> > The light stemmer removes the last character of a word if the last two
> > characters are identical.
> > We can see that here:
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > In this light stemmer, there is a check to avoid altering the token if the
> > token is a number.
> >
> > The minimal stemmer also removes the last character of a word if the last
> > two characters are identical.
> > We can see that here:
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> >
> > But in this minimal stemmer there is no check to see if the character is a
> > letter or not.
> > So when we have numeric tokens with the last two characters identical they
> > are altered.
> >
> > Is there a reason for this?
> > Should I file an issue on Jira to add this check?
> >
> > Thanks,
> >
> > Adrien Gallou

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about the light and minimal French stemmers [ In reply to ]

msokolov at gmail

Jul 27, 2019, 12:55 PM

Post #5 of 8 (1258 views)

I'm not so sure. I think the whole idea of having both stemmers is that the
minimal one does less than the light one.

Removing the final character of a double letter suffix is going to
sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
others.

So having both options is helpful, I don't think it's a bug on the face of
it. However I didn't look closely at the code, so I'm not sure what the
intent is exactly.

On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <tomoko.uchida.1111@gmail.com>
wrote:

> Hi Adrien,
>
> To me, it sounds simply a bug. Can you please open a JIRA (with a
> patch if possible)?
>
> Tomoko
>
> 2019?7?23?(?) 22:05 Adrien Gallou <adriengallou@gmail.com>:
> >
> > Hi,
> >
> > I'm using both light and minimal French stemmers and encountered an issue
> > when using the minimal stemmer.
> >
> > The light stemmer removes the last character of a word if the last two
> > characters are identical.
> > We can see that here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > In this light stemmer, there is a check to avoid altering the token if
> the
> > token is a number.
> >
> > The minimal stemmer also removes the last character of a word if the last
> > two characters are identical.
> > We can see that here:
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> >
> > But in this minimal stemmer there is no check to see if the character is
> a
> > letter or not.
> > So when we have numeric tokens with the last two characters identical
> they
> > are altered.
> >
> > Is there a reason for this?
> > Should I file an issue on Jira to add this check?
> >
> > Thanks,
> >
> > Adrien Gallou
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Question about the light and minimal French stemmers [ In reply to ]

tomoko.uchida.1111 at gmail

Jul 27, 2019, 7:35 PM

Post #6 of 8 (1258 views)

Let me just make things a bit clear...
I think the concern here is that FrenchMinimalStemmer would remove the
last "digit" from a token because of it does not check if the
character is letter or not.
e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.

To me, this behaviour is beyond stemming.

Tomoko

2019?7?28?(?) 4:55 Michael Sokolov <msokolov@gmail.com>:
>
> I'm not so sure. I think the whole idea of having both stemmers is that the
> minimal one does less than the light one.
>
> Removing the final character of a double letter suffix is going to
> sacrifice some precision. For example mes/mess, ne/née, I'm sure there are
> others.
>
> So having both options is helpful, I don't think it's a bug on the face of
> it. However I didn't look closely at the code, so I'm not sure what the
> intent is exactly.
>
> On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <tomoko.uchida.1111@gmail.com>
> wrote:
>
> > Hi Adrien,
> >
> > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > patch if possible)?
> >
> > Tomoko
> >
> > 2019?7?23?(?) 22:05 Adrien Gallou <adriengallou@gmail.com>:
> > >
> > > Hi,
> > >
> > > I'm using both light and minimal French stemmers and encountered an issue
> > > when using the minimal stemmer.
> > >
> > > The light stemmer removes the last character of a word if the last two
> > > characters are identical.
> > > We can see that here:
> > >
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > In this light stemmer, there is a check to avoid altering the token if
> > the
> > > token is a number.
> > >
> > > The minimal stemmer also removes the last character of a word if the last
> > > two characters are identical.
> > > We can see that here:
> > >
> > https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > >
> > > But in this minimal stemmer there is no check to see if the character is
> > a
> > > letter or not.
> > > So when we have numeric tokens with the last two characters identical
> > they
> > > are altered.
> > >
> > > Is there a reason for this?
> > > Should I file an issue on Jira to add this check?
> > >
> > > Thanks,
> > >
> > > Adrien Gallou
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Question about the light and minimal French stemmers [ In reply to ]

msokolov at gmail

Jul 28, 2019, 4:51 AM

Post #7 of 8 (1258 views)

Oh sorry for jumping in with my irrelevant comment, you are right, of
course!

On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida <tomoko.uchida.1111@gmail.com>
wrote:

> Let me just make things a bit clear...
> I think the concern here is that FrenchMinimalStemmer would remove the
> last "digit" from a token because of it does not check if the
> character is letter or not.
> e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.
>
> To me, this behaviour is beyond stemming.
>
> Tomoko
>
> 2019?7?28?(?) 4:55 Michael Sokolov <msokolov@gmail.com>:
> >
> > I'm not so sure. I think the whole idea of having both stemmers is that
> the
> > minimal one does less than the light one.
> >
> > Removing the final character of a double letter suffix is going to
> > sacrifice some precision. For example mes/mess, ne/née, I'm sure there
> are
> > others.
> >
> > So having both options is helpful, I don't think it's a bug on the face
> of
> > it. However I didn't look closely at the code, so I'm not sure what the
> > intent is exactly.
> >
> > On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <
> tomoko.uchida.1111@gmail.com>
> > wrote:
> >
> > > Hi Adrien,
> > >
> > > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > > patch if possible)?
> > >
> > > Tomoko
> > >
> > > 2019?7?23?(?) 22:05 Adrien Gallou <adriengallou@gmail.com>:
> > > >
> > > > Hi,
> > > >
> > > > I'm using both light and minimal French stemmers and encountered an
> issue
> > > > when using the minimal stemmer.
> > > >
> > > > The light stemmer removes the last character of a word if the last
> two
> > > > characters are identical.
> > > > We can see that here:
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > > In this light stemmer, there is a check to avoid altering the token
> if
> > > the
> > > > token is a number.
> > > >
> > > > The minimal stemmer also removes the last character of a word if the
> last
> > > > two characters are identical.
> > > > We can see that here:
> > > >
> > >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > > >
> > > > But in this minimal stemmer there is no check to see if the
> character is
> > > a
> > > > letter or not.
> > > > So when we have numeric tokens with the last two characters identical
> > > they
> > > > are altered.
> > > >
> > > > Is there a reason for this?
> > > > Should I file an issue on Jira to add this check?
> > > >
> > > > Thanks,
> > > >
> > > > Adrien Gallou
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Question about the light and minimal French stemmers [ In reply to ]

adriengallou at gmail

Jul 28, 2019, 12:50 PM

Post #8 of 8 (1251 views)

Hi Tomoko,

Thanks for your answer.

So, after them, I have opened an issue with a patch attached:
https://issues.apache.org/jira/browse/LUCENE-8937

Adrien

Le dim. 28 juil. 2019 à 13:51, Michael Sokolov <msokolov@gmail.com> a
écrit :

> Oh sorry for jumping in with my irrelevant comment, you are right, of
> course!
>
> On Sat, Jul 27, 2019, 10:36 PM Tomoko Uchida <tomoko.uchida.1111@gmail.com
> >
> wrote:
>
> > Let me just make things a bit clear...
> > I think the concern here is that FrenchMinimalStemmer would remove the
> > last "digit" from a token because of it does not check if the
> > character is letter or not.
> > e.g., "123455" is trimmed to "12345" by FrenchMinimalStemmer.
> >
> > To me, this behaviour is beyond stemming.
> >
> > Tomoko
> >
> > 2019?7?28?(?) 4:55 Michael Sokolov <msokolov@gmail.com>:
> > >
> > > I'm not so sure. I think the whole idea of having both stemmers is that
> > the
> > > minimal one does less than the light one.
> > >
> > > Removing the final character of a double letter suffix is going to
> > > sacrifice some precision. For example mes/mess, ne/née, I'm sure there
> > are
> > > others.
> > >
> > > So having both options is helpful, I don't think it's a bug on the face
> > of
> > > it. However I didn't look closely at the code, so I'm not sure what the
> > > intent is exactly.
> > >
> > > On Sat, Jul 27, 2019, 7:30 AM Tomoko Uchida <
> > tomoko.uchida.1111@gmail.com>
> > > wrote:
> > >
> > > > Hi Adrien,
> > > >
> > > > To me, it sounds simply a bug. Can you please open a JIRA (with a
> > > > patch if possible)?
> > > >
> > > > Tomoko
> > > >
> > > > 2019?7?23?(?) 22:05 Adrien Gallou <adriengallou@gmail.com>:
> > > > >
> > > > > Hi,
> > > > >
> > > > > I'm using both light and minimal French stemmers and encountered an
> > issue
> > > > > when using the minimal stemmer.
> > > > >
> > > > > The light stemmer removes the last character of a word if the last
> > two
> > > > > characters are identical.
> > > > > We can see that here:
> > > > >
> > > >
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchLightStemmer.java#L263
> > > > > In this light stemmer, there is a check to avoid altering the token
> > if
> > > > the
> > > > > token is a number.
> > > > >
> > > > > The minimal stemmer also removes the last character of a word if
> the
> > last
> > > > > two characters are identical.
> > > > > We can see that here:
> > > > >
> > > >
> >
> https://github.com/apache/lucene-solr/blob/master/lucene/analysis/common/src/java/org/apache/lucene/analysis/fr/FrenchMinimalStemmer.java#L77
> > > > >
> > > > > But in this minimal stemmer there is no check to see if the
> > character is
> > > > a
> > > > > letter or not.
> > > > > So when we have numeric tokens with the last two characters
> identical
> > > > they
> > > > > are altered.
> > > > >
> > > > > Is there a reason for this?
> > > > > Should I file an issue on Jira to add this check?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Adrien Gallou
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > > >
> > > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>