Mailing List Archive

A question on PhraseQuery and slop
Hello.


The explanation of
https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop
<https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop-->
writes
that the edit distance between "quick fox" and "the fox is quick" would be
at an edit distance of 3;
this seems inaccurate to me.

I don't know if the edit distance used by Lucene is the Levenshtein
distance (insertion, deletion, substitution, all of weight 1) - a standard
in information retrieval - but a test of "quick fox" PhraseQuery with a
slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the
slop does not have to be 3.

I wonder if I'm right.


Claude Lepère, Belgium

claudelepere@gmail.com



<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
Virus-free.
www.avg.com
<http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail>
<#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
Re: A question on PhraseQuery and slop [ In reply to ]
Hello Claude,

Hmm, that is interesting that you see slop=2 matching query "quick fox"
against document "the fox is quick".

Edit distance (Levenshtein) is a bit tricky because it might include a
transposition (just swapping the two words) as edit distance 1 OR 2.

So maybe Lucene's PhraseQuery is counting transposition as edit distance 1,
in which case, your test makes sense, and the javadocs are wrong?

I am far from an expert on PhraseQuery :) Does anyone know if we change
the behavior? In any case, we must at least fix the javadocs. Claude,
maybe open a Jira issue (
https://issues.apache.org/jira/projects/LUCENE/summary) and we can
discuss there?

Thank you for catching this!

Mike McCandless

http://blog.mikemccandless.com


On Fri, Dec 10, 2021 at 8:47 AM Claude Lepere <claudelepere@gmail.com>
wrote:

> Hello.
>
>
> The explanation of
>
> https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop
> <
> https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--
> >
> writes
> that the edit distance between "quick fox" and "the fox is quick" would be
> at an edit distance of 3;
> this seems inaccurate to me.
>
> I don't know if the edit distance used by Lucene is the Levenshtein
> distance (insertion, deletion, substitution, all of weight 1) - a standard
> in information retrieval - but a test of "quick fox" PhraseQuery with a
> slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the
> slop does not have to be 3.
>
> I wonder if I'm right.
>
>
> Claude Lepère, Belgium
>
> claudelepere@gmail.com
>
>
>
> <
> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> Virus-free.
> www.avg.com
> <
> http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> >
> <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
>
Re: A question on PhraseQuery and slop [ In reply to ]
I wonder if the Analysis chain could be involved. If those stop words
("is") are removed without leaving a hole somehow, then that could
explain?

On Mon, Dec 13, 2021 at 9:35 AM Michael McCandless
<lucene@mikemccandless.com> wrote:
>
> Hello Claude,
>
> Hmm, that is interesting that you see slop=2 matching query "quick fox"
> against document "the fox is quick".
>
> Edit distance (Levenshtein) is a bit tricky because it might include a
> transposition (just swapping the two words) as edit distance 1 OR 2.
>
> So maybe Lucene's PhraseQuery is counting transposition as edit distance 1,
> in which case, your test makes sense, and the javadocs are wrong?
>
> I am far from an expert on PhraseQuery :) Does anyone know if we change
> the behavior? In any case, we must at least fix the javadocs. Claude,
> maybe open a Jira issue (
> https://issues.apache.org/jira/projects/LUCENE/summary) and we can
> discuss there?
>
> Thank you for catching this!
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>
>
> On Fri, Dec 10, 2021 at 8:47 AM Claude Lepere <claudelepere@gmail.com>
> wrote:
>
> > Hello.
> >
> >
> > The explanation of
> >
> > https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop
> > <
> > https://lucene.apache.org/core/8_0_0/core/org/apache/lucene/search/PhraseQuery.html#getSlop--
> > >
> > writes
> > that the edit distance between "quick fox" and "the fox is quick" would be
> > at an edit distance of 3;
> > this seems inaccurate to me.
> >
> > I don't know if the edit distance used by Lucene is the Levenshtein
> > distance (insertion, deletion, substitution, all of weight 1) - a standard
> > in information retrieval - but a test of "quick fox" PhraseQuery with a
> > slop of 2 hits the text "the fox is quick" (1 deletion + 1 insertion); the
> > slop does not have to be 3.
> >
> > I wonder if I'm right.
> >
> >
> > Claude Lepère, Belgium
> >
> > claudelepere@gmail.com
> >
> >
> >
> > <
> > http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> > >
> > Virus-free.
> > www.avg.com
> > <
> > http://www.avg.com/email-signature?utm_medium=email&utm_source=link&utm_campaign=sig-email&utm_content=webmail
> > >
> > <#DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2>
> >

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org