Mailing List Archive

IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops?
Hi all, hi Alan,

I am currently rewriting some SpanQuery code to use IntervalQuery. Most of the transformations can be done quite easily and it is also better to read after transformation. What I am missing a bit is some document to compare the different query types and a guide how to convert those.

I did not find a replacement for SpanFirstQuery (or at least any query stat takes absolute positions). I know intervals more deal with term intervals, but I was successful in replacing a SpanFirstQuery with this:
IntervalsSource term = Intervals.term("foo");
IntervalsSource filtered = new FilteredIntervalsSource("FIRST"+distance, term) {
@Override
protected boolean accept(IntervalIterator it) {
return it.end() < distance; // or should this be <= distance???
}
};
Query = new IntervalQuery(field, iv2);

I am not fully sure if this works under all circumstances ????. To me it looks fine and also did work with more complex intervals than "term". If this is ok, how about adding a "first(int n, IntervalsSource iv)" method to Intervals class?

The second question: What's the "closest" replacement for a PhraseQuery with slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or maxgaps(slop). I know SpanQuery slops cannot be fully replaced with intervals, but I don't care about those SpanQuery bugs.

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops? [ In reply to ]
For what it is worth, I would be also interested in answers to
these questions. ;)

On Mon, Sep 21, 2020, 19:08 Uwe Schindler <uwe@thetaphi.de> wrote:

> Hi all, hi Alan,
>
> I am currently rewriting some SpanQuery code to use IntervalQuery. Most of
> the transformations can be done quite easily and it is also better to read
> after transformation. What I am missing a bit is some document to compare
> the different query types and a guide how to convert those.
>
> I did not find a replacement for SpanFirstQuery (or at least any query
> stat takes absolute positions). I know intervals more deal with term
> intervals, but I was successful in replacing a SpanFirstQuery with this:
> IntervalsSource term = Intervals.term("foo");
> IntervalsSource filtered = new
> FilteredIntervalsSource("FIRST"+distance, term) {
> @Override
> protected boolean accept(IntervalIterator it) {
> return it.end() < distance; // or should this be <= distance???
> }
> };
> Query = new IntervalQuery(field, iv2);
>
> I am not fully sure if this works under all circumstances ????. To me it
> looks fine and also did work with more complex intervals than "term". If
> this is ok, how about adding a "first(int n, IntervalsSource iv)" method to
> Intervals class?
>
> The second question: What's the "closest" replacement for a PhraseQuery
> with slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or
> maxgaps(slop). I know SpanQuery slops cannot be fully replaced with
> intervals, but I don't care about those SpanQuery bugs.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail: uwe@thetaphi.de
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
Re: IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops? [ In reply to ]
Your filtered query should work the same as a SpanFirst, yes. I didn’t add a shortcut just because you can do it this way, but feel free to add it if you think it’s useful!

Re sloppy phrases, this one is trickier. The closest you can get at the moment is an unordered near, but that’s not the same thing as it doesn’t take transpositions into account when calculating the slop. I think it should be possible to write something that works similarly to SloppyPhraseMatcher, but as always the tricky part is in dealing with duplicate entries. I have some ideas but they’re not ready to commit yet, unfortunately.

In terms of your suggested replacements: maxwidth will give you the equivalent of a SpanNearUnordered. Maxgaps gives a restriction on how many internal holes there are in the query, so works better if the constituent intervals are not necessarily single terms.

> On 21 Sep 2020, at 18:47, Dawid Weiss <dawid.weiss@gmail.com> wrote:
>
>
> For what it is worth, I would be also interested in answers to these questions. ;)
>
> On Mon, Sep 21, 2020, 19:08 Uwe Schindler <uwe@thetaphi.de <mailto:uwe@thetaphi.de>> wrote:
> Hi all, hi Alan,
>
> I am currently rewriting some SpanQuery code to use IntervalQuery. Most of the transformations can be done quite easily and it is also better to read after transformation. What I am missing a bit is some document to compare the different query types and a guide how to convert those.
>
> I did not find a replacement for SpanFirstQuery (or at least any query stat takes absolute positions). I know intervals more deal with term intervals, but I was successful in replacing a SpanFirstQuery with this:
> IntervalsSource term = Intervals.term("foo");
> IntervalsSource filtered = new FilteredIntervalsSource("FIRST"+distance, term) {
> @Override
> protected boolean accept(IntervalIterator it) {
> return it.end() < distance; // or should this be <= distance???
> }
> };
> Query = new IntervalQuery(field, iv2);
>
> I am not fully sure if this works under all circumstances ????. To me it looks fine and also did work with more complex intervals than "term". If this is ok, how about adding a "first(int n, IntervalsSource iv)" method to Intervals class?
>
> The second question: What's the "closest" replacement for a PhraseQuery with slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or maxgaps(slop). I know SpanQuery slops cannot be fully replaced with intervals, but I don't care about those SpanQuery bugs.
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de <https://www.thetaphi.de/>
> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>
> For additional commands, e-mail: java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>
>
RE: IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops? [ In reply to ]
Hi Alan,

this was all very helpful. Another thing about the intervals and transformation from SpanQuery to IntervalQuery: IntervalQuery only returns a score between 0..1 and does not take term statistics into account. To combine them with term scoring, one should combine it with some term queries (which is perfectly fine as it decouples term scoring from their position and allows more flexibility).

My question now (and maybe this should be documented in some MIGRATE.txt or the Javadocs): How to best combine the scores from TermQuery and IntervalQuery to get a scoring *similar* (not identical) to the good old SpanQueries? I tried to read the SpanQuery scoring mechanisms but gave up because I did not figure out where the final score of the terms is combined with the span score.

My first idea was to create a BooleanQuery with the IntervalQuery as MUST clause and all terms appearing somewhere in the (positive) intervals added as SHOULD clauses. My problem is now that the number of terms differs from query to query, but the IntervalQuery only adds 0..1 to the total score. So should you use a BoostQuery around the IntervalQuery that boosts by the number of terms added as sibling should clauses? Other suggestions?

Uwe

-----
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de
eMail: uwe@thetaphi.de

> -----Original Message-----
> From: Alan Woodward <alan.woodward@romseysoftware.co.uk>
> Sent: Monday, September 21, 2020 7:56 PM
> To: Dawid Weiss <dawid.weiss@gmail.com>
> Cc: Lucene Users <java-user@lucene.apache.org>
> Subject: Re: IntervalQuery replacement for SpanFirstQuery? Closest
> replacement for slops?
>
> Your filtered query should work the same as a SpanFirst, yes. I didn’t add a
> shortcut just because you can do it this way, but feel free to add it if you think
> it’s useful!
>
> Re sloppy phrases, this one is trickier. The closest you can get at the moment is
> an unordered near, but that’s not the same thing as it doesn’t take
> transpositions into account when calculating the slop. I think it should be
> possible to write something that works similarly to SloppyPhraseMatcher, but
> as always the tricky part is in dealing with duplicate entries. I have some ideas
> but they’re not ready to commit yet, unfortunately.
>
> In terms of your suggested replacements: maxwidth will give you the
> equivalent of a SpanNearUnordered. Maxgaps gives a restriction on how many
> internal holes there are in the query, so works better if the constituent intervals
> are not necessarily single terms.
>
> > On 21 Sep 2020, at 18:47, Dawid Weiss <dawid.weiss@gmail.com> wrote:
> >
> >
> > For what it is worth, I would be also interested in answers to these questions.
> ;)
> >
> > On Mon, Sep 21, 2020, 19:08 Uwe Schindler <uwe@thetaphi.de
> <mailto:uwe@thetaphi.de>> wrote:
> > Hi all, hi Alan,
> >
> > I am currently rewriting some SpanQuery code to use IntervalQuery. Most of
> the transformations can be done quite easily and it is also better to read after
> transformation. What I am missing a bit is some document to compare the
> different query types and a guide how to convert those.
> >
> > I did not find a replacement for SpanFirstQuery (or at least any query stat
> takes absolute positions). I know intervals more deal with term intervals, but I
> was successful in replacing a SpanFirstQuery with this:
> > IntervalsSource term = Intervals.term("foo");
> > IntervalsSource filtered = new FilteredIntervalsSource("FIRST"+distance,
> term) {
> > @Override
> > protected boolean accept(IntervalIterator it) {
> > return it.end() < distance; // or should this be <= distance???
> > }
> > };
> > Query = new IntervalQuery(field, iv2);
> >
> > I am not fully sure if this works under all circumstances ????. To me it looks
> fine and also did work with more complex intervals than "term". If this is ok,
> how about adding a "first(int n, IntervalsSource iv)" method to Intervals class?
> >
> > The second question: What's the "closest" replacement for a PhraseQuery
> with slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or
> maxgaps(slop). I know SpanQuery slops cannot be fully replaced with intervals,
> but I don't care about those SpanQuery bugs.
> >
> > Uwe
> >
> > -----
> > Uwe Schindler
> > Achterdiek 19, D-28357 Bremen
> > https://www.thetaphi.de <https://www.thetaphi.de/>
> > eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> <mailto:java-user-unsubscribe@lucene.apache.org>
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> <mailto:java-user-help@lucene.apache.org>
> >



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: IntervalQuery replacement for SpanFirstQuery? Closest replacement for slops? [ In reply to ]
Scoring is tricky, yes. Spans score by treating all the terms as if they were a phrase, which for example BM25Similarity handles by summing the idf of the individual terms to get a total idf, and then using that as a pseudo-idf for the whole phrase. The problem here is that if you have a disjunction in there somewhere - for example, NEAR(a, OR(b, c)) - then documents that don’t contain term ‘c’ at all are still scored as if ‘c’ was in there, so you don’t get any boost if your document contains infrequent terms. For this reason I’m not very keen on the idea of trying to reproduce Span scoring, as it’s pretty fundamentally broken.

Coupling an IntervalQuery with a set of boolean disjunctions should get you the best result, I think. When I looked a couple of years ago I couldn’t find any particularly useful papers on how to combine BM25 and proximity scores, but something like using (1 + proximity score) as a multiplicative boost would work? You can do this now using FunctionQuery but perhaps we should add functionality to IntervalQuery to make this work automatically.

> On 8 Oct 2021, at 11:42, Uwe Schindler <uwe@thetaphi.de> wrote:
>
> Hi Alan,
>
> this was all very helpful. Another thing about the intervals and transformation from SpanQuery to IntervalQuery: IntervalQuery only returns a score between 0..1 and does not take term statistics into account. To combine them with term scoring, one should combine it with some term queries (which is perfectly fine as it decouples term scoring from their position and allows more flexibility).
>
> My question now (and maybe this should be documented in some MIGRATE.txt or the Javadocs): How to best combine the scores from TermQuery and IntervalQuery to get a scoring *similar* (not identical) to the good old SpanQueries? I tried to read the SpanQuery scoring mechanisms but gave up because I did not figure out where the final score of the terms is combined with the span score.
>
> My first idea was to create a BooleanQuery with the IntervalQuery as MUST clause and all terms appearing somewhere in the (positive) intervals added as SHOULD clauses. My problem is now that the number of terms differs from query to query, but the IntervalQuery only adds 0..1 to the total score. So should you use a BoostQuery around the IntervalQuery that boosts by the number of terms added as sibling should clauses? Other suggestions?
>
> Uwe
>
> -----
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de <https://www.thetaphi.de/>
> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de>
>
>> -----Original Message-----
>> From: Alan Woodward <alan.woodward@romseysoftware.co.uk <mailto:alan.woodward@romseysoftware.co.uk>>
>> Sent: Monday, September 21, 2020 7:56 PM
>> To: Dawid Weiss <dawid.weiss@gmail.com <mailto:dawid.weiss@gmail.com>>
>> Cc: Lucene Users <java-user@lucene.apache.org <mailto:java-user@lucene.apache.org>>
>> Subject: Re: IntervalQuery replacement for SpanFirstQuery? Closest
>> replacement for slops?
>>
>> Your filtered query should work the same as a SpanFirst, yes. I didn’t add a
>> shortcut just because you can do it this way, but feel free to add it if you think
>> it’s useful!
>>
>> Re sloppy phrases, this one is trickier. The closest you can get at the moment is
>> an unordered near, but that’s not the same thing as it doesn’t take
>> transpositions into account when calculating the slop. I think it should be
>> possible to write something that works similarly to SloppyPhraseMatcher, but
>> as always the tricky part is in dealing with duplicate entries. I have some ideas
>> but they’re not ready to commit yet, unfortunately.
>>
>> In terms of your suggested replacements: maxwidth will give you the
>> equivalent of a SpanNearUnordered. Maxgaps gives a restriction on how many
>> internal holes there are in the query, so works better if the constituent intervals
>> are not necessarily single terms.
>>
>>> On 21 Sep 2020, at 18:47, Dawid Weiss <dawid.weiss@gmail.com> wrote:
>>>
>>>
>>> For what it is worth, I would be also interested in answers to these questions.
>> ;)
>>>
>>> On Mon, Sep 21, 2020, 19:08 Uwe Schindler <uwe@thetaphi.de
>> <mailto:uwe@thetaphi.de <mailto:uwe@thetaphi.de>>> wrote:
>>> Hi all, hi Alan,
>>>
>>> I am currently rewriting some SpanQuery code to use IntervalQuery. Most of
>> the transformations can be done quite easily and it is also better to read after
>> transformation. What I am missing a bit is some document to compare the
>> different query types and a guide how to convert those.
>>>
>>> I did not find a replacement for SpanFirstQuery (or at least any query stat
>> takes absolute positions). I know intervals more deal with term intervals, but I
>> was successful in replacing a SpanFirstQuery with this:
>>> IntervalsSource term = Intervals.term("foo");
>>> IntervalsSource filtered = new FilteredIntervalsSource("FIRST"+distance,
>> term) {
>>> @Override
>>> protected boolean accept(IntervalIterator it) {
>>> return it.end() < distance; // or should this be <= distance???
>>> }
>>> };
>>> Query = new IntervalQuery(field, iv2);
>>>
>>> I am not fully sure if this works under all circumstances ????. To me it looks
>> fine and also did work with more complex intervals than "term". If this is ok,
>> how about adding a "first(int n, IntervalsSource iv)" method to Intervals class?
>>>
>>> The second question: What's the "closest" replacement for a PhraseQuery
>> with slop? Should I use maxwidth(slop + 1) or maxgaps(slop-1) or
>> maxgaps(slop). I know SpanQuery slops cannot be fully replaced with intervals,
>> but I don't care about those SpanQuery bugs.
>>>
>>> Uwe
>>>
>>> -----
>>> Uwe Schindler
>>> Achterdiek 19, D-28357 Bremen
>>> https://www.thetaphi.de <https://www.thetaphi.de/> <https://www.thetaphi.de/ <https://www.thetaphi.de/>>
>>> eMail: uwe@thetaphi.de <mailto:uwe@thetaphi.de> <mailto:uwe@thetaphi.de <mailto:uwe@thetaphi.de>>
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>
>> <mailto:java-user-unsubscribe@lucene.apache.org <mailto:java-user-unsubscribe@lucene.apache.org>>
>>> For additional commands, e-mail: java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>
>> <mailto:java-user-help@lucene.apache.org <mailto:java-user-help@lucene.apache.org>>