Mailing List Archive

Inconsistent query results in Lucene 8.1.0
Hello,

I’m Fiona with Basis Technology. We’re investigating what we believe to be
a bug involving inconsistent query results. We have binary searched this
issue and found that it specifically appears when flattening nested
disjunctions was introduced with the merge of LUCENE-7386
<https://issues.apache.org/jira/browse/LUCENE-7386>. In order to reproduce
the issue, I have attached a Lucene index built in Lucene 8.1.0 as
names_index.tar.gz and if you run the attached Java class
(LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the
max_score is the same between runs whereas if you run it against Lucene
8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs
and you should be able to see that sometimes it returns max_score of
1.8651859 and sometimes 2.1415303).

From debugging in Lucene 8.1.0, the query against the name index before
flattening its nested disjunctions looks like below:

(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR
(bt_rni_name_encoded_1:ANTR)^0.75
(bt_rni_name_encoded_1:LTR)^0.6666666)
((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75
(bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75
(bt_rni_name_encoded_1:FTR)^0.6666666
(bt_rni_name_encoded_1:LTR)^0.6666666)) |
(((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR
(bt_rni_name_encoded_2:FLTRN)^0.75))


The term that's causing the difference in the final score is
bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows
twice nested under different clauses: in the first clause that it occurs
the docFreq for it is 3, and for the same term but in the second clause
that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; *is
a term being read with different docFreq values expected behaviour? *

After flattening the nested disjunctions (part of query rewrite process),
the query looks like below:

((bt_rni_name_encoded_1:FTR)^0.6666666
(bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75
(bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75
(bt_rni_name_encoded_1:ANTR)^0.75
(bt_rni_name_encoded_1:LTR)^1.3333333
(bt_rni_name_encoded_1:ALTR)^1.75) |
((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)


As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight
has been summed up from the original query. This is the version of the
query that actually gets used, and the docFreq here for the
bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it
shows as 2 between runs and final score changes accordingly to that. *Is
this "coin toss" pick of docFreq for the same term expected behaviour? *

Looks like the issue stems from one of the behaviours observed and
highlighted in bold.

Looking forward to hearing back from you.
Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
Hi all,

I looked into this today. I can reproduce it and I believe it's a bug.
This is caused by the following working together:
- LUCENE-7386 <https://issues.apache.org/jira/browse/LUCENE-7386> Flatten
nested disjunctions
- LUCENE-7925 <https://issues.apache.org/jira/browse/LUCENE-7925>
Deduplicate SHOULD and MUST clauses in BooleanQuery

Blended term queries modify the df/ttf of their terms to make sure all
terms produce identical scores. In this case, two blended term queries
contain a few terms each, only some of which overlap. The two queries
calculate different df/ttf for their terms respectively, since the two sets
are different. During the rewrite process,

1. the two Blended queries get rewritten as Boolean queries themselves,
with each (modified) TermQuery as a SHOULD clause
2. the nested Boolean queries get flattened, since they are nested
disjunctions
3. the Term queries (some of which are actually Boost queries) are
deduplicated, with one of the two TermQuery and its modified TermStates
being picked at random (the randomness is due to the HashSet underlying
Lucene's MultiSet).

I haven't managed to create a failing test yet, I'll share it when I have
one ready.
If anybody has suggestions or pointers on how this should be fixed, I'm
also happy to provide a patch - I'm just a bit clueless what the right
thing to do would be here: I have a feeling (2.) should not happen for
(rewritten) Blended Queries?

Cheers,
Michele


On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <fiona@basistech.com> wrote:

> Hello,
>
> I’m Fiona with Basis Technology. We’re investigating what we believe to be
> a bug involving inconsistent query results. We have binary searched this
> issue and found that it specifically appears when flattening nested
> disjunctions was introduced with the merge of LUCENE-7386
> <https://issues.apache.org/jira/browse/LUCENE-7386>. In order to
> reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0
> as names_index.tar.gz and if you run the attached Java class
> (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the
> max_score is the same between runs whereas if you run it against Lucene
> 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs
> and you should be able to see that sometimes it returns max_score of
> 1.8651859 and sometimes 2.1415303).
>
> From debugging in Lucene 8.1.0, the query against the name index before
> flattening its nested disjunctions looks like below:
>
> (((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))
>
>
> The term that's causing the difference in the final score is
> bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows
> twice nested under different clauses: in the first clause that it occurs
> the docFreq for it is 3, and for the same term but in the second clause
> that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; *is
> a term being read with different docFreq values expected behaviour? *
>
> After flattening the nested disjunctions (part of query rewrite process),
> the query looks like below:
>
> ((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)
>
>
> As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight
> has been summed up from the original query. This is the version of the
> query that actually gets used, and the docFreq here for the
> bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it
> shows as 2 between runs and final score changes accordingly to that. *Is
> this "coin toss" pick of docFreq for the same term expected behaviour? *
>
> Looks like the issue stems from one of the behaviours observed and
> highlighted in bold.
>
> Looking forward to hearing back from you.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
So - I think you should open an issue. Can you determine whether
flattening on its own would result in a bug? If not, then perhaps
focus on the merging (deduplication) and whether it properly respects
boosting?

On Fri, Mar 6, 2020 at 10:50 AM Michele Palmia <micpalmia@gmail.com> wrote:
>
> Hi all,
>
> I looked into this today. I can reproduce it and I believe it's a bug.
> This is caused by the following working together:
> - LUCENE-7386 Flatten nested disjunctions
> - LUCENE-7925 Deduplicate SHOULD and MUST clauses in BooleanQuery
>
> Blended term queries modify the df/ttf of their terms to make sure all terms produce identical scores. In this case, two blended term queries contain a few terms each, only some of which overlap. The two queries calculate different df/ttf for their terms respectively, since the two sets are different. During the rewrite process,
>
> the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
> the nested Boolean queries get flattened, since they are nested disjunctions
> the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).
>
> I haven't managed to create a failing test yet, I'll share it when I have one ready.
> If anybody has suggestions or pointers on how this should be fixed, I'm also happy to provide a patch - I'm just a bit clueless what the right thing to do would be here: I have a feeling (2.) should not happen for (rewritten) Blended Queries?
>
> Cheers,
> Michele
>
>
> On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <fiona@basistech.com> wrote:
>>
>> Hello,
>>
>> I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303).
>>
>> From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:
>>
>> (((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))
>>
>>
>> The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour?
>>
>> After flattening the nested disjunctions (part of query rewrite process), the query looks like below:
>>
>> ((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)
>>
>>
>> As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour?
>>
>> Looks like the issue stems from one of the behaviours observed and highlighted in bold.
>>
>> Looking forward to hearing back from you.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
> the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
> the nested Boolean queries get flattened, since they are nested disjunctions
> the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).

This seems a bit worrisome in itself -- the data structure supporting
the implementation should not affect the selection.

--
Regards,

Atri
Apache Concerted

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
RE: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
We recently upgraded to our Drupal 8 sites to SOLR 8.3.1. We are now getting reports of certain patterns of search terms resulting in an error that reads, “The website encountered an unexpected error. Please try again later.”



Below is a list of example terms that always result in this error and a similar list that works fine. The problem pattern seems to be a search term that contains 2 or 3 characters followed by a space, followed by additional text.



To confirm that the problem is version 8 of SOLR, I have updated our local and UAT sites with the latest Drupal updates that did include an update to the Search API Solr module and tested the terms below under SOLR 7.7.2, 8.3.1, and 8.4.1. Under version 7.7.2 everything works fine. Under either of the version 8, the problem returns.



Thoughts?



Search terms that result in error

• w-2 agency directory

• agency w-2 directory

• w-2 agency

• w-2 directory

• w2 agency directory

• w2 agency

• w2 directory



Search terms that do not result in error • w-22 agency directory • agency directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 • agency directory • agency • directory • -2 agency directory • 2 agency directory • w-2agency directory • w2agency directory


From: Michele Palmia <micpalmia@gmail.com>
Sent: Friday, March 6, 2020 9:50 AM
To: dev@lucene.apache.org
Subject: Re: Inconsistent query results in Lucene 8.1.0

Hi all,

I looked into this today. I can reproduce it and I believe it's a bug.
This is caused by the following working together:
- LUCENE-7386<https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386> Flatten nested disjunctions
- LUCENE-7925<https://secure-web.cisco.com/1hzn5x604aHO9rCwQ2LgnrasmSRAfGal79Kj0TxxLjLVvoXnCA2qw7hnjtlkZFqVG-5QSDKfdkxwyo7HbsdW02QQjr0hkeD2MM-Arlgh8Me7TL3VL1WtaWpdPLTthfJfHxytGjEuHe4_lgaXBOPGT0Asc4mgOUL8X0HZvEFwHdPyr8Frjgc9xXNJMSxue85CPT6wX_vTczFI5WIJptjmt5HPnhD-2109aCueO-F0bw7XssxckniCtAlIkUaRCrt-PRYhXal-7UGzFztVDHNI9Xg/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7925> Deduplicate SHOULD and MUST clauses in BooleanQuery

Blended term queries modify the df/ttf of their terms to make sure all terms produce identical scores. In this case, two blended term queries contain a few terms each, only some of which overlap. The two queries calculate different df/ttf for their terms respectively, since the two sets are different. During the rewrite process,

1. the two Blended queries get rewritten as Boolean queries themselves, with each (modified) TermQuery as a SHOULD clause
2. the nested Boolean queries get flattened, since they are nested disjunctions
3. the Term queries (some of which are actually Boost queries) are deduplicated, with one of the two TermQuery and its modified TermStates being picked at random (the randomness is due to the HashSet underlying Lucene's MultiSet).
I haven't managed to create a failing test yet, I'll share it when I have one ready.
If anybody has suggestions or pointers on how this should be fixed, I'm also happy to provide a patch - I'm just a bit clueless what the right thing to do would be here: I have a feeling (2.) should not happen for (rewritten) Blended Queries?

Cheers,
Michele


On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <fiona@basistech.com<mailto:fiona@basistech.com>> wrote:
Hello,

I’m Fiona with Basis Technology. We’re investigating what we believe to be a bug involving inconsistent query results. We have binary searched this issue and found that it specifically appears when flattening nested disjunctions was introduced with the merge of LUCENE-7386<https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>. In order to reproduce the issue, I have attached a Lucene index built in Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the max_score is the same between runs whereas if you run it against Lucene 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs and you should be able to see that sometimes it returns max_score of 1.8651859 and sometimes 2.1415303).

From debugging in Lucene 8.1.0, the query against the name index before flattening its nested disjunctions looks like below:



(((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))

The term that's causing the difference in the final score is bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows twice nested under different clauses: in the first clause that it occurs the docFreq for it is 3, and for the same term but in the second clause that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; is a term being read with different docFreq values expected behaviour?

After flattening the nested disjunctions (part of query rewrite process), the query looks like below:



((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)

As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight has been summed up from the original query. This is the version of the query that actually gets used, and the docFreq here for the bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it shows as 2 between runs and final score changes accordingly to that. Is this "coin toss" pick of docFreq for the same term expected behaviour?

Looks like the issue stems from one of the behaviours observed and highlighted in bold.

Looking forward to hearing back from you.


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org<mailto:dev-unsubscribe@lucene.apache.org>
For additional commands, e-mail: dev-help@lucene.apache.org<mailto:dev-help@lucene.apache.org>
Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
Hi Phil,

Please start new threads (emails) for new problems instead of replying to
an existing one. The behavior of the existing thread does not result in an
error; yours does, and so I think they are entirely dissimilar. Also,
you'll need to dig deeper to learn what the particular error was and report
that. Go to Solr's logs.

~ David Smiley
Apache Lucene/Solr Search Developer
http://www.linkedin.com/in/davidwsmiley


On Fri, Mar 6, 2020 at 2:01 PM Staley, Phil R - DCF <
Phil.Staley@wisconsin.gov> wrote:

> We recently upgraded to our Drupal 8 sites to SOLR 8.3.1. We are now
> getting reports of certain patterns of search terms resulting in an error
> that reads, “The website encountered an unexpected error. Please try again
> later.”
>
>
>
> Below is a list of example terms that always result in this error and a
> similar list that works fine. The problem pattern seems to be a search
> term that contains 2 or 3 characters followed by a space, followed by
> additional text.
>
>
>
> To confirm that the problem is version 8 of SOLR, I have updated our local
> and UAT sites with the latest Drupal updates that did include an update to
> the Search API Solr module and tested the terms below under SOLR 7.7.2,
> 8.3.1, and 8.4.1. Under version 7.7.2 everything works fine. Under either
> of the version 8, the problem returns.
>
>
>
> Thoughts?
>
>
>
> Search terms that result in error
>
> • w-2 agency directory
>
> • agency w-2 directory
>
> • w-2 agency
>
> • w-2 directory
>
> • w2 agency directory
>
> • w2 agency
>
> • w2 directory
>
>
>
> Search terms that do not result in error • w-22 agency directory • agency
> directory w-2 • agency w-2directory • agencyw-2 directory • w-2 • w2 •
> agency directory • agency • directory • -2 agency directory • 2 agency
> directory • w-2agency directory • w2agency directory
>
>
>
>
>
> *From:* Michele Palmia <micpalmia@gmail.com>
> *Sent:* Friday, March 6, 2020 9:50 AM
> *To:* dev@lucene.apache.org
> *Subject:* Re: Inconsistent query results in Lucene 8.1.0
>
>
>
> Hi all,
>
>
>
> I looked into this today. I can reproduce it and I believe it's a bug.
>
> This is caused by the following working together:
> - LUCENE-7386
> <https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>
> Flatten nested disjunctions
>
> - LUCENE-7925
> <https://secure-web.cisco.com/1hzn5x604aHO9rCwQ2LgnrasmSRAfGal79Kj0TxxLjLVvoXnCA2qw7hnjtlkZFqVG-5QSDKfdkxwyo7HbsdW02QQjr0hkeD2MM-Arlgh8Me7TL3VL1WtaWpdPLTthfJfHxytGjEuHe4_lgaXBOPGT0Asc4mgOUL8X0HZvEFwHdPyr8Frjgc9xXNJMSxue85CPT6wX_vTczFI5WIJptjmt5HPnhD-2109aCueO-F0bw7XssxckniCtAlIkUaRCrt-PRYhXal-7UGzFztVDHNI9Xg/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7925>
> Deduplicate SHOULD and MUST clauses in BooleanQuery
>
>
>
> Blended term queries modify the df/ttf of their terms to make sure all
> terms produce identical scores. In this case, two blended term queries
> contain a few terms each, only some of which overlap. The two queries
> calculate different df/ttf for their terms respectively, since the two sets
> are different. During the rewrite process,
>
> 1. the two Blended queries get rewritten as Boolean queries
> themselves, with each (modified) TermQuery as a SHOULD clause
> 2. the nested Boolean queries get flattened, since they are nested
> disjunctions
> 3. the Term queries (some of which are actually Boost queries) are
> deduplicated, with one of the two TermQuery and its modified TermStates
> being picked at random (the randomness is due to the HashSet underlying
> Lucene's MultiSet).
>
> I haven't managed to create a failing test yet, I'll share it when I have
> one ready.
>
> If anybody has suggestions or pointers on how this should be fixed, I'm
> also happy to provide a patch - I'm just a bit clueless what the right
> thing to do would be here: I have a feeling (2.) should not happen for
> (rewritten) Blended Queries?
>
>
>
> Cheers,
>
> Michele
>
>
>
>
>
> On Tue, Mar 3, 2020 at 7:55 PM Fiona Hasanaj <fiona@basistech.com> wrote:
>
> Hello,
>
>
>
> I’m Fiona with Basis Technology. We’re investigating what we believe to be
> a bug involving inconsistent query results. We have binary searched this
> issue and found that it specifically appears when flattening nested
> disjunctions was introduced with the merge of LUCENE-7386
> <https://secure-web.cisco.com/1gkr5LTkeMdFRicQeMHBrlIXyvYIp1P0w27F8ZyT5bqofSPZImBg6_ZLgaf_B47pxYLZrmC0Hii3RiNGaduLkJuOucpPDOOkNGg4Rp1CBK7fYACGGtdIHLiqEjBvZwgVes2TufYNMazfSwd564IYMqf1b8zvn6lZtNgH-fi2fdysnaxVVcNUZ8rhZWJL5GUXAh6tijSHheIBqeJdZW9RVrh8VYrD4RyTQraOGs4-M8ajOQCHeLAWMjxe-tdAhwoip1iA4gdb6tDE2xV_SuXbdjA/https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FLUCENE-7386>.
> In order to reproduce the issue, I have attached a Lucene index built in
> Lucene 8.1.0 as names_index.tar.gz and if you run the attached Java class
> (LuceneSearchIndex.java) multiple times against Lucene 8.0.0 you'll see the
> max_score is the same between runs whereas if you run it against Lucene
> 8.1.0 you'll see inconsistent max_score between runs (try a max of 10 runs
> and you should be able to see that sometimes it returns max_score of
> 1.8651859 and sometimes 2.1415303).
>
>
>
> From debugging in Lucene 8.1.0, the query against the name index before
> flattening its nested disjunctions looks like below:
>
>
> (((bt_rni_name_encoded_1:ALFR)^0.75 bt_rni_name_encoded_1:ALTR (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^0.6666666) ((bt_rni_name_encoded_1:ALTR)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:LTR)^0.6666666)) | (((bt_rni_name_encoded_2:FLTR)^0.75) (bt_rni_name_encoded_2:FLTR (bt_rni_name_encoded_2:FLTRN)^0.75))
>
>
> The term that's causing the difference in the final score is
> bt_rni_name_encoded_1:ALTR and as we can see in the above query, it shows
> twice nested under different clauses: in the first clause that it occurs
> the docFreq for it is 3, and for the same term but in the second clause
> that it appears in, its docFreq is 2. This happens in Lucene 8.0.0 as well; *is
> a term being read with different docFreq values expected behaviour? *
>
>
>
> After flattening the nested disjunctions (part of query rewrite process),
> the query looks like below:
>
>
> ((bt_rni_name_encoded_1:FTR)^0.6666666 (bt_rni_name_encoded_1:FLTRN)^0.75 (bt_rni_name_encoded_1:FLTMR)^0.75 (bt_rni_name_encoded_1:ALFR)^0.75 (bt_rni_name_encoded_1:FLTS)^0.75 (bt_rni_name_encoded_1:ANTR)^0.75 (bt_rni_name_encoded_1:LTR)^1.3333333 (bt_rni_name_encoded_1:ALTR)^1.75) | ((bt_rni_name_encoded_2:FLTRN)^0.75 (bt_rni_name_encoded_2:FLTR)^1.75)
>
>
>
> As you can see, bt_rni_name_encoded_1:ALTR shows only once, but the weight
> has been summed up from the original query. This is the version of the
> query that actually gets used, and the docFreq here for the
> bt_rni_name_encoded_1:ALTR term sometimes it shows as 3 and sometimes it
> shows as 2 between runs and final score changes accordingly to that. *Is
> this "coin toss" pick of docFreq for the same term expected behaviour? *
>
>
>
> Looks like the issue stems from one of the behaviours observed and
> highlighted in bold.
>
>
>
> Looking forward to hearing back from you.
>
>
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
Fiona - I opened a ticket
<https://issues.apache.org/jira/browse/LUCENE-9269> for this. You can find
some recommendations there that might help you fix your issue.
Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
Thanks for digging this issue Michele.

On Tue, Mar 10, 2020 at 5:04 PM Michele Palmia <micpalmia@gmail.com> wrote:

> Fiona - I opened a ticket
> <https://issues.apache.org/jira/browse/LUCENE-9269> for this. You can
> find some recommendations there that might help you fix your issue.
>


--
Adrien
RE: Re: Inconsistent query results in Lucene 8.1.0 [ In reply to ]
Hello,

I'm Nicholas with Basis Technology. We have done more work investigating
the bug we listed previously in March 2020, LUCENE-9269
<https://issues.apache.org/jira/browse/LUCENE-9269>.

We have verified that this bug is present in two different versions of
Lucene: 8.11.1 and 9.0.0. Given that this bug still exists in these newer
versions of Lucene, we extrapolate that this bug has not been fixed since
8.1. Because this issue prevents us from receiving consistent scores
between runs, we are stuck using Lucene 7.

Would you mind giving us an update on the progress of this issue, please?

Thank you,
Nicholas Selvitelli


On 2020/03/06 17:40:57 Atri Sharma wrote:
> > the two Blended queries get rewritten as Boolean queries themselves,
with each (modified) TermQuery as a SHOULD clause
> > the nested Boolean queries get flattened, since they are nested
disjunctions
> > the Term queries (some of which are actually Boost queries) are
deduplicated, with one of the two TermQuery and its modified TermStates
being picked at random (the randomness is due to the HashSet underlying
Lucene's MultiSet).
>
> This seems a bit worrisome in itself -- the data structure supporting
> the implementation should not affect the selection.
>
> --
> Regards,
>
> Atri
> Apache Concerted
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>