Mailing List Archive

PointInSetQuery dose not terminate early if DocIdSetBuilder's bitSet is null
Hi Lucene developers,

In Lucene-7.7.0, I find that in `PointInSetQuery.createWeight()`, and in the method `scorer()` after `values.intersect()`, if the `result.bitSet` is null, then the `result.build()` would use `concat()` to generate a Buffer and the length is 1. And the element of array is `NO_MORE_DOCS`.

Why not return null in the method `scorer()` if `result.bitSet` is null ?

In the following case:

SubQuery1 AND SubQuery2 AND SubQuery3 …...

BooleanWeight.scorerSupplier() ->

the first subScorer of query is PointInSetQuery ->

scorer() ->

after intersect if result.bitSet is null, there is no specific point value at all then return null ->


This will terminate early in the BooleanWeight.scoreSupplier() because subScore is null and the boolean clause is required

The following SubQuery2, SubQuery3, SubQuery4 …. Need not to call scorerSupplier() to build scorer


hacker win7
hackerswin7@gmail.com
Re: PointInSetQuery dose not terminate early if DocIdSetBuilder's bitSet is null [ In reply to ]
What are you storing in your points? If you are storing numbers, I wonder
if a better approach to this problem might be to start leveraging
IndexOrDocValuesQuery and scorerSupplier() for point-in-set queries like we
did for range queries.

The approach you suggested would help in some cases, but I'm a bit unhappy
that it would be quite fragile, e.g. SubQuery1 AND SubQuery2 might become
faster as we could save evaluating matches of SubQuery2 but SubQuery2 AND
SubQuery1 would still be slow.

On Mon, Sep 28, 2020 at 5:02 AM hacker win7 <hackerswin7@gmail.com> wrote:

> Hi Lucene developers,
>
> In Lucene-7.7.0, I find that in `PointInSetQuery.createWeight()`, and in
> the method `scorer()` after `values.intersect()`, if the `result.bitSet` is
> null, then the `result.build()` would use `concat()` to generate a Buffer
> and the length is 1. And the element of array is `NO_MORE_DOCS`.
>
> Why not return null in the method `scorer()` if `result.bitSet` is null ?
>
> In the following case:
>
> SubQuery1 AND SubQuery2 AND SubQuery3 …...
>
> BooleanWeight.scorerSupplier() ->
>
> the first subScorer of query is PointInSetQuery ->
>
> scorer() ->
>
> after intersect if result.bitSet is null, there is no specific point value
> at all then return null ->
>
>
> This will terminate early in the BooleanWeight.scoreSupplier() because
> subScore is null and the boolean clause is required
>
> The following SubQuery2, SubQuery3, SubQuery4 …. Need not to call
> scorerSupplier() to build scorer
>
>
> hacker win7
> hackerswin7@gmail.com
>
>
>
>

--
Adrien
Re: PointInSetQuery dose not terminate early if DocIdSetBuilder's bitSet is null [ In reply to ]
Thanks Adrien Grand

We store long numbers in our points, in our search service, most search requests look like this:

id-match AND range match AND string match AND …. (Clause count is high)

Id is long number and we find most searches of id-match query have no points at all but the subsequent SubQueries still evaluate match, there is some extra cost for these searches which ought to be terminated early in id-match SubQuery.

This lead to our most search requests usually spend extra cost to response

And we find that in string match style of TernWeight scorer() before return scorer, in the getTermsEnum() it would call termsEnum.seekExact(term.bytes()) to check the value is exists or not, if not exists the value then return null and return null outer for TermWeight.scorer()

This is confused to me that the string-match of TermQuery the soccer() return null if the value is not exists. However, the id-match of PointInSetQuery the scorer() dose not return null if the value is not exists, as you can see because DocIdSetBuilder.build() would build a length=1 of NO_MOARE_DOCS of array

For this info, I check the earliest version of DocIdSetBuilder in the LUCENE-5938, the build() can return null, as the comment says: “This method may return <tt>null</tt> if no documents were added to this”

I’m not sure why this changes.

hacker win7
hackerswin7@gmail.com



> On Sep 28, 2020, at 21:06, Adrien Grand <jpountz@gmail.com> wrote:
>
> What are you storing in your points? If you are storing numbers, I wonder if a better approach to this problem might be to start leveraging IndexOrDocValuesQuery and scorerSupplier() for point-in-set queries like we did for range queries.
>
> The approach you suggested would help in some cases, but I'm a bit unhappy that it would be quite fragile, e.g. SubQuery1 AND SubQuery2 might become faster as we could save evaluating matches of SubQuery2 but SubQuery2 AND SubQuery1 would still be slow.
>
> On Mon, Sep 28, 2020 at 5:02 AM hacker win7 <hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>> wrote:
> Hi Lucene developers,
>
> In Lucene-7.7.0, I find that in `PointInSetQuery.createWeight()`, and in the method `scorer()` after `values.intersect()`, if the `result.bitSet` is null, then the `result.build()` would use `concat()` to generate a Buffer and the length is 1. And the element of array is `NO_MORE_DOCS`.
>
> Why not return null in the method `scorer()` if `result.bitSet` is null ?
>
> In the following case:
>
> SubQuery1 AND SubQuery2 AND SubQuery3 …...
>
> BooleanWeight.scorerSupplier() ->
>
> the first subScorer of query is PointInSetQuery ->
>
> scorer() ->
>
> after intersect if result.bitSet is null, there is no specific point value at all then return null ->
>
>
> This will terminate early in the BooleanWeight.scoreSupplier() because subScore is null and the boolean clause is required
>
> The following SubQuery2, SubQuery3, SubQuery4 …. Need not to call scorerSupplier() to build scorer
>
>
> hacker win7
> hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>
>
>
>
>
>
> --
> Adrien
Re: PointInSetQuery dose not terminate early if DocIdSetBuilder's bitSet is null [ In reply to ]
Is there any reasons for this why PointInsetQuery return NO_MORE_DOCS instead of NULL while NULL can terminate early.


hacker win7
hackerswin7@gmail.com



> On Sep 29, 2020, at 11:59, hacker win7 <hackerswin7@gmail.com> wrote:
>
> Thanks Adrien Grand
>
> We store long numbers in our points, in our search service, most search requests look like this:
>
> id-match AND range match AND string match AND …. (Clause count is high)
>
> Id is long number and we find most searches of id-match query have no points at all but the subsequent SubQueries still evaluate match, there is some extra cost for these searches which ought to be terminated early in id-match SubQuery.
>
> This lead to our most search requests usually spend extra cost to response
>
> And we find that in string match style of TernWeight scorer() before return scorer, in the getTermsEnum() it would call termsEnum.seekExact(term.bytes()) to check the value is exists or not, if not exists the value then return null and return null outer for TermWeight.scorer()
>
> This is confused to me that the string-match of TermQuery the soccer() return null if the value is not exists. However, the id-match of PointInSetQuery the scorer() dose not return null if the value is not exists, as you can see because DocIdSetBuilder.build() would build a length=1 of NO_MOARE_DOCS of array
>
> For this info, I check the earliest version of DocIdSetBuilder in the LUCENE-5938, the build() can return null, as the comment says: “This method may return <tt>null</tt> if no documents were added to this”
>
> I’m not sure why this changes.
>
> hacker win7
> hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>
>
>
>
>> On Sep 28, 2020, at 21:06, Adrien Grand <jpountz@gmail.com <mailto:jpountz@gmail.com>> wrote:
>>
>> What are you storing in your points? If you are storing numbers, I wonder if a better approach to this problem might be to start leveraging IndexOrDocValuesQuery and scorerSupplier() for point-in-set queries like we did for range queries.
>>
>> The approach you suggested would help in some cases, but I'm a bit unhappy that it would be quite fragile, e.g. SubQuery1 AND SubQuery2 might become faster as we could save evaluating matches of SubQuery2 but SubQuery2 AND SubQuery1 would still be slow.
>>
>> On Mon, Sep 28, 2020 at 5:02 AM hacker win7 <hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>> wrote:
>> Hi Lucene developers,
>>
>> In Lucene-7.7.0, I find that in `PointInSetQuery.createWeight()`, and in the method `scorer()` after `values.intersect()`, if the `result.bitSet` is null, then the `result.build()` would use `concat()` to generate a Buffer and the length is 1. And the element of array is `NO_MORE_DOCS`.
>>
>> Why not return null in the method `scorer()` if `result.bitSet` is null ?
>>
>> In the following case:
>>
>> SubQuery1 AND SubQuery2 AND SubQuery3 …...
>>
>> BooleanWeight.scorerSupplier() ->
>>
>> the first subScorer of query is PointInSetQuery ->
>>
>> scorer() ->
>>
>> after intersect if result.bitSet is null, there is no specific point value at all then return null ->
>>
>>
>> This will terminate early in the BooleanWeight.scoreSupplier() because subScore is null and the boolean clause is required
>>
>> The following SubQuery2, SubQuery3, SubQuery4 …. Need not to call scorerSupplier() to build scorer
>>
>>
>> hacker win7
>> hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>
>>
>>
>>
>>
>>
>> --
>> Adrien
>
Re: PointInSetQuery dose not terminate early if DocIdSetBuilder's bitSet is null [ In reply to ]
Hi, Adrien Grand

Could we check the result after BKD.intersect() inner the PointInSetQuery.scorer(), if the result.build.iterator() the first next doc is already NO_MORE_DOCS ? We need not to build scorer with new ConstantScoreScorer() , just return null. In this way, the BooleanQuery for AND can terminate early and the subsequent clauses should not evaluate to build scorer to waste extra compute resource.

This terminate more early on PointInSetQuery can be good at a number match leading clause of boolean query search mode with a high percent requests whose number match hits is empty.


hacker win7
hackerswin7@gmail.com



> On Sep 28, 2020, at 21:06, Adrien Grand <jpountz@gmail.com> wrote:
>
> What are you storing in your points? If you are storing numbers, I wonder if a better approach to this problem might be to start leveraging IndexOrDocValuesQuery and scorerSupplier() for point-in-set queries like we did for range queries.
>
> The approach you suggested would help in some cases, but I'm a bit unhappy that it would be quite fragile, e.g. SubQuery1 AND SubQuery2 might become faster as we could save evaluating matches of SubQuery2 but SubQuery2 AND SubQuery1 would still be slow.
>
> On Mon, Sep 28, 2020 at 5:02 AM hacker win7 <hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>> wrote:
> Hi Lucene developers,
>
> In Lucene-7.7.0, I find that in `PointInSetQuery.createWeight()`, and in the method `scorer()` after `values.intersect()`, if the `result.bitSet` is null, then the `result.build()` would use `concat()` to generate a Buffer and the length is 1. And the element of array is `NO_MORE_DOCS`.
>
> Why not return null in the method `scorer()` if `result.bitSet` is null ?
>
> In the following case:
>
> SubQuery1 AND SubQuery2 AND SubQuery3 …...
>
> BooleanWeight.scorerSupplier() ->
>
> the first subScorer of query is PointInSetQuery ->
>
> scorer() ->
>
> after intersect if result.bitSet is null, there is no specific point value at all then return null ->
>
>
> This will terminate early in the BooleanWeight.scoreSupplier() because subScore is null and the boolean clause is required
>
> The following SubQuery2, SubQuery3, SubQuery4 …. Need not to call scorerSupplier() to build scorer
>
>
> hacker win7
> hackerswin7@gmail.com <mailto:hackerswin7@gmail.com>
>
>
>
>
>
> --
> Adrien