Mailing List Archive

Slow DV equivalent of TermInSetQuery
Hi Team,

I was discussing this problem with Greg Miller (also at Amazon Product
Search):

If I want to make a query that filters out a few primary keys (ASIN in our
Amazon Product Search world), I can make a TermInSetQuery and add it as a
MUST_NOT onto a BooleanQuery that has all the other interesting clauses for
my query.

But if I have many, many ASINs to filter out, at some point it may become
more efficient to just use doc values and filter them out like Solr's
"post-filter" / during collection, e.g. by loading the BINARY value or
SORTED (globalized) ordinal, and checking e.g. a HashSet to see if it
should be skipped. Not using the inverted index at all...

Do we already have such a "slow DV TermInSet" query?

It seems like it could belong in SortedDocValues where we already have
newSlowRangeQuery, newSlowExactQuery, we could add a newSlowInSetQuery?

And then we could make an IndexOrDocValuesQuery with both the
TermInSetQuery and this SDV.newSlowInSetQuery?

Or maybe there is already a good way to do this in Lucene?

Thanks!,

Mike McCandless

http://blog.mikemccandless.com
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
We have SortedSetDocValuesField.newSlowRangeQuery() which does something close to what you want here, I think.

> On 26 Oct 2021, at 15:23, Michael McCandless <lucene@mikemccandless.com <mailto:lucene@mikemccandless.com>> wrote:
>
> Hi Team,
>
> I was discussing this problem with Greg Miller (also at Amazon Product Search):
>
> If I want to make a query that filters out a few primary keys (ASIN in our Amazon Product Search world), I can make a TermInSetQuery and add it as a MUST_NOT onto a BooleanQuery that has all the other interesting clauses for my query.
>
> But if I have many, many ASINs to filter out, at some point it may become more efficient to just use doc values and filter them out like Solr's "post-filter" / during collection, e.g. by loading the BINARY value or SORTED (globalized) ordinal, and checking e.g. a HashSet to see if it should be skipped. Not using the inverted index at all...
>
> Do we already have such a "slow DV TermInSet" query?
>
> It seems like it could belong in SortedDocValues where we already have newSlowRangeQuery, newSlowExactQuery, we could add a newSlowInSetQuery?
>
> And then we could make an IndexOrDocValuesQuery with both the TermInSetQuery and this SDV.newSlowInSetQuery?
>
> Or maybe there is already a good way to do this in Lucene?
>
> Thanks!,
>
> Mike McCandless
>
> http://blog.mikemccandless.com <http://blog.mikemccandless.com/>
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
On Tue, Oct 26, 2021 at 10:58 AM Alan Woodward <romseygeek@gmail.com> wrote:
>
> We have SortedSetDocValuesField.newSlowRangeQuery() which does something close to what you want here, I think.
>

See also DocValuesRewriteMethod which might be useful, at least as a
start. You'd have to express the "SetQuery" as a MultiTermQuery for
that to work, but It would be more efficient than a disjunction of
slow-exact-queries.

e.g. for each segment it will first sequentially fill a bitset
corresponding to the ordinals matching the terms in your set.
Then when checking a single doc, it looks at document's ordinals to
see if one is in the bitset.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
On Tue, Oct 26, 2021 at 11:24 AM Robert Muir <rcmuir@gmail.com> wrote:
>
> On Tue, Oct 26, 2021 at 10:58 AM Alan Woodward <romseygeek@gmail.com> wrote:
> >
> > We have SortedSetDocValuesField.newSlowRangeQuery() which does something close to what you want here, I think.
> >
>
> See also DocValuesRewriteMethod which might be useful, at least as a
> start. You'd have to express the "SetQuery" as a MultiTermQuery for
> that to work, but It would be more efficient than a disjunction of
> slow-exact-queries.

Maybe that's the issue here? If TermInSetQuery extended
MultiTermQuery, then this would be trivial, you wouldn't have to write
any code to use the DV ordinals instead of the terms+postings, you'd
just call .setRewriteMethod().

Could/should TermInSetQuery be refactored to extend multitermquery?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
> And then we could make an IndexOrDocValuesQuery with both the
TermInSetQuery and this SDV.newSlowInSetQuery?

Unfortunately IndexOrDocValuesQuery relies on the fact that the "index"
query can evaluate its cost (ScorerSupplier#cost) without doing anything
costly, which isn't the case for TermInSetQuery.

So we'd need to make some changes. Estimating the cost of a TermInSetQuery
in general without seeking the terms is a hard problem, but maybe we could
specialize the unique key case to return the number of terms as the cost?

On Tue, Oct 26, 2021 at 5:37 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, Oct 26, 2021 at 11:24 AM Robert Muir <rcmuir@gmail.com> wrote:
> >
> > On Tue, Oct 26, 2021 at 10:58 AM Alan Woodward <romseygeek@gmail.com>
> wrote:
> > >
> > > We have SortedSetDocValuesField.newSlowRangeQuery() which does
> something close to what you want here, I think.
> > >
> >
> > See also DocValuesRewriteMethod which might be useful, at least as a
> > start. You'd have to express the "SetQuery" as a MultiTermQuery for
> > that to work, but It would be more efficient than a disjunction of
> > slow-exact-queries.
>
> Maybe that's the issue here? If TermInSetQuery extended
> MultiTermQuery, then this would be trivial, you wouldn't have to write
> any code to use the DV ordinals instead of the terms+postings, you'd
> just call .setRewriteMethod().
>
> Could/should TermInSetQuery be refactored to extend multitermquery?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com> wrote:
>
> > And then we could make an IndexOrDocValuesQuery with both the TermInSetQuery and this SDV.newSlowInSetQuery?
>
> Unfortunately IndexOrDocValuesQuery relies on the fact that the "index" query can evaluate its cost (ScorerSupplier#cost) without doing anything costly, which isn't the case for TermInSetQuery.
>
> So we'd need to make some changes. Estimating the cost of a TermInSetQuery in general without seeking the terms is a hard problem, but maybe we could specialize the unique key case to return the number of terms as the cost?

Yes we know each term in terms dict only has a single document, when
terms.size() == terms.getSumDocFreq(): there's only one posting for
each term.
But we can probably generalize a cost estimation a bit more, just
based on these two stats?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these
ideas.

On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > > And then we could make an IndexOrDocValuesQuery with both the
> TermInSetQuery and this SDV.newSlowInSetQuery?
> >
> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index"
> query can evaluate its cost (ScorerSupplier#cost) without doing anything
> costly, which isn't the case for TermInSetQuery.
> >
> > So we'd need to make some changes. Estimating the cost of a
> TermInSetQuery in general without seeking the terms is a hard problem, but
> maybe we could specialize the unique key case to return the number of terms
> as the cost?
>
> Yes we know each term in terms dict only has a single document, when
> terms.size() == terms.getSumDocFreq(): there's only one posting for
> each term.
> But we can probably generalize a cost estimation a bit more, just
> based on these two stats?
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

--
Adrien
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
If the list of ASIN's is presorted you can quickly merge it with the
SortedDocValues and produce a FixedBitSet of the top level ordinals, which
can be used as the post filter. This is a nice approach for things like
passing in a long list of access control predicates.


Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpountz@gmail.com> wrote:

> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these
> ideas.
>
> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcmuir@gmail.com> wrote:
>
>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >
>> > > And then we could make an IndexOrDocValuesQuery with both the
>> TermInSetQuery and this SDV.newSlowInSetQuery?
>> >
>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index"
>> query can evaluate its cost (ScorerSupplier#cost) without doing anything
>> costly, which isn't the case for TermInSetQuery.
>> >
>> > So we'd need to make some changes. Estimating the cost of a
>> TermInSetQuery in general without seeking the terms is a hard problem, but
>> maybe we could specialize the unique key case to return the number of terms
>> as the cost?
>>
>> Yes we know each term in terms dict only has a single document, when
>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>> each term.
>> But we can probably generalize a cost estimation a bit more, just
>> based on these two stats?
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>
>
> --
> Adrien
>
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
One more wrinkle for extremely large lists, is pass the list in as an
InputStream which is a presorted binary representation of the ASIN's and
slide a BytesRef across the stream and merge it with the SortedDocValues.
This saves on all the object creation and String overhead for really long
lists of id's.

Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joelsolr@gmail.com> wrote:

> If the list of ASIN's is presorted you can quickly merge it with the
> SortedDocValues and produce a FixedBitSet of the top level ordinals, which
> can be used as the post filter. This is a nice approach for things like
> passing in a long list of access control predicates.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpountz@gmail.com> wrote:
>
>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these
>> ideas.
>>
>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com> wrote:
>>> >
>>> > > And then we could make an IndexOrDocValuesQuery with both the
>>> TermInSetQuery and this SDV.newSlowInSetQuery?
>>> >
>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the
>>> "index" query can evaluate its cost (ScorerSupplier#cost) without doing
>>> anything costly, which isn't the case for TermInSetQuery.
>>> >
>>> > So we'd need to make some changes. Estimating the cost of a
>>> TermInSetQuery in general without seeking the terms is a hard problem, but
>>> maybe we could specialize the unique key case to return the number of terms
>>> as the cost?
>>>
>>> Yes we know each term in terms dict only has a single document, when
>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>>> each term.
>>> But we can probably generalize a cost estimation a bit more, just
>>> based on these two stats?
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>>
>> --
>> Adrien
>>
>
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
Sorry, I don't think there is a need to use any top-level ordinals.
none of these docvalues-based query implementations need it.

As far as query intersecting an input-stream, that is a big no-go.
Lucene Queries need to have correct hashcode/equals/etc.

That's why current stuff around this such as TermInSetQuery encode
everything into a PrefixCodedTerms.

On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein <joelsolr@gmail.com> wrote:
>
> One more wrinkle for extremely large lists, is pass the list in as an InputStream which is a presorted binary representation of the ASIN's and slide a BytesRef across the stream and merge it with the SortedDocValues. This saves on all the object creation and String overhead for really long lists of id's.
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joelsolr@gmail.com> wrote:
>>
>> If the list of ASIN's is presorted you can quickly merge it with the SortedDocValues and produce a FixedBitSet of the top level ordinals, which can be used as the post filter. This is a nice approach for things like passing in a long list of access control predicates.
>>
>>
>> Joel Bernstein
>> http://joelsolr.blogspot.com/
>>
>>
>> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpountz@gmail.com> wrote:
>>>
>>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these ideas.
>>>
>>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcmuir@gmail.com> wrote:
>>>>
>>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com> wrote:
>>>> >
>>>> > > And then we could make an IndexOrDocValuesQuery with both the TermInSetQuery and this SDV.newSlowInSetQuery?
>>>> >
>>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index" query can evaluate its cost (ScorerSupplier#cost) without doing anything costly, which isn't the case for TermInSetQuery.
>>>> >
>>>> > So we'd need to make some changes. Estimating the cost of a TermInSetQuery in general without seeking the terms is a hard problem, but maybe we could specialize the unique key case to return the number of terms as the cost?
>>>>
>>>> Yes we know each term in terms dict only has a single document, when
>>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>>>> each term.
>>>> But we can probably generalize a cost estimation a bit more, just
>>>> based on these two stats?
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>>
>>>
>>>
>>> --
>>> Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
There are times, particularly in ecommerce and access control, where speed
really matters. So, you build stuff that's really fast at query time, with
a tradeoff at commit time.


Joel Bernstein
http://joelsolr.blogspot.com/


On Tue, Oct 26, 2021 at 5:31 PM Robert Muir <rcmuir@gmail.com> wrote:

> Sorry, I don't think there is a need to use any top-level ordinals.
> none of these docvalues-based query implementations need it.
>
> As far as query intersecting an input-stream, that is a big no-go.
> Lucene Queries need to have correct hashcode/equals/etc.
>
> That's why current stuff around this such as TermInSetQuery encode
> everything into a PrefixCodedTerms.
>
> On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein <joelsolr@gmail.com> wrote:
> >
> > One more wrinkle for extremely large lists, is pass the list in as an
> InputStream which is a presorted binary representation of the ASIN's and
> slide a BytesRef across the stream and merge it with the SortedDocValues.
> This saves on all the object creation and String overhead for really long
> lists of id's.
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
> >
> > On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joelsolr@gmail.com>
> wrote:
> >>
> >> If the list of ASIN's is presorted you can quickly merge it with the
> SortedDocValues and produce a FixedBitSet of the top level ordinals, which
> can be used as the post filter. This is a nice approach for things like
> passing in a long list of access control predicates.
> >>
> >>
> >> Joel Bernstein
> >> http://joelsolr.blogspot.com/
> >>
> >>
> >> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpountz@gmail.com> wrote:
> >>>
> >>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about
> these ideas.
> >>>
> >>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcmuir@gmail.com> wrote:
> >>>>
> >>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com>
> wrote:
> >>>> >
> >>>> > > And then we could make an IndexOrDocValuesQuery with both the
> TermInSetQuery and this SDV.newSlowInSetQuery?
> >>>> >
> >>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the
> "index" query can evaluate its cost (ScorerSupplier#cost) without doing
> anything costly, which isn't the case for TermInSetQuery.
> >>>> >
> >>>> > So we'd need to make some changes. Estimating the cost of a
> TermInSetQuery in general without seeking the terms is a hard problem, but
> maybe we could specialize the unique key case to return the number of terms
> as the cost?
> >>>>
> >>>> Yes we know each term in terms dict only has a single document, when
> >>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
> >>>> each term.
> >>>> But we can probably generalize a cost estimation a bit more, just
> >>>> based on these two stats?
> >>>>
> >>>> ---------------------------------------------------------------------
> >>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> >>>> For additional commands, e-mail: dev-help@lucene.apache.org
> >>>>
> >>>
> >>>
> >>> --
> >>> Adrien
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>
Re: Slow DV equivalent of TermInSetQuery [ In reply to ]
Well if, as I suggest, we use MultiTermQuery + DocValuesRewriteMethod
to implement this, then the choice is yours. just run it against a
"slow IndexReader" and go thru the ordinal map if you choose? There's
nothing stopping you from doing that, and it will do what you want
already.

I just personally don't recommend it for this case. As the number of
documents increases, the ordinal map indirection probably costs more
than the construction cost is worth. Better tradeoff to simply work
per-segment with no indirection. The number of lookupOrds is bounded
in a simple way, unlike faceting, where I would recommend the ordinal
map.


On Tue, Oct 26, 2021 at 6:10 PM Joel Bernstein <joelsolr@gmail.com> wrote:
>
> There are times, particularly in ecommerce and access control, where speed really matters. So, you build stuff that's really fast at query time, with a tradeoff at commit time.
>
>
> Joel Bernstein
> http://joelsolr.blogspot.com/
>
>
> On Tue, Oct 26, 2021 at 5:31 PM Robert Muir <rcmuir@gmail.com> wrote:
>>
>> Sorry, I don't think there is a need to use any top-level ordinals.
>> none of these docvalues-based query implementations need it.
>>
>> As far as query intersecting an input-stream, that is a big no-go.
>> Lucene Queries need to have correct hashcode/equals/etc.
>>
>> That's why current stuff around this such as TermInSetQuery encode
>> everything into a PrefixCodedTerms.
>>
>> On Tue, Oct 26, 2021 at 4:57 PM Joel Bernstein <joelsolr@gmail.com> wrote:
>> >
>> > One more wrinkle for extremely large lists, is pass the list in as an InputStream which is a presorted binary representation of the ASIN's and slide a BytesRef across the stream and merge it with the SortedDocValues. This saves on all the object creation and String overhead for really long lists of id's.
>> >
>> > Joel Bernstein
>> > http://joelsolr.blogspot.com/
>> >
>> >
>> > On Tue, Oct 26, 2021 at 4:50 PM Joel Bernstein <joelsolr@gmail.com> wrote:
>> >>
>> >> If the list of ASIN's is presorted you can quickly merge it with the SortedDocValues and produce a FixedBitSet of the top level ordinals, which can be used as the post filter. This is a nice approach for things like passing in a long list of access control predicates.
>> >>
>> >>
>> >> Joel Bernstein
>> >> http://joelsolr.blogspot.com/
>> >>
>> >>
>> >> On Tue, Oct 26, 2021 at 3:52 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >>>
>> >>> I opened https://issues.apache.org/jira/browse/LUCENE-10207 about these ideas.
>> >>>
>> >>> On Tue, Oct 26, 2021 at 7:52 PM Robert Muir <rcmuir@gmail.com> wrote:
>> >>>>
>> >>>> On Tue, Oct 26, 2021 at 1:37 PM Adrien Grand <jpountz@gmail.com> wrote:
>> >>>> >
>> >>>> > > And then we could make an IndexOrDocValuesQuery with both the TermInSetQuery and this SDV.newSlowInSetQuery?
>> >>>> >
>> >>>> > Unfortunately IndexOrDocValuesQuery relies on the fact that the "index" query can evaluate its cost (ScorerSupplier#cost) without doing anything costly, which isn't the case for TermInSetQuery.
>> >>>> >
>> >>>> > So we'd need to make some changes. Estimating the cost of a TermInSetQuery in general without seeking the terms is a hard problem, but maybe we could specialize the unique key case to return the number of terms as the cost?
>> >>>>
>> >>>> Yes we know each term in terms dict only has a single document, when
>> >>>> terms.size() == terms.getSumDocFreq(): there's only one posting for
>> >>>> each term.
>> >>>> But we can probably generalize a cost estimation a bit more, just
>> >>>> based on these two stats?
>> >>>>
>> >>>> ---------------------------------------------------------------------
>> >>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> >>>> For additional commands, e-mail: dev-help@lucene.apache.org
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> Adrien
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org