Mailing List Archive: Dense union of doc IDs

Dense union of doc IDs

msfroh at gmail

Nov 3, 2022, 5:15 PM

Post #1 of 8 (190 views)

Hi,

I was recently poking around in the createWeight implementation for
MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
queries, and I realized that the worst-case performance could be pretty
bad, but (maybe) possible to optimize for.

Imagine if we have a segment with N docs and our MultiTermQuery expands to
hit M terms, where each of the M terms matches N-1 docs. (If we matched all
N docs, then Greg Miller's recent optimization to replace the
MultiTermQuery with a TermQuery would kick in.) In this case, my
understanding is that we would iterate through all the terms and pass each
one's postings to a DocIdSetBuilder, which will iterate through the
postings to set bits. This whole thing would be O(MN), I think.

I was thinking that it would be cool if the DocIdSetBuilder could detect
long runs of set bits and advance() each DISI to skip over them (since
they're guaranteed not to contribute anything new to the union). In the
worst case that I described above, I think it would make the whole thing
O(M log N) (assuming advance() takes log time).

At the risk of overcomplicating things, maybe DocIdSetBuilder could use a
third ("dense") BulkAdder implementation that kicks in once enough bits are
set, to efficiently implement the "or" operation to skip over known
(sufficiently long) runs of set bits?

Would something like that be useful? Is the "dense union of doc IDs" case
common enough to warrant it?

Thanks,
Froh

Re: Dense union of doc IDs [ In reply to ]

zhai7631 at gmail

Nov 3, 2022, 11:24 PM

Post #2 of 8 (190 views)

Hi Froh,

The idea sounds reasonable to me, altho I wonder whether using
CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense
union case should be already handled by disjunction query I suppose?
But that boolean rewrite is subject to max clause limit so it may have some
other problems depending on the use case I guess.

Best
Patrick

On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <msfroh@gmail.com> wrote:

> Hi,
>
> I was recently poking around in the createWeight implementation for
> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
> queries, and I realized that the worst-case performance could be pretty
> bad, but (maybe) possible to optimize for.
>
> Imagine if we have a segment with N docs and our MultiTermQuery expands to
> hit M terms, where each of the M terms matches N-1 docs. (If we matched all
> N docs, then Greg Miller's recent optimization to replace the
> MultiTermQuery with a TermQuery would kick in.) In this case, my
> understanding is that we would iterate through all the terms and pass each
> one's postings to a DocIdSetBuilder, which will iterate through the
> postings to set bits. This whole thing would be O(MN), I think.
>
> I was thinking that it would be cool if the DocIdSetBuilder could detect
> long runs of set bits and advance() each DISI to skip over them (since
> they're guaranteed not to contribute anything new to the union). In the
> worst case that I described above, I think it would make the whole thing
> O(M log N) (assuming advance() takes log time).
>
> At the risk of overcomplicating things, maybe DocIdSetBuilder could use a
> third ("dense") BulkAdder implementation that kicks in once enough bits are
> set, to efficiently implement the "or" operation to skip over known
> (sufficiently long) runs of set bits?
>
> Would something like that be useful? Is the "dense union of doc IDs" case
> common enough to warrant it?
>
> Thanks,
> Froh
>

Re: Dense union of doc IDs [ In reply to ]

msokolov at gmail

Nov 4, 2022, 6:46 AM

Post #3 of 8 (190 views)

It sounds like a lot of complexity to handle an unusual edge case, but
... I guess this actually happened? Can you give any sense of the
end-user behavior that caused it?

On Fri, Nov 4, 2022 at 2:26 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>
> Hi Froh,
>
> The idea sounds reasonable to me, altho I wonder whether using CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense union case should be already handled by disjunction query I suppose?
> But that boolean rewrite is subject to max clause limit so it may have some other problems depending on the use case I guess.
>
> Best
> Patrick
>
>
> On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <msfroh@gmail.com> wrote:
>>
>> Hi,
>>
>> I was recently poking around in the createWeight implementation for MultiTermQueryConstantScoreWrapper to get to the bottom of some slow queries, and I realized that the worst-case performance could be pretty bad, but (maybe) possible to optimize for.
>>
>> Imagine if we have a segment with N docs and our MultiTermQuery expands to hit M terms, where each of the M terms matches N-1 docs. (If we matched all N docs, then Greg Miller's recent optimization to replace the MultiTermQuery with a TermQuery would kick in.) In this case, my understanding is that we would iterate through all the terms and pass each one's postings to a DocIdSetBuilder, which will iterate through the postings to set bits. This whole thing would be O(MN), I think.
>>
>> I was thinking that it would be cool if the DocIdSetBuilder could detect long runs of set bits and advance() each DISI to skip over them (since they're guaranteed not to contribute anything new to the union). In the worst case that I described above, I think it would make the whole thing O(M log N) (assuming advance() takes log time).
>>
>> At the risk of overcomplicating things, maybe DocIdSetBuilder could use a third ("dense") BulkAdder implementation that kicks in once enough bits are set, to efficiently implement the "or" operation to skip over known (sufficiently long) runs of set bits?
>>
>> Would something like that be useful? Is the "dense union of doc IDs" case common enough to warrant it?
>>
>> Thanks,
>> Froh

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org

Re: Dense union of doc IDs [ In reply to ]

gsmiller at gmail

Nov 4, 2022, 7:53 AM

Post #4 of 8 (190 views)

+1 to exploring this idea. A couple additional thoughts...

I can imagine real world use cases that would benefit from this beyond the
super pathological N-1 case. With dense terms, I can believe that the
disjunction would start to accumulate "blocks" of documents that all match
as the bitset gets populated. As that starts to happen, it could be very
beneficial to `advance` other term postings beyond the "block." For
example, in an ecommerce search engine, a disjunction of "product types"
would likely exhibit this behavior (where some product types are likely
pretty dense).

Also, it would be nice to try this same idea in TermInSetQuery. It's very
similar, but still has its own implementation.

Thanks for raising this idea Froh! I'd be excited to see what you come up
with, and may have a use-case in mind to benchmark with if you end up with
a patch to test.

Cheers,
-Greg

On Fri, Nov 4, 2022 at 6:46 AM Michael Sokolov <msokolov@gmail.com> wrote:

> It sounds like a lot of complexity to handle an unusual edge case, but
> ... I guess this actually happened? Can you give any sense of the
> end-user behavior that caused it?
>
> On Fri, Nov 4, 2022 at 2:26 AM Patrick Zhai <zhai7631@gmail.com> wrote:
> >
> > Hi Froh,
> >
> > The idea sounds reasonable to me, altho I wonder whether using
> CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense
> union case should be already handled by disjunction query I suppose?
> > But that boolean rewrite is subject to max clause limit so it may have
> some other problems depending on the use case I guess.
> >
> > Best
> > Patrick
> >
> >
> > On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <msfroh@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I was recently poking around in the createWeight implementation for
> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
> queries, and I realized that the worst-case performance could be pretty
> bad, but (maybe) possible to optimize for.
> >>
> >> Imagine if we have a segment with N docs and our MultiTermQuery expands
> to hit M terms, where each of the M terms matches N-1 docs. (If we matched
> all N docs, then Greg Miller's recent optimization to replace the
> MultiTermQuery with a TermQuery would kick in.) In this case, my
> understanding is that we would iterate through all the terms and pass each
> one's postings to a DocIdSetBuilder, which will iterate through the
> postings to set bits. This whole thing would be O(MN), I think.
> >>
> >> I was thinking that it would be cool if the DocIdSetBuilder could
> detect long runs of set bits and advance() each DISI to skip over them
> (since they're guaranteed not to contribute anything new to the union). In
> the worst case that I described above, I think it would make the whole
> thing O(M log N) (assuming advance() takes log time).
> >>
> >> At the risk of overcomplicating things, maybe DocIdSetBuilder could use
> a third ("dense") BulkAdder implementation that kicks in once enough bits
> are set, to efficiently implement the "or" operation to skip over known
> (sufficiently long) runs of set bits?
> >>
> >> Would something like that be useful? Is the "dense union of doc IDs"
> case common enough to warrant it?
> >>
> >> Thanks,
> >> Froh
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>

Re: Dense union of doc IDs [ In reply to ]

jpountz at gmail

Nov 7, 2022, 3:05 AM

Post #5 of 8 (190 views)

I have been thinking about a similar feature for conjunctions and
negations. When you have many low-cardinality fields, a good way to speed
up queries on these fields is to configure an index sort on these fields.
This automatically creates large gaps in postings between long runs of
ones, and Lucene naturally takes advantage of it to skip over non-matching
documents. In that case, the bottleneck becomes linearly scanning runs of
ones until the first non-matching doc ID of one of the postings is found.
We could do much better if Lucene could give us the next non-matching doc
ID.

This would also help negations, since after advancing the main clause,
Lucene needs to ask the prohibited clause: "what is your next non-matching
doc ID on or after this doc ID?" in order to figure out 1. if the doc ID of
the main clause is a match and 2. what is the next target for the main
clause. In case of a long run of ones in the prohibited clause, we would
currently linearly scan until the next non-matching doc ID.

I had created a prototype for this by introducing a new
`DocIdSetIterator#peekNextNonMatchingDocID()` method that would default to
returning `docID() + 1`, and patching postings to return `docID() + 128` in
case a postings block would consist of ones. To fully take advantage of
this, we'd create a new postings format that would specialize encoding long
runs of ones.

In summary, I'm also +1 on exploring this idea. :)

On Fri, Nov 4, 2022 at 3:54 PM Greg Miller <gsmiller@gmail.com> wrote:

> +1 to exploring this idea. A couple additional thoughts...
>
> I can imagine real world use cases that would benefit from this beyond the
> super pathological N-1 case. With dense terms, I can believe that the
> disjunction would start to accumulate "blocks" of documents that all match
> as the bitset gets populated. As that starts to happen, it could be very
> beneficial to `advance` other term postings beyond the "block." For
> example, in an ecommerce search engine, a disjunction of "product types"
> would likely exhibit this behavior (where some product types are likely
> pretty dense).
>
> Also, it would be nice to try this same idea in TermInSetQuery. It's very
> similar, but still has its own implementation.
>
> Thanks for raising this idea Froh! I'd be excited to see what you come up
> with, and may have a use-case in mind to benchmark with if you end up with
> a patch to test.
>
> Cheers,
> -Greg
>
> On Fri, Nov 4, 2022 at 6:46 AM Michael Sokolov <msokolov@gmail.com> wrote:
>
>> It sounds like a lot of complexity to handle an unusual edge case, but
>> ... I guess this actually happened? Can you give any sense of the
>> end-user behavior that caused it?
>>
>> On Fri, Nov 4, 2022 at 2:26 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>> >
>> > Hi Froh,
>> >
>> > The idea sounds reasonable to me, altho I wonder whether using
>> CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense
>> union case should be already handled by disjunction query I suppose?
>> > But that boolean rewrite is subject to max clause limit so it may have
>> some other problems depending on the use case I guess.
>> >
>> > Best
>> > Patrick
>> >
>> >
>> > On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <msfroh@gmail.com> wrote:
>> >>
>> >> Hi,
>> >>
>> >> I was recently poking around in the createWeight implementation for
>> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
>> queries, and I realized that the worst-case performance could be pretty
>> bad, but (maybe) possible to optimize for.
>> >>
>> >> Imagine if we have a segment with N docs and our MultiTermQuery
>> expands to hit M terms, where each of the M terms matches N-1 docs. (If we
>> matched all N docs, then Greg Miller's recent optimization to replace the
>> MultiTermQuery with a TermQuery would kick in.) In this case, my
>> understanding is that we would iterate through all the terms and pass each
>> one's postings to a DocIdSetBuilder, which will iterate through the
>> postings to set bits. This whole thing would be O(MN), I think.
>> >>
>> >> I was thinking that it would be cool if the DocIdSetBuilder could
>> detect long runs of set bits and advance() each DISI to skip over them
>> (since they're guaranteed not to contribute anything new to the union). In
>> the worst case that I described above, I think it would make the whole
>> thing O(M log N) (assuming advance() takes log time).
>> >>
>> >> At the risk of overcomplicating things, maybe DocIdSetBuilder could
>> use a third ("dense") BulkAdder implementation that kicks in once enough
>> bits are set, to efficiently implement the "or" operation to skip over
>> known (sufficiently long) runs of set bits?
>> >>
>> >> Would something like that be useful? Is the "dense union of doc IDs"
>> case common enough to warrant it?
>> >>
>> >> Thanks,
>> >> Froh
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: dev-help@lucene.apache.org
>>
>>

--
Adrien

Re: Dense union of doc IDs [ In reply to ]

jpountz at gmail

Nov 7, 2022, 4:15 AM

Post #6 of 8 (190 views)

I just found the branch I had used to play with this idea:
https://github.com/apache/lucene/compare/main...jpountz:lucene:bulk_collection,
in case you're interested in having a look.

There are a few changes on DefaultBulkScorer and LeafCollector too because
I was also interested in other ways to speed things up in case when index
sorting is enabled as described in my previous email, and e.g. we could
optimize counting hits or terms facets with such an index organization, but
it's something we could evaluate separately.

On Mon, Nov 7, 2022 at 12:05 PM Adrien Grand <jpountz@gmail.com> wrote:

> I have been thinking about a similar feature for conjunctions and
> negations. When you have many low-cardinality fields, a good way to speed
> up queries on these fields is to configure an index sort on these fields.
> This automatically creates large gaps in postings between long runs of
> ones, and Lucene naturally takes advantage of it to skip over non-matching
> documents. In that case, the bottleneck becomes linearly scanning runs of
> ones until the first non-matching doc ID of one of the postings is found.
> We could do much better if Lucene could give us the next non-matching doc
> ID.
>
> This would also help negations, since after advancing the main clause,
> Lucene needs to ask the prohibited clause: "what is your next non-matching
> doc ID on or after this doc ID?" in order to figure out 1. if the doc ID of
> the main clause is a match and 2. what is the next target for the main
> clause. In case of a long run of ones in the prohibited clause, we would
> currently linearly scan until the next non-matching doc ID.
>
> I had created a prototype for this by introducing a new
> `DocIdSetIterator#peekNextNonMatchingDocID()` method that would default to
> returning `docID() + 1`, and patching postings to return `docID() + 128` in
> case a postings block would consist of ones. To fully take advantage of
> this, we'd create a new postings format that would specialize encoding long
> runs of ones.
>
> In summary, I'm also +1 on exploring this idea. :)
>
> On Fri, Nov 4, 2022 at 3:54 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> +1 to exploring this idea. A couple additional thoughts...
>>
>> I can imagine real world use cases that would benefit from this beyond
>> the super pathological N-1 case. With dense terms, I can believe that the
>> disjunction would start to accumulate "blocks" of documents that all match
>> as the bitset gets populated. As that starts to happen, it could be very
>> beneficial to `advance` other term postings beyond the "block." For
>> example, in an ecommerce search engine, a disjunction of "product types"
>> would likely exhibit this behavior (where some product types are likely
>> pretty dense).
>>
>> Also, it would be nice to try this same idea in TermInSetQuery. It's very
>> similar, but still has its own implementation.
>>
>> Thanks for raising this idea Froh! I'd be excited to see what you come up
>> with, and may have a use-case in mind to benchmark with if you end up with
>> a patch to test.
>>
>> Cheers,
>> -Greg
>>
>> On Fri, Nov 4, 2022 at 6:46 AM Michael Sokolov <msokolov@gmail.com>
>> wrote:
>>
>>> It sounds like a lot of complexity to handle an unusual edge case, but
>>> ... I guess this actually happened? Can you give any sense of the
>>> end-user behavior that caused it?
>>>
>>> On Fri, Nov 4, 2022 at 2:26 AM Patrick Zhai <zhai7631@gmail.com> wrote:
>>> >
>>> > Hi Froh,
>>> >
>>> > The idea sounds reasonable to me, altho I wonder whether using
>>> CONSTANT_SCORE_BOOLEAN_REWRITE would help with your case since that dense
>>> union case should be already handled by disjunction query I suppose?
>>> > But that boolean rewrite is subject to max clause limit so it may have
>>> some other problems depending on the use case I guess.
>>> >
>>> > Best
>>> > Patrick
>>> >
>>> >
>>> > On Thu, Nov 3, 2022 at 5:15 PM Michael Froh <msfroh@gmail.com> wrote:
>>> >>
>>> >> Hi,
>>> >>
>>> >> I was recently poking around in the createWeight implementation for
>>> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
>>> queries, and I realized that the worst-case performance could be pretty
>>> bad, but (maybe) possible to optimize for.
>>> >>
>>> >> Imagine if we have a segment with N docs and our MultiTermQuery
>>> expands to hit M terms, where each of the M terms matches N-1 docs. (If we
>>> matched all N docs, then Greg Miller's recent optimization to replace the
>>> MultiTermQuery with a TermQuery would kick in.) In this case, my
>>> understanding is that we would iterate through all the terms and pass each
>>> one's postings to a DocIdSetBuilder, which will iterate through the
>>> postings to set bits. This whole thing would be O(MN), I think.
>>> >>
>>> >> I was thinking that it would be cool if the DocIdSetBuilder could
>>> detect long runs of set bits and advance() each DISI to skip over them
>>> (since they're guaranteed not to contribute anything new to the union). In
>>> the worst case that I described above, I think it would make the whole
>>> thing O(M log N) (assuming advance() takes log time).
>>> >>
>>> >> At the risk of overcomplicating things, maybe DocIdSetBuilder could
>>> use a third ("dense") BulkAdder implementation that kicks in once enough
>>> bits are set, to efficiently implement the "or" operation to skip over
>>> known (sufficiently long) runs of set bits?
>>> >>
>>> >> Would something like that be useful? Is the "dense union of doc IDs"
>>> case common enough to warrant it?
>>> >>
>>> >> Thanks,
>>> >> Froh
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: dev-help@lucene.apache.org
>>>
>>>
>
> --
> Adrien
>

--
Adrien

Re: Dense union of doc IDs [ In reply to ]

xuganglu at icloud

Nov 7, 2022, 4:20 AM

Post #7 of 8 (190 views)

+1 If we would have a new BulkAdder and it could detect long runs of set bits, It also could be at least used in LRUQueryCache to cache part dense docs instead of always building a huge BitSet by maxDoc?

Xugang

https://www.amazingkoala.com.cn

> On Nov 4, 2022, at 08:15, Michael Froh <msfroh@gmail.com> wrote:
>
> Hi,
>
> I was recently poking around in the createWeight implementation for MultiTermQueryConstantScoreWrapper to get to the bottom of some slow queries, and I realized that the worst-case performance could be pretty bad, but (maybe) possible to optimize for.
>
> Imagine if we have a segment with N docs and our MultiTermQuery expands to hit M terms, where each of the M terms matches N-1 docs. (If we matched all N docs, then Greg Miller's recent optimization to replace the MultiTermQuery with a TermQuery would kick in.) In this case, my understanding is that we would iterate through all the terms and pass each one's postings to a DocIdSetBuilder, which will iterate through the postings to set bits. This whole thing would be O(MN), I think.
>
> I was thinking that it would be cool if the DocIdSetBuilder could detect long runs of set bits and advance() each DISI to skip over them (since they're guaranteed not to contribute anything new to the union). In the worst case that I described above, I think it would make the whole thing O(M log N) (assuming advance() takes log time).
>
> At the risk of overcomplicating things, maybe DocIdSetBuilder could use a third ("dense") BulkAdder implementation that kicks in once enough bits are set, to efficiently implement the "or" operation to skip over known (sufficiently long) runs of set bits?
>
> Would something like that be useful? Is the "dense union of doc IDs" case common enough to warrant it?
>
> Thanks,
> Froh

Re: Dense union of doc IDs [ In reply to ]

jpountz at gmail

Nov 10, 2022, 12:54 AM

Post #8 of 8 (187 views)

I opened an issue to track this idea:
https://github.com/apache/lucene/issues/11915

On Mon, Nov 7, 2022 at 1:21 PM LuXugang <xuganglu@icloud.com.invalid> wrote:

> +1 If we would have a new BulkAdder and it could detect long runs of set
> bits, It also could be at least used in LRUQueryCache to cache part dense
> docs instead of always building a huge BitSet by maxDoc?
>
> Xugang
>
> https://www.amazingkoala.com.cn
>
>
>
>
> On Nov 4, 2022, at 08:15, Michael Froh <msfroh@gmail.com> wrote:
>
> Hi,
>
> I was recently poking around in the createWeight implementation for
> MultiTermQueryConstantScoreWrapper to get to the bottom of some slow
> queries, and I realized that the worst-case performance could be pretty
> bad, but (maybe) possible to optimize for.
>
> Imagine if we have a segment with N docs and our MultiTermQuery expands to
> hit M terms, where each of the M terms matches N-1 docs. (If we matched all
> N docs, then Greg Miller's recent optimization to replace the
> MultiTermQuery with a TermQuery would kick in.) In this case, my
> understanding is that we would iterate through all the terms and pass each
> one's postings to a DocIdSetBuilder, which will iterate through the
> postings to set bits. This whole thing would be O(MN), I think.
>
> I was thinking that it would be cool if the DocIdSetBuilder could detect
> long runs of set bits and advance() each DISI to skip over them (since
> they're guaranteed not to contribute anything new to the union). In the
> worst case that I described above, I think it would make the whole thing
> O(M log N) (assuming advance() takes log time).
>
> At the risk of overcomplicating things, maybe DocIdSetBuilder could use a
> third ("dense") BulkAdder implementation that kicks in once enough bits are
> set, to efficiently implement the "or" operation to skip over known
> (sufficiently long) runs of set bits?
>
> Would something like that be useful? Is the "dense union of doc IDs" case
> common enough to warrant it?
>
> Thanks,
> Froh
>
>
>

--
Adrien