Mailing List Archive

Deduplication/inversion for dimensional points
Hi Team!

I know we have made recent strides so that if the same dimensional point
value (or N-dimensional point value) is indexed many, many times in one
segment, we try to somehow optimize for that case. I think this happens
only at the leaf-block level, i.e. if all points in the block are
identical, we write a special header byte and the one value.

This is a massive reduction in index size, and a good speedup at search
time, since we only need to check if the value passes the query's shape
once, and can then collect all docids in that block.

Do we have any more deduping logic if many leaf blocks also share a single
value? E.g. an inner node in the KD tree could note that all leaf blocks
under it have the same value? And then somehow when intersecting we might
treat those leaf blocks (whose docids will be in postings order, right?) as
a strange postings list, maybe disjunctively inserted into something like a
disjunctive clause in a BooleanQuery? I think I saw an issue talking about
something like this idea, but cannot find it now.

Context: we (Amazon Product Search team) are working out how we could use
Lucene's awesome geo features to help customers search/shop better :) And
the simplest approach, index a lat/lon point on every "offer", would result
in many many duplicate lat/lons.

Thanks,

Mike McCandless

http://blog.mikemccandless.com
Re: Deduplication/inversion for dimensional points [ In reply to ]
Hey Mike,

I believe that we always handled the case when all documents in a leaf have
the same value efficiently. The case that got improved more recently is the
case when there are many duplicates within a leaf but still more than one
value. We added run-length encoding for this case (LUCENE-8868
<https://issues.apache.org/jira/browse/LUCENE-8868>) and added a new
optional API that allows checking the value only once to make queries more
efficient (LUCENE-8885 <https://issues.apache.org/jira/browse/LUCENE-8885>).
Kudos to Ignacio for this work.

To my knowledge, we don't have more deduplication logic. When an inner
block has a single value, the IntersectVisitor likely
returns CELL_INSIDE_QUERY and Lucene will only collect doc IDs for all leaf
blocks that are under this leaf block without doing any additional
comparison. This should be quite good already.

The fact that we have a visitor API makes it hard to actually treat these
doc IDs as a postings list even though they are in order. This is one
reason why I've been wondering if we should move to a different API (
LUCENE-9619 <https://issues.apache.org/jira/browse/LUCENE-9619>) where we
could explore the tree programmatically, which would hopefully allow us to
treat these doc IDs like a postings list. It wouldn't support skipping, but
at least we could avoid materializing a BitSet containing all doc IDs that
match this value. Not supporting skipping feels bad, but maybe it's not
actually that bad actually given that if you are using
IndexOrDocValuesQuery, the BKD tree will only be consulted if it leads
iteration, ie. we'd consume most of the matching doc IDs anyway. We could
record the min/max doc ID for each leaf block, but this feels like a major
change to the API for a very narrow use-case?

Also related, Ignacio has been asking if our BKD trees should encode doc
IDs more like postings: LUCENE-9368
<https://issues.apache.org/jira/browse/LUCENE-9368>.

On Fri, Jul 16, 2021 at 2:42 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> Hi Team!
>
> I know we have made recent strides so that if the same dimensional point
> value (or N-dimensional point value) is indexed many, many times in one
> segment, we try to somehow optimize for that case. I think this happens
> only at the leaf-block level, i.e. if all points in the block are
> identical, we write a special header byte and the one value.
>
> This is a massive reduction in index size, and a good speedup at search
> time, since we only need to check if the value passes the query's shape
> once, and can then collect all docids in that block.
>
> Do we have any more deduping logic if many leaf blocks also share a single
> value? E.g. an inner node in the KD tree could note that all leaf blocks
> under it have the same value? And then somehow when intersecting we might
> treat those leaf blocks (whose docids will be in postings order, right?) as
> a strange postings list, maybe disjunctively inserted into something like a
> disjunctive clause in a BooleanQuery? I think I saw an issue talking about
> something like this idea, but cannot find it now.
>
> Context: we (Amazon Product Search team) are working out how we could use
> Lucene's awesome geo features to help customers search/shop better :) And
> the simplest approach, index a lat/lon point on every "offer", would result
> in many many duplicate lat/lons.
>
> Thanks,
>
> Mike McCandless
>
> http://blog.mikemccandless.com
>


--
Adrien
Re: Deduplication/inversion for dimensional points [ In reply to ]
Whoa, thank you for the detailed response with links Adrien! Dimensional
points sure have come a long way! And thank you Ignacio and everyone else
for all the awesome improvements :)

Inlined responses below:

On Fri, Jul 16, 2021 at 9:51 AM Adrien Grand <jpountz@gmail.com> wrote:

> Hey Mike,
>
> I believe that we always handled the case when all documents in a leaf
> have the same value efficiently. The case that got improved more recently
> is the case when there are many duplicates within a leaf but still more
> than one value. We added run-length encoding for this case (LUCENE-8868
> <https://issues.apache.org/jira/browse/LUCENE-8868>) and added a new
> optional API that allows checking the value only once to make queries more
> efficient (LUCENE-8885 <https://issues.apache.org/jira/browse/LUCENE-8885>).
> Kudos to Ignacio for this work.
>

Ahh that's great -- so it makes the optimization less brittle, e.g. if you
have nearly the same value but one value is not, we still strongly optimize
that using RLE.

To my knowledge, we don't have more deduplication logic. When an inner
> block has a single value, the IntersectVisitor likely
> returns CELL_INSIDE_QUERY and Lucene will only collect doc IDs for all leaf
> blocks that are under this leaf block without doing any additional
> comparison. This should be quite good already.
>

I agree, this should be very effective. We check the value once, and then
go on to collect or skip those 1K (default maxPointsInLeafNode) docids.

The fact that we have a visitor API makes it hard to actually treat these
> doc IDs as a postings list even though they are in order. This is one
> reason why I've been wondering if we should move to a different API (
> LUCENE-9619 <https://issues.apache.org/jira/browse/LUCENE-9619>) where we
> could explore the tree programmatically, which would hopefully allow us to
> treat these doc IDs like a postings list.
>

Aha! That was the issue I was thinking of, awesome.


> It wouldn't support skipping, but at least we could avoid materializing a
> BitSet containing all doc IDs that match this value. Not supporting
> skipping feels bad, but maybe it's not actually that bad actually given
> that if you are using IndexOrDocValuesQuery, the BKD tree will only be
> consulted if it leads iteration, ie. we'd consume most of the matching doc
> IDs anyway.
>

We could eventually support skipping if this worked out right? Like
nothing serious would preclude that. Maybe something like the randomized
inlined skip data structure (LUCENE-2962
<https://issues.apache.org/jira/browse/LUCENE-2962>)

We could record the min/max doc ID for each leaf block, but this feels like
> a major change to the API for a very narrow use-case?
>

Yeah maybe this is a narrow use case, not sure. We are just exploring now
if we could make wasteful heavily duplicated lat/lon work efficiently, but
really we should find a way to simply convert them to postings :) Dedup up
front and then index those dedup'd tokens.


> Also related, Ignacio has been asking if our BKD trees should encode doc
> IDs more like postings: LUCENE-9368
> <https://issues.apache.org/jira/browse/LUCENE-9368>.
>

Oh yeah! Vectorizing the decode right? That makes total sense. That was
a nice performance bump for postings.

Thanks Adrien,

Mike

>
Re: Deduplication/inversion for dimensional points [ In reply to ]
On Tue, Jul 20, 2021 at 5:50 PM Michael McCandless <
lucene@mikemccandless.com> wrote:

> To my knowledge, we don't have more deduplication logic. When an inner
>> block has a single value, the IntersectVisitor likely
>> returns CELL_INSIDE_QUERY and Lucene will only collect doc IDs for all leaf
>> blocks that are under this leaf block without doing any additional
>> comparison. This should be quite good already.
>>
>
> I agree, this should be very effective. We check the value once, and then
> go on to collect or skip those 1K (default maxPointsInLeafNode) docids.
>

It's actually 512 since LUCENE-9087
<https://issues.apache.org/jira/browse/LUCENE-9087>. :) (And an even more
interesting feature of this change is that it changed n-dimensional BKD
trees to always have 512 points per leaf, while they used to have between
512 and 1024 points per leaf before depending on the number of points of
the segment.)

--
Adrien