Hi,
my response was a bit unclear. Before Lucene 4.0 we saved *deletions* in
a bitset (1 = doc deleted), so you were able to use the DocIdSetIterator
provided directly. At this point there was no sparse implementation.
My idea was more about this: "Because we marked *deleted* docs (not live
docs) in the bitset, the cardinality of the Bitset was small and a
sparse one would work well". Of course we can just invert on set/get to
make use of a SparseFixedBitSet.
Uwe
Am 06.02.2024 um 21:05 schrieb Adrien Grand:
> Good point, I opened an issue to discuss this:
> https://github.com/apache/lucene/issues/13084.
>
> Did we actually use a sparse bit set to encode deleted docs before? I
> don't recall that.
>
> On Tue, Feb 6, 2024 at 2:42?PM Uwe Schindler <uwe@thetaphi.de> wrote:
>
> Hi,
>
> A SparseBitset impl for DELETES would be fine if the model in
> Lucene would encode deleted docs (it did that in earlier times).
> As deletes are sparse (deletes are in most cases <40%), this would
> help to make the iterator cheaper.
>
> Uwe
>
> Am 06.02.2024 um 09:01 schrieb Adrien Grand:
>> Hey Michael,
>>
>> You are right, iterating all deletes with nextClearBit() would
>> run in O(maxDoc). I am coming from the other direction, where I'm
>> expecting the number of deletes to be more in the order of 1%-5%
>> of the doc ID space, so a separate int[] would use lots of heap
>> and probably not help that much compared with nextClearBit(). My
>> mental model is that the two most common use-cases are
>> append-only workloads, where there are no deletes at all, and
>> update workloads, which would commonly have several percents of
>> deleted docs. It's not clear to me how common it is to have very
>> few deletes.
>>
>> On Tue, Feb 6, 2024 at 7:03?AM Michael Froh <msfroh@gmail.com> wrote:
>>
>> Thanks Adrien!
>>
>> My thinking with a separate iterator was that nextClearBit()
>> is relatively expensive (O(maxDoc) to traverse everything, I
>> think). The solution I was imagining would involve an
>> index-time change to output, say, an int[] of deleted docIDs
>> if the number is sufficiently small (like maybe less than
>> 1000). Then the livedocs interface could optionally return a
>> cheap deleted docs iterator (i.e. only if the number of
>> deleted docs is less than the threshold). Technically, the
>> cost would be O(1), since we set a constant bound on the
>> effort and fail otherwise. :)
>>
>> I think 1000 doc value lookups would be cheap, but I don't
>> know if the guarantee is cheap enough to make it into
>> Weight#count.
>>
>> That said, I'm going to see if iterating with nextClearBit()
>> is sufficiently cheap. Hmm... precomputing that int[] for
>> deleted docIDs on refresh could be an option too.
>>
>> Thanks again,
>> Froh
>>
>> On Fri, Feb 2, 2024 at 11:38?PM Adrien Grand
>> <jpountz@gmail.com> wrote:
>>
>> Hi Michael,
>>
>> Indeed, only MatchAllDocsQuery knows how to produce a
>> count when there are deletes.
>>
>> Your idea sounds good to me, do you actually need a side
>> car iterator for deletes, or could you use a
>> nextClearBit() operation on the bit set?
>>
>> I don't think we can fold it into Weight#count since
>> there is an expectation that it is negligible compared
>> with the cost of a naive count, but we may be able to do
>> it in IndexSearcher#count or on the OpenSearch side.
>>
>> Le ven. 2 févr. 2024, 23:50, Michael Froh
>> <msfroh@gmail.com> a écrit :
>>
>> Hi,
>>
>> On OpenSearch, we've been taking advantage of the
>> various O(1) Weight#count() implementations to
>> quickly compute various aggregations without needing
>> to iterate over all the matching documents (at least
>> when the top-level query is functionally a match-all
>> at the segment level). Of course, from what I've
>> seen, every clever Weight#count()
>> implementation falls apart (returns -1) in the face
>> of deletes.
>>
>> I was thinking that we could still handle small
>> numbers of deletes efficiently if only we could get a
>> DocIdSetIterator for deleted docs.
>>
>> Like suppose you're doing a date histogram
>> aggregation, you could get the counts for each bucket
>> from the points tree (ignoring deletes), then iterate
>> through the deleted docs and decrement their
>> contribution from the relevant bucket (determined
>> based on a docvalues lookup). Assuming the number of
>> deleted docs is small, it should be cheap, right?
>>
>> The current LiveDocs implementation is just a
>> FixedBitSet, so AFAIK it's not great for iteration.
>> I'm imagining adding a supplementary "deleted docs
>> iterator" that could sit next to the FixedBitSet if
>> and only if the number of deletes is "small". Is
>> there a better way that I should be thinking about this?
>>
>> Thanks,
>> Froh
>>
>>
>>
>> --
>> Adrien
>
> --
> Uwe Schindler
> Achterdiek 19, D-28357 Bremen
> https://www.thetaphi.de
> eMail:uwe@thetaphi.de
>
>
>
> --
> Adrien
--
Uwe Schindler
Achterdiek 19, D-28357 Bremen
https://www.thetaphi.de eMail:uwe@thetaphi.de