Mailing List Archive

Querying into a Collector visits documents multiple times
Hi Lucene users,

I am developing a search application that needs to do some basic
summary statistics. We use Lucene 8.9.0.
To improve performance for e.g. summing a value across 10,000
documents, we are using DocValues as columnar storage.

In order to retrieve the DocValues without collecting all hits into a
TopDocs, which we determined to cause a lot of memory pressure and
consume much time, we are using the expert Collector query interface.

Here's the code, simplified a bit for the list:

final collector = new Collector() {
long sum = 0;

@Override
public ScoreMode scoreMode() {
return ScoreMode.COMPLETE_NO_SCORES;
}

@Override
public LeafCollector getLeafCollector(final LeafReaderContext
context) throws IOException {
if (context.docBase == 0) {
sum = 0; // XXX: this should not be necessary?
}
final var subtotalValue =
context.reader().getNumericDocValues("subtotal");
return new LeafCollector() {
@Override
public void setScorer(final Scorable scorer) throws IOException {
}

@Override
public void collect(final int doc) throws IOException {
if (subtotalValue.docID() > doc ||
!subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
return;
}
sum += subtotalValue.longValue();
}
};
}
}
searcher.search(myQuery, collector);
return collector.sum;

The query is a moderately complicated Boolean query with some
TermQuery and MultiTermQuery instances combined together.
While first testing, I observed that seemingly the collector is called
twice for each document, and the sum is exactly double what you would
expect.

It seems that the Collector is observing every matched document twice,
and by printing out the Scorer, I see that it's done with two
different BooleanScorer instances.
You can see my hack that resets the collector every time it starts at
docBase 0. which I am sure is not the right approach, but seems to
work.
What is the right pattern to ensure my Collector only observes result
documents once, no matter the input query? I see a note in the
documentation that state is supposed to be stored on the Scorer
implementation, but I am not providing a custom Scorer, nor do I
actually want any scoring at all.

Thank you for any guidance!
Steven

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: Querying into a Collector visits documents multiple times [ In reply to ]
Hi Steven,

This collector looks correct to me. Resetting the counter to 0 on the first
segment is indeed not necessary.

We have plenty of collectors that are very similar to this one and we never
observed any double-counting issue. I would suspect an issue in the code
that calls this collector. Maybe try to print the stack trace under the `
if (context.docBase == 0) {` check to see why your collector is being
called twice?

On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
stevenschlansker@gmail.com> wrote:

> Hi Lucene users,
>
> I am developing a search application that needs to do some basic
> summary statistics. We use Lucene 8.9.0.
> To improve performance for e.g. summing a value across 10,000
> documents, we are using DocValues as columnar storage.
>
> In order to retrieve the DocValues without collecting all hits into a
> TopDocs, which we determined to cause a lot of memory pressure and
> consume much time, we are using the expert Collector query interface.
>
> Here's the code, simplified a bit for the list:
>
> final collector = new Collector() {
> long sum = 0;
>
> @Override
> public ScoreMode scoreMode() {
> return ScoreMode.COMPLETE_NO_SCORES;
> }
>
> @Override
> public LeafCollector getLeafCollector(final LeafReaderContext
> context) throws IOException {
> if (context.docBase == 0) {
> sum = 0; // XXX: this should not be necessary?
> }
> final var subtotalValue =
> context.reader().getNumericDocValues("subtotal");
> return new LeafCollector() {
> @Override
> public void setScorer(final Scorable scorer) throws
> IOException {
> }
>
> @Override
> public void collect(final int doc) throws IOException {
> if (subtotalValue.docID() > doc ||
> !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
> return;
> }
> sum += subtotalValue.longValue();
> }
> };
> }
> }
> searcher.search(myQuery, collector);
> return collector.sum;
>
> The query is a moderately complicated Boolean query with some
> TermQuery and MultiTermQuery instances combined together.
> While first testing, I observed that seemingly the collector is called
> twice for each document, and the sum is exactly double what you would
> expect.
>
> It seems that the Collector is observing every matched document twice,
> and by printing out the Scorer, I see that it's done with two
> different BooleanScorer instances.
> You can see my hack that resets the collector every time it starts at
> docBase 0. which I am sure is not the right approach, but seems to
> work.
> What is the right pattern to ensure my Collector only observes result
> documents once, no matter the input query? I see a note in the
> documentation that state is supposed to be stored on the Scorer
> implementation, but I am not providing a custom Scorer, nor do I
> actually want any scoring at all.
>
> Thank you for any guidance!
> Steven
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

--
Adrien
Re: Querying into a Collector visits documents multiple times [ In reply to ]
Separate issue, but this collector is not going to work with concurrent
search since the sum is not updated in a thread safe manner. Maybe you
don't care, since you don't use a thread pool to execute your queries, but
you probably should!

On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <jpountz@gmail.com> wrote:

> Hi Steven,
>
> This collector looks correct to me. Resetting the counter to 0 on the first
> segment is indeed not necessary.
>
> We have plenty of collectors that are very similar to this one and we never
> observed any double-counting issue. I would suspect an issue in the code
> that calls this collector. Maybe try to print the stack trace under the `
> if (context.docBase == 0) {` check to see why your collector is being
> called twice?
>
> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
> stevenschlansker@gmail.com> wrote:
>
> > Hi Lucene users,
> >
> > I am developing a search application that needs to do some basic
> > summary statistics. We use Lucene 8.9.0.
> > To improve performance for e.g. summing a value across 10,000
> > documents, we are using DocValues as columnar storage.
> >
> > In order to retrieve the DocValues without collecting all hits into a
> > TopDocs, which we determined to cause a lot of memory pressure and
> > consume much time, we are using the expert Collector query interface.
> >
> > Here's the code, simplified a bit for the list:
> >
> > final collector = new Collector() {
> > long sum = 0;
> >
> > @Override
> > public ScoreMode scoreMode() {
> > return ScoreMode.COMPLETE_NO_SCORES;
> > }
> >
> > @Override
> > public LeafCollector getLeafCollector(final LeafReaderContext
> > context) throws IOException {
> > if (context.docBase == 0) {
> > sum = 0; // XXX: this should not be necessary?
> > }
> > final var subtotalValue =
> > context.reader().getNumericDocValues("subtotal");
> > return new LeafCollector() {
> > @Override
> > public void setScorer(final Scorable scorer) throws
> > IOException {
> > }
> >
> > @Override
> > public void collect(final int doc) throws IOException {
> > if (subtotalValue.docID() > doc ||
> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
> > return;
> > }
> > sum += subtotalValue.longValue();
> > }
> > };
> > }
> > }
> > searcher.search(myQuery, collector);
> > return collector.sum;
> >
> > The query is a moderately complicated Boolean query with some
> > TermQuery and MultiTermQuery instances combined together.
> > While first testing, I observed that seemingly the collector is called
> > twice for each document, and the sum is exactly double what you would
> > expect.
> >
> > It seems that the Collector is observing every matched document twice,
> > and by printing out the Scorer, I see that it's done with two
> > different BooleanScorer instances.
> > You can see my hack that resets the collector every time it starts at
> > docBase 0. which I am sure is not the right approach, but seems to
> > work.
> > What is the right pattern to ensure my Collector only observes result
> > documents once, no matter the input query? I see a note in the
> > documentation that state is supposed to be stored on the Scorer
> > implementation, but I am not providing a custom Scorer, nor do I
> > actually want any scoring at all.
> >
> > Thank you for any guidance!
> > Steven
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
> --
> Adrien
>
Re: Querying into a Collector visits documents multiple times [ In reply to ]
Ah sorry never mind. Confused collector and collector manager

On Fri, Sep 24, 2021, 6:51 AM Michael Sokolov <msokolov@gmail.com> wrote:

> Separate issue, but this collector is not going to work with concurrent
> search since the sum is not updated in a thread safe manner. Maybe you
> don't care, since you don't use a thread pool to execute your queries, but
> you probably should!
>
> On Wed, Sep 22, 2021, 8:38 AM Adrien Grand <jpountz@gmail.com> wrote:
>
>> Hi Steven,
>>
>> This collector looks correct to me. Resetting the counter to 0 on the
>> first
>> segment is indeed not necessary.
>>
>> We have plenty of collectors that are very similar to this one and we
>> never
>> observed any double-counting issue. I would suspect an issue in the code
>> that calls this collector. Maybe try to print the stack trace under the `
>> if (context.docBase == 0) {` check to see why your collector is being
>> called twice?
>>
>> On Tue, Sep 21, 2021 at 9:30 PM Steven Schlansker <
>> stevenschlansker@gmail.com> wrote:
>>
>> > Hi Lucene users,
>> >
>> > I am developing a search application that needs to do some basic
>> > summary statistics. We use Lucene 8.9.0.
>> > To improve performance for e.g. summing a value across 10,000
>> > documents, we are using DocValues as columnar storage.
>> >
>> > In order to retrieve the DocValues without collecting all hits into a
>> > TopDocs, which we determined to cause a lot of memory pressure and
>> > consume much time, we are using the expert Collector query interface.
>> >
>> > Here's the code, simplified a bit for the list:
>> >
>> > final collector = new Collector() {
>> > long sum = 0;
>> >
>> > @Override
>> > public ScoreMode scoreMode() {
>> > return ScoreMode.COMPLETE_NO_SCORES;
>> > }
>> >
>> > @Override
>> > public LeafCollector getLeafCollector(final LeafReaderContext
>> > context) throws IOException {
>> > if (context.docBase == 0) {
>> > sum = 0; // XXX: this should not be necessary?
>> > }
>> > final var subtotalValue =
>> > context.reader().getNumericDocValues("subtotal");
>> > return new LeafCollector() {
>> > @Override
>> > public void setScorer(final Scorable scorer) throws
>> > IOException {
>> > }
>> >
>> > @Override
>> > public void collect(final int doc) throws IOException {
>> > if (subtotalValue.docID() > doc ||
>> > !subtotalValue.advanceExact(doc) || subtotalValue.longValue() == 0) {
>> > return;
>> > }
>> > sum += subtotalValue.longValue();
>> > }
>> > };
>> > }
>> > }
>> > searcher.search(myQuery, collector);
>> > return collector.sum;
>> >
>> > The query is a moderately complicated Boolean query with some
>> > TermQuery and MultiTermQuery instances combined together.
>> > While first testing, I observed that seemingly the collector is called
>> > twice for each document, and the sum is exactly double what you would
>> > expect.
>> >
>> > It seems that the Collector is observing every matched document twice,
>> > and by printing out the Scorer, I see that it's done with two
>> > different BooleanScorer instances.
>> > You can see my hack that resets the collector every time it starts at
>> > docBase 0. which I am sure is not the right approach, but seems to
>> > work.
>> > What is the right pattern to ensure my Collector only observes result
>> > documents once, no matter the input query? I see a note in the
>> > documentation that state is supposed to be stored on the Scorer
>> > implementation, but I am not providing a custom Scorer, nor do I
>> > actually want any scoring at all.
>> >
>> > Thank you for any guidance!
>> > Steven
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> > For additional commands, e-mail: java-user-help@lucene.apache.org
>> >
>> >
>>
>> --
>> Adrien
>>
>