Mailing List Archive

How to access block-max metadata?
Hi all,
There was some fairly recent work in Lucene to introduce Block-Max WAND
Scoring (
https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
, https://issues.apache.org/jira/browse/LUCENE-8135).

I've been working on a use-case where I need very efficient top-k scoring
for 100s of query terms (usually between 300 and 600 terms, k between 100
and 10000, each term contributes a simple TF-IDF score). There's some
discussion here: https://github.com/alexklibisz/elastiknn/issues/160.

Now that block-based metadata are presumably available in Lucene, how would
I access this metadata?

I've read the WANDScorer.java code, but I couldn't quite understand how
exactly it is leveraging a block-max codec or block-based statistics. In my
own code, I'm exploring some ways to prune low-quality docs, and I figured
there might be some block-max metadata that I can access to improve the
pruning. I'm iterating over the docs matching each term using the
.advance() and .nextDoc() methods on a PostingsEnum. I don't see any
block-related methods on the PostingsEnum interface. I feel like I'm
missing something.. hopefully something simple!

I appreciate any tips or examples!

Thanks,
Alex
Re: How to access block-max metadata? [ In reply to ]
Hi Alex,

The entry point for block-max metadata is TermsEnum#impacts (
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int))
which returns a view of the postings lists that includes block-max
metadata. In particular, see documentation for ImpactsSource#advanceShallow
and ImpactsSource#getImpacts (
https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
).

You can look at ImpactsDISI to see how this metadata is leveraged in
practice to turn this metadata into score upper bounds, which is in-turn
used to skip irrelevant documents.

On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklibisz@gmail.com> wrote:

> Hi all,
> There was some fairly recent work in Lucene to introduce Block-Max WAND
> Scoring (
>
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> , https://issues.apache.org/jira/browse/LUCENE-8135).
>
> I've been working on a use-case where I need very efficient top-k scoring
> for 100s of query terms (usually between 300 and 600 terms, k between 100
> and 10000, each term contributes a simple TF-IDF score). There's some
> discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
>
> Now that block-based metadata are presumably available in Lucene, how would
> I access this metadata?
>
> I've read the WANDScorer.java code, but I couldn't quite understand how
> exactly it is leveraging a block-max codec or block-based statistics. In my
> own code, I'm exploring some ways to prune low-quality docs, and I figured
> there might be some block-max metadata that I can access to improve the
> pruning. I'm iterating over the docs matching each term using the
> .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> block-related methods on the PostingsEnum interface. I feel like I'm
> missing something.. hopefully something simple!
>
> I appreciate any tips or examples!
>
> Thanks,
> Alex
>


--
Adrien
Re: How to access block-max metadata? [ In reply to ]
Thanks Adrien. Very helpful.
The doc for ImpactSource.advanceShallow says it's more efficient than
DocIDSetIterator.advance.
Is that because advanceShallow is skipping entire blocks at a time, whereas
advance is not?
One possible optimization I've explored involves skipping pruned docIDs. I
tried this using .advance() instead of .nextDoc(), but found the
improvement was negligible. I'm thinking maybe advanceShallow() would let
me get that speedup.
- AK

On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <jpountz@gmail.com> wrote:

> Hi Alex,
>
> The entry point for block-max metadata is TermsEnum#impacts (
>
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> )
> which returns a view of the postings lists that includes block-max
> metadata. In particular, see documentation for ImpactsSource#advanceShallow
> and ImpactsSource#getImpacts (
>
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> ).
>
> You can look at ImpactsDISI to see how this metadata is leveraged in
> practice to turn this metadata into score upper bounds, which is in-turn
> used to skip irrelevant documents.
>
> On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklibisz@gmail.com> wrote:
>
> > Hi all,
> > There was some fairly recent work in Lucene to introduce Block-Max WAND
> > Scoring (
> >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > , https://issues.apache.org/jira/browse/LUCENE-8135).
> >
> > I've been working on a use-case where I need very efficient top-k scoring
> > for 100s of query terms (usually between 300 and 600 terms, k between 100
> > and 10000, each term contributes a simple TF-IDF score). There's some
> > discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
> >
> > Now that block-based metadata are presumably available in Lucene, how
> would
> > I access this metadata?
> >
> > I've read the WANDScorer.java code, but I couldn't quite understand how
> > exactly it is leveraging a block-max codec or block-based statistics. In
> my
> > own code, I'm exploring some ways to prune low-quality docs, and I
> figured
> > there might be some block-max metadata that I can access to improve the
> > pruning. I'm iterating over the docs matching each term using the
> > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > block-related methods on the PostingsEnum interface. I feel like I'm
> > missing something.. hopefully something simple!
> >
> > I appreciate any tips or examples!
> >
> > Thanks,
> > Alex
> >
>
>
> --
> Adrien
>
Re: How to access block-max metadata? [ In reply to ]
advanceShallow is indeed faster than advance because it does less:
advanceShallow only advances the cursor for block-max metadata, this allows
reasoning about maximum scores without actually advancing the doc ID.
advanceShallow is implicitly called via advance.

If your optimization rarely helps skip entire blocks, then it's expected
that advance doesn't help much over nextDoc. advanceShallow is rarely a
drop-in replacement for advance since it's unable to tell whether a
document matches or not, it can only be used to reason about maximum scores
for a range of doc IDs when combined with ImpactsSource#getImpacts.

On Mon, Oct 12, 2020 at 5:21 PM Alex K <aklibisz@gmail.com> wrote:

> Thanks Adrien. Very helpful.
> The doc for ImpactSource.advanceShallow says it's more efficient than
> DocIDSetIterator.advance.
> Is that because advanceShallow is skipping entire blocks at a time, whereas
> advance is not?
> One possible optimization I've explored involves skipping pruned docIDs. I
> tried this using .advance() instead of .nextDoc(), but found the
> improvement was negligible. I'm thinking maybe advanceShallow() would let
> me get that speedup.
> - AK
>
> On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <jpountz@gmail.com> wrote:
>
> > Hi Alex,
> >
> > The entry point for block-max metadata is TermsEnum#impacts (
> >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> > )
> > which returns a view of the postings lists that includes block-max
> > metadata. In particular, see documentation for
> ImpactsSource#advanceShallow
> > and ImpactsSource#getImpacts (
> >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> > ).
> >
> > You can look at ImpactsDISI to see how this metadata is leveraged in
> > practice to turn this metadata into score upper bounds, which is in-turn
> > used to skip irrelevant documents.
> >
> > On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklibisz@gmail.com> wrote:
> >
> > > Hi all,
> > > There was some fairly recent work in Lucene to introduce Block-Max WAND
> > > Scoring (
> > >
> > >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > > , https://issues.apache.org/jira/browse/LUCENE-8135).
> > >
> > > I've been working on a use-case where I need very efficient top-k
> scoring
> > > for 100s of query terms (usually between 300 and 600 terms, k between
> 100
> > > and 10000, each term contributes a simple TF-IDF score). There's some
> > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
> > >
> > > Now that block-based metadata are presumably available in Lucene, how
> > would
> > > I access this metadata?
> > >
> > > I've read the WANDScorer.java code, but I couldn't quite understand how
> > > exactly it is leveraging a block-max codec or block-based statistics.
> In
> > my
> > > own code, I'm exploring some ways to prune low-quality docs, and I
> > figured
> > > there might be some block-max metadata that I can access to improve the
> > > pruning. I'm iterating over the docs matching each term using the
> > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > > block-related methods on the PostingsEnum interface. I feel like I'm
> > > missing something.. hopefully something simple!
> > >
> > > I appreciate any tips or examples!
> > >
> > > Thanks,
> > > Alex
> > >
> >
> >
> > --
> > Adrien
> >
>


--
Adrien
Re: How to access block-max metadata? [ In reply to ]
I see. So I'm most likely rarely skipping a block's worth of docs, so using
advance() vs nextDoc() doesn't make much of a difference.
All good to know. Thank you.

On Mon, Oct 12, 2020 at 11:42 AM Adrien Grand <jpountz@gmail.com> wrote:

> advanceShallow is indeed faster than advance because it does less:
> advanceShallow only advances the cursor for block-max metadata, this allows
> reasoning about maximum scores without actually advancing the doc ID.
> advanceShallow is implicitly called via advance.
>
> If your optimization rarely helps skip entire blocks, then it's expected
> that advance doesn't help much over nextDoc. advanceShallow is rarely a
> drop-in replacement for advance since it's unable to tell whether a
> document matches or not, it can only be used to reason about maximum scores
> for a range of doc IDs when combined with ImpactsSource#getImpacts.
>
> On Mon, Oct 12, 2020 at 5:21 PM Alex K <aklibisz@gmail.com> wrote:
>
> > Thanks Adrien. Very helpful.
> > The doc for ImpactSource.advanceShallow says it's more efficient than
> > DocIDSetIterator.advance.
> > Is that because advanceShallow is skipping entire blocks at a time,
> whereas
> > advance is not?
> > One possible optimization I've explored involves skipping pruned docIDs.
> I
> > tried this using .advance() instead of .nextDoc(), but found the
> > improvement was negligible. I'm thinking maybe advanceShallow() would let
> > me get that speedup.
> > - AK
> >
> > On Mon, Oct 12, 2020 at 2:59 AM Adrien Grand <jpountz@gmail.com> wrote:
> >
> > > Hi Alex,
> > >
> > > The entry point for block-max metadata is TermsEnum#impacts (
> > >
> > >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/TermsEnum.html#impacts(int)
> > > )
> > > which returns a view of the postings lists that includes block-max
> > > metadata. In particular, see documentation for
> > ImpactsSource#advanceShallow
> > > and ImpactsSource#getImpacts (
> > >
> > >
> >
> https://lucene.apache.org/core/8_6_0/core/org/apache/lucene/index/ImpactsSource.html
> > > ).
> > >
> > > You can look at ImpactsDISI to see how this metadata is leveraged in
> > > practice to turn this metadata into score upper bounds, which is
> in-turn
> > > used to skip irrelevant documents.
> > >
> > > On Mon, Oct 12, 2020 at 2:45 AM Alex K <aklibisz@gmail.com> wrote:
> > >
> > > > Hi all,
> > > > There was some fairly recent work in Lucene to introduce Block-Max
> WAND
> > > > Scoring (
> > > >
> > > >
> > >
> >
> https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
> > > > , https://issues.apache.org/jira/browse/LUCENE-8135).
> > > >
> > > > I've been working on a use-case where I need very efficient top-k
> > scoring
> > > > for 100s of query terms (usually between 300 and 600 terms, k between
> > 100
> > > > and 10000, each term contributes a simple TF-IDF score). There's some
> > > > discussion here: https://github.com/alexklibisz/elastiknn/issues/160
> .
> > > >
> > > > Now that block-based metadata are presumably available in Lucene, how
> > > would
> > > > I access this metadata?
> > > >
> > > > I've read the WANDScorer.java code, but I couldn't quite understand
> how
> > > > exactly it is leveraging a block-max codec or block-based statistics.
> > In
> > > my
> > > > own code, I'm exploring some ways to prune low-quality docs, and I
> > > figured
> > > > there might be some block-max metadata that I can access to improve
> the
> > > > pruning. I'm iterating over the docs matching each term using the
> > > > .advance() and .nextDoc() methods on a PostingsEnum. I don't see any
> > > > block-related methods on the PostingsEnum interface. I feel like I'm
> > > > missing something.. hopefully something simple!
> > > >
> > > > I appreciate any tips or examples!
> > > >
> > > > Thanks,
> > > > Alex
> > > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
>
>
> --
> Adrien
>