Hi all,
There was some fairly recent work in Lucene to introduce Block-Max WAND
Scoring (
https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
, https://issues.apache.org/jira/browse/LUCENE-8135).
I've been working on a use-case where I need very efficient top-k scoring
for 100s of query terms (usually between 300 and 600 terms, k between 100
and 10000, each term contributes a simple TF-IDF score). There's some
discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
Now that block-based metadata are presumably available in Lucene, how would
I access this metadata?
I've read the WANDScorer.java code, but I couldn't quite understand how
exactly it is leveraging a block-max codec or block-based statistics. In my
own code, I'm exploring some ways to prune low-quality docs, and I figured
there might be some block-max metadata that I can access to improve the
pruning. I'm iterating over the docs matching each term using the
.advance() and .nextDoc() methods on a PostingsEnum. I don't see any
block-related methods on the PostingsEnum interface. I feel like I'm
missing something.. hopefully something simple!
I appreciate any tips or examples!
Thanks,
Alex
There was some fairly recent work in Lucene to introduce Block-Max WAND
Scoring (
https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf
, https://issues.apache.org/jira/browse/LUCENE-8135).
I've been working on a use-case where I need very efficient top-k scoring
for 100s of query terms (usually between 300 and 600 terms, k between 100
and 10000, each term contributes a simple TF-IDF score). There's some
discussion here: https://github.com/alexklibisz/elastiknn/issues/160.
Now that block-based metadata are presumably available in Lucene, how would
I access this metadata?
I've read the WANDScorer.java code, but I couldn't quite understand how
exactly it is leveraging a block-max codec or block-based statistics. In my
own code, I'm exploring some ways to prune low-quality docs, and I figured
there might be some block-max metadata that I can access to improve the
pruning. I'm iterating over the docs matching each term using the
.advance() and .nextDoc() methods on a PostingsEnum. I don't see any
block-related methods on the PostingsEnum interface. I feel like I'm
missing something.. hopefully something simple!
I appreciate any tips or examples!
Thanks,
Alex