Mailing List Archive: Impact and WAND

Impact and WAND

wuyunfeng01 at baidu

Jul 9, 2019, 11:41 PM

Post #1 of 8 (1159 views)

Hi,

We discuss some topic from https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose discuss with the java dev list.

Impact `frequency ` and `norm ` just to accelerate the `score process` which `terminate early`.

In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm) pair , would stored at index level. Also I noted `CompetitiveImpactAccumulator` commented with `This class accumulates the (freq, norm) pairs that may produce competitive scores`, maybe related to `WAND`?

The norm value which produced or consumed by `Lucene80NormsFormat`.

In this ` Impact way`, we can avoid read norms from `Lucene80NormsProducer` that may generate the extra IO? ?? the norm value Lucene stored twice.??and take full advantage of the WAND method?

Re: Impact and WAND [ In reply to ]

jpountz at gmail

Jul 10, 2019, 7:52 AM

Post #2 of 8 (1159 views)

To clarify, the scoring process is not accelerated because we
terminate early but because we can skip low-scoring matches (there
might be competitive hits at the very end of the index).

CompetitiveImpactAccumulator is indeed related to WAND. It helps store
the maximum score impacts per block of documents in postings lists.
Then this information is leveraged by block-max WAND in order to skip
low-scoring blocks.

This does indeed help avoid reading norms, but also document IDs and
term frequencies.

On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com> wrote:
>
> Hi,
>
> We discuss some topic from https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose discuss with the java dev list.
>
>
> Impact `frequency ` and `norm ` just to accelerate the `score process` which `terminate early`.
>
> In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm) pair , would stored at index level. Also I noted `CompetitiveImpactAccumulator` commented with `This class accumulates the (freq, norm) pairs that may produce competitive scores`, maybe related to `WAND`?
>
>
> The norm value which produced or consumed by `Lucene80NormsFormat`.
>
> In this ` Impact way`, we can avoid read norms from `Lucene80NormsProducer` that may generate the extra IO? ? the norm value Lucene stored twice.?and take full advantage of the WAND method?

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Impact and WAND [ In reply to ]

wuyunfeng01 at baidu

Jul 10, 2019, 8:21 PM

Post #3 of 8 (1154 views)

@Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>>. Thanks for your reply.

The explanation ` skip low-scoring matches` is great, I looked up some docs and inspect some related code.

I noticed the ` block-max WAND` mode only work when ScoreMode.TOP_SCORES is used, is right? (The basic TermQuery would generate ImpactDISI with scoreMode is TOP_SCORES.)

Lucene compute max score per block and then cached in `MaxScoreCache` , this means we can skip low-scoring block( current one block 128 DocIds) and in competitive block still need to score any docId as seen, I confused with `MaxScoreCache#getMaxScoreForLevel(int level)`, what the level mean? Skip level? (Somewhere invoke this method pass one Integer upTo param)

Thanks Lucene Team

?? 2019??7??10???????10:52??Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>> ?????

To clarify, the scoring process is not accelerated because we
terminate early but because we can skip low-scoring matches (there
might be competitive hits at the very end of the index).

CompetitiveImpactAccumulator is indeed related to WAND. It helps store
the maximum score impacts per block of documents in postings lists.
Then this information is leveraged by block-max WAND in order to skip
low-scoring blocks.

This does indeed help avoid reading norms, but also document IDs and
term frequencies.

On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com<mailto:wuyunfeng01@baidu.com>> wrote:

Hi,

We discuss some topic from https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose discuss with the java dev list.

Impact `frequency ` and `norm ` just to accelerate the `score process` which `terminate early`.

In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm) pair , would stored at index level. Also I noted `CompetitiveImpactAccumulator` commented with `This class accumulates the (freq, norm) pairs that may produce competitive scores`, maybe related to `WAND`?

The norm value which produced or consumed by `Lucene80NormsFormat`.

In this ` Impact way`, we can avoid read norms from `Lucene80NormsProducer` that may generate the extra IO? ?? the norm value Lucene stored twice.??and take full advantage of the WAND method?

--
Adrien

Re: Impact and WAND [ In reply to ]

jpountz at gmail

Jul 10, 2019, 11:57 PM

Post #4 of 8 (1154 views)

Block-max WAND and other optimizations that improve the retrieval of
top hits (block-max WAND is about disjunctions, but we have
optimizations for conjunctions, phrases and boolean queries that mix
MUST and SHOULD clauses too) are only applied when the score mode is
TOP_SCORES indeed.

The level in MaxScoreCache is indeed a skip level. Impacts are stored
alongside skip data.

On Thu, Jul 11, 2019 at 5:21 AM Wu,Yunfeng <wuyunfeng01@baidu.com> wrote:
>
>
> @Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>>. Thanks for your reply.
>
> The explanation ` skip low-scoring matches` is great, I looked up some docs and inspect some related code.
>
> I noticed the ` block-max WAND` mode only work when ScoreMode.TOP_SCORES is used, is right? (The basic TermQuery would generate ImpactDISI with scoreMode is TOP_SCORES.)
>
> Lucene compute max score per block and then cached in `MaxScoreCache` , this means we can skip low-scoring block( current one block 128 DocIds) and in competitive block still need to score any docId as seen, I confused with `MaxScoreCache#getMaxScoreForLevel(int level)`, what the level mean? Skip level? (Somewhere invoke this method pass one Integer upTo param)
>
> Thanks Lucene Team
>
>
> ? 2019?7?10????10:52?Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>> ???
>
> To clarify, the scoring process is not accelerated because we
> terminate early but because we can skip low-scoring matches (there
> might be competitive hits at the very end of the index).
>
> CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> the maximum score impacts per block of documents in postings lists.
> Then this information is leveraged by block-max WAND in order to skip
> low-scoring blocks.
>
> This does indeed help avoid reading norms, but also document IDs and
> term frequencies.
>
> On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com<mailto:wuyunfeng01@baidu.com>> wrote:
>
> Hi,
>
> We discuss some topic from https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose discuss with the java dev list.
>
>
> Impact `frequency ` and `norm ` just to accelerate the `score process` which `terminate early`.
>
> In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm) pair , would stored at index level. Also I noted `CompetitiveImpactAccumulator` commented with `This class accumulates the (freq, norm) pairs that may produce competitive scores`, maybe related to `WAND`?
>
>
> The norm value which produced or consumed by `Lucene80NormsFormat`.
>
> In this ` Impact way`, we can avoid read norms from `Lucene80NormsProducer` that may generate the extra IO? ? the norm value Lucene stored twice.?and take full advantage of the WAND method?
>
>
>
> --
> Adrien
>

--
Adrien

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Impact and WAND [ In reply to ]

Jul 11, 2019, 12:04 AM

Post #5 of 8 (1154 views)

Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
mandatorily visit all hits, so there is no scope of skipping and hence
no point of using impacts.

On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wuyunfeng01@baidu.com> wrote:
>
>
> @Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>>. Thanks for your reply.
>
> The explanation ` skip low-scoring matches` is great, I looked up some docs and inspect some related code.
>
> I noticed the ` block-max WAND` mode only work when ScoreMode.TOP_SCORES is used, is right? (The basic TermQuery would generate ImpactDISI with scoreMode is TOP_SCORES.)
>
> Lucene compute max score per block and then cached in `MaxScoreCache` , this means we can skip low-scoring block( current one block 128 DocIds) and in competitive block still need to score any docId as seen, I confused with `MaxScoreCache#getMaxScoreForLevel(int level)`, what the level mean? Skip level? (Somewhere invoke this method pass one Integer upTo param)
>
> Thanks Lucene Team
>
>
> ? 2019?7?10????10:52?Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>> ???
>
> To clarify, the scoring process is not accelerated because we
> terminate early but because we can skip low-scoring matches (there
> might be competitive hits at the very end of the index).
>
> CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> the maximum score impacts per block of documents in postings lists.
> Then this information is leveraged by block-max WAND in order to skip
> low-scoring blocks.
>
> This does indeed help avoid reading norms, but also document IDs and
> term frequencies.
>
> On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com<mailto:wuyunfeng01@baidu.com>> wrote:
>
> Hi,
>
> We discuss some topic from https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose discuss with the java dev list.
>
>
> Impact `frequency ` and `norm ` just to accelerate the `score process` which `terminate early`.
>
> In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm) pair , would stored at index level. Also I noted `CompetitiveImpactAccumulator` commented with `This class accumulates the (freq, norm) pairs that may produce competitive scores`, maybe related to `WAND`?
>
>
> The norm value which produced or consumed by `Lucene80NormsFormat`.
>
> In this ` Impact way`, we can avoid read norms from `Lucene80NormsProducer` that may generate the extra IO? ? the norm value Lucene stored twice.?and take full advantage of the WAND method?
>
>
>
> --
> Adrien
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Impact and WAND [ In reply to ]

tomasflobbe at gmail

Apr 16, 2021, 9:51 AM

Post #6 of 8 (845 views)

I was looking at the nightly benchmarks[1] and noticed a big jump in
performance for conjunction queries when LUCENE-8060 was merged. I was
puzzled because I didn't expect BMW to help in this type of queries, but I
guess that's the "other optimizations" you were talking about? Do you have
any pointers to those?

[1] https://home.apache.org/~mikemccand/lucenebench

On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma <atri@linux.com> wrote:

> Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> mandatorily visit all hits, so there is no scope of skipping and hence
> no point of using impacts.
>
> On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wuyunfeng01@baidu.com> wrote:
> >
> >
> > @Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>>. Thanks for
> your reply.
> >
> > The explanation ` skip low-scoring matches` is great, I looked up some
> docs and inspect some related code.
> >
> > I noticed the ` block-max WAND` mode only work when
> ScoreMode.TOP_SCORES is used, is right? (The basic TermQuery would
> generate ImpactDISI with scoreMode is TOP_SCORES.)
> >
> > Lucene compute max score per block and then cached in `MaxScoreCache` ,
> this means we can skip low-scoring block( current one block 128 DocIds)
> and in competitive block still need to score any docId as seen, I
> confused with `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> level mean? Skip level? (Somewhere invoke this method pass one Integer
> upTo param)
> >
> > Thanks Lucene Team
> >
> >
> > ? 2019?7?10????10:52?Adrien Grand <jpountz@gmail.com<mailto:
> jpountz@gmail.com>> ???
> >
> > To clarify, the scoring process is not accelerated because we
> > terminate early but because we can skip low-scoring matches (there
> > might be competitive hits at the very end of the index).
> >
> > CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> > the maximum score impacts per block of documents in postings lists.
> > Then this information is leveraged by block-max WAND in order to skip
> > low-scoring blocks.
> >
> > This does indeed help avoid reading norms, but also document IDs and
> > term frequencies.
> >
> > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com
> <mailto:wuyunfeng01@baidu.com>> wrote:
> >
> > Hi,
> >
> > We discuss some topic from
> https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> discuss with the java dev list.
> >
> >
> > Impact `frequency ` and `norm ` just to accelerate the `score process`
> which `terminate early`.
> >
> > In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm)
> pair , would stored at index level. Also I noted
> `CompetitiveImpactAccumulator` commented with `This class accumulates the
> (freq, norm) pairs that may produce competitive scores`, maybe related to
> `WAND`?
> >
> >
> > The norm value which produced or consumed by `Lucene80NormsFormat`.
> >
> > In this ` Impact way`, we can avoid read norms from
> `Lucene80NormsProducer` that may generate the extra IO? ? the norm value
> Lucene stored twice.?and take full advantage of the WAND method?
> >
> >
> >
> > --
> > Adrien
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Impact and WAND [ In reply to ]

jpountz at gmail

Apr 16, 2021, 10:56 AM

Post #7 of 8 (845 views)

Hi,

Indeed BMW is only about disjunctions but the paper (
http://engineering.nyu.edu/~suel/papers/bmw.pdf) shortly describes how
block max indexes can be used to speed up conjunctions as well using a
simple algorithm that they call Block Max And, which we implemented in the
BlockMaxConjunctionScorer class.

Le ven. 16 avr. 2021 à 18:51, Tomás Fernández Löbbe <tomasflobbe@gmail.com>
a écrit :

> I was looking at the nightly benchmarks[1] and noticed a big jump in
> performance for conjunction queries when LUCENE-8060 was merged. I was
> puzzled because I didn't expect BMW to help in this type of queries, but I
> guess that's the "other optimizations" you were talking about? Do you have
> any pointers to those?
>
>
> [1] https://home.apache.org/~mikemccand/lucenebench
>
> On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma <atri@linux.com> wrote:
>
> > Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> > mandatorily visit all hits, so there is no scope of skipping and hence
> > no point of using impacts.
> >
> > On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wuyunfeng01@baidu.com>
> wrote:
> > >
> > >
> > > @Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>>. Thanks
> for
> > your reply.
> > >
> > > The explanation ` skip low-scoring matches` is great, I looked up
> some
> > docs and inspect some related code.
> > >
> > > I noticed the ` block-max WAND` mode only work when
> > ScoreMode.TOP_SCORES is used, is right? (The basic TermQuery would
> > generate ImpactDISI with scoreMode is TOP_SCORES.)
> > >
> > > Lucene compute max score per block and then cached in `MaxScoreCache` ,
> > this means we can skip low-scoring block( current one block 128 DocIds)
> > and in competitive block still need to score any docId as seen, I
> > confused with `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> > level mean? Skip level? (Somewhere invoke this method pass one Integer
> > upTo param)
> > >
> > > Thanks Lucene Team
> > >
> > >
> > > ? 2019?7?10????10:52?Adrien Grand <jpountz@gmail.com<mailto:
> > jpountz@gmail.com>> ???
> > >
> > > To clarify, the scoring process is not accelerated because we
> > > terminate early but because we can skip low-scoring matches (there
> > > might be competitive hits at the very end of the index).
> > >
> > > CompetitiveImpactAccumulator is indeed related to WAND. It helps store
> > > the maximum score impacts per block of documents in postings lists.
> > > Then this information is leveraged by block-max WAND in order to skip
> > > low-scoring blocks.
> > >
> > > This does indeed help avoid reading norms, but also document IDs and
> > > term frequencies.
> > >
> > > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com
> > <mailto:wuyunfeng01@baidu.com>> wrote:
> > >
> > > Hi,
> > >
> > > We discuss some topic from
> > https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> > discuss with the java dev list.
> > >
> > >
> > > Impact `frequency ` and `norm ` just to accelerate the `score process`
> > which `terminate early`.
> > >
> > > In impact mode, `CompetitiveImpactAccumulator` will record (freq, norm)
> > pair , would stored at index level. Also I noted
> > `CompetitiveImpactAccumulator` commented with `This class accumulates the
> > (freq, norm) pairs that may produce competitive scores`, maybe related
> to
> > `WAND`?
> > >
> > >
> > > The norm value which produced or consumed by `Lucene80NormsFormat`.
> > >
> > > In this ` Impact way`, we can avoid read norms from
> > `Lucene80NormsProducer` that may generate the extra IO? ? the norm value
> > Lucene stored twice.?and take full advantage of the WAND method?
> > >
> > >
> > >
> > > --
> > > Adrien
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>

Re: Impact and WAND [ In reply to ]

tomasflobbe at gmail

Apr 16, 2021, 4:50 PM

Post #8 of 8 (845 views)

Ah, great. Thanks!

On Fri, Apr 16, 2021 at 10:57 AM Adrien Grand <jpountz@gmail.com> wrote:

> Hi,
>
> Indeed BMW is only about disjunctions but the paper (
> http://engineering.nyu.edu/~suel/papers/bmw.pdf) shortly describes how
> block max indexes can be used to speed up conjunctions as well using a
> simple algorithm that they call Block Max And, which we implemented in the
> BlockMaxConjunctionScorer class.
>
> Le ven. 16 avr. 2021 à 18:51, Tomás Fernández Löbbe <tomasflobbe@gmail.com
> >
> a écrit :
>
> > I was looking at the nightly benchmarks[1] and noticed a big jump in
> > performance for conjunction queries when LUCENE-8060 was merged. I was
> > puzzled because I didn't expect BMW to help in this type of queries, but
> I
> > guess that's the "other optimizations" you were talking about? Do you
> have
> > any pointers to those?
> >
> >
> > [1] https://home.apache.org/~mikemccand/lucenebench
> >
> > On Thu, Jul 11, 2019 at 6:02 AM Atri Sharma <atri@linux.com> wrote:
> >
> > > Note that any other scoring mode (COMPLETE or COMPLETE_NO_SCORES) will
> > > mandatorily visit all hits, so there is no scope of skipping and hence
> > > no point of using impacts.
> > >
> > > On Thu, Jul 11, 2019 at 8:51 AM Wu,Yunfeng <wuyunfeng01@baidu.com>
> > wrote:
> > > >
> > > >
> > > > @Adrien Grand <jpountz@gmail.com<mailto:jpountz@gmail.com>>. Thanks
> > for
> > > your reply.
> > > >
> > > > The explanation ` skip low-scoring matches` is great, I looked up
> > some
> > > docs and inspect some related code.
> > > >
> > > > I noticed the ` block-max WAND` mode only work when
> > > ScoreMode.TOP_SCORES is used, is right? (The basic TermQuery would
> > > generate ImpactDISI with scoreMode is TOP_SCORES.)
> > > >
> > > > Lucene compute max score per block and then cached in
> `MaxScoreCache` ,
> > > this means we can skip low-scoring block( current one block 128 DocIds)
> > > and in competitive block still need to score any docId as seen, I
> > > confused with `MaxScoreCache#getMaxScoreForLevel(int level)`, what the
> > > level mean? Skip level? (Somewhere invoke this method pass one Integer
> > > upTo param)
> > > >
> > > > Thanks Lucene Team
> > > >
> > > >
> > > > ? 2019?7?10????10:52?Adrien Grand <jpountz@gmail.com<mailto:
> > > jpountz@gmail.com>> ???
> > > >
> > > > To clarify, the scoring process is not accelerated because we
> > > > terminate early but because we can skip low-scoring matches (there
> > > > might be competitive hits at the very end of the index).
> > > >
> > > > CompetitiveImpactAccumulator is indeed related to WAND. It helps
> store
> > > > the maximum score impacts per block of documents in postings lists.
> > > > Then this information is leveraged by block-max WAND in order to skip
> > > > low-scoring blocks.
> > > >
> > > > This does indeed help avoid reading norms, but also document IDs and
> > > > term frequencies.
> > > >
> > > > On Wed, Jul 10, 2019 at 4:10 PM Wu,Yunfeng <wuyunfeng01@baidu.com
> > > <mailto:wuyunfeng01@baidu.com>> wrote:
> > > >
> > > > Hi,
> > > >
> > > > We discuss some topic from
> > > https://github.com/apache/lucene-solr/pull/595. As Atri Sharma propose
> > > discuss with the java dev list.
> > > >
> > > >
> > > > Impact `frequency ` and `norm ` just to accelerate the `score
> process`
> > > which `terminate early`.
> > > >
> > > > In impact mode, `CompetitiveImpactAccumulator` will record (freq,
> norm)
> > > pair , would stored at index level. Also I noted
> > > `CompetitiveImpactAccumulator` commented with `This class accumulates
> the
> > > (freq, norm) pairs that may produce competitive scores`, maybe related
> > to
> > > `WAND`?
> > > >
> > > >
> > > > The norm value which produced or consumed by `Lucene80NormsFormat`.
> > > >
> > > > In this ` Impact way`, we can avoid read norms from
> > > `Lucene80NormsProducer` that may generate the extra IO? ? the norm
> value
> > > Lucene stored twice.?and take full advantage of the WAND method?
> > > >
> > > >
> > > >
> > > > --
> > > > Adrien
> > > >
> > >
> > > ---------------------------------------------------------------------
> > > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > > For additional commands, e-mail: java-user-help@lucene.apache.org
> > >
> > >
> >
>