Mailing List Archive

Boolean Scorer
Doug,

It's a nuisance that BooleanScorer does not deliver documents in the correct
order and does not implement skipTo. It already broke my neck twice :-)

I know that you also had problems with that behavior of BooleanScorer several
times. Last but not least bug#32467 was caused by this. I am not completely
sure whether my fix is correct.

I think we should change BooleanScorer. An easy way would be to sort the bucket
list before it is used. Do you think that would affect performance dramatically?
Otherwise we should reimplement BooleanScorer. I haven't looked into the
DisjunctionScorer patch in Bugzilla yet. Maybe it's a good starting point.

Christoph


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
Christoph Goller wrote:
> I think we should change BooleanScorer. An easy way would be to sort the
> bucket
> list before it is used. Do you think that would affect performance
> dramatically?

I think it would make it slower.

> Otherwise we should reimplement BooleanScorer. I haven't looked into the
> DisjunctionScorer patch in Bugzilla yet. Maybe it's a good starting point.

I think we should incorporate Paul's code into CVS. This algorithm may
be slower in some cases, but it may also be faster in some cases. We
should add a static method to switch back to the old implementation, and
encourage folks to benchmark their code. If it proves no slower then we
could remove the old implementation altogether.

What do others think?

Paul's code is in:

http://issues.apache.org/bugzilla/show_bug.cgi?id=31785

Has anyone tried this?

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
RE: Boolean Scorer [ In reply to ]
I'm just a user and haven't tried Paul's DisjunctionScorer yet, but I
must say this sounds like a great idea. The ability to specialize
combineScores() is a significant advantage. I'd suggest adding
DisjunctionMaxScorer in addition to DisjunctionSumScorer and
DisjunctionSumCoordScorer in the default distribution, and adding a
variant to MultiFieldQueryParser that uses these to expand the query
differently (expand the innermost clauses to search the multiple fields
with Max rather than duplicating the original query across each field; I
have code to do that for my MaxDisjunctionQuery which could be trivially
changed to use DisjunctionMaxScorer if this was desired). I expect the
variant MultiFieldQueryParser would be what most users want most of the
time (e.g., it does a much better job searching for "red elephants"
across fields "title" and "description").

Chuck

> -----Original Message-----
> From: Doug Cutting [mailto:cutting@apache.org]
> Sent: Friday, December 10, 2004 12:36 PM
> To: Lucene Developers List
> Subject: Re: Boolean Scorer
>
> Christoph Goller wrote:
> > I think we should change BooleanScorer. An easy way would be to
sort
> the
> > bucket
> > list before it is used. Do you think that would affect performance
> > dramatically?
>
> I think it would make it slower.
>
> > Otherwise we should reimplement BooleanScorer. I haven't looked
into
> the
> > DisjunctionScorer patch in Bugzilla yet. Maybe it's a good
starting
> point.
>
> I think we should incorporate Paul's code into CVS. This algorithm
may
> be slower in some cases, but it may also be faster in some cases.
We
> should add a static method to switch back to the old implementation,
and
> encourage folks to benchmark their code. If it proves no slower
then we
> could remove the old implementation altogether.
>
> What do others think?
>
> Paul's code is in:
>
> http://issues.apache.org/bugzilla/show_bug.cgi?id=31785
>
> Has anyone tried this?
>
> Doug
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
Chuck,

On Friday 10 December 2004 22:16, Chuck Williams wrote:
> I'm just a user and haven't tried Paul's DisjunctionScorer yet, but I
> must say this sounds like a great idea. The ability to specialize
> combineScores() is a significant advantage. ...

In the latest version I dropped the combineScores() and inlined
the summing (in advanceAfterCurrent()).
The main reason is that passing the scoring values of the scorers
in an array does not scale well, ie. it will always cost an amount of
work in proportion to the number of scorers, even if only one of
them actually has a score.

Changing this is should be no problem, though.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
RE: Boolean Scorer [ In reply to ]
Paul,

Would there be a way to get the best of both worlds? E.g., could you
factor the specializable score combination differently, so that one
method was called with each new score to generate a state entity, while
a final method computed the score from the state. For both sum and max,
the state entity could just be a float, not requiring an array. The
final operation for the sum with coord case would do the coord. I
haven't looked at the code carefully enough to see if this actually
works, but it seemed worth mentioning.

Chuck

> -----Original Message-----
> From: Paul Elschot [mailto:paul.elschot@xs4all.nl]
> Sent: Friday, December 10, 2004 1:45 PM
> To: lucene-dev@jakarta.apache.org
> Subject: Re: Boolean Scorer
>
> Chuck,
>
> On Friday 10 December 2004 22:16, Chuck Williams wrote:
> > I'm just a user and haven't tried Paul's DisjunctionScorer yet,
but I
> > must say this sounds like a great idea. The ability to specialize
> > combineScores() is a significant advantage. ...
>
> In the latest version I dropped the combineScores() and inlined
> the summing (in advanceAfterCurrent()).
> The main reason is that passing the scoring values of the scorers
> in an array does not scale well, ie. it will always cost an amount
of
> work in proportion to the number of scorers, even if only one of
> them actually has a score.
>
> Changing this is should be no problem, though.
>
> Regards,
> Paul Elschot
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
On Friday 10 December 2004 21:35, Doug Cutting wrote:
> Christoph Goller wrote:
> > I think we should change BooleanScorer. An easy way would be to sort the
> > bucket
> > list before it is used. Do you think that would affect performance
> > dramatically?
>
> I think it would make it slower.
>
> > Otherwise we should reimplement BooleanScorer. I haven't looked into the
> > DisjunctionScorer patch in Bugzilla yet. Maybe it's a good starting point.
>
> I think we should incorporate Paul's code into CVS. This algorithm may
> be slower in some cases, but it may also be faster in some cases. We
> should add a static method to switch back to the old implementation, and
> encourage folks to benchmark their code. If it proves no slower then we
> could remove the old implementation altogether.
>

There may be an alternative to this in the form of adding skipTo() to the
current Boolean Scorer. Before I wrote the alternative
boolean scorer, I investigated this possibility shortly, but I did not
see how adding skipTo() could be done easily.
Nonetheless, it might be possible.

Here is some background on the alternative boolean scorer.
More information is in the posting on bugzilla and from the javadocs.
http://issues.apache.org/bugzilla/show_bug.cgi?id=31785

The core of the DisjunctionScorer is based on a simplification of
SpanOrQuery. In particular class DisjunctionScorer.ScorerQueue is a
simplified version of SpanOrQuery.SpanQueue in that it only needs to
use document numbers, but not term positions.

The existing ConjunctionScorer needed to be slightly extended to implement
NrMatchersScorer, which is a Scorer that also provides the number of
matching subscorers. The number of matchers is needed to provide
coordination factor back the level of the BooleanQuery through some
nested scorers.
In case the code of the alternative boolean is added in cvs, it might be
considered to merge the nrMatchers() method into the current Scorer.

To complete the alternative boolean scorer, I added scorers for
combining with prohibited scorers and for combining with optional scorers.
These combining scorers were available from an extension of the Surround
query language I posted in April this year.

Mapping the required, optional and prohibited scorers of a BooleanQuery
to a nesting of these combining scorers, DisjunctionScorer and
ConjunctionScorer was straightforward, but a bit tedious.
It is done by the make...SumScorer methods.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
Chuck,

On Friday 10 December 2004 23:12, Chuck Williams wrote:
> Paul,
>
> Would there be a way to get the best of both worlds? E.g., could you
> factor the specializable score combination differently, so that one
> method was called with each new score to generate a state entity, while
> a final method computed the score from the state. For both sum and max,
> the state entity could just be a float, not requiring an array. The
> final operation for the sum with coord case would do the coord. I
> haven't looked at the code carefully enough to see if this actually
> works, but it seemed worth mentioning.

It's simple enough to do some abstract method call instead of initializing
a sum or adding to it. The problem is that as long as such a call is not
effectively inlined by the JVM, it will cause a performance hit for the sum
case.

The latest version of the advanceAfterCurrent method that computes the
score is java protected. It can be overridden to make the best
of it in another world.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
RE: Boolean Scorer [ In reply to ]
I'd be surprised if the function call overhead was significant, but
nonetheless I can't argue with optimizing the sum case. However, it
would seem this could be achieved without losing the generality by
having DisjunctionScorer.advanceAfterCurrent() call the initialization
and accumulation methods, while DisjunctionSumScorer overrides
advanceAfterCurrent() to implement its optimization. This seems more
natural to me than having the general class be optimized for sum.

<soapbox>
I maintain the belief that max is *required* to implement reasonable
multi-field searching (1). I can't imagine a case where the current
MultiFieldQueryParser actually does the right thing. For users wishing
to take a typical query and have it search all fields of their
documents, they're going to get horrible results. Maybe they won't
notice -- I care about the quality of results, did notice, and was
surprised I had to write my own class. After all, Lucene is generally
an excellent search engine and it uses a multi-field-based document
model (which is a good thing). I would think that good results for
multi-field searching out-of-the-box, and therefore built-in support for
max, would be viewed as required.

It doesn't really matter to me, because I have made it work right, and
will be able to make it work right again with the new scheme. It's just
that I really like Lucene and am encouraging others to use it. I love
the performance and am glad there is such emphasis placed in this area.
I'm also happy there is serious attention paid to ensuring the software
is easy to specialize or otherwise customize. However, that same kind
of care does not seem to carry over to the quality of built-in relevance
ranking, nor to the quality and consistency of the scoring model in
general. In these areas, I must say Lucene is weak. Based on
experience in the commercial enterprise search engine market, this is
all too common, and the reason that most internal and site searches
produce such horrible results. IT people focus on the performance,
scalability and architecture only while the users are screaming that the
results are no good. I've seen this pattern many places.
</soapbox>

Chuck

(1) Actually MaxDisjunctionScorer does something a little more refined
-- it starts with max and then adds in a specified, presumably small,
constant times the sum of the other terms. The max part solves the
multi-field problem that is currently in Lucene; i.e., a result matching
multiple distinct query terms spread over multiple fields generally gets
a higher score than another result matching fewer query terms overall
but having the same number of matches in each field. The contribution
of the small constant times the sum over the remaining terms allows a
result where a term matches in multiple fields to rise above other
results matching the same total term set in the same fields but without
the multiple matches.

> -----Original Message-----
> From: Paul Elschot [mailto:paul.elschot@xs4all.nl]
> Sent: Saturday, December 11, 2004 2:05 PM
> To: lucene-dev@jakarta.apache.org
> Subject: Re: Boolean Scorer
>
> Chuck,
>
> On Friday 10 December 2004 23:12, Chuck Williams wrote:
> > Paul,
> >
> > Would there be a way to get the best of both worlds? E.g., could
you
> > factor the specializable score combination differently, so that
one
> > method was called with each new score to generate a state entity,
> while
> > a final method computed the score from the state. For both sum
and
> max,
> > the state entity could just be a float, not requiring an array.
The
> > final operation for the sum with coord case would do the coord. I
> > haven't looked at the code carefully enough to see if this
actually
> > works, but it seemed worth mentioning.
>
> It's simple enough to do some abstract method call instead of
> initializing
> a sum or adding to it. The problem is that as long as such a call is
not
> effectively inlined by the JVM, it will cause a performance hit for
the
> sum
> case.
>
> The latest version of the advanceAfterCurrent method that computes
the
> score is java protected. It can be overridden to make the best
> of it in another world.
>
> Regards,
> Paul Elschot
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
Chuck,

On Sunday 12 December 2004 04:01, Chuck Williams wrote:
> I'd be surprised if the function call overhead was significant, but
> nonetheless I can't argue with optimizing the sum case. However, it

In case you have performance data, I'm all ears.

...

> <soapbox>
> I maintain the belief that max is *required* to implement reasonable
> multi-field searching (1). I can't imagine a case where the current

I agree that a maximum score over fields is useful.

...

> > -----Original Message-----
> > From: Paul Elschot [mailto:paul.elschot@xs4all.nl]
> > Sent: Saturday, December 11, 2004 2:05 PM
> > To: lucene-dev@jakarta.apache.org
> > Subject: Re: Boolean Scorer
> >
> > Chuck,
> >
> > On Friday 10 December 2004 23:12, Chuck Williams wrote:
> > > Paul,
> > >
> > > Would there be a way to get the best of both worlds? E.g., could
> you
> > > factor the specializable score combination differently, so that
> one
> > > method was called with each new score to generate a state entity,
> > while
> > > a final method computed the score from the state. For both sum
> and
> > max,
> > > the state entity could just be a float, not requiring an array.
> The
> > > final operation for the sum with coord case would do the coord. I
> > > haven't looked at the code carefully enough to see if this
> actually
> > > works, but it seemed worth mentioning.
> >
> > It's simple enough to do some abstract method call instead of
> > initializing
> > a sum or adding to it. The problem is that as long as such a call is
> not
> > effectively inlined by the JVM, it will cause a performance hit for
> the
> > sum
> > case.
> >
> > The latest version of the advanceAfterCurrent method that computes
> the
> > score is java protected. It can be overridden to make the best
> > of it in another world.

I'm afraid I should have left the last three words out. I'm sorry if they
can be interpreted in a negative way, that was not my intention.

Regards,
Paul Elschot


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
On Sunday 12 December 2004 04:01, Chuck Williams wrote:

> I maintain the belief that max is *required* to implement reasonable
> multi-field searching (1).

Could you give a small example -- preferably a test case -- that shows what
the problem is? I know it has been discussed before but I hadn't been able
to follow that discussion closely enough. I assume the problem is in the
scoring, not in MultiFieldQueryParser. MultiFieldQueryParser has a
different problem, namely that it doesn't correctly work with AND queries.
Or is that the issue you're talking about? Anyway, that will be fixed
soon.

Regards
Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
RE: Boolean Scorer [ In reply to ]
Daniel,

A perfectly reasonable request -- I'll put together a simple test case
but can't do it today.

The problem is with scoring -- nothing to do with and queries.

The test will run along these lines:
1. Use a custom similarity to eliminate all tf and idf effects, just
to isolate what is being tested.
2. Create two documents doc1 and doc2, each with two fields title and
description. doc1 has "elephant" in title and "elephant" in
description. doc2 has "elephant" in title and "albino" in description.
3. Express query for "albino elephant" against both fields.
Problems:
a. MultiFieldQueryParser won't recognize either document as
containing both terms, due to the way it expands the query across
fields.
b. Expressing query as "title:albino description:albino
title:elephant description:elephant" will score both documents
equivalently, since each matches two query terms.
4. Comparison to MaxDisjunctionQuery and my method for expanding
queries across fields. Using notation that () represents a BooleanQuery
and {} represents a MaxDisjunctionQuery, "albino elephant" expands to:
( {title:albino description:albino}
{title:elephant description:elephant} )
This will recognize that doc2 has both terms matched while doc1 only has
1 term matched, score doc2 over doc1.

Refinement note: the actual expansion for "albino query" that I use is:
( {title:albino description:albino}~0.1
{title:elephant description:elephant}~0.1 )
This causes the score of each MaxDisjunctionQuery to be the score of
highest scoring MDQ subclause plus 0.1 times the sum of the scores of
the other MDQ subclauses. Thus, doc1 gets some credit for also having
"elephant" in the description but only 1/10 as much as doc2 gets for
covering another query term in its description. If doc3 has "elephant"
in title and both "albino" and "elephant" in the description, then with
the actual refined expansion, it gets the highest score of all (whereas
with pure max, without the 0.1, it would get the same score as doc2).

In real apps, tf's and idf's also come into play of course, but can
affect these either way (i.e., mitigate this fundamental problem or
exacerbate it).

Chuck

> -----Original Message-----
> From: Daniel Naber [mailto:daniel.naber@t-online.de]
> Sent: Sunday, December 12, 2004 2:24 AM
> To: Lucene Developers List
> Subject: Re: Boolean Scorer
>
> On Sunday 12 December 2004 04:01, Chuck Williams wrote:
>
> > I maintain the belief that max is *required* to implement
reasonable
> > multi-field searching (1).
>
> Could you give a small example -- preferably a test case -- that
shows
> what
> the problem is? I know it has been discussed before but I hadn't
been
> able
> to follow that discussion closely enough. I assume the problem is in
the
> scoring, not in MultiFieldQueryParser. MultiFieldQueryParser has a
> different problem, namely that it doesn't correctly work with AND
> queries.
> Or is that the issue you're talking about? Anyway, that will be
fixed
> soon.
>
> Regards
> Daniel
>
> --
> http://www.danielnaber.de
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
RE: Boolean Scorer [ In reply to ]
Daniel,

The test case is now attached as Bug #32674. It's commented with lines
from the email below to make the correspondence easy. Please let me
know your thoughts,

Chuck

> -----Original Message-----
> From: Chuck Williams [mailto:chuck@manawiz.com]
> Sent: Sunday, December 12, 2004 11:23 AM
> To: Lucene Developers List
> Subject: RE: Boolean Scorer
>
> Daniel,
>
> A perfectly reasonable request -- I'll put together a simple test
case
> but can't do it today.
>
> The problem is with scoring -- nothing to do with and queries.
>
> The test will run along these lines:
> 1. Use a custom similarity to eliminate all tf and idf effects,
just
> to isolate what is being tested.
> 2. Create two documents doc1 and doc2, each with two fields title
and
> description. doc1 has "elephant" in title and "elephant" in
> description. doc2 has "elephant" in title and "albino" in
description.
> 3. Express query for "albino elephant" against both fields.
> Problems:
> a. MultiFieldQueryParser won't recognize either document as
> containing both terms, due to the way it expands the query across
> fields.
> b. Expressing query as "title:albino description:albino
> title:elephant description:elephant" will score both documents
> equivalently, since each matches two query terms.
> 4. Comparison to MaxDisjunctionQuery and my method for expanding
> queries across fields. Using notation that () represents a
BooleanQuery
> and {} represents a MaxDisjunctionQuery, "albino elephant" expands
to:
> ( {title:albino description:albino}
> {title:elephant description:elephant} )
> This will recognize that doc2 has both terms matched while doc1 only
has
> 1 term matched, score doc2 over doc1.
>
> Refinement note: the actual expansion for "albino query" that I use
is:
> ( {title:albino description:albino}~0.1
> {title:elephant description:elephant}~0.1 )
> This causes the score of each MaxDisjunctionQuery to be the score of
> highest scoring MDQ subclause plus 0.1 times the sum of the scores
of
> the other MDQ subclauses. Thus, doc1 gets some credit for also
having
> "elephant" in the description but only 1/10 as much as doc2 gets for
> covering another query term in its description. If doc3 has
"elephant"
> in title and both "albino" and "elephant" in the description, then
with
> the actual refined expansion, it gets the highest score of all
(whereas
> with pure max, without the 0.1, it would get the same score as
doc2).
>
> In real apps, tf's and idf's also come into play of course, but can
> affect these either way (i.e., mitigate this fundamental problem or
> exacerbate it).
>
> Chuck
>
> > -----Original Message-----
> > From: Daniel Naber [mailto:daniel.naber@t-online.de]
> > Sent: Sunday, December 12, 2004 2:24 AM
> > To: Lucene Developers List
> > Subject: Re: Boolean Scorer
> >
> > On Sunday 12 December 2004 04:01, Chuck Williams wrote:
> >
> > > I maintain the belief that max is *required* to implement
> reasonable
> > > multi-field searching (1).
> >
> > Could you give a small example -- preferably a test case -- that
> shows
> > what
> > the problem is? I know it has been discussed before but I hadn't
> been
> > able
> > to follow that discussion closely enough. I assume the problem
is in
> the
> > scoring, not in MultiFieldQueryParser. MultiFieldQueryParser has
a
> > different problem, namely that it doesn't correctly work with
AND
> > queries.
> > Or is that the issue you're talking about? Anyway, that will be
> fixed
> > soon.
> >
> > Regards
> > Daniel
> >
> > --
> > http://www.danielnaber.de
> >
> >
>
---------------------------------------------------------------------
> > To unsubscribe, e-mail:
lucene-dev-unsubscribe@jakarta.apache.org
> > For additional commands, e-mail:
lucene-dev-help@jakarta.apache.org
>
>
>
---------------------------------------------------------------------
> To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org
Re: Boolean Scorer [ In reply to ]
Hello Arihant,

The Scorer for disjunctions uses a heap data structure that needs to be
reordered upon every hit. While reordering heaps is efficient as it runs in
logarithmic time, the fact that it needs to run on every document might add
non-negligible overhead. BooleanScorer tries to work around this overhead
by scoring large windows of documents in a more TAAT (term-at-a-time)
fashion so that Lucene only needs to reorder the heap every 2048 doc IDs
(the hardcoded window size).

This paper gives a bit more context:
http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
see section 4 in particular.

On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com> wrote:

> Hi ,
>
> I am new here . I would like to know what is the exact optimisation
> carried out in “Boolean Scorer.java” code which led to a separate class for
> resolving Boolean Queries in bulk documents. I could not find any material
> in the documentation for this as well, hence I decided to ask here.
>
>
> Thanking you in advance,
>
> Arihant.
>
>
>
> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
> Windows 10
>
>
>


--
Adrien
Re: Boolean Scorer [ In reply to ]
Thanks for this explanation Adrien! I'd been wondering about this a bit
myself since seeing that DrillSideways also implements a TAAT approach (in
addition to a doc-at-a-time approach). This really helps clear that up.
Appreciate you taking the time to explain!

Cheers,
-Greg

On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com> wrote:

> Hello Arihant,
>
> The Scorer for disjunctions uses a heap data structure that needs to be
> reordered upon every hit. While reordering heaps is efficient as it runs in
> logarithmic time, the fact that it needs to run on every document might add
> non-negligible overhead. BooleanScorer tries to work around this overhead
> by scoring large windows of documents in a more TAAT (term-at-a-time)
> fashion so that Lucene only needs to reorder the heap every 2048 doc IDs
> (the hardcoded window size).
>
> This paper gives a bit more context:
> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
> see section 4 in particular.
>
> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com> wrote:
>
>> Hi ,
>>
>> I am new here . I would like to know what is the exact optimisation
>> carried out in “Boolean Scorer.java” code which led to a separate class for
>> resolving Boolean Queries in bulk documents. I could not find any material
>> in the documentation for this as well, hence I decided to ask here.
>>
>>
>> Thanking you in advance,
>>
>> Arihant.
>>
>>
>>
>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>> Windows 10
>>
>>
>>
>
>
> --
> Adrien
>
Re: Boolean Scorer [ In reply to ]
Glad it helped. :)

On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:

> Thanks for this explanation Adrien! I'd been wondering about this a bit
> myself since seeing that DrillSideways also implements a TAAT approach (in
> addition to a doc-at-a-time approach). This really helps clear that up.
> Appreciate you taking the time to explain!
>
> Cheers,
> -Greg
>
> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com> wrote:
>
>> Hello Arihant,
>>
>> The Scorer for disjunctions uses a heap data structure that needs to be
>> reordered upon every hit. While reordering heaps is efficient as it runs in
>> logarithmic time, the fact that it needs to run on every document might add
>> non-negligible overhead. BooleanScorer tries to work around this overhead
>> by scoring large windows of documents in a more TAAT (term-at-a-time)
>> fashion so that Lucene only needs to reorder the heap every 2048 doc IDs
>> (the hardcoded window size).
>>
>> This paper gives a bit more context:
>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>> see section 4 in particular.
>>
>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>> wrote:
>>
>>> Hi ,
>>>
>>> I am new here . I would like to know what is the exact optimisation
>>> carried out in “Boolean Scorer.java” code which led to a separate class for
>>> resolving Boolean Queries in bulk documents. I could not find any material
>>> in the documentation for this as well, hence I decided to ask here.
>>>
>>>
>>> Thanking you in advance,
>>>
>>> Arihant.
>>>
>>>
>>>
>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>> Windows 10
>>>
>>>
>>>
>>
>>
>> --
>> Adrien
>>
>

--
Adrien
Re: Boolean Scorer [ In reply to ]
Hi,
There is a function "ScoreWindowIntoBitSetAndReplay" in
"BooleanScorer.java" which runs over all the scorers.
I was wondering if we can use multi-threading here with numScorers threads.
Anyways we are using a special OrCollector here which updates the matching
array and the score in the buckets of 2048 docs. So we can use a Reentrant
lock for synchronization in the collector.

I just wanted reviews on this since I tried this and some tests were not
passing. So if you could tell what is wrong in this approach, I
would appreciate it.

Thanking You in advance,
Arihant.

On Tue, 15 Jun 2021, 19:05 Adrien Grand, <jpountz@gmail.com> wrote:

> Glad it helped. :)
>
> On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:
>
>> Thanks for this explanation Adrien! I'd been wondering about this a bit
>> myself since seeing that DrillSideways also implements a TAAT approach (in
>> addition to a doc-at-a-time approach). This really helps clear that up.
>> Appreciate you taking the time to explain!
>>
>> Cheers,
>> -Greg
>>
>> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com> wrote:
>>
>>> Hello Arihant,
>>>
>>> The Scorer for disjunctions uses a heap data structure that needs to be
>>> reordered upon every hit. While reordering heaps is efficient as it runs in
>>> logarithmic time, the fact that it needs to run on every document might add
>>> non-negligible overhead. BooleanScorer tries to work around this overhead
>>> by scoring large windows of documents in a more TAAT (term-at-a-time)
>>> fashion so that Lucene only needs to reorder the heap every 2048 doc IDs
>>> (the hardcoded window size).
>>>
>>> This paper gives a bit more context:
>>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>>> see section 4 in particular.
>>>
>>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>>> wrote:
>>>
>>>> Hi ,
>>>>
>>>> I am new here . I would like to know what is the exact optimisation
>>>> carried out in “Boolean Scorer.java” code which led to a separate class for
>>>> resolving Boolean Queries in bulk documents. I could not find any material
>>>> in the documentation for this as well, hence I decided to ask here.
>>>>
>>>>
>>>> Thanking you in advance,
>>>>
>>>> Arihant.
>>>>
>>>>
>>>>
>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>> Windows 10
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Adrien
>>>
>>
>
> --
> Adrien
>
Re: Boolean Scorer [ In reply to ]
It should be possible to make something like this work. The main issue is
that Lucene has the expectation that a (Bulk)Scorer is consumed in the
thread where it was pulled, so this would require substantial changes to
how BooleanScorer currently operates I believe.

I'd be curious to know why you are looking into this rather than passing an
Executor to IndexSearcher so that it can search segments concurrently. Is
it not providing enough concurrency for you?

On Mon, Jun 21, 2021 at 9:46 AM Arihant Samar <arisamjay@gmail.com> wrote:

> Hi,
> There is a function "ScoreWindowIntoBitSetAndReplay" in
> "BooleanScorer.java" which runs over all the scorers.
> I was wondering if we can use multi-threading here with numScorers
> threads. Anyways we are using a special OrCollector here which updates the
> matching array and the score in the buckets of 2048 docs. So we can use a
> Reentrant lock for synchronization in the collector.
>
> I just wanted reviews on this since I tried this and some tests were not
> passing. So if you could tell what is wrong in this approach, I
> would appreciate it.
>
> Thanking You in advance,
> Arihant.
>
> On Tue, 15 Jun 2021, 19:05 Adrien Grand, <jpountz@gmail.com> wrote:
>
>> Glad it helped. :)
>>
>> On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:
>>
>>> Thanks for this explanation Adrien! I'd been wondering about this a bit
>>> myself since seeing that DrillSideways also implements a TAAT approach (in
>>> addition to a doc-at-a-time approach). This really helps clear that up.
>>> Appreciate you taking the time to explain!
>>>
>>> Cheers,
>>> -Greg
>>>
>>> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>
>>>> Hello Arihant,
>>>>
>>>> The Scorer for disjunctions uses a heap data structure that needs to be
>>>> reordered upon every hit. While reordering heaps is efficient as it runs in
>>>> logarithmic time, the fact that it needs to run on every document might add
>>>> non-negligible overhead. BooleanScorer tries to work around this overhead
>>>> by scoring large windows of documents in a more TAAT (term-at-a-time)
>>>> fashion so that Lucene only needs to reorder the heap every 2048 doc IDs
>>>> (the hardcoded window size).
>>>>
>>>> This paper gives a bit more context:
>>>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>>>> see section 4 in particular.
>>>>
>>>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi ,
>>>>>
>>>>> I am new here . I would like to know what is the exact optimisation
>>>>> carried out in “Boolean Scorer.java” code which led to a separate class for
>>>>> resolving Boolean Queries in bulk documents. I could not find any material
>>>>> in the documentation for this as well, hence I decided to ask here.
>>>>>
>>>>>
>>>>> Thanking you in advance,
>>>>>
>>>>> Arihant.
>>>>>
>>>>>
>>>>>
>>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>> Windows 10
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>
>>
>> --
>> Adrien
>>
>

--
Adrien
Re: Boolean Scorer [ In reply to ]
TBH, the proposal sound like an overkill to me - IndexSearcher's
concurrency should be good enough (unless you are searching a single large
segment)

On Mon, 21 Jun 2021, 19:04 Adrien Grand, <jpountz@gmail.com> wrote:

> It should be possible to make something like this work. The main issue is
> that Lucene has the expectation that a (Bulk)Scorer is consumed in the
> thread where it was pulled, so this would require substantial changes to
> how BooleanScorer currently operates I believe.
>
> I'd be curious to know why you are looking into this rather than passing
> an Executor to IndexSearcher so that it can search segments concurrently.
> Is it not providing enough concurrency for you?
>
> On Mon, Jun 21, 2021 at 9:46 AM Arihant Samar <arisamjay@gmail.com> wrote:
>
>> Hi,
>> There is a function "ScoreWindowIntoBitSetAndReplay" in
>> "BooleanScorer.java" which runs over all the scorers.
>> I was wondering if we can use multi-threading here with numScorers
>> threads. Anyways we are using a special OrCollector here which updates the
>> matching array and the score in the buckets of 2048 docs. So we can use a
>> Reentrant lock for synchronization in the collector.
>>
>> I just wanted reviews on this since I tried this and some tests were not
>> passing. So if you could tell what is wrong in this approach, I
>> would appreciate it.
>>
>> Thanking You in advance,
>> Arihant.
>>
>> On Tue, 15 Jun 2021, 19:05 Adrien Grand, <jpountz@gmail.com> wrote:
>>
>>> Glad it helped. :)
>>>
>>> On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Thanks for this explanation Adrien! I'd been wondering about this a bit
>>>> myself since seeing that DrillSideways also implements a TAAT approach (in
>>>> addition to a doc-at-a-time approach). This really helps clear that up.
>>>> Appreciate you taking the time to explain!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>>
>>>>> Hello Arihant,
>>>>>
>>>>> The Scorer for disjunctions uses a heap data structure that needs to
>>>>> be reordered upon every hit. While reordering heaps is efficient as it runs
>>>>> in logarithmic time, the fact that it needs to run on every document might
>>>>> add non-negligible overhead. BooleanScorer tries to work around this
>>>>> overhead by scoring large windows of documents in a more TAAT
>>>>> (term-at-a-time) fashion so that Lucene only needs to reorder the heap
>>>>> every 2048 doc IDs (the hardcoded window size).
>>>>>
>>>>> This paper gives a bit more context:
>>>>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>>>>> see section 4 in particular.
>>>>>
>>>>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> I am new here . I would like to know what is the exact optimisation
>>>>>> carried out in “Boolean Scorer.java” code which led to a separate class for
>>>>>> resolving Boolean Queries in bulk documents. I could not find any material
>>>>>> in the documentation for this as well, hence I decided to ask here.
>>>>>>
>>>>>>
>>>>>> Thanking you in advance,
>>>>>>
>>>>>> Arihant.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>>> Windows 10
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>
>>>
>>> --
>>> Adrien
>>>
>>
>
> --
> Adrien
>
Re: Boolean Scorer [ In reply to ]
There was a Jira relating to GPU acceleration where it was mentioned that
Boolean Scorer has possibilities of GPU usage.
So I was just checking first with multithreading in Java itself and
thought that this function may be amenable to parallelization.
Hence I was just giving it a try.
Will this not be useful if there are very long Boolean queries with a lot
of SHOULD clauses although I have no clue if this is a common situation.

I just need one more little help. Although some of the tests do give the
error Adrien mentioned that docs should be collected in the same thread
they were generated, but some tests also give wrong scores itself. Do you
see anything wrong in the synchronization I have done?
The synchronization I have done is basically creating an array of
matching.length size of Reentrant locks and just running the function
"ScoreWindowIntoBitSetAndReplay " with numScorer threads instead of the for
loop.
/// in BooleanScorer.java -> OrCollector -> collect function
Lock[idx].lock();
matching[idx] |= 1L << i;
final Bucket bucket = buckets[i];
bucket.freq++;
bucket.score += scorer.score();
Lock[idx].unlock();



On Mon, 21 Jun 2021 at 19:04, Adrien Grand <jpountz@gmail.com> wrote:

> It should be possible to make something like this work. The main issue is
> that Lucene has the expectation that a (Bulk)Scorer is consumed in the
> thread where it was pulled, so this would require substantial changes to
> how BooleanScorer currently operates I believe.
>
> I'd be curious to know why you are looking into this rather than passing
> an Executor to IndexSearcher so that it can search segments concurrently.
> Is it not providing enough concurrency for you?
>
> On Mon, Jun 21, 2021 at 9:46 AM Arihant Samar <arisamjay@gmail.com> wrote:
>
>> Hi,
>> There is a function "ScoreWindowIntoBitSetAndReplay" in
>> "BooleanScorer.java" which runs over all the scorers.
>> I was wondering if we can use multi-threading here with numScorers
>> threads. Anyways we are using a special OrCollector here which updates the
>> matching array and the score in the buckets of 2048 docs. So we can use a
>> Reentrant lock for synchronization in the collector.
>>
>> I just wanted reviews on this since I tried this and some tests were not
>> passing. So if you could tell what is wrong in this approach, I
>> would appreciate it.
>>
>> Thanking You in advance,
>> Arihant.
>>
>> On Tue, 15 Jun 2021, 19:05 Adrien Grand, <jpountz@gmail.com> wrote:
>>
>>> Glad it helped. :)
>>>
>>> On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>
>>>> Thanks for this explanation Adrien! I'd been wondering about this a bit
>>>> myself since seeing that DrillSideways also implements a TAAT approach (in
>>>> addition to a doc-at-a-time approach). This really helps clear that up.
>>>> Appreciate you taking the time to explain!
>>>>
>>>> Cheers,
>>>> -Greg
>>>>
>>>> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com> wrote:
>>>>
>>>>> Hello Arihant,
>>>>>
>>>>> The Scorer for disjunctions uses a heap data structure that needs to
>>>>> be reordered upon every hit. While reordering heaps is efficient as it runs
>>>>> in logarithmic time, the fact that it needs to run on every document might
>>>>> add non-negligible overhead. BooleanScorer tries to work around this
>>>>> overhead by scoring large windows of documents in a more TAAT
>>>>> (term-at-a-time) fashion so that Lucene only needs to reorder the heap
>>>>> every 2048 doc IDs (the hardcoded window size).
>>>>>
>>>>> This paper gives a bit more context:
>>>>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>>>>> see section 4 in particular.
>>>>>
>>>>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi ,
>>>>>>
>>>>>> I am new here . I would like to know what is the exact optimisation
>>>>>> carried out in “Boolean Scorer.java” code which led to a separate class for
>>>>>> resolving Boolean Queries in bulk documents. I could not find any material
>>>>>> in the documentation for this as well, hence I decided to ask here.
>>>>>>
>>>>>>
>>>>>> Thanking you in advance,
>>>>>>
>>>>>> Arihant.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>>> Windows 10
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Adrien
>>>>>
>>>>
>>>
>>> --
>>> Adrien
>>>
>>
>
> --
> Adrien
>
Re: Boolean Scorer [ In reply to ]
I managed to correct some mistakes and now the tests which checks scores
are passing. Obviously the tests which check about the same thread
generating and collecting fail , but just out of interest I removed those
asserts. Are there any tests or benchmarks which I can compare how these
changes perform.

Thanking you in advance,
Arihant.

On Tue, 22 Jun 2021 at 11:37, Arihant Samar <arisamjay@gmail.com> wrote:

> There was a Jira relating to GPU acceleration where it was mentioned that
> Boolean Scorer has possibilities of GPU usage.
> So I was just checking first with multithreading in Java itself and
> thought that this function may be amenable to parallelization.
> Hence I was just giving it a try.
> Will this not be useful if there are very long Boolean queries with a lot
> of SHOULD clauses although I have no clue if this is a common situation.
>
> I just need one more little help. Although some of the tests do give the
> error Adrien mentioned that docs should be collected in the same thread
> they were generated, but some tests also give wrong scores itself. Do you
> see anything wrong in the synchronization I have done?
> The synchronization I have done is basically creating an array of
> matching.length size of Reentrant locks and just running the function
> "ScoreWindowIntoBitSetAndReplay " with numScorer threads instead of the for
> loop.
> /// in BooleanScorer.java -> OrCollector -> collect function
> Lock[idx].lock();
> matching[idx] |= 1L << i;
> final Bucket bucket = buckets[i];
> bucket.freq++;
> bucket.score += scorer.score();
> Lock[idx].unlock();
>
>
>
> On Mon, 21 Jun 2021 at 19:04, Adrien Grand <jpountz@gmail.com> wrote:
>
>> It should be possible to make something like this work. The main issue is
>> that Lucene has the expectation that a (Bulk)Scorer is consumed in the
>> thread where it was pulled, so this would require substantial changes to
>> how BooleanScorer currently operates I believe.
>>
>> I'd be curious to know why you are looking into this rather than passing
>> an Executor to IndexSearcher so that it can search segments concurrently.
>> Is it not providing enough concurrency for you?
>>
>> On Mon, Jun 21, 2021 at 9:46 AM Arihant Samar <arisamjay@gmail.com>
>> wrote:
>>
>>> Hi,
>>> There is a function "ScoreWindowIntoBitSetAndReplay" in
>>> "BooleanScorer.java" which runs over all the scorers.
>>> I was wondering if we can use multi-threading here with numScorers
>>> threads. Anyways we are using a special OrCollector here which updates the
>>> matching array and the score in the buckets of 2048 docs. So we can use a
>>> Reentrant lock for synchronization in the collector.
>>>
>>> I just wanted reviews on this since I tried this and some tests were not
>>> passing. So if you could tell what is wrong in this approach, I
>>> would appreciate it.
>>>
>>> Thanking You in advance,
>>> Arihant.
>>>
>>> On Tue, 15 Jun 2021, 19:05 Adrien Grand, <jpountz@gmail.com> wrote:
>>>
>>>> Glad it helped. :)
>>>>
>>>> On Tue, Jun 15, 2021 at 3:28 PM Greg Miller <gsmiller@gmail.com> wrote:
>>>>
>>>>> Thanks for this explanation Adrien! I'd been wondering about this a
>>>>> bit myself since seeing that DrillSideways also implements a TAAT approach
>>>>> (in addition to a doc-at-a-time approach). This really helps clear that up.
>>>>> Appreciate you taking the time to explain!
>>>>>
>>>>> Cheers,
>>>>> -Greg
>>>>>
>>>>> On Mon, Jun 14, 2021 at 2:35 AM Adrien Grand <jpountz@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hello Arihant,
>>>>>>
>>>>>> The Scorer for disjunctions uses a heap data structure that needs to
>>>>>> be reordered upon every hit. While reordering heaps is efficient as it runs
>>>>>> in logarithmic time, the fact that it needs to run on every document might
>>>>>> add non-negligible overhead. BooleanScorer tries to work around this
>>>>>> overhead by scoring large windows of documents in a more TAAT
>>>>>> (term-at-a-time) fashion so that Lucene only needs to reorder the heap
>>>>>> every 2048 doc IDs (the hardcoded window size).
>>>>>>
>>>>>> This paper gives a bit more context:
>>>>>> http://www.savar.se/media/1181/space_optimizations_for_total_ranking.pdf,
>>>>>> see section 4 in particular.
>>>>>>
>>>>>> On Sat, Jun 12, 2021 at 5:47 PM Arihant Samar <arisamjay@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi ,
>>>>>>>
>>>>>>> I am new here . I would like to know what is the exact optimisation
>>>>>>> carried out in “Boolean Scorer.java” code which led to a separate class for
>>>>>>> resolving Boolean Queries in bulk documents. I could not find any material
>>>>>>> in the documentation for this as well, hence I decided to ask here.
>>>>>>>
>>>>>>>
>>>>>>> Thanking you in advance,
>>>>>>>
>>>>>>> Arihant.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Sent from Mail <https://go.microsoft.com/fwlink/?LinkId=550986> for
>>>>>>> Windows 10
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Adrien
>>>>>>
>>>>>
>>>>
>>>> --
>>>> Adrien
>>>>
>>>
>>
>> --
>> Adrien
>>
>