Mailing List Archive

positional scoring
Hi Marvin ---

I'm still trying to figure out how to architect the custom scoring apparatus
I want, and I could use someone to talk a little sense into me. I
fear I'm heading down a blind alley. Since part of what I want could
be viewed as a generic proximity scorer, it's possible you've already
thought this through better than I've been able to.

I started with what seemed like the reasonable assumption: I wanted a
proximity scorer that was able to wrap itself around an existing
scorer. It would, for example, have an internal ANDNotScorer, which
would run to produce matching documents. The ANDNot scorer would make
the positions of the matches available to its parent, and the
ProximityScorer would use these positions to determine a score. The
alternative would be to have a more complex ProximityScorer, which
finds the matching document as well as assessing the positions of the
matches. It seemed cleaner and more flexible to keep each level
simpler, but I'm finding it more complex to implement than I expected.

It's probably easiest to think about this in terms of the existing
PhraseScorer, which currently takes the do-it-all approach. What
would it take to redesign PhraseScorer to so that rather than finding
documents that contain all terms and then checking their positions, it
would wrap an internal ANDScorer and just do the position checking?
And what would be the benefits and costs of this approach?

The main change would be that the ANDScorer would have to allow the
parent PhraseScorer access to its children's position records. In the
simple case where the ANDScorer's children are TermScorers, this is
fairly simple: it can just create a Positions array, with one element
for each of its children[*]. But if the children are potentially
complex (subphrases, for example), the ANDScorer must create a record
that includes both the positions and the lengths of all its children's
matches, otherwise the parent PhraseScorer will misjudge positions.

The benefit of doing this would be that one can simplify some of the
existing scorers and that it would be easier to add layers of
encapsulating scorers. It would make the layers more like building
blocks that can be combined at will. The disadvantage would be that it
is expensive to always create this composite position record. One can
avoid most of this expense by having a 'want_positions' flag that is
set when the subscorer is created, but this starts adding complexity
back. One could also make the length records default to 1, so that
existing position records can be used in situ.

You mentioned you were interested in proximity scoring. Is this a
direction that interests you? I think it has potential if one is
trying to design a customizable search toolkit. Or would I do better
to concentrate on a solution that works well for the exact problem I'm
trying to solve without trying to generalize it yet? The generalized
solution seems like a lot more work, and probably isn't worth doing
unless you think it would be useful elsewhere.

Thanks!

Nathan Kurz
nate@verse.com


[*] An ORScorer by contrast would merge all the positions into a
single element.
positional scoring [ In reply to ]
On Jul 9, 2007, at 4:05 PM, Nathan Kurz wrote:

Nathan, please forgive the brevity of my response... I'm beat...

> Hi Marvin ---
>
> I'm still trying to figure out how to architect the custom scoring
> apparatus
> I want, and I could use someone to talk a little sense into me. I
> fear I'm heading down a blind alley. Since part of what I want could
> be viewed as a generic proximity scorer, it's possible you've already
> thought this through better than I've been able to.
>
> I started with what seemed like the reasonable assumption: I wanted a
> proximity scorer that was able to wrap itself around an existing
> scorer. It would, for example, have an internal ANDNotScorer, which
> would run to produce matching documents. The ANDNot scorer would make
> the positions of the matches available to its parent, and the
> ProximityScorer would use these positions to determine a score. The
> alternative would be to have a more complex ProximityScorer, which
> finds the matching document as well as assessing the positions of the
> matches. It seemed cleaner and more flexible to keep each level
> simpler, but I'm finding it more complex to implement than I expected.

Tally is the vessel that's intended to facilitate this. Instead of
just asking for scorer.score() as in Lucene, you ask for Scorer_Tally
() in KS, which is supposed to be an extendable object for holding a
score, a doc num, and any other relevant data -- such as arrays of
matching positions.

The problem is that you need to keep separate positions arrays for
each field.
>
> It's probably easiest to think about this in terms of the existing
> PhraseScorer, which currently takes the do-it-all approach. What
> would it take to redesign PhraseScorer to so that rather than finding
> documents that contain all terms and then checking their positions, it
> would wrap an internal ANDScorer and just do the position checking?
> And what would be the benefits and costs of this approach?
>
> The main change would be that the ANDScorer would have to allow the
> parent PhraseScorer access to its children's position records. In the
> simple case where the ANDScorer's children are TermScorers, this is
> fairly simple: it can just create a Positions array, with one element
> for each of its children[*]. But if the children are potentially
> complex (subphrases, for example), the ANDScorer must create a record
> that includes both the positions and the lengths of all its children's
> matches, otherwise the parent PhraseScorer will misjudge positions.
>
> The benefit of doing this would be that one can simplify some of the
> existing scorers and that it would be easier to add layers of
> encapsulating scorers. It would make the layers more like building
> blocks that can be combined at will. The disadvantage would be that it
> is expensive to always create this composite position record.

Yes, absolutely true. However, for individual TermScorers, we can
use pointers to the original Posting from the PostingList, which has
all the position data already. It's only the aggregating scorers
that need to build the composite position data.l

> One can
> avoid most of this expense by having a 'want_positions' flag that is
> set when the subscorer is created, but this starts adding complexity
> back.

I think it ought to be done with dedicated subclasses.

> One could also make the length records default to 1, so that
> existing position records can be used in situ.

I intentionally designed Posting with this application in mind...

>
> You mentioned you were interested in proximity scoring. Is this a
> direction that interests you? I think it has potential if one is
> trying to design a customizable search toolkit. Or would I do better
> to concentrate on a solution that works well for the exact problem I'm
> trying to solve without trying to generalize it yet? The generalized
> solution seems like a lot more work, and probably isn't worth doing
> unless you think it would be useful elsewhere.

A position-aware scoring suite would be very very nice. It would
improve IR precision, at the cost of some memory and some CPU. I
wanted to put something like this in 0.20, but as I mentioned in my
last note, it's getting moved to a Project To Be Named Later.

Maybe a KSx::Search::PositionQueries should be the first KSx Scorer
subclass distro. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/