Mailing List Archive

passing positions
(prescript: I'm not attached to the names or terminology I'm using
here. Sometimes it is because I'm being sloppy, sometimes because I'm
wrong, sometimes because I've chosen a name different than the current
so I can contrast them, and sometimes because the current names don't
resonate for me. But in the end, we'll call things whatever you
want.)

I was able to think about how to pass positions a bit today, but am
still searching for better ideas. But I thought I'd bounce my current
thoughts off you.

My mental goal is to allow a Phrase scorer to sit on top of an
arbitrary tree of Boolean type scorers, and have enough information to
determine if the terms found occur in the document as a phrase. For
example, "a|b c" might parse to scorers nested as such:
Phrase(
And(
Or(a b)
c
)
)
It's not that the actual Phrase scorer has to work like this, but if
it did (and was efficient) I think we'd be in a good position to do
generalized proximity matching.

I'm thinking that every Scorer will have a Match struct, either within
it's Tally or standalone. I'm not against having it in the Tally (or
conjoined with it), but I haven't yet figured out how to make it work
that way.

The trickiness (and I don't like trickiness) is that each Match is
allowed to contain either an array of positions, or an array of Match
structs:

struct Match {
int field; // set to MATCH_HAS_SUBMATCHES if submatch array
// set to MATCH_MULTIPLE_FIELDS if necessary
// defaults to MATCH_INVALID_FIELD
// else field number that positions come from

int alloc; // set to 0 if static position data is referenced
// otherwise allocated length of positions
// (and lengths if lengths is non-null)
// arrays below are copy on write if alloc is 0

int count; // number of positions or submatches in data array
// also number in lengths array if lengths non-null

void *data; // used as union {
// // field == MATCH_HAS_SUBMATCHES ?
// Match **submatches; // :
// int *positions;
// } data;

int *lengths; // NULL means all 1, else 'int array[count]'
// (lengths is unused if MATCH_HAS_SUBMATCHES)
}

I'd hoped to be able to avoid the position/submatch duality, but I
couldn't come up with any way to flatten the results that seemed
efficient and general.

This structure, like Tally, would be created at Scorer init, and
used/reused over the life of the Scorer. For efficiency, it wouldn't
be refcounted, and the values would only be valid until the next call
to Scorer_Advance().

I think it can be very fast to create the positions, because the
pointers to the subscorers Match structs remain the same over the life
of a scorer.

For an And or Or scorer:
// at init:
self->match->field = MATCH_HAS_SUBMATCHES;
for (int i = 0; i < self->num_clauses; i++) {
// push adds match to data taking care of alloc and count
self->match->push(self->match, self->subclause[i]->match);
}

// per doc:
// nothing! it just points through to the static matches in the subscorers
// (although an Or scorer needs to be sure unmatched scorers are zeroed)

For a Term scorer:
// at init:
self->match->field = field_num; // as passed by query
// self->match->alloc left at zero

// per doc:
self->match->data = posting->positions;
self->match->count = posting->num_positions;

Things are going to be trickier for a the Phrase Scorer. Rather than
passing through Matches of its children, it will create its own. The
trickiness comes because if the subscorer Matches themselves have
submatches, it needs to descend to the bottom of these trees,
presumably treating the various bottom matches as alternative
possibilities for that position in the phrase. Then it will add the
positions (and lengths) of these phrases to its own Match struct. I'm
not sure how to deal with (or exclude) the possibility of phrases
occurring in multiple fields (submatches?). I'm not sure yet how to
do any of this efficiently.

I'll be able to continue thinking about it tomorrow. Criticisms and
suggestions for how to do it better would be appreciated.

Nathan Kurz
nate@verse.com
Re: passing positions [ In reply to ]
On Jul 25, 2007, at 12:23 AM, Nathan Kurz wrote:

> (prescript: I'm not attached to the names or terminology I'm using
> here. Sometimes it is because I'm being sloppy, sometimes because I'm
> wrong, sometimes because I've chosen a name different than the current
> so I can contrast them, and sometimes because the current names don't
> resonate for me. But in the end, we'll call things whatever you
> want.)

FWIW, I've written up the naming principles I follow as a blog entry:
<http://www.rectangular.com/blog/my_name_is_variable.html>.

One of your inventions is Scorer_Advance. I like it as a substitute
for Scorer_Next, and it might be worth a global search and replace
since that method isn't public yet. :) However, in your code it
appears to be a substitute for Scorer_Skip_To. If that's correct, I
think it's not the best choice, because there's greater contrast in a
Scorer_Next:Scorer_Skip_To duality than a Scorer_Next:Scorer_Advance
duality.

> I was able to think about how to pass positions a bit today, but am
> still searching for better ideas. But I thought I'd bounce my current
> thoughts off you.
>
> My mental goal is to allow a Phrase scorer to sit on top of an
> arbitrary tree of Boolean type scorers, and have enough information to
> determine if the terms found occur in the document as a phrase. For
> example, "a|b c" might parse to scorers nested as such:
> Phrase(
> And(
> Or(a b)
> c
> )
> )
> It's not that the actual Phrase scorer has to work like this, but if
> it did (and was efficient) I think we'd be in a good position to do
> generalized proximity matching.

From a DRY standpoint, it would be nice to have a single
PhraseScorer working over sub-scorers rather than having one which
uses sub-Scorers and one which uses PostingLists. Let's proceed
under that assumption and mod PhraseScorer. We can go back and
unroll an auxiliary PostingLists version for efficiency if need be.
>
> I'm thinking that every Scorer will have a Match struct, either within
> it's Tally or standalone. I'm not against having it in the Tally (or
> conjoined with it), but I haven't yet figured out how to make it work
> that way.

I think similar reasoning led you to Match and me to Tally.

> The trickiness (and I don't like trickiness) is that each Match is
> allowed to contain either an array of positions, or an array of Match
> structs:

I doubt that's necessary. Just create a default wrapper at the
lowest level. That's how TermScorer does things presently.

> struct Match {
> int field; // set to MATCH_HAS_SUBMATCHES if submatch
> array
> // set to MATCH_MULTIPLE_FIELDS if necessary
> // defaults to MATCH_INVALID_FIELD
> // else field number that positions come from

This variable name violates my "avoid overload overload" rule. :)
"field" has a very specific meaning in the context of KS and this
isn't it.

>
> int alloc; // set to 0 if static position data is
> referenced
> // otherwise allocated length of positions
> // (and lengths if lengths is non-null)
> // arrays below are copy on write if alloc is 0



>
> int count; // number of positions or submatches in data
> array
> // also number in lengths array if lengths
> non-null
>
> void *data; // used as union {
> // // field ==
> MATCH_HAS_SUBMATCHES ?
> // Match **submatches; // :
> // int *positions;
> // } data;

I think we can avoid this union. See below.

>
> int *lengths; // NULL means all 1, else 'int array[count]'
> // (lengths is unused if MATCH_HAS_SUBMATCHES)
> }
>
> I'd hoped to be able to avoid the position/submatch duality, but I
> couldn't come up with any way to flatten the results that seemed
> efficient and general.

This was the driving factor behind the ScoreProx class.

The solution is general because each scorer presents an array of
ScoreProx in its Tally. That includes TermScorer, which presents a
single element array.

It's efficient because the ScoreProx objects persist. TermScorer
keeps just copies pointers into it from the Posting Object.
PhraseScorer keeps the same one around and reallocates space for the
positions on an as-needed basis.

A better name for the ScoreProx class would be appreciated. :) It's
the worst class name in KS, and the "num_sproxen" member var in Tally
is the worst member var name.

> This structure, like Tally, would be created at Scorer init, and
> used/reused over the life of the Scorer. For efficiency, it wouldn't
> be refcounted, and the values would only be valid until the next call
> to Scorer_Advance().

Yes. Just so. Tally does that.

> I think it can be very fast to create the positions, because the
> pointers to the subscorers Match structs remain the same over the life
> of a scorer.
>
> For an And or Or scorer:
> // at init:
> self->match->field = MATCH_HAS_SUBMATCHES;
> for (int i = 0; i < self->num_clauses; i++) {
> // push adds match to data taking care of alloc and count
> self->match->push(self->match, self->subclause[i]->match);
> }
>
> // per doc:
> // nothing! it just points through to the static matches in the
> subscorers
> // (although an Or scorer needs to be sure unmatched scorers are
> zeroed)

Collation of positions gets complicated when these scorers are nested.

> For a Term scorer:
> // at init:
> self->match->field = field_num; // as passed by query
> // self->match->alloc left at zero
>
> // per doc:
> self->match->data = posting->positions;
> self->match->count = posting->num_positions;

Yes.

> Things are going to be trickier for a the Phrase Scorer. Rather than
> passing through Matches of its children, it will create its own. The
> trickiness comes because if the subscorer Matches themselves have
> submatches, it needs to descend to the bottom of these trees,
> presumably treating the various bottom matches as alternative
> possibilities for that position in the phrase. Then it will add the
> positions (and lengths) of these phrases to its own Match struct. I'm
> not sure how to deal with (or exclude) the possibility of phrases
> occurring in multiple fields (submatches?). I'm not sure yet how to
> do any of this efficiently.

Probably we'll need a priority queue. Or some other sorting scheme.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: passing positions [ In reply to ]
After sending you this message, I worked on this particular approach
enough to convince myself that it was a dead end. I've move on to
another approach that I hope you'll like better, and while I don't
have it working yet I think I'm convinced it can be made to work.
There is still a Match structure, but it doesn't nest. Instead, the
existing Scorer hierarchy is used. I'll send you more details in a
followup message, and respond more generally here.

> FWIW, I've written up the naming principles I follow as a blog entry:
> <http://www.rectangular.com/blog/my_name_is_variable.html>.

Yes, I can agree with most of that. Personally, I have less fear of
long identifiers and find 'string_compare' to be clearer to purpose
than 'scomp'. :)

> One of your inventions is Scorer_Advance. I like it as a substitute
> for Scorer_Next, and it might be worth a global search and replace
> since that method isn't public yet. :) However, in your code it
> appears to be a substitute for Scorer_Skip_To.

I'm hoping to collapse those two down to a single function.
Currently, I'm thinking that function is Scorer_Match(), to emphasize
that the contents of the Match struct are available only until the
next such call, in a manner parallel to Scorer_Tally().

> From a DRY standpoint, it would be nice to have a single
> PhraseScorer working over sub-scorers rather than having one which
> uses sub-Scorers and one which uses PostingLists.

Yes, I think that PhraseScorer should use a subscorer and not PostingLists.
That said, it may be simpler to restrict complexity of that subscorer
at least temporarily so that we don't have to start with a fully
recursive phrase scorer.

Something like allowing:
PhraseScorer -> AndScorer -> [TermScorer TermScorer TermScorer]
and not yet handling:
PhraseScorer -> AndScorer -.> [OrScorer PhraseScorer AndScorer]

> I think similar reasoning led you to Match and me to Tally.

Well, that and the hope that if I paralleled Match and Tally you'd
like the idea better :).

> > The trickiness (and I don't like trickiness) is that each Match is
> > allowed to contain either an array of positions, or an array of Match
> > structs:
>
> I doubt that's necessary. Just create a default wrapper at the
> lowest level. That's how TermScorer does things presently.

I fear the trickiness is still necessary at some level, but I think
I've managed to hide it in a place you'll like better. Essentially,
I'm going to propose two main subclasses for Scorer, MultiScorer and
MatchScorer. MultiScorer's contain a public VArray of other Scorer's,
while MatchScorer's contain a public Match struct.

> This variable name violates my "avoid overload overload" rule. :)
> "field" has a very specific meaning in the context of KS and this
> isn't it.

I agree with you in general, but I thought this was the specific
meaning. It's removed from Match in my new incarnation, but would
would you prefer it to be called: 'index_field', 'field_num'?

> I think we can avoid this union. See below.

Yes, it's gone in current incarnation. Unfortunately, what it
switches to is a run-time type check for OBJ_IS_A(MultiScorer).

> This was the driving factor behind the ScoreProx class.

I've forgotten the details, but I came to the conclusion that
ScoreProx was at odds with Rich Positions, and that to allow a
Proximity type scorer to use Positions specific weights some wider
interface was needed.

> A better name for the ScoreProx class would be appreciated. :) It's
> the worst class name in KS, and the "num_sproxen" member var in Tally
> is the worst member var name.

:)

> Collation of positions gets complicated when these scorers are nested.

It's possible we are defining terms differently here, but my current
plan is that there never will be any collation. Instead, the
MultiScorer's (AndScorer, OrScorer) will allow their children's Match
structs to be accessed directly. I tried to pursue collation at one
point, and gave up: positions from multiple fields, phrases of
different lengths. On the bright side, direct access is very
efficient!

More tomorrow about what I'm currently aiming for. It's still rough,
but I think you'll like it better than the initial proposal.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: passing positions [ In reply to ]
On Sep 6, 2007, at 12:17 AM, Nathan Kurz wrote:

>> One of your inventions is Scorer_Advance. I like it as a substitute
>> for Scorer_Next, and it might be worth a global search and replace
>> since that method isn't public yet. :) However, in your code it
>> appears to be a substitute for Scorer_Skip_To.
>
> I'm hoping to collapse those two down to a single function.

That would be very nice. I tried to pull that off, but I ran into
some problem. I don't remember what it was, though. :(

> Yes, I think that PhraseScorer should use a subscorer and not
> PostingLists.
> That said, it may be simpler to restrict complexity of that subscorer
> at least temporarily so that we don't have to start with a fully
> recursive phrase scorer.
>
> Something like allowing:
> PhraseScorer -> AndScorer -> [TermScorer TermScorer TermScorer]

Good plan. Can we do that now, in isolation from the rest of the
changes?

>> I think similar reasoning led you to Match and me to Tally.
>
> Well, that and the hope that if I paralleled Match and Tally you'd
> like the idea better :).

Heh.

>>> The trickiness (and I don't like trickiness) is that each Match is
>>> allowed to contain either an array of positions, or an array of
>>> Match
>>> structs:
>>
>> I doubt that's necessary. Just create a default wrapper at the
>> lowest level. That's how TermScorer does things presently.
>
> I fear the trickiness is still necessary at some level, but I think
> I've managed to hide it in a place you'll like better. Essentially,
> I'm going to propose two main subclasses for Scorer, MultiScorer and
> MatchScorer. MultiScorer's contain a public VArray of other Scorer's,
> while MatchScorer's contain a public Match struct.

Interesting. Do you end up with more subscorers than before?

>> This variable name violates my "avoid overload overload" rule. :)
>> "field" has a very specific meaning in the context of KS and this
>> isn't it.
>
> I agree with you in general, but I thought this was the specific
> meaning. It's removed from Match in my new incarnation, but would
> would you prefer it to be called: 'index_field', 'field_num'?

field_num.

"field", when it's used at all, means "field name". It used to mean
a Field object -- before I killed that class -- and that's still the
place it holds in concept-space.

>> This was the driving factor behind the ScoreProx class.
>
> I've forgotten the details, but I came to the conclusion that
> ScoreProx was at odds with Rich Positions, and that to allow a
> Proximity type scorer to use Positions specific weights some wider
> interface was needed.

I'm having trouble visualizing this. I wish there was a way to
divide and conquer this problem more effectively.

>> Collation of positions gets complicated when these scorers are
>> nested.
>
> It's possible we are defining terms differently here, but my current
> plan is that there never will be any collation. Instead, the
> MultiScorer's (AndScorer, OrScorer) will allow their children's Match
> structs to be accessed directly.

I think you need collation for the PhraseScorer. Say you're
iterating over positions in several subscorers. You have position 35
and 36; now you need 37. If you haven't kept track of where each
subscorer is at, you'll have to start from scratch with each one.
If you don't, and the subscorer has multiple subscorers itself, you
might miss something.

> I tried to pursue collation at one
> point, and gave up: positions from multiple fields, phrases of
> different lengths.

Yes, this is the same problem that thwarted me in my first go.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: passing positions [ In reply to ]
On 9/7/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > Something like allowing:
> > PhraseScorer -> AndScorer -> [TermScorer TermScorer TermScorer]
>
> Good plan. Can we do that now, in isolation from the rest of the
> changes?

It's possible with your greater familiarity with the current code that
you could do so, but I haven't found a comfortable way to do it.

> > I'm going to propose two main subclasses for Scorer, MultiScorer and
> > MatchScorer. MultiScorer's contain a public VArray of other Scorer's,
> > while MatchScorer's contain a public Match struct.
>
> Interesting. Do you end up with more subscorers than before?

I'm not done yet, but I think it will end up the same or fewer than
before. The parent classes are new, but ANDOR and ANDNOT go away,
replaced by simple combinations of And, Or, and Not. Not is new (and
not yet done), but BooleanScorer goes. More importantly, I think
things like PhraseScorer and its unborn ilk will be simpler as they
won't have to duplicate the low-level work.

> > It's removed from Match in my new incarnation, but would
> > would you prefer it to be called: 'index_field', 'field_num'?
>
> field_num.

Agreed to be better, and changed.

> "field", when it's used at all, means "field name". It used to mean
> a Field object -- before I killed that class -- and that's still the
> place it holds in concept-space.

At one point I asked my grandfather about some directions he gave me
based on "the road where the bridge is out", and was suprised to learn
he'd never actually seen the bridge in the half-century he'd been in
that area, and that it must have washed out before he was born.
That was just how people referred to that road. :)

> I'm having trouble visualizing this. I wish there was a way to
> divide and conquer this problem more effectively.

I haven't found it yet. I think it might be possible, though, at
least in pieces. I'm hoping to get it working independent from the
existing code, and then work with you to integrate it. Hopefully once
I have a working model (ie, once I have the ocean at a rolling boil),
the ways it can be incrementally incorporated will become clearer.

> I think you need collation for the PhraseScorer. Say you're
> iterating over positions in several subscorers. You have position 35
> and 36; now you need 37. If you haven't kept track of where each
> subscorer is at, you'll have to start from scratch with each one.
> If you don't, and the subscorer has multiple subscorers itself, you
> might miss something.

We might just be defining 'collation' differently. I agree that one
needs to keep track the current position within each Scorer, but I
think this can be done with a pointer rather than a copy. I'll fire
off another message with my current version of PhraseScorer so you can
see what I'm doing. Likely we mean the same thing but are just
describing it differently.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch