(prescript: I'm not attached to the names or terminology I'm using
here. Sometimes it is because I'm being sloppy, sometimes because I'm
wrong, sometimes because I've chosen a name different than the current
so I can contrast them, and sometimes because the current names don't
resonate for me. But in the end, we'll call things whatever you
want.)
I was able to think about how to pass positions a bit today, but am
still searching for better ideas. But I thought I'd bounce my current
thoughts off you.
My mental goal is to allow a Phrase scorer to sit on top of an
arbitrary tree of Boolean type scorers, and have enough information to
determine if the terms found occur in the document as a phrase. For
example, "a|b c" might parse to scorers nested as such:
Phrase(
And(
Or(a b)
c
)
)
It's not that the actual Phrase scorer has to work like this, but if
it did (and was efficient) I think we'd be in a good position to do
generalized proximity matching.
I'm thinking that every Scorer will have a Match struct, either within
it's Tally or standalone. I'm not against having it in the Tally (or
conjoined with it), but I haven't yet figured out how to make it work
that way.
The trickiness (and I don't like trickiness) is that each Match is
allowed to contain either an array of positions, or an array of Match
structs:
struct Match {
int field; // set to MATCH_HAS_SUBMATCHES if submatch array
// set to MATCH_MULTIPLE_FIELDS if necessary
// defaults to MATCH_INVALID_FIELD
// else field number that positions come from
int alloc; // set to 0 if static position data is referenced
// otherwise allocated length of positions
// (and lengths if lengths is non-null)
// arrays below are copy on write if alloc is 0
int count; // number of positions or submatches in data array
// also number in lengths array if lengths non-null
void *data; // used as union {
// // field == MATCH_HAS_SUBMATCHES ?
// Match **submatches; // :
// int *positions;
// } data;
int *lengths; // NULL means all 1, else 'int array[count]'
// (lengths is unused if MATCH_HAS_SUBMATCHES)
}
I'd hoped to be able to avoid the position/submatch duality, but I
couldn't come up with any way to flatten the results that seemed
efficient and general.
This structure, like Tally, would be created at Scorer init, and
used/reused over the life of the Scorer. For efficiency, it wouldn't
be refcounted, and the values would only be valid until the next call
to Scorer_Advance().
I think it can be very fast to create the positions, because the
pointers to the subscorers Match structs remain the same over the life
of a scorer.
For an And or Or scorer:
// at init:
self->match->field = MATCH_HAS_SUBMATCHES;
for (int i = 0; i < self->num_clauses; i++) {
// push adds match to data taking care of alloc and count
self->match->push(self->match, self->subclause[i]->match);
}
// per doc:
// nothing! it just points through to the static matches in the subscorers
// (although an Or scorer needs to be sure unmatched scorers are zeroed)
For a Term scorer:
// at init:
self->match->field = field_num; // as passed by query
// self->match->alloc left at zero
// per doc:
self->match->data = posting->positions;
self->match->count = posting->num_positions;
Things are going to be trickier for a the Phrase Scorer. Rather than
passing through Matches of its children, it will create its own. The
trickiness comes because if the subscorer Matches themselves have
submatches, it needs to descend to the bottom of these trees,
presumably treating the various bottom matches as alternative
possibilities for that position in the phrase. Then it will add the
positions (and lengths) of these phrases to its own Match struct. I'm
not sure how to deal with (or exclude) the possibility of phrases
occurring in multiple fields (submatches?). I'm not sure yet how to
do any of this efficiently.
I'll be able to continue thinking about it tomorrow. Criticisms and
suggestions for how to do it better would be appreciated.
Nathan Kurz
nate@verse.com
here. Sometimes it is because I'm being sloppy, sometimes because I'm
wrong, sometimes because I've chosen a name different than the current
so I can contrast them, and sometimes because the current names don't
resonate for me. But in the end, we'll call things whatever you
want.)
I was able to think about how to pass positions a bit today, but am
still searching for better ideas. But I thought I'd bounce my current
thoughts off you.
My mental goal is to allow a Phrase scorer to sit on top of an
arbitrary tree of Boolean type scorers, and have enough information to
determine if the terms found occur in the document as a phrase. For
example, "a|b c" might parse to scorers nested as such:
Phrase(
And(
Or(a b)
c
)
)
It's not that the actual Phrase scorer has to work like this, but if
it did (and was efficient) I think we'd be in a good position to do
generalized proximity matching.
I'm thinking that every Scorer will have a Match struct, either within
it's Tally or standalone. I'm not against having it in the Tally (or
conjoined with it), but I haven't yet figured out how to make it work
that way.
The trickiness (and I don't like trickiness) is that each Match is
allowed to contain either an array of positions, or an array of Match
structs:
struct Match {
int field; // set to MATCH_HAS_SUBMATCHES if submatch array
// set to MATCH_MULTIPLE_FIELDS if necessary
// defaults to MATCH_INVALID_FIELD
// else field number that positions come from
int alloc; // set to 0 if static position data is referenced
// otherwise allocated length of positions
// (and lengths if lengths is non-null)
// arrays below are copy on write if alloc is 0
int count; // number of positions or submatches in data array
// also number in lengths array if lengths non-null
void *data; // used as union {
// // field == MATCH_HAS_SUBMATCHES ?
// Match **submatches; // :
// int *positions;
// } data;
int *lengths; // NULL means all 1, else 'int array[count]'
// (lengths is unused if MATCH_HAS_SUBMATCHES)
}
I'd hoped to be able to avoid the position/submatch duality, but I
couldn't come up with any way to flatten the results that seemed
efficient and general.
This structure, like Tally, would be created at Scorer init, and
used/reused over the life of the Scorer. For efficiency, it wouldn't
be refcounted, and the values would only be valid until the next call
to Scorer_Advance().
I think it can be very fast to create the positions, because the
pointers to the subscorers Match structs remain the same over the life
of a scorer.
For an And or Or scorer:
// at init:
self->match->field = MATCH_HAS_SUBMATCHES;
for (int i = 0; i < self->num_clauses; i++) {
// push adds match to data taking care of alloc and count
self->match->push(self->match, self->subclause[i]->match);
}
// per doc:
// nothing! it just points through to the static matches in the subscorers
// (although an Or scorer needs to be sure unmatched scorers are zeroed)
For a Term scorer:
// at init:
self->match->field = field_num; // as passed by query
// self->match->alloc left at zero
// per doc:
self->match->data = posting->positions;
self->match->count = posting->num_positions;
Things are going to be trickier for a the Phrase Scorer. Rather than
passing through Matches of its children, it will create its own. The
trickiness comes because if the subscorer Matches themselves have
submatches, it needs to descend to the bottom of these trees,
presumably treating the various bottom matches as alternative
possibilities for that position in the phrase. Then it will add the
positions (and lengths) of these phrases to its own Match struct. I'm
not sure how to deal with (or exclude) the possibility of phrases
occurring in multiple fields (submatches?). I'm not sure yet how to
do any of this efficiently.
I'll be able to continue thinking about it tomorrow. Criticisms and
suggestions for how to do it better would be appreciated.
Nathan Kurz
nate@verse.com