On Mon, Sep 17, 2007 at 02:14:00PM -0600, Nathan Kurz wrote:
> I'm trying to understand better how TermScorer's make use of the
> underlying Postings. I thought it was going to be straightforward,
> but once I get under the hood I get lost. I think I understand the
> terms as you define them in the IRTheory doc, but I have trouble
> matching these concepts up with the actual classes.
>
> When you have time, could you offer a couple paragraphs on the
> lifecycle of a Posting, with a one-sentence gloss of the relevant
> classes?
There's one thing that's really wacky about PostingList and the Postings that
TermScorer sees.
malloc() and free() are expensive ops. And a hell of a lot of Postings go by
during scoring.
So... to save time, PList_Bulk_Read doesn't actually create individual Posting
objects. It reads new data into the *same* master Posting over and over and
stacks copies of the master end to end within a ByteBuf. Instead of creating
and destroying many many Postings, we create and destroy a single ByteBuf.
These copies are what the Scorers actually see.
With that out of the way, let's step back.
The order of creation is...
TermQuery
TermWeight
PostingList
Posting (within PostingList contstuctor)
TermScorer
[many copies of Posting]
A lot happens in TermWeight->scorer():
sub scorer {
my ( $self, $reader ) = @_;
my $term = $self->{parent}{term};
my $plist = $reader->posting_list( term => $term );
return unless defined $plist;
return unless $plist->get_doc_freq;
return $plist->make_scorer(
similarity => $self->{similarity},
weight => $self,
weight_value => $self->get_value,
);
}
The PostingList contains a single master Posting object, of whatever type is
appropriate for the field -- almost always ScorePosting, for now.
TermWeight->scorer() has to delegate to the PostingList object because it
doesn't know what kind of TermScorer subclass to create. PostingList *also*
has to delegate, because it doesn't know *either*. Only the master Posting
knows.
The master Posting for each PostingList is ultimately created here:
Posting*
Schema_fetch_posting(Schema *self, const ByteBuf *field_name)
{
Similarity *sim = Schema_Fetch_Sim(self, field_name);
FieldSpec *fspec = (FieldSpec*)Hash_Fetch_BB(self->fspecs, field_name);
if (fspec == NULL)
CONFESS("Can't Fetch_Posting for unknown field %s", field_name->ptr);
return Post_Dupe(fspec->posting, sim);
}
The sequence for creating TermScorers is kind of convoluted because it wasn't
easy to support per-field Similarity. Like many other internal things in KS,
I wouldn't mind refactoring this stuff if we can come up with a better
implementation.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/ _______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch