Mailing List Archive

the lifecycle of a Posting
Hi Marvin ---

I'm trying to understand better how TermScorer's make use of the
underlying Postings. I thought it was going to be straightforward,
but once I get under the hood I get lost. I think I understand the
terms as you define them in the IRTheory doc, but I have trouble
matching these concepts up with the actual classes.

When you have time, could you offer a couple paragraphs on the
lifecycle of a Posting, with a one-sentence gloss of the relevant
classes?

Thanks!

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
On 9/17/07, Nathan Kurz <nate@verse.com> wrote:
> could you offer a couple paragraphs on the lifecycle of a Posting

I think I mean PostingList here, but I got confused since a
PostingList also encompasses multiple terms via PList_Seek(). Things
are making little more sense, but still am not getting my head around
it.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
On Mon, Sep 17, 2007 at 02:14:00PM -0600, Nathan Kurz wrote:
> I'm trying to understand better how TermScorer's make use of the
> underlying Postings. I thought it was going to be straightforward,
> but once I get under the hood I get lost. I think I understand the
> terms as you define them in the IRTheory doc, but I have trouble
> matching these concepts up with the actual classes.
>
> When you have time, could you offer a couple paragraphs on the
> lifecycle of a Posting, with a one-sentence gloss of the relevant
> classes?

There's one thing that's really wacky about PostingList and the Postings that
TermScorer sees.

malloc() and free() are expensive ops. And a hell of a lot of Postings go by
during scoring.

So... to save time, PList_Bulk_Read doesn't actually create individual Posting
objects. It reads new data into the *same* master Posting over and over and
stacks copies of the master end to end within a ByteBuf. Instead of creating
and destroying many many Postings, we create and destroy a single ByteBuf.

These copies are what the Scorers actually see.

With that out of the way, let's step back.

The order of creation is...

TermQuery
TermWeight
PostingList
Posting (within PostingList contstuctor)
TermScorer
[many copies of Posting]

A lot happens in TermWeight->scorer():

sub scorer {
my ( $self, $reader ) = @_;
my $term = $self->{parent}{term};
my $plist = $reader->posting_list( term => $term );
return unless defined $plist;
return unless $plist->get_doc_freq;

return $plist->make_scorer(
similarity => $self->{similarity},
weight => $self,
weight_value => $self->get_value,
);
}


The PostingList contains a single master Posting object, of whatever type is
appropriate for the field -- almost always ScorePosting, for now.
TermWeight->scorer() has to delegate to the PostingList object because it
doesn't know what kind of TermScorer subclass to create. PostingList *also*
has to delegate, because it doesn't know *either*. Only the master Posting
knows.

The master Posting for each PostingList is ultimately created here:

Posting*
Schema_fetch_posting(Schema *self, const ByteBuf *field_name)
{
Similarity *sim = Schema_Fetch_Sim(self, field_name);
FieldSpec *fspec = (FieldSpec*)Hash_Fetch_BB(self->fspecs, field_name);

if (fspec == NULL)
CONFESS("Can't Fetch_Posting for unknown field %s", field_name->ptr);

return Post_Dupe(fspec->posting, sim);
}

The sequence for creating TermScorers is kind of convoluted because it wasn't
easy to support per-field Similarity. Like many other internal things in KS,
I wouldn't mind refactoring this stuff if we can come up with a better
implementation.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
I wrote:
> There's one thing that's really wacky about PostingList and the Postings that
> TermScorer sees.
>
> malloc() and free() are expensive ops. And a hell of a lot of Postings go by
> during scoring.
>
> So... to save time, PList_Bulk_Read doesn't actually create individual Posting
> objects. It reads new data into the *same* master Posting over and over and
> stacks copies of the master end to end within a ByteBuf. Instead of creating
> and destroying many many Postings, we create and destroy a single ByteBuf.
>
> These copies are what the Scorers actually see.

Explaining this got me thinking. The "bulk read" functionality is a Lucene
artifact. Of necessity, it's implemented differently in KS. But I don't
think we really need it at all.

Hard drive buffering is handled by InStream, and even the FILE* object, since
I've never been able to figure out why turning off buffering with setvbuf
slows things down. There's really no reason to buffer a bunch of Posting
objects.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
On 9/27/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> Explaining this got me thinking. The "bulk read" functionality is a Lucene
> artifact. Of necessity, it's implemented differently in KS. But I don't
> think we really need it at all.

I think that is true, and that we might even get a speed increase
processing Postings one at a time. If we handle it immediately, it
still should be hot in L2, maybe even L1. As it is we churn through a
lot of data at a time and then have to refetch from RAM when it's time
to use it.

Avoiding the Bulk read is also going to simplify the code and make it
possible do things more flexibly within the Posting class.

I'd like to try to get the Position code I've been working on finished
sometime soon, although I'm distracted with other projects right now.
Will you be freed up to spend some time working with me on that in a
week or two?

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
On Thu, Sep 27, 2007 at 02:18:11PM -0600, Nathan Kurz wrote:
> I think that is true, and that we might even get a speed increase
> processing Postings one at a time. If we handle it immediately, it
> still should be hot in L2, maybe even L1. As it is we churn through a
> lot of data at a time and then have to refetch from RAM when it's time
> to use it.

Seems likely. We're even going to be overwriting the same prox buffer.

> I'd like to try to get the Position code I've been working on finished
> sometime soon, although I'm distracted with other projects right now.
> Will you be freed up to spend some time working with me on that in a
> week or two?

Yes.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
On Thu, Sep 27, 2007 at 02:18:11PM -0600, Nathan Kurz wrote:
> Avoiding the Bulk read is also going to simplify the code and make it
> possible do things more flexibly within the Posting class.

OK, the bulk read capability is now gone from Posting and PostingList as of
r2557, r2558, and r2560.

I look forward to further simplifications.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: the lifecycle of a Posting [ In reply to ]
On 9/28/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Thu, Sep 27, 2007 at 02:18:11PM -0600, Nathan Kurz wrote:
> > Will you be freed up to spend some time working with me on that in a
> > week or two?
>
> Yes.

Wonderful. I'm busy and/or traveling until October 10th. It will
probably take me a couple days after that to get my head back into
KinoSearch. So pencil in some time starting October 12th, lasting
until we get it perfect and faster than lightning. Boil, C, boil! :)

> > Avoiding the Bulk read is also going to simplify the code and make it
> > possible do things more flexibly within the Posting class.
>
> OK, the bulk read capability is now gone from Posting and PostingList as of
> r2557, r2558, and r2560.

Wow! That's great. I'll check that out when I get a chance (probably
this weekend).

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch