Mailing List Archive

subclassing term scorers
Hi Marvin --

In your last response, you asked why it was bad to have term scorers
be specific to TF/IDF scoring. I didn't do a good job of answering
clearly. I've been thinking about it more, and hopefully I can do a
better job of explaining why this time.

The problem is that while it is easy to subclass a term scorer, it's
difficult to get that subclassed scorer to be used. In the general
case you don't know what sort of posting format you are dealing with,
so you ask the posting list to generate a scorer for you. How do you
subclass something that is generated for you?

Here's some some alternatives I've considered:

I could bypass PostingList->make_scorer, as PhraseScorer does, and
work directly on the Posting. This feels presumptive, and reliant on
Posting internals. I'm confused by your comfort in having
PhraseScorer do so.

I could call PostingList->make_scorer, and then twiddle with VTable so
that the scorer calls my custom tally function. This feels tricksy,
and my custom tally function would still have to forcibly cast the
Posting into something specific.

I could try to wrap the posting specific scorer with my own, and never
call its tally, but the tally I want needs access to posting->impact
(among other things), so I'd have to peek the internals of the posting
anyway.

I could change the architecture so that rather than passing a Sim
object, I pass a Tally object with a run method. Then Scorer_Tally()
is changed to call scorer->tally->run(scorer, tally). Then I
wouldn't have to subclass the generated scorer, as it would already be
customized.

(I like that this further encapsulates the particular scoring scheme,
but it would be complex, and would still require the internals of the
Scorer to be exposed, if only to the Tally object. Either it would
need an ISA() check and a cast, or some way of interrogating the
Scorer as to whether certain features are available.)

Anyway, is there a solution you would recommend here? I feel like I
must be missing something obvious. I'd like to have a path that would
continue to work even as I move to other posting formats, particularly
the mmap() approach I mentioned.

Thanks again!

Nathan Kurz
nate@verse.com
subclassing term scorers [ In reply to ]
On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote:

> The problem is that while it is easy to subclass a term scorer, it's
> difficult to get that subclassed scorer to be used.

The intent is that each Posting subclass will have a fixed
association with a corresponding TermScorer subclass. You're not
supposed to be able to override that association without additional
subclassing.

Furthermore, each Posting subclass shall wholly define a posting file
format. This is very different from Lucene. In Lucene, the code to
read postings data is spread out over many classes, resulting in a
tight binding between code base and file format which impedes
extensibility. The postings data itself is spread out over 2-3 index
files per field -- .frq and .prx, and optionally .fNUM -- whose
formats have slowly become mangled by the persistent nibbling of
feeping creatures. Take a look at <http://lucene.apache.org/java/
docs/fileformats.html#Frequencies> and try to imagine writing actual
code to support that spec. :)

The introduction of the Posting class in KS and the ironclad
association of a posting_type with a particular field is an attempt
to disentangle that situation by applying the refactoring technique I
referenced earlier, "replace conditionals with polymorphism".
Instead of keeping track of "hasPositions", "hasBoost", "hasPayload"
and such, we point a read function at the postings file which knows
*exactly* how to decode it.

Posting currently works great for internal use, but it's not quite
ready for public subclassing yet. The next step along the path as I
see it is to refactor Posting and the classes that touch it so that
writing a Posting subclass is as simple as defining...

* a write method
* a read method
* a make_scorer method
* a TermScorer subclass that overrides Scorer_Tally

In either C or Perl. :)

Another advantage of the unified posting file design is superior
locality of reference compared with Lucene's spread-out posting
files. This is an engineering tradeoff, as when matching only doc
number on a ScorePosting field, Lucene has to plow through less
data... but when positions are needed, KS does fewer hard disk
seeks. (: It would seem natural for us to exploit positions data
fully, since we're reading it regardless -- which is one reason I'm
enthusiastic about your positional work. :)

Lastly, Posting and PostingList also happen to align well with IR
theory, making for what seems to me is a more coherent conceptual OO
model than Lucene's TermDocs/TermPositions. (A "posting" is one
"term" indexing one "document", and each "term" in an index has a
"posting list" of documents which it indexes - see <http://
www.rectangular.com/kinosearch/docs/devel/KinoSearch/Docs/
IRTheory.html>.)

Here's a rundown of how TermScorer subclasses might interact with
their corresponding Posting subclasses:

* If a field uses MatchPosting, then iterating over a posting
list is really just iterating over a set of document numbers
(transported via MatchPosting objects).
* ScorePosting adds position, freq, and boost information.
* RichPosting is like ScorePosting with enhanced boost info.
* A "PartOfSpeechPosting" might consist of a document number
and an integer representing "verb", "noun", etc -- allowing
someone to perform linguistic analysis of a document
collection.
* An MMapMatchPosting might encode document numbers using
native 32-bit integers rather than delta vints.
* A proprietary GraphPosting might include a node_id and a
node_type, facilitating the kind of search engine described in
that FaceBook blog post -- weighting the document score
according to data looked up in an external structure.

In all cases, these Posting subclasses would meet the minimum
criteria of supplying a document number, but that would be all they
had in common. A PartOfSpeech term scorer wouldn't really know what
to make of a GraphPosting.

> I could bypass PostingList->make_scorer, as PhraseScorer does, and
> work directly on the Posting. This feels presumptive, and reliant on
> Posting internals. I'm confused by your comfort in having
> PhraseScorer do so.

The only way to get a PhraseScorer is via the factory method
PhraseWeight->make_scorer, which will return undef unless it detects
that the posting subclass isa ScorePosting, i.e. it makes positional
data available. We're safe.

> Either it would
> need an ISA() check and a cast, or some way of interrogating the
> Scorer as to whether certain features are available.)

I don't believe that will be necessary. We just need to make sure
that whatever technique we use for inquiring about the positional
data is safe. Tally has a num_sproxen member indicating how many
ScoreProx objects it has. That number would always be 0 for a Tally
supplied by a MatchPostingScorer.

You know, as an alternative to deleting all this positional stuff, we
could more or less finish it off. :) You and I both need positional
data. TermScorer provides it. PhraseScorer provides it modulo a
couple glaring, easily-squashed bugs. We just need the BooleanScorer
components to provide it, too.

The hardest part of what I want to do is figuring out exactly how
positional data should contribute to the score. It's going to be
tricky enough that I think it will require some trial and error.
However, we can defer that task. The positional data itself
probably won't change, so we can get the positional data engine
working and just not connect it up.

One thing I'm concerned about is RAM footprint. Let's try and come
up with a worst case scenario.

'the OR (and OR (it OR is)))'

KS isn't optimized for book-sized documents, but let's say that's
what's in an index, and the query above hits "War and Peace". Each
TermScorer will have a Posting object that grows to some large number
of KiB or even MiB to accommodate so many positions. That's OK, but
is there any danger of geometric growth as our position-resolving
algorithm recurses?

> I'd like to have a path that would
> continue to work even as I move to other posting formats, particularly
> the mmap() approach I mentioned.

That's not gonna fly. You'll need dedicated Posting subclasses for
mmap. MMapMatchPosting, MMapScorePosting, etc.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
subclassing term scorers [ In reply to ]
On 7/18/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Jul 17, 2007, at 1:50 AM, Nathan Kurz wrote:

I've rearranged my responses to emphasize our agreement :).

> Lastly, Posting and PostingList also happen to align well with IR
> theory, making for what seems to me is a more coherent conceptual OO
> model than Lucene's TermDocs/TermPositions.

Yes, I agree. I'm not intimately familiar with Lucene's model apart
from via yours, but Posting and PostingList inherently make sense.

> Furthermore, each Posting subclass shall wholly define a posting file
> format. This is very different from Lucene.

Yes, this is a great improvement.

> Instead of keeping track of "hasPositions", "hasBoost", "hasPayload"
> and such, we point a read function at the postings file which knows
> *exactly* how to decode it.

Yes, this is a win. Although it would be possible to work trait by
trait, having a single reader per format is better for simplicity,
efficiency, and maintainability.

> The next step along the path as I
> see it is to refactor Posting and the classes that touch it so that
> writing a Posting subclass is as simple as defining...
>
> * a write method
> * a read method
> * a make_scorer method
> * a TermScorer subclass that overrides Scorer_Tally

Here's where we separate a little. I'd like to make it even simpler,
and require only that it define a read method (and presumably a write
method, although I've thought very little about that side). A new
scorer could be defined to make use of new information in new Posting,
but this would be optional. A subclassed Posting can continue to use
the Scorer used by its parent. Thus if if ScorePosting is a
descendant of MatchPosting, MatchPostingScorer can call
ScorePosting->read() and end up with a Posting it can handle.

This is essentially how PhraseScorer works now, and apart from wanting
some better type-checking, I like this. If I write a
CustomScorePosting class that adds a field to the ScorePosting struct,
PhraseScorer doesn't care, and can continue to directly access posting
struct as a ScorePosting. This is good, because I don't want to have
to write a custom PhraseScorer for every custom Posting class I come
up with. In this view, the purpose of the reader method is to return
a filled in Posting struct for use by a Term or Phrase scorer.

> Here's a rundown of how TermScorer subclasses might interact with
> their corresponding Posting subclasses:

Yes, those, like that.

> In all cases, these Posting subclasses would meet the minimum
> criteria of supplying a document number, but that would be all they
> had in common. A PartOfSpeech term scorer wouldn't really know what
> to make of a GraphPosting.

Yes to the specific example, but I'd like to take advantage of the
inheritance hierarchy when it exists. Thus a MatchPostingScorer would
work just fine if given a ScorePosting (since ScorePosting is a
MatchPosting), but not vice versa. So while a PartOfSpeechScorer
wouldn't try to handle a GraphPosting (presuming PartOfSpeechPosting
is not a GraphPosting), a scorer that wants only a MatchPosting would
likely be able to handle both.

> The intent is that each Posting subclass will have a fixed
> association with a corresponding TermScorer subclass. You're not
> supposed to be able to override that association without additional
> subclassing.

This I don't like. I can see how you got here, but I think there is
a better solution: the TermScorers depend only on the format of the
Posting struct, and Posting->read() is the sole point of conversion
from Index as file to Posting as object. Thus so long as the custom
Posting is a subclass of a standard Posting format, all the scorers
that worked with that parent will work with the subclass.

On the bright side, I'm now confident enough that this will work that
I think we can talk about it later, and concentrate now on how to make
positions work.

> You know, as an alternative to deleting all this positional stuff, we
> could more or less finish it off. :)

I was planning to try to give you a patch removing it this evening,
but I making it work would certainly feel more rewarding. I'll send
you another email later tonight with further thoughts on this.

Nathan Kurz
nate@verse.com
subclassing term scorers [ In reply to ]
On Jul 18, 2007, at 6:27 PM, Nathan Kurz wrote:

> I've rearranged my responses to emphasize our agreement :).

Well done! ;)

>> Lastly, Posting and PostingList also happen to align well with IR
>> theory, making for what seems to me is a more coherent conceptual OO
>> model than Lucene's TermDocs/TermPositions.
>
> Yes, I agree. I'm not intimately familiar with Lucene's model apart
> from via yours, but Posting and PostingList inherently make sense.

In the TermDocs/TermPositions model, the traits are added to the
iterator itself.

while (termDocs.next()) {
system.out.println("DOC: " + termDocs.doc());
system.out.println("FREQ: " + termDocs.freq());
}

while (termPositions.next()) {
system.out.println("DOC: " + termPositions.doc());
int freq = termPositions.freq());
system.out.println("FREQ: " + freq);
while (freq--) {
int position = termPositions.nextPosition();
system.out.println("POS: " + position);
if (termPositions.isPayloadAvailable()) {
byte[] payload = termPositions.getPayload(null, 0);
printPayloadSomeHow(payload);
}
}
}

There isn't an object which represents a posting.

Another significant difference is that Lucene iterates over positions
one at a time via nextPosition(), while KS loads them all into memory
at once.

>> * a write method
>> * a read method
>> * a make_scorer method
>> * a TermScorer subclass that overrides Scorer_Tally
>
> Here's where we separate a little. I'd like to make it even simpler,
> and require only that it define a read method (and presumably a write
> method, although I've thought very little about that side).

Yes, you could do that. Presumably, the subclass would interpret the
same postings file data differently somehow from the parent class.

> A new scorer could be defined to make use of new information in new
> Posting,
> but this would be optional.

You're right. In general that would work, provided that the subclass
was serious about fulfilling the parent class's interface.

> A subclassed Posting can continue to use
> the Scorer used by its parent. Thus if if ScorePosting is a
> descendant of MatchPosting, MatchPostingScorer can call
> ScorePosting->read() and end up with a Posting it can handle.

I can't think of a reason why this wouldn't work. Boilerplater
implements single inheritance only, a very limited OO model. There's
a little trickiness in there -- RichPosting's file format doesn't
"inherit" from ScorePosting's, for instance...

<doc, freq, shared_boost, <position>+>+
<doc, freq, <position, boost>+>+

... and the generated posting->impact would presumably differ (that's
the whole point of RichPosting after all). But the C structs would
be compatible.

>> The intent is that each Posting subclass will have a fixed
>> association with a corresponding TermScorer subclass. You're not
>> supposed to be able to override that association without additional
>> subclassing.
>
> This I don't like. I can see how you got here, but I think there is
> a better solution: the TermScorers depend only on the format of the
> Posting struct, and Posting->read() is the sole point of conversion
> from Index as file to Posting as object.

Well put. You've persuaded me.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
subclassing term scorers [ In reply to ]
On 7/18/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>
> Well put. You've persuaded me.
>

Wonderful! I'm sure there are details left, but I think we can safely
wait until later to discuss them, and concentrate on positions.

I said I'd send you more thoughts tonight, but I'm not going to have
time. Things came up (happy things: an opportunity to go out tuna
fishing leaving tomorrow morning at 2:30 am) so I'm not going to get
to it until Friday.

The gist of my thoughts was that:

1) I think we can get away with a single flat array of positions
rather than a complex structure.

2) All the Booleans can should be able to pass positions by reference,
so no memory troubles foreseen there.

3) PhraseScorer and the like end up reducing the number of positions
so shouldn't be a problem.

4) I haven't been able to think of any geometric space issues.

5) Geometric time: that could be a problem, which is why I'm
concerned with avoiding unnecessary work on unused branches.

I think your test case is a good one to think about.

Possibly disappearing until Friday,

Nathan Kurz
nate@verse.com