Mailing List Archive

removing position code from scorer subclasses
On 7/13/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> On Jul 13, 2007, at 11:49 AM, Nathan Kurz wrote:
>
> > What should I tackle next?
>
> Remove position-generating code in Scorer subclasses and Tally,
> now that the decision has been made not to implement the
> position-aware BooleanScorer in core.
>

I think I understand enough to do what you suggest, but could we
discuss the end goal a bit first? In particular, I'd like to talk
more about how positions will be made available to a parent scorer,
and how custom scorers will interface with custom indexes.

As I mentioned in a previous message, I think that PhraseScorer is a
good example to think about. If we come up with a solution that
allows for a generalized and efficient phrase scorer, I think we'll be
on the right path. Here are some premises I'm starting with:

1. There are two stages to a scorer: matching and scoring.
2. Matching is done for every possible document, thus should be optimizable.
3. The scoring stage only occurs for documents that match, so can be slower.

You are proposing that the position information (when needed) be
passed along as part of the Tally. This creates problems for scorers
like PhraseScorer, which need[*] to get positions from its subscorers
before determining if the document is a match.

The current version of PhraseScorer sidesteps this by working directly
off the position data in the Posting. While I like the efficiency of
this approach, I'm worried it is not going to extend well to custom
index formats. In particular, I don't want the author of a new index
format to be required to write a CustomPhraseScorer. I'd also like it
to be possible to do phrase-type matching on items other than raw
terms: things like "a b|c d".

I think there needs to be a way to signal to a Scorer that its parent
wants this Positions to be set. You suggested that this be done by
subclasses instead, but I don't see this working well. First, it
would double the number of Scorer classes, or worse, it wouldn't
double them and if you wanted a position scorer you'd have to subclass
the existing scorers yourself. Second, I think that each
position-passing subclass is going to end up duplicating most of the
code within its 'parent' class (efficient but bug-prone), or doing the
work twice (think of a PhraseScorer that has already found a phrase).

I think the path between these rocks is going to involve conflating
some of the notions of Posting and Scorer, in particular making
position information available from a Scorer in the way analagous to
that which PhraseScorer currently grabs it directly from
Postings. Rather than having the position data within the Tally, I
think it needs to be part of the base Scorer class, so that it can
always be accessed as directly as Scorer->positions.

Thus here's my view of the end goal of the Scorer classes:

Scorer components (And, Or, Nor, Phrase) are directly reusable.
1. be used for Match-only (simple -- don't call Tally)
2. be used for Scoring (great --- call Tally on subscorers as necessary)
3. Make positions of matches available to parent scorer prior to Tally

Custom scoring algorithms can layer on top of standard components.
1. Base components do not presume any particular scoring scheme.
2. Scorer can signal to its subscorers that it needs position data.
3. Scorers fail gracefully if required data not in the index.

Custom index formats require only a single custom term scorer.
1. This term scorer is the sole interface that a search touches.
2. Scorers don't need to be aware of underlying index format.
3. A speed-for-size trading mmap'ed index format should be possible.

How do these strike you as goals? These make sense to me, but perhaps
I'm confusing a particular implementation with a general need.
Alternatively, I'm willing to just remove the position data as you
suggest, but I think it might produce a better end result if I better
understood where we were headed.

Nathan Kurz
nate@verse.com

* According to my premises, which might be wrong. One way out would be
to relax the requirement that Tally only be called on Matching
documents. This would have things work the way that the current
ORScorer does, which seems to be working. But we can come up with
cases where the performance of this approach might be poor, and I'm
worried that these cases might end up being my normal usage pattern.
removing position code from scorer subclasses [ In reply to ]
Nathan,

We're pretty much on the same page. Some of what you suggest is
already implemented. :)

On Jul 14, 2007, at 5:15 PM, Nathan Kurz wrote:

> As I mentioned in a previous message, I think that PhraseScorer is a
> good example to think about. If we come up with a solution that
> allows for a generalized and efficient phrase scorer, I think we'll be
> on the right path.

I think ORScorer is likely to be the hardest. ANDScorer and
ANDORScorer aren't going to be much easier.

PhraseScorer deals with multiple PostingLists, but all of them belong
to the same field, so their intersection is a single array. In
contrast, the BooleanScorer components have to resolve/remember
multiple positions arrays belonging to different fields -- which is
harder.

> 2. Matching is done for every possible document, thus should be
> optimizable.

Searches like 'rare_term AND common_term' and '"rare_term
common_term"' are already optimized, and do not perform matching for
every document, thanks to the fact that all the Scorer subclasses in
KS 0.20_04 properly implement Scorer_Skip_To. (SegPList_Skip_To is
where the actual optimization takes place.)

> 3. The scoring stage only occurs for documents that match, so can
> be slower.

Maybe it can be, but it probably won't be.

When you're plowing through a postings file, you have to read pretty
much all the data. You can't easily read the matching data and skip
the scoring data. So you'll already have done most of the work by
the time you get to Tally. Take a look at ScorePost_Read_Record,
which you need for matching, and ScorePostScorer_Tally, which isn't
very costly.

I think worrying about the cost of calling Scorer_Tally() may be
premature optimization.

> You are proposing that the position information (when needed) be
> passed along as part of the Tally. This creates problems for scorers
> like PhraseScorer, which need[*] to get positions from its subscorers
> before determining if the document is a match.

PhraseScorer doesn't actually have subscorers -- it handles
PostingList objects directly. If we were to have PhraseScorer
operate over TermScorers, that would slow it down some -- hard to say
how much.

> The current version of PhraseScorer sidesteps this by working directly
> off the position data in the Posting. While I like the efficiency of
> this approach, I'm worried it is not going to extend well to custom
> index formats. In particular, I don't want the author of a new index
> format to be required to write a CustomPhraseScorer.

So long as the Posting subclass can supply an array of positions and
a freq, PhraseScorer_Next will work fine.

To guarantee such a behavior for all Posting subclasses, we'd have to
move the freq, prox, and num_prox member vars from ScorePosting into
the base class Posting. I have some misgivings about that because
the bigger Posting is, the less efficient various implementations of
Post_Bulk_Read will be. Maybe we can figure out a better way to
handle things there. :)

> I'd also like it
> to be possible to do phrase-type matching on items other than raw
> terms: things like "a b|c d".

In Lucene-land, people generally meet this requirement by injecting
synonyms at index time. Implementing a search-time solution would be
tricky.

> I think there needs to be a way to signal to a Scorer that its parent
> wants this Positions to be set.

Ugh, I'm not a fan of flags like that. You wind up with a bunch of
extra conditional code everywhere.

http://www.refactoring.com/catalog/
replaceConditionalWithPolymorphism.html

> You suggested that this be done by
> subclasses instead, but I don't see this working well. First, it
> would double the number of Scorer classes, or worse, it wouldn't
> double them and if you wanted a position scorer you'd have to subclass
> the existing scorers yourself. Second, I think that each
> position-passing subclass is going to end up duplicating most of the
> code within its 'parent' class (efficient but bug-prone), or doing the
> work twice (think of a PhraseScorer that has already found a phrase).

PhraseScorer is easy.

bool_t
PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
{
/* if ($self->SUPER::skip_to($target)) */
if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
build_prox(self);
return true;
}
return false;
}

> I think the path between these rocks is going to involve conflating
> some of the notions of Posting and Scorer, in particular making
> position information available from a Scorer in the way analagous to
> that which PhraseScorer currently grabs it directly from
> Postings. Rather than having the position data within the Tally, I
> think it needs to be part of the base Scorer class, so that it can
> always be accessed as directly as Scorer->positions.

This is fairly close to how things are now. Tally objects "belong"
to a particular Scorer.

> Thus here's my view of the end goal of the Scorer classes:
>
> Scorer components (And, Or, Nor, Phrase) are directly reusable.

Note that all of these *are* directly reusable whether the field
specifies a posting type of ScorePosting or RichPosting, and that the
only difference between the TermScorer subclasses ScorePostingScorer
and RichPostingScorer is a rebless!

> 1. be used for Match-only (simple -- don't call Tally)

Or, make calling Tally so cheap that it doesn't matter:

BoolPostingScorer_Tally(BoolPostingScorer *self)
{
return self->tally;
}

> 2. be used for Scoring (great --- call Tally on subscorers as
> necessary)
> 3. Make positions of matches available to parent scorer prior to
> Tally
>
> Custom scoring algorithms can layer on top of standard components.
> 1. Base components do not presume any particular scoring scheme.

Implementation:

Tally*
Scorer_tally(Scorer *self)
{
ABSTRACT_DEATH(self, "Tally");
UNREACHABLE_RETURN(Tally*);
}

Subclasses override Tally to provide their own scoring scheme.

> 2. Scorer can signal to its subscorers that it needs position data.

I think a better way to handle this is to make the presence or
absence of position data a property of the field, via FieldSpec-
>posting_type.

As originally envisioned, "MatchPosting" would not provide any
positional data. A MatchPosting file is just a stack of doc nums.
(Actually, compressed delta doc nums).

<doc_num>+

Perhaps we should rename "MatchPosting" to "BoolPosting", and have KS
throw an error if you try to create a phrase-match against a field
with that posting_type.

We could modify MatchPosting to include freq and position data, which
would make phrase matching possible.

<doc_num, freq, <position>+>+

That's perilously close to the current implementation of
ScorePosting, though:

<doc_num, freq, shared_boost, <position>+>+

The only difference is a single byte per document for the shared boost.

> 3. Scorers fail gracefully if required data not in the index.

The way to do this is to indicate that the document matches, but
there aren't any positions. The position loop code will never be
entered.

> Custom index formats require only a single custom term scorer.
> 1. This term scorer is the sole interface that a search touches.
> 2. Scorers don't need to be aware of underlying index format.

Yes. :) That's how things are working right now.

However, position-aware Scorers are fundamentally different from
ordinary scorers, and probably should be distinct classes.

> 3. A speed-for-size trading mmap'ed index format should be possible.

To be frank, my experience with mmap is not extensive. FWIW, there's
a Lucene class called MMapDirectory. (Directory => Folder in KS.)
It's gotten mixed reviews.

I don't like it because porting it would require that InStream become
subclassable. Right now, InStream is a "final" class, which means
that all of its method macros are just aliases for the actual functions:

/* (from InStream.r, generated by Boilerplater) */
#define Kino_InStream_Read_VInt(self) \
kino_InStream_read_vint((kino_InStream*)self)

If we have to allow both buffering and non-buffering versions of
InStream, then I need to make InStream's methods overrideable, which
means a double dereference:

#define Kino_InStream_Read_VInt(self) \
(self)->_->read_vint((kino_InStream*)self)

I rather like the simplicity of having only one InStream.

> How do these strike you as goals? These make sense to me, but perhaps
> I'm confusing a particular implementation with a general need.
> Alternatively, I'm willing to just remove the position data as you
> suggest, but I think it might produce a better end result if I better
> understood where we were headed.

This position data was never used, because I never wrote the
implementation for BooleanScorer.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
removing position code from scorer subclasses [ In reply to ]
On 7/15/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > As I mentioned in a previous message, I think that PhraseScorer is a
> > good example to think about. If we come up with a solution that
> > allows for a generalized and efficient phrase scorer, I think we'll be
> > on the right path.
>
> I think ORScorer is likely to be the hardest. ANDScorer and
> ANDORScorer aren't going to be much easier.

I agree with you that ORScorer is going to be hard, but I think I'm
actually positing a harder problem. By 'generalized', I mean a phrase
scorer that can make use of the multiple position arrays passed up by
that ORScorer.
The reason I want it to be generalized is not for the sake of
PhraseScorer, but because my real goal is to write a positional scorer
that can wrap a BooleanScorer in the same way that PhraseScorer wraps
its subscorers (or rather what would be subscorers if it was
generalized).

Multiple fields does complicate things a bit, but not much (I think):
just associate the field used with the positional array. Dealing with
multiple position arrays at all is going to be the complicated part,
whether they are single or multiple fields. I think answer is that
all position data is represented as an array of position arrays, with
Term scorers just being the 1-dimensional case. Both And and Or
scorers return a separate position array for each matching clause.

>> 3. The scoring stage only occurs for documents that match, so can
>> be slower.
>
> Maybe it can be, but it probably won't be.

Maybe I should define what I mean by 'optimizeable' and 'slower'. To
me, optimizeable C code means that I want to be able to avoid function
calls, think about L2 cache optimization, and generally pick apart the
assembly code to see what the compiler produced. Slower means using a
smart algorithm, but not worrying much about things like function call
overhead. So while it will be slower, it hopefully will not be slow.

> I think worrying about the cost of calling Scorer_Tally() may be
> premature optimization.

You may be quite right that due to the optimization that happens
during matching phase, there isn't any great difference between the
matching and scoring phases. But I would argue that there is a
difference between designing an architecture that makes the
optimization possible (which strikes me as prudent) and actually
performing that optimization (which I agree is premature). But I
certainly could be silly wrong about that.

> When you're plowing through a postings file, you have to read pretty
> much all the data. You can't easily read the matching data and skip
> the scoring data. So you'll already have done most of the work by
> the time you get to Tally.

Yes, for Term scorers, but for things like a sloppy phrase scorer,
determining how many times a phrase occurs (or almost occurs) might be
quite expensive.

> PhraseScorer doesn't actually have subscorers -- it handles
> PostingList objects directly. If we were to have PhraseScorer
> operate over TermScorers, that would slow it down some -- hard to say
> how much.

Or worse, over ORScorers. But this is the problem I'm proposing: how
can this be done efficiently? The right answer might be to deny that
the question matters. I can definitely do without a general solution
for my purposes, but if I do that, any solution I come up with won't
be that useful to anyone else.

> To guarantee such a behavior for all Posting subclasses, we'd have to
> move the freq, prox, and num_prox member vars from ScorePosting into
> the base class Posting.

I'm not sure if it is better or worse, but I was suggesting moving
those (well, the last two) into the base Scorer! I'm hoping that the
internals of Posting won't be visible at all from the component
scorers perspective.

> > I'd also like it
> > to be possible to do phrase-type matching on items other than raw
> > terms: things like "a b|c d".
>
> In Lucene-land, people generally meet this requirement by injecting
> synonyms at index time. Implementing a search-time solution would be
> tricky.

Yeah, but that trickiness was the focus of this message. If we want a
position scorer that plays well with others, we would need to tackle
it. Should we?

>> I think that each
>> position-passing subclass is going to end up duplicating most of the
>> code within its 'parent' class (efficient but bug-prone), or doing the
>> work twice (think of a PhraseScorer that has already found a phrase).
>
> PhraseScorer is easy.
>
> bool_t
> PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
> {
> /* if ($self->SUPER::skip_to($target)) */
> if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
> build_prox(self);
> return true;
> }
> return false;
> }

This has what I feel is the significant disadvantage of finding the
first occurrence of the phrase twice: PhraseScorer_skip_to finds the
first occurence of the phrase and returns a match, then build_prox
does it again.
Perhaps this is acceptable, though.

> > Scorer components (And, Or, Nor, Phrase) are directly reusable.
>
> Note that all of these *are* directly reusable whether the field
> specifies a posting type of ScorePosting or RichPosting, and that the
> only difference between the TermScorer subclasses ScorePostingScorer
> and RichPostingScorer is a rebless!

True, but I'm not finding them reuseable for positional stuff, as they
don't make it positions available. And PhraseScorer (as well as the
TermScorers, which I meant to include in the list) are currently
TF/IDF specific.

> > 1. Base components do not presume any particular scoring scheme.
>
> Implementation:
>
> Tally*
> Scorer_tally(Scorer *self)
> {
> ABSTRACT_DEATH(self, "Tally");
> UNREACHABLE_RETURN(Tally*);
> }
>
> Subclasses override Tally to provide their own scoring scheme.

Are they required to subclass each TermScorer as well? And each
PhraseScorer? And what about when a custom scorer suite meets a
custom index format? And can God create a rock so large that even he
cannot lift it? OK, skip that last question. :)


> We could modify MatchPosting to include freq and position data, which
> would make phrase matching possible.

MatchPosting is just fine the way it is. If it wasn't, the way it
was, I'd have to write it!

> > 3. Scorers fail gracefully if required data not in the index.
>
> The way to do this is to indicate that the document matches, but
> there aren't any positions. The position loop code will never be
> entered.

Yes, it's just a matter of figuring out how to signal this by some
means other than a segfault. Does PhraseScorer currently prevent
this somewhere?
I confess I didn't know about the FieldSpec->posting_type you mentioned.

> > Custom index formats require only a single custom term scorer.
> > 1. This term scorer is the sole interface that a search touches.
> > 2. Scorers don't need to be aware of underlying index format.
>
> Yes. :) That's how things are working right now.

Well, sort of. PhraseScorer currently uses the Posting directly,
instead of via a Scorer. I was suggesting that it should go through
the scorer as well.

> > 3. A speed-for-size trading mmap'ed index format should be possible.
>
> To be frank, my experience with mmap is not extensive. FWIW, there's
> a Lucene class called MMapDirectory. (Directory => Folder in KS.)
> It's gotten mixed reviews.
>
> I don't like it because porting it would require that InStream become
> subclassable. Right now, InStream is a "final" class, which means
> that all of its method macros are just aliases for the actual functions:

mmap() provides a way to directly use the system pager to handle
caching and pool filling very efficiently and easily. Being system
level, it also means that cache is automatically shared across
processes. It provides no real advantage is used simply to replace
File IO.

The real gain would come only if you iterate directly over the mapped
index file without translation or buffering. As such, a useful
mmap-based index format would be bypassing InStream. It couldn't use
VInts, rather it would need to have a format that can be accessed and
used directly as a C-struct.

Without having done the tests, I'm guessing it's probably worth at
least a 2x speedup (probably more), with better shared cache
performance as well.
Among the downsides are that the index files would be byte-order
dependent, and as far as I know it won't work on Windows.

In my somewhat considered opinion, it's as close to a silver bullet as
exists for a problem of this sort. While I hadn't known about
MMapDirectory, I'm pretty sure it's not doing the same thing, as I
don't think it's even possible to mmap a struct in Java. It's really
only makes sense for languages that make their internals fully
accessible, but in this case it's certainly worth exploring further.

Nathan Kurz
nate@verse.com
removing position code from scorer subclasses [ In reply to ]
On 7/15/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> PhraseScorer is easy.
>
> bool_t
> PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
> {
> /* if ($self->SUPER::skip_to($target)) */
> if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
> build_prox(self);
> return true;
> }
> return false;
> }
>
> http://www.refactoring.com/catalog/replaceConditionalWithPolymorphism.html

Possibly useful for discussion, here's the not-yet-compiled code I was
working on for this. Tally would call Scorer_Find_Phrases (or an
inlined version) as it needed to, checking first if it had already
been done. I offer it not because it would actually work, but just to
say that it's not quite as pathological as the refactoring example you
linked to. :)

------------------------------------------------------------------------------------------------

u32_t
ProxPhraseScorer_advance(ProxPhraseScorer *self, u32_t target_doc)
{
Scorer *element_scorer = self->subscorer;

while (1) {
// advance to the next document that satisfies scorer
target_doc = Scorer_Advance(element_scorer, target_doc);

// stop if there are no more matching documents
if (target_doc == DOC_NOT_FOUND) break;

// use this doc if the positions match the offsets
if (Scorer_Find_Phrases(self)) break;

// otherwise try the next doc
target_doc++;
}

return target_doc;
}


u32_t
ProxPhraseScorer_find_phrases(ProxPhraseScorer *self)
{
Positions *positions = scorer->positions;
chy_u32_t *offsets = self->offsets;

chy_u32_t *position;
chy_u32_t length = phrase_occurs(positions, offsets, &position);

// no phrases found here
if (length == 0) return 0;

chy_u32_t num_phrases = 1;
if (self->want_positions) {
// record the phrase we already found
Scorer_Add_Occurrence(self, position, length);
// find further occurrences of the phrase
while (length = phrase_occurs(positions, offsets, &position)) {
Scorer_Add_Occurrence(self, position, length);
num_phrases++;
}
}

return num_phrases;
}

-------------------------------------------------------------------------------

My instinct this morning (it changes often) is that I should go
through and remove the position code as you suggested, and then write
a ProximityScorer that will serve my purposes but not more. Rather
than being the difficult generalized position scorer, this would look
more like an AndNotScorer hybridized with a PhraseScorer.

The And portion would contain ShortCircuitOrScorer's (need to write),
which in turn would contain term or phrase scorers. Since the
ShortCircuitOrScorer would stop after the first subclause match, it
will only have a single position array to report. And since the
AndClauses will be directly embedded in the ProximityScorer, I can
peek directly at their positions like PhraseScorer does.

Rather than worrying about properly figuring the positions for the
phrases, the position scorer will just bail if it sees a subphrase,
under the likely correct assumption that in the very rare case that a
user is searching for a phrase the position bonus isn't that
important. I haven't fully thought it out, but I think it should work
for now, and we can reconsider the generalized solution at a later
date.

Nathan Kurz
nate@verse.com
removing position code from scorer subclasses [ In reply to ]
On Jul 15, 2007, at 4:21 AM, Nathan Kurz wrote:

> I agree with you that ORScorer is going to be hard, but I think I'm
> actually positing a harder problem. By 'generalized', I mean a phrase
> scorer that can make use of the multiple position arrays passed up by
> that ORScorer.
> The reason I want it to be generalized is not for the sake of
> PhraseScorer, but because my real goal is to write a positional scorer
> that can wrap a BooleanScorer in the same way that PhraseScorer wraps
> its subscorers (or rather what would be subscorers if it was
> generalized).

Mmm. A wrapper class...

I'd been thinking that the positions logic would go into
BooleanScorer itself. But then, my goal was to give a little extra
oomph to existing matches when positions occurred near each other, in
order to improve relevancy. You want to do something different.

> Multiple fields does complicate things a bit, but not much (I think):
> just associate the field used with the positional array.

That's what ScoreProx is for:

struct kino_ScoreProx {
KINO_SCOREPROX_VTABLE *_;
KINO_OBJ_MEMBER_VARS;
chy_i32_t field_num;
chy_u32_t num_prox;
chy_u32_t *prox;
};

> I think answer is that
> all position data is represented as an array of position arrays, with
> Term scorers just being the 1-dimensional case.

Yes. TermScorers and PhraseScorers -- both of which are single-field
-- keep their own ScoreProx and just populate it with new numbers for
each match.

For TermScorer, all you have to do is copy pointers from the Posting
object.

For PhraseScorer, you have to build up new ScoreProx data based on
which positions actually matched.

All this is currently implemented, though the information is not used.

[. Heh. I just realized that PhraseScorer's implementation is buggy.
I've added an explanatory comment to PhraseScorer.c. ]

> Both And and Or
> scorers return a separate position array for each matching clause.

Exactly. We'd need an array of ScoreProx objects, and a count
indicating the number of elements in the array. That's what's in
Tally at this moment.

struct kino_Tally {
KINO_TALLY_VTABLE *_;
KINO_OBJ_MEMBER_VARS;
float score;
chy_u32_t num_matchers;
chy_u32_t sprox_cap; /* sproxen allocation
size */
chy_u32_t num_sproxen; /* number of ScoreProx */
struct kino_ScoreProx **sproxen; /* array of ScoreProx
objects */
};

>> When you're plowing through a postings file, you have to read pretty
>> much all the data. You can't easily read the matching data and skip
>> the scoring data. So you'll already have done most of the work by
>> the time you get to Tally.
>
> Yes, for Term scorers, but for things like a sloppy phrase scorer,
> determining how many times a phrase occurs (or almost occurs) might be
> quite expensive.

It's expensive, but the way things are now, that determination has to
be made during the matching phase. You have to know that the phrase
occurs at least once in order to decide that the document matches;
once you've started down that path, you might as well finish up and
find all the phrase matches within the doc.

>> PhraseScorer doesn't actually have subscorers -- it handles
>> PostingList objects directly. If we were to have PhraseScorer
>> operate over TermScorers, that would slow it down some -- hard to say
>> how much.
>
> Or worse, over ORScorers. But this is the problem I'm proposing: how
> can this be done efficiently?

I don't think efficiency is going to be a big problem, because the
pathological cases (like your even/odd example) aren't common enough
to worry about.

Vanilla PhraseScorer should remain as is, operating on PostingLists.
Your custom scorer will operate on subscorers.

>> To guarantee such a behavior for all Posting subclasses, we'd have to
>> move the freq, prox, and num_prox member vars from ScorePosting into
>> the base class Posting.
>
> I'm not sure if it is better or worse, but I was suggesting moving
> those (well, the last two) into the base Scorer!

Posting is single-field, so it only needs prox and num_prox, while
Scorers need to be able to pass multi-field data so they need sproxen
and num_sproxen. I think this information should be made available
via the Scorer's Tally object, rather than via the Scorer struct.
The Scorer struct's layout should remain private.

You seem very determined not to call Scorer_Tally when doing match-
only. There's a cost to generate the positions information, though,
in PhraseScorer, OrScorer, etc. How about delaying that effort until
Tally is called? If you don't care about scoring info, then don't
calculate it -- just manipulate the positions.

> I'm hoping that the
> internals of Posting won't be visible at all from the component
> scorers perspective.

They wouldn't have to be visible to your high-level scorer.

>> > I'd also like it
>> > to be possible to do phrase-type matching on items other than raw
>> > terms: things like "a b|c d".
>>
>> In Lucene-land, people generally meet this requirement by injecting
>> synonyms at index time. Implementing a search-time solution would be
>> tricky.
>
> Yeah, but that trickiness was the focus of this message. If we want a
> position scorer that plays well with others, we would need to tackle
> it. Should we?

Hell yeah. I don't know whether the current PhraseScorer should be
the base class, though. I think you're right that we need a hybrid.

>> PhraseScorer is easy.
>>
>> bool_t
>> PosPhraseScorer_Skip_To(PosPhraseScorer *self, i32_t target)
>> {
>> /* if ($self->SUPER::skip_to($target)) */
>> if (PhraseScorer_skip_to((PhraseScorer*)self, target)) {
>> build_prox(self);
>> return true;
>> }
>> return false;
>> }
>
> This has what I feel is the significant disadvantage of finding the
> first occurrence of the phrase twice: PhraseScorer_skip_to finds the
> first occurence of the phrase and returns a match, then build_prox
> does it again.

That's not the case. The way PhraseScorer works, you're left with
an "anchor set" of positions at the end of the phrase matching
process. Each item in the anchor_set array represents one phrase.
The static function build_prox in PhraseScorer.c just riffs off of
that, building outward from each anchor.

>> > Scorer components (And, Or, Nor, Phrase) are directly reusable.
>>
>> Note that all of these *are* directly reusable whether the field
>> specifies a posting type of ScorePosting or RichPosting, and that the
>> only difference between the TermScorer subclasses ScorePostingScorer
>> and RichPostingScorer is a rebless!
>
> True, but I'm not finding them reuseable for positional stuff, as they
> don't make it positions available.

We may have no choice but to modify core then. :(

I would really like to avoid that if we can. Adding to my
maintenance burden is the wrong way to go. I'd rather see you
publish a CPAN distro, so you get the credit and the responsibility.

> And PhraseScorer (as well as the
> TermScorers, which I meant to include in the list) are currently
> TF/IDF specific.

Why does that matter?

>> Subclasses override Tally to provide their own scoring scheme.
>
> Are they required to subclass each TermScorer as well? And each
> PhraseScorer?

> And what about when a custom scorer suite meets a
> custom index format?

Those are very general questions. It's not possible to anticipate
all the possible scoring configurations people will want to build.

> Yes, it's just a matter of figuring out how to signal this by some
> means other than a segfault. Does PhraseScorer currently prevent
> this somewhere?

Of course. If freq is 0, the positions array never gets dereferenced.

u32_t *candidates = posting->prox;
u32_t *candidates_end = posting->prox + posting->freq;

/* ... */

while (candidates < candidates_end && *candidates <
target) {
candidates++;
}

if (candidates == candidates_end) {
break;
}

> The real gain would come only if you iterate directly over the mapped
> index file without translation or buffering. As such, a useful
> mmap-based index format would be bypassing InStream. It couldn't use
> VInts, rather it would need to have a format that can be accessed and
> used directly as a C-struct.

This could be implemented using a custom Posting subclass. For the
simplest case -- doc num only -- you'd just have it write native
integers, rather than deltas encoded as vints. The size of such a
postings file would probably increase by a factor of between 3 and 4.

Also note that these postings "files" are actually sub-files within a
compound file. That complicates things a bit, but it's probably not
an insurmountable problem.

> Among the downsides are that the index files would be byte-order
> dependent, and as far as I know it won't work on Windows.

File format portability isn't a major issue with search indexes.

Sheer index size might be a problem.

I'll be a little concerned about bypassing InStream until I see an
actual implementation.

InStream can work off of a RAMFolder, which stores data using a
ragged array technique. That won't be compatible with this.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
removing position code from scorer subclasses [ In reply to ]
On Jul 15, 2007, at 12:41 PM, Nathan Kurz wrote:

> u32_t
> ProxPhraseScorer_advance(ProxPhraseScorer *self, u32_t target_doc)
> {
> Scorer *element_scorer = self->subscorer;
>
> while (1) {
> // advance to the next document that satisfies scorer
> target_doc = Scorer_Advance(element_scorer, target_doc);
>
> // stop if there are no more matching documents
> if (target_doc == DOC_NOT_FOUND) break;
>
> // use this doc if the positions match the offsets
> if (Scorer_Find_Phrases(self)) break;
>
> // otherwise try the next doc
> target_doc++;
> }
>
> return target_doc;
> }

That part looks straightforward. It resembles the abstract
PhraseScorer base class in Lucene, which has two subclasses,
ExactPhraseScorer and SloppyPhraseScorer. The only difference
between the two is how they implement the abstract method
PhraseScorer.phraseFreq().

>
> u32_t
> ProxPhraseScorer_find_phrases(ProxPhraseScorer *self)
> {
> Positions *positions = scorer->positions;
> chy_u32_t *offsets = self->offsets;
>
> chy_u32_t *position;
> chy_u32_t length = phrase_occurs(positions, offsets, &position);
>
> // no phrases found here
> if (length == 0) return 0;
>
> chy_u32_t num_phrases = 1;
> if (self->want_positions) {
> // record the phrase we already found
> Scorer_Add_Occurrence(self, position, length);
> // find further occurrences of the phrase
> while (length = phrase_occurs(positions, offsets, &position)) {
> Scorer_Add_Occurrence(self, position, length);
> num_phrases++;
> }
> }
>
> return num_phrases;
> }

This part resembles Lucene's SpanNearQuery, particularly the
Scorer_Add_Occurrence part, which looks like a Span getting stored away.

> My instinct this morning (it changes often) is that I should go
> through and remove the position code as you suggested, and then write
> a ProximityScorer that will serve my purposes but not more.

OK. I think by now you have a pretty good idea of how things are put
together down there. Please forward your code as it progresses.

> we can reconsider the generalized solution at a later date.

My main concern is that you're currently forced to violate private
API in order to do what you need to do. The challenge is to come up
with an infrastructure that will allow you to publish your
generalized solution as a separate CPAN distro.

I have some ideas as to how to pull that off. We can air them out by
test-porting some of the less ambitious query/scorer classes from
Lucene, like ConstantScoringRangeQuery.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
removing position code from scorer subclasses [ In reply to ]
On 7/15/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> > Multiple fields does complicate things a bit, but not much (I think):
> > just associate the field used with the positional array.
>
> That's what ScoreProx is for:
>
> struct kino_ScoreProx {
> KINO_SCOREPROX_VTABLE *_;
> KINO_OBJ_MEMBER_VARS;
> chy_i32_t field_num;
> chy_u32_t num_prox;
> chy_u32_t *prox;
> };

Yes, it's mostly there. I fear that for generality we also need:
chy_u32_t *lengths;
If null, the lengths are all presumed as one, but if present it has
the same length as prox and each entry represents the length of that
occurence.
Sometimes I can convince myself that you could get away with a single
length entry, but suppose you are trying to do a phrase match on:
'a b|"c d" e'. Ugly, but if we want generality I think we need per
position length. You mentioned something about planning for this?

> Exactly. We'd need an array of ScoreProx objects, and a count
> indicating the number of elements in the array. That's what's in
> Tally at this moment.
>
> struct kino_Tally {
> KINO_TALLY_VTABLE *_;
> KINO_OBJ_MEMBER_VARS;
> float score;
> chy_u32_t num_matchers;
> chy_u32_t sprox_cap; /* sproxen allocation
> size */
> chy_u32_t num_sproxen; /* number of ScoreProx */
> struct kino_ScoreProx **sproxen; /* array of ScoreProx
> objects */
> };

Yup, just about exactly that. The 'Positions' object in the stuff I
quoted was intended to be an encapsulation of the last three fields
here. I'm not sure about the name, but I think it would be good to
factor it out.


> [. Heh. I just realized that PhraseScorer's implementation is buggy.
> I've added an explanatory comment to PhraseScorer.c. ]

Since you bring it up, I think there's another bug in that loop: the
phrase_offsets are ignored and the phrase is presumed to be
contiguous. But it seemed churlish to report a bug in unused
section I wasn't sure I understood when I was about to take out
anyway. Since it's not being used, I wasn't sure if it was supposed
to be that way: is it?

> I think this information should be made available
> via the Scorer's Tally object, rather than via the Scorer struct.
> The Scorer struct's layout should remain private.

I do like the way that Tally currently works, but I'm not sure that
exposing a data API based on the struct layout is a bad thing,
particularly as we are already depending on a rigid layout to make
BoilerPlater work.

> You seem very determined not to call Scorer_Tally when doing match-
> only. There's a cost to generate the positions information, though,
> in PhraseScorer, OrScorer, etc. How about delaying that effort until
> Tally is called?

Certainly the goal is to delay it as long as possible in the hope
that one won't have to do it at all, but I think there are times
(Position scoring a PhraseScorer) when you are either going to need to
Tally the subscorers during the parent's matching phase (as OrScorer
does now) else come up with a different way of accessing position
information. You may be right that there is no need to force all the
tallying to occur after all the matching, though. I like that rule
for conceptual clarity, but there's no implementation reason it should
be that way.

> >> > terms: things like "a b|c d".
> >>
> > Yeah, but that trickiness was the focus of this message. If we want a
> > position scorer that plays well with others, we would need to tackle
> > it. Should we?
>
> Hell yeah.

Great! I think if we do end up with an architecture that enables that
level of generality, there will be a lot of interesting things one can
do with this.

On a slightly related note regarding the potential target audience,
did you happen to see this post from Facebook on why they wrote own
search engine: http://blog.facebook.com/blog.php?post=2535632130
It's similar to some of scoring mechanisms I'm thinking about.


> That's not the case. The way PhraseScorer works, you're left with
> an "anchor set" of positions at the end of the phrase matching
> process.

My fault. I'd read through that before, but just remembered it all wrong.

> I would really like to avoid that if we can. Adding to my
> maintenance burden is the wrong way to go.

Au contraire! From my position, it's definitely the way to go! Is
there some reason it doesn't seem like a perfect solution to you? :)

> > And PhraseScorer (as well as the
> > TermScorers, which I meant to include in the list) are currently
> > TF/IDF specific.
>
> Why does that matter?

Because they don't (in my limited experience) make good base classes
for scorers using other schemes, thus one ends up rewriting them
rather than subclassing them. But it may just be me not using them
properly.

> Those are very general questions. It's not possible to anticipate
> all the possible scoring configurations people will want to build.

No, but we can try to anticipate the likely configurations: TF/IDF,
match-only, and position/scope, for starters.

> > Yes, it's just a matter of figuring out how to signal this by some
> > means other than a segfault. Does PhraseScorer currently prevent
> > this somewhere?
>
> Of course. If freq is 0, the positions array never gets dereferenced.

I was more worried that the explicit cast to ScorePosting might cause
posting->freq and posting->prox to point to something unrelated if a
custom Posting class inherits directly from Posting rather than from
ScorePosting. Is this case caught somewhere?

>> (mmap)
> The size of such a
> postings file would probably increase by a factor of between 3 and 4.

Yes, there's definitely a size trade-off. If the active portion of
your expanded dataset is larger than available RAM and you start
paging from disk per search, the advantage goes away. But if it does
fit (or can be split across enough machines so it does) you end up
with something faster than you could achieve otherwise, with no
startup penalty, and perfect interprocess sharing.

> I'll be a little concerned about bypassing InStream until I see an
> actual implementation.

Certainly, and you should be. While this might be a useful custom
format, it's not going to be able to replace the existing formats in
most cases.

Thanks for your responses. I'll try to send you a
position-code-removing patch in the next week.

Nathan Kurz
nate@verse.com