Mailing List Archive

adding a proximity scorer
I've continued looking at KinoSearch and also trying to figure out how
I want my search to work. There's a lot of changes I want to make,
but I think the most straightforward is that I want a scorer that
approximates a sloppy phrase match. Unfortunately, even this seems
pretty complex.

Here's what I'm aiming for in my proximity scorer:

--------------

All terms (or clauses really) of the main query are required -- that
is, the top level is an AND query.

All queries are treated as 'sloppy phrase queries', and a hit is given
a bonus if it matches terms in order . (ie, would have matched as a
phrase query)

There is no bonus for rare terms (IDF) or for terms occurring multiple times
within a record (TF). There is no preference for short or long
records (no normalization).

No stemming occurs, but individual terms are expanded with lower
boosted alternate spellings using a slightly modified version of
aspell. So the user query "movie titel" is expanded into something
like:
(movie^10 || movies^7 || move^5) && (titel^10 || title^7 || tilt^3)

There is no boost for matching multiple terms within the OR portions
of the alternate spellings. Only the highest score is used.

There is a preference for 'popular' records, and each record will be
assigned a popularity boost in advance. This boost will be small
enough that a misspelled popular match will not rank higher than a
properly spelled rare match.

My actual searches are will be bracketed by START and END tokens, so:
START && (movie^10 || movies^7 || move^5) && (titel^10 || title^7) && END

These tokens represent the beginning and ends of lines, and I'll
insert them when indexing. For now I think this is simpler than using
a custom position format with separate values for line and prox.

There is a partial bonus for having the first two or last two terms of
the query occurring in order. (ie, the first word of the user query
starts a line in the doc or the last word ends a line)

There is a partial bonus for having all the middle terms of a query
occurring between the first and last terms, without an intervening
first or last term. (ie, all words occur on a single line)

Words with a leading '-' are excluded (as in BooleanScorer).

Quoted and hyphenated phrases must exist in the query order. Quoted
words (single or multi), hyphenated words, and words with a leading
'+' or '-' are not expanded for alternate spellings.

---------------

I'm not really sure how to go about adding such a Scorer to the
existing code base. Easiest for me to get it working would be to just
directly modify the existing BooleanScorer apparatus to do what I
want, since almost everything has a parallel. But this would be hard
to make useful for anyone else.

Better would be to add a few new classes derived from the existing
ones, but I don't know how to do this without just copying code
wholesale. Take for example ORScorer. I'd like to override just
ORScorer_tally() (in ORScorer.c) so it returns the highest score
rather than the sum.

Is there a way I can do this with the BoilerPlater vtable stuff? I
haven't actually figured out how ORScorer_tally() actually gets called
from Scorer_Tally(). Or more generally, how would you like a piece of
potential user-contributed code to be arranged?

Thanks again!

Nathan Kurz
nate@verse.com

ps. I've been trying to refrain from responding line-by-line to your
other responses, but I assure you I am reading them closely and
appreciate the effort you put into them. But I'm trying to prioritize
my questions so to help delay the point at which you get sick of my
constant pestering... :)