Mailing List Archive: rich positions and custom scoring

Hello --

I'm looking at Kinosearch again after about a year away
(http://www.rectangular.com/pipermail/kinosearch/2006-May/000170.html),
and I was excited to see that rich positions seem to be included as
promised. Fabulous! Are there any examples yet of using customized
rich position formats?

For my particular use case, I'd like to be able to provide a strong
boost to docs in which the query terms occur within the same
'sentence', which would be delineated by the occurrence of some
regexp. Rather than storing position as a single increasing int, I'd
store it as a (sentence_num, word_num) pair, with word_num increasing
monotonically until a new sentence is started, at which point
sentence_num is incremented and word_num is restarted at zero. The
scorer would then give a bonus if all the terms share a common
sentence number.

I started looking at the code, and it seems like this would be
possible if I define a custom tokenizer, a custom posting, and a
custom scorer (what else?), but I can't figure out how to do this
easily without just editing some of the existing classes in place. It
seems like it should be possible to do this with some artful
subclassing, I can't figure out how to do it. For example,
Index::SegWriter seems hard-coded to use the built-in
Analysis::TokenBatch in a way that I'm not sure how to override
gracefully. Subclass SegWriter and then redefine add_doc? Then call
a subclassed InvIndex which call my subclass? Then again, I'm
somewhat OO-illiterate, so I may be missing something obvious.

I'm sure I'll figure out some way to get it done, but suggestions
would be appreciated.

Thanks!

Nathan Kurz
nate@verse.com

On Jun 7, 2007, at 4:56 PM, Nathan Kurz wrote:

> Are there any examples yet of using customized
> rich position formats?

The RichPosting file format is done. BooleanScorer does not yet take
position into account by default. That's near the top of my TODO list.

There are not yet any meaningful examples using RichPosition.
KinoSearch::Simple::HTML will be the first. That's just below the
adaptations to BooleanScorer on my TODO list.

> For my particular use case, I'd like to be able to provide a strong
> boost to docs in which the query terms occur within the same
> 'sentence', which would be delineated by the occurrence of some
> regexp. Rather than storing position as a single increasing int, I'd
> store it as a (sentence_num, word_num) pair, with word_num increasing
> monotonically until a new sentence is started, at which point
> sentence_num is incremented and word_num is restarted at zero. The
> scorer would then give a bonus if all the terms share a common
> sentence number.

The adaptations to BooleanScorer will make it behave in a way which
is somewhat similar to this, especially if your custom Tokenizer
separates sentences by injecting a large position increment.

> I started looking at the code, and it seems like this would be
> possible if I define a custom tokenizer, a custom posting, and a
> custom scorer (what else?),

A custom Query and Weight.

Exposing an API which allows people to create their own custom
scoring algorithms relatively easily would be the ultimate ambition
for KinoSearch. I REALLY, REALLY want to make that happen.

The New York Times recently published an article about the process by
which Google tweaks its search engine.

http://www.nytimes.com/2007/06/03/business/yourmoney/03google.html

THAT is the direction I want to take. There's pressure on indexers
like KS and Lucene to behave more like relational databases -- people
want transactions, instant updates, and so on. But the RDBMS path is
well-traveled. The unexplored, exciting territory is all in the
search domain.

I would love to sculpt KinoSearch into a modular open source
framework where people could try out scoring ideas like yours
easily. Many of the puzzle pieces are already in place: Schema,
Posting, and the recent Analyzer API changes Peter Karman and I
hammered out all help. Crucially, I think the file format design is
done and will support future expansion without problems. I believe
the tight coupling between file format and code base that KinoSearch
inherited from Lucene has been release.

At the same time... it's important to just get 0.20 out, and make the
API stable enough that people can build apps with it safely. Plus,
for the next couple months I have a full-time contract job unrelated
to KS. So I just need to fix bugs, hack the last few items of the
TODO list and release an official version, and try to avoid being
seduced by this really, really interesting but difficult problem. :)

> but I can't figure out how to do this
> easily without just editing some of the existing classes in place. It
> seems like it should be possible to do this with some artful
> subclassing, I can't figure out how to do it.

Absolutely, it should be possible. But there are still a lot of
rough private APIs that would need to be refined and made public.

> For example,
> Index::SegWriter seems hard-coded to use the built-in
> Analysis::TokenBatch in a way that I'm not sure how to override
> gracefully. Subclass SegWriter and then redefine add_doc? Then call
> a subclassed InvIndex which call my subclass?

I like your basic idea. How would you like your Analyzer subclass to
look? And how would you like to change SegWriter->add_doc?

I think the high level code would look like this:

my $custom_seg_writer = CustomSegWriter->new;
my $invindexer = KinoSearch::InvIndexer->new(
invindex => $invindex,
seg_writer => $custom_seg_writer,
);

We'd keep InvIndexer simple and make subclassing SegWriter the point
of departure for advanced users.

SegWriter->add_doc would need to be broken up somehow to make it more
customizable. That would either happen by adding methods to
SegWriter itself, or more likely, by pushing bigger chunks of
responsibility down to components like DocWriter and Analyzer.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/