Hello --
I'm looking at Kinosearch again after about a year away
(http://www.rectangular.com/pipermail/kinosearch/2006-May/000170.html),
and I was excited to see that rich positions seem to be included as
promised. Fabulous! Are there any examples yet of using customized
rich position formats?
For my particular use case, I'd like to be able to provide a strong
boost to docs in which the query terms occur within the same
'sentence', which would be delineated by the occurrence of some
regexp. Rather than storing position as a single increasing int, I'd
store it as a (sentence_num, word_num) pair, with word_num increasing
monotonically until a new sentence is started, at which point
sentence_num is incremented and word_num is restarted at zero. The
scorer would then give a bonus if all the terms share a common
sentence number.
I started looking at the code, and it seems like this would be
possible if I define a custom tokenizer, a custom posting, and a
custom scorer (what else?), but I can't figure out how to do this
easily without just editing some of the existing classes in place. It
seems like it should be possible to do this with some artful
subclassing, I can't figure out how to do it. For example,
Index::SegWriter seems hard-coded to use the built-in
Analysis::TokenBatch in a way that I'm not sure how to override
gracefully. Subclass SegWriter and then redefine add_doc? Then call
a subclassed InvIndex which call my subclass? Then again, I'm
somewhat OO-illiterate, so I may be missing something obvious.
I'm sure I'll figure out some way to get it done, but suggestions
would be appreciated.
Thanks!
Nathan Kurz
nate@verse.com
I'm looking at Kinosearch again after about a year away
(http://www.rectangular.com/pipermail/kinosearch/2006-May/000170.html),
and I was excited to see that rich positions seem to be included as
promised. Fabulous! Are there any examples yet of using customized
rich position formats?
For my particular use case, I'd like to be able to provide a strong
boost to docs in which the query terms occur within the same
'sentence', which would be delineated by the occurrence of some
regexp. Rather than storing position as a single increasing int, I'd
store it as a (sentence_num, word_num) pair, with word_num increasing
monotonically until a new sentence is started, at which point
sentence_num is incremented and word_num is restarted at zero. The
scorer would then give a bonus if all the terms share a common
sentence number.
I started looking at the code, and it seems like this would be
possible if I define a custom tokenizer, a custom posting, and a
custom scorer (what else?), but I can't figure out how to do this
easily without just editing some of the existing classes in place. It
seems like it should be possible to do this with some artful
subclassing, I can't figure out how to do it. For example,
Index::SegWriter seems hard-coded to use the built-in
Analysis::TokenBatch in a way that I'm not sure how to override
gracefully. Subclass SegWriter and then redefine add_doc? Then call
a subclassed InvIndex which call my subclass? Then again, I'm
somewhat OO-illiterate, so I may be missing something obvious.
I'm sure I'll figure out some way to get it done, but suggestions
would be appreciated.
Thanks!
Nathan Kurz
nate@verse.com