Mailing List Archive: Subclassable Highlighter (was Re: KinoSearch feature suggestions)

On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:

> fwiw, Search::Tools offers highlighting and excerpting (snipping)
> via the building of
> complex regular expressions. See
> http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/
> Snipper.pm
> http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/
> HiLiter.pm
>
> The algorithm I use for snipping/excerpting is slow, and I would
> love to see how a
> different approach could improve performance. I believe the primary
> reason my approach is
> slow is that it uses a big regex.

KinoSearch's highlighter is fast because it utilizes information
generated at index time and stored in the "term vectors" file. Each
"vectorized" field's data consists of...

* Term text.
* Each term's position in the field, measured in tokens.
* Each position's start offset, measured in Unicode code points.
* Each position's end offset, measured in Unicode code points.

Because the start offset and end offset are stored, it is possible to
highlight stemmed terms accurately. For instance, if a field starts
off with "Horses are fast", the stemmed text "hors" is stored along
with a start offset of 0 and an end offset of 6, allowing us to
insert highlighting emphasis marks at those positions. The same
technique could be used to e.g. highlight synonyms after synonym
analysis.

The essence of the Highlighter is that after we have a result set, we
rerun the query against the documents one-at-a-time and see what
parts are most important. For this to work, we need...

* Query/Scorer classes which are capable of telling us why they
scored
a document the way they did. Right now, this is done via
$query->extract_terms, but that's a crude mechanism that will not
hold up for esoteric subclasses of Query.
* Access to the parsed, analyzed document.

If we did not store the "term vectors" information, we would have the
option of rerunning analysis on the fly. Unfortunately, this doesn't
work well if you have either large documents or costly Analyzer
chains. So, storing some serialized version of the parsed document
which can be reassembled into an object quickly will remain a crucial
facet of the KinoSearch highlighter.

I wish it were realistic to perform analysis on the fly, because then
it would not be necessary to worry about the file format of
persistent term vector data within the index. TermVectors probably
won't be part of the official file spec, in order to limit the
clutter. However, for backwards compatibility purposes, we'll still
be stuck with the format once it's set.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch