Mailing List Archive

Re: KinoSearch::Highlight::Highlighter
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Jul 15, 2008, at 8:04 PM, Marvin Humphrey wrote:
> Hello, Michael,
>
>> I'm using KinoSearch to develop a search engine for the IRC logs I
>> have browsable on my website. I am currently using the developer
>> release, version 0.20_051 due to a need for non-score based sorting
>> (sort by date). I am very pleased with KinoSearch so far. For IRC
>> logs it makes the most sense to break on line breaks versus periods
>> for excerpts. This is an easy one line patch[1] in Highlight/
>> Highlighter.pm but it seems a bit overkill to subclass Highlighter
>> for a one line patch to _gen_excerpt.
>>
>> Perhaps it may make sense to have an argument that allows you to
>> specify a character/string to prefer breaking on that defaults to
>> '\.'. Allowing RegEx syntax would be most flexible and I think
>> most overriding the default wouldn't have an issue escaping things
>> but you are the author ;). I'm really not sure what other than
>> periods and new-lines someone may want to break on, perhaps tabs,
>> so would definitely understand should you decide this is a feature
>> request that wouldn't be used widely enough to merit inclusion.
>
> Sorry for the delayed response.

Not a problem.

> I've been working on Highlighter lately, and I think the answer is
> to define a couple methods that the user can override:
> find_sentence_boundaries() and raw_excerpt(). If you're interested
> in discussing API design for those, we should take up the matter on
> the KinoSearch mailing list: <http://www.rectangular.com/mailman/listinfo/kinosearch/
> >

Indeed, this makes sense and allows for even more specialization.

Re Mailing List: Yes, I fail, too used to the small modules without
the lists, subscribed a couple of days ago to this and CCing this
reply there.

Mike
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (Darwin)

iD8DBQFIffZl0Qbp4bPZvesRAryNAKCFNBWbExBIxMpJc9ZqlIdrbOGgbACeNw69
QFU7BwgJGgoscT6k+7sVH1E=
=DkXV
-----END PGP SIGNATURE-----

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: KinoSearch::Highlight::Highlighter [ In reply to ]
On Jul 16, 2008, at 6:23 AM, Michael Greb wrote:

>>> Perhaps it may make sense to have an argument that allows you to
>>> specify a character/string to prefer breaking on that defaults to
>>> '\.'.

Please note that Highlighter's API has changed since the last dev
release.

Here's Highlighter's current algorithm:

* Hand the Query a document and ask it what sections
of the field in question it thinks are important, if any.
Any "hot" sections are expressed via HighlightSpan
objects, which define a start_offset, an end_offset,
and a
floating point "weight".
* Take all the HighlightSpan objects and create a HeatMap,
which muxes all the spans plus adds bonus heat whenever
spans occur close together.
* Analyze the HeatMap and find the hottest section of the
field, using boundaries a little larger than the desired
excerpt size. (Right now, it's find_best_fragment() that
does this, but it's not clear that that method needs to
be public.)
* Use Highlighter::find_sentence_boundaries to locate
bounds inside and immediately outside the hot window.
* Have Highlighter::raw_excerpt determine the formal
boundaries of the excerpt. Use sentence boundaries when
possible, but apply ellipses when necessary.
* Have Highlighter::highlight_excerpt process the raw
excerpt by applying Highlighter::highlight and
Highlighter::encode.

The question right now is what the APIs should look like for
find_sentence_boundaries() and raw_excerpt(). FWIW, they are
surprisingly hard to implement, because grammatical inconsistencies
are hard to avoid and there are lots of edge cases.

For starters: Right now, find_sentence_boundaries() returns an array
of start offsets delimiting sentence starts. However, this is not
ideal; it would be better to know what the exact end offsets are as
well.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch