Mailing List Archive

Feature request: highlight without excerpt
I'd like to be able to highlight the matches in a field without
creating an excerpt from it. A typical use case would be for
highlighting in titles, like Google does.

I would have a go at implementing it, but I'm not sure how best to fit
it into the class hierarchy, and where to put the result in the data
structure returned by fetch_hit_hashref.

Thanks
--
Edward Betts
Feature request: highlight without excerpt [ In reply to ]
Edward Betts wrote on 6/7/07 3:55 PM:
> I'd like to be able to highlight the matches in a field without
> creating an excerpt from it. A typical use case would be for
> highlighting in titles, like Google does.
>
> I would have a go at implementing it, but I'm not sure how best to fit
> it into the class hierarchy, and where to put the result in the data
> structure returned by fetch_hit_hashref.
>

Marvin might have a native KS solution for this; otherwise, I recommend
(shameless plug) Search::Tools::HiLiter.

--
Peter Karman . http://peknet.com/ . peter@peknet.com
Feature request: highlight without excerpt [ In reply to ]
On Jun 7, 2007, at 1:55 PM, Edward Betts wrote:

> I'd like to be able to highlight the matches in a field without
> creating an excerpt from it.

At least one other person has made the same feature request (<http://
rt.cpan.org/Ticket/Display.html?id=25400>).

The revised Highlighter API introduced in 0.20_03 is intended to
facilitate such features. You can even process the same field
multiple times if you want.

$highlighter->add_spec(
field => 'content',
name => 'less'
excerpt_length => 50,
);
$highlighter->add_spec(
field => 'content',
name => 'more'
excerpt_length => 2000,
);
...
print "$hit->{excerpts}{less}\n";
...
print "$hit->{excerpts}{more}\n";

As I mentioned in my reply to that bug report, you can sort of fake
up a non-excerpted excerpt by making excerpt_length a large number.
However, as was pointed out to me, Highlighter will tack on an
ellipsis unless the field ends with a period.

That's a bug that needs fixin'. Highlighter should not tack on an
ellipsis if the end of the excerpt coincides with the end of the
field value.

> A typical use case would be for
> highlighting in titles, like Google does.

Another use would be highlighting within URLs, something Google also
does.

> I would have a go at implementing it, but I'm not sure how best to fit
> it into the class hierarchy, and where to put the result in the data
> structure returned by fetch_hit_hashref.

It should still go under $hit->{excerpts}.

I think there's a bit of a disconnect because of the name of that
hash key and the name of the Hits method, "create_excerpts". Those
names sort-of imply that you can't use the Highlighter without
excerpting. Maybe that Hits method should be named "set_highlighter"
instead, though having the word "highlighter" in there sort-of
implies the opposite -- that you can't create excerpts without
highlighting -- which is just as misleading.

In any case, there should be a way to turn off excerpting via the
Highlighter->add_spec API. I think the best way to do that is to add
a extract_excerpt parameter to add_spec():

$highlighter->add_spec(
field => 'title',
extract_excerpt => 0, # default 1
);

Another possibility would be to treat an explicit undef supplied to
excerpt_length as an indication that no excerpting should be
performed, but I think people reading the docs wouldn't find that as
easily.

Do you feel like taking this on?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Feature request: highlight without excerpt [ In reply to ]
On 08/06/07, Marvin Humphrey <marvin@rectangular.com> wrote:
> It should still go under $hit->{excerpts}.

First attempt, I still need to write docs and tests. I went with
"excerpt_length => undef", but this could easily be switched to 0 or
support both.

Index: lib/KinoSearch/Highlight/Highlighter.pm
===================================================================
--- lib/KinoSearch/Highlight/Highlighter.pm (revision 2474)
+++ lib/KinoSearch/Highlight/Highlighter.pm (working copy)
@@ -57,7 +57,8 @@
}

# scoring window is 1.66 * excerpt_length, with the loc in the middle
- $spec{limit} = int( $spec{excerpt_length} / 3 );
+ $spec{limit} = int( $spec{excerpt_length} / 3 )
+ if defined $spec{excerpt_length};

# use field name as key unless specified
$spec{name} = $spec{field} unless defined $spec{name};
@@ -71,13 +72,45 @@
# create an excerpt for each spec
my %excerpts;
for my $spec ( @{ $self->{specs} } ) {
- $excerpts{ $spec->{name} }
- = $self->_gen_excerpt( $doc, $doc_vector, $spec );
+ if (defined $spec->{excerpt_length}) {
+ $excerpts{ $spec->{name} }
+ = $self->_gen_excerpt( $doc, $doc_vector, $spec );
+ } else {
+ $excerpts{ $spec->{name} }
+ = $self->_gen_excerpt_no_length( $doc, $doc_vector, $spec );
+ }
}

return \%excerpts;
}

+sub _gen_excerpt_no_length {
+ my ( $self, $doc, $doc_vector, $spec ) = @_;
+ my $excerpt_field = $spec->{field};
+
+ my $text = $doc->{$excerpt_field};
+ return unless defined $text;
+ return '' unless length $text;
+
+ my $formatter = $spec->{formatter};
+ my $encoder = $spec->{encoder};
+
+ my $output_text = '';
+ my $posits = $self->_starts_and_ends( $doc_vector, $excerpt_field );
+ my $last_end = 0;
+ foreach (@$posits) {
+ my ($start, $end) = @$_;
+ $output_text .= $encoder->encode(
+ substr( $text, $last_end, $start - $last_end ) );
+ $output_text .= $formatter->highlight(
+ $encoder->encode( substr( $text, $start, $end - $start ) ) );
+ $last_end = $end;
+ }
+ $output_text .= $encoder->encode( substr( $text, $last_end ) );
+
+ return $output_text;
+}
+
sub _gen_excerpt {
my ( $self, $doc, $doc_vector, $spec ) = @_;
my $excerpt_field = $spec->{field};
@@ -186,7 +219,7 @@
my $formatter = $spec->{formatter};
my $encoder = $spec->{encoder};
my $output_text = '';
- my ( $start, $end, $last_start, $last_end ) = ( undef, undef, 0, 0 );
+ my ( $start, $end, $last_end ) = ( undef, undef, 0 );
while (@relative_starts) {
$end = shift @relative_ends;
$start = shift @relative_starts;


--
Edward Betts
Feature request: highlight without excerpt [ In reply to ]
On Jun 15, 2007, at 8:25 AM, Edward Betts wrote:

> On 08/06/07, Marvin Humphrey <marvin@rectangular.com> wrote:
>> It should still go under $hit->{excerpts}.
>
> First attempt, I still need to write docs and tests.

This looks right to me. I look forward to adding it. :)

> I went with "excerpt_length => undef", but this could easily be
> switched to 0 or
> support both.

I prefer your proposal of using 0 over both the named parameter and
the explicit undef I originally suggested.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/