Hi,
On Jan 21, 2008, at 2:16 PM, Father Chrysostomos wrote:
> I’d like to request that a few features be added to KinoSearch. I
> need these features myself, so I’m willing to contribute patches.
> Please let me know what you think.
I'm going to take the liberty of cc'ing this to the KinoSearch
mailing list, since it was filed as a public rt.cpan.org issue.
> 1. Wildcards in search queries
I am in favor of wildcards being available via a separate
distribution, and I would very much like to hammer out an elegant low-
level API to support such a distro. A lot of the work I have been
doing lately is intended to facilitate such endeavors.
Wildcards should not be in core KS, because they are by their nature
vastly more expensive than whole-word queries. I have observed that
their comparative cost often comes as an unpleasant shock. However,
providing a separate distro will prompt people to assess the costs
with open eyes.
> 2. I’d like KinoSearch::Highlight::Highlighter to be able to create
> non-contiguous excerpts (which I’m calling ‘summaries’; the
> contiguous sub-parts of each summary I’m calling excerpts):
>
> $highlighter->add_spec( excerpt_length => 50, summary_length =>
> 200, ...);
>
> The highlighter would find the most important word to highlight (as
> it currently does), and create a 50-char excerpt. Then it would
> create an excerpt for the second most important word and add that
> (removing overlap if necessary), repeating this process until the
> summary is the right length.
I think this should be implemented by abstracting out the excerpt
selection engine, analogous to the way that
KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter
abstract out other functionality used by the Highlighter. How about
if we outsource excerpting to subclasses of a new class,
KinoSearch::Highlight::Excerpter? Then you could release your own
distro, e.g. KSx::Highlight::SummaryExcerpter.
> 3. Custom ellipsis marks:
>
> $highlighter->add_spec( ellipsis_mark => "\x{2026}", ... )
I understand the problem, but adding a such a specific param to
Highlighter->add_spec seems brittle. I think this should be
something which is set via a custom excerpting engine.
Incidentally, Highlighter's treatment of the ellipsis also prompted
part of <http://rt.cpan.org/Public/Bug/Display.html?id=25400>.
> 4. Pagination (another highlighter feature): An index field could
> be designated as the ‘page offset’ field, containing byte offsets
> of page breaks.
>
> $highlighter->add_spec(
> page_offset_field => 'pageoffsets',
> page_offset_formatter => $object,
> );
>
> And $object would have to have a page_label method: sub page_label
> { my ($self, $fields_hashref, $page_no) = @_; ... }
This feature also seems like it should belong to a particular
Excerpter implementation.
> Though it might be more complicated, maybe we could have page
> breaks (chr 12) recorded automatically when the index is created.
> Then ‘page_offset_field’ won’t be necessary.
That would work well. It's trivial to implement effectively using C/
XS, because you can just zip along the string counting page breaks.
long
count_breaks(SV *input_sv) {
STRLEN len;
char *ptr = SvPV(input_sv, len);
char *end = SvEND(input_sv);
long count = 0;
while (ptr < end) { if (*ptr++ == 12) count++; }
return count;
}
With Perl, tr// works for efficient character counting, IIRC.
> For examples of 2 and 4 in use, see <http://synodinresistance.org/
> cgi-bin/anazetesis?all=1&and-glossa=&and-morphe=&g=en&q=thing>
> (which I’d like to switch to using KinoSearch, because it’s
> currently too slow).
I admire the sophistication of the excerpting provided. Kudos.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
On Jan 21, 2008, at 2:16 PM, Father Chrysostomos wrote:
> I’d like to request that a few features be added to KinoSearch. I
> need these features myself, so I’m willing to contribute patches.
> Please let me know what you think.
I'm going to take the liberty of cc'ing this to the KinoSearch
mailing list, since it was filed as a public rt.cpan.org issue.
> 1. Wildcards in search queries
I am in favor of wildcards being available via a separate
distribution, and I would very much like to hammer out an elegant low-
level API to support such a distro. A lot of the work I have been
doing lately is intended to facilitate such endeavors.
Wildcards should not be in core KS, because they are by their nature
vastly more expensive than whole-word queries. I have observed that
their comparative cost often comes as an unpleasant shock. However,
providing a separate distro will prompt people to assess the costs
with open eyes.
> 2. I’d like KinoSearch::Highlight::Highlighter to be able to create
> non-contiguous excerpts (which I’m calling ‘summaries’; the
> contiguous sub-parts of each summary I’m calling excerpts):
>
> $highlighter->add_spec( excerpt_length => 50, summary_length =>
> 200, ...);
>
> The highlighter would find the most important word to highlight (as
> it currently does), and create a 50-char excerpt. Then it would
> create an excerpt for the second most important word and add that
> (removing overlap if necessary), repeating this process until the
> summary is the right length.
I think this should be implemented by abstracting out the excerpt
selection engine, analogous to the way that
KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter
abstract out other functionality used by the Highlighter. How about
if we outsource excerpting to subclasses of a new class,
KinoSearch::Highlight::Excerpter? Then you could release your own
distro, e.g. KSx::Highlight::SummaryExcerpter.
> 3. Custom ellipsis marks:
>
> $highlighter->add_spec( ellipsis_mark => "\x{2026}", ... )
I understand the problem, but adding a such a specific param to
Highlighter->add_spec seems brittle. I think this should be
something which is set via a custom excerpting engine.
Incidentally, Highlighter's treatment of the ellipsis also prompted
part of <http://rt.cpan.org/Public/Bug/Display.html?id=25400>.
> 4. Pagination (another highlighter feature): An index field could
> be designated as the ‘page offset’ field, containing byte offsets
> of page breaks.
>
> $highlighter->add_spec(
> page_offset_field => 'pageoffsets',
> page_offset_formatter => $object,
> );
>
> And $object would have to have a page_label method: sub page_label
> { my ($self, $fields_hashref, $page_no) = @_; ... }
This feature also seems like it should belong to a particular
Excerpter implementation.
> Though it might be more complicated, maybe we could have page
> breaks (chr 12) recorded automatically when the index is created.
> Then ‘page_offset_field’ won’t be necessary.
That would work well. It's trivial to implement effectively using C/
XS, because you can just zip along the string counting page breaks.
long
count_breaks(SV *input_sv) {
STRLEN len;
char *ptr = SvPV(input_sv, len);
char *end = SvEND(input_sv);
long count = 0;
while (ptr < end) { if (*ptr++ == 12) count++; }
return count;
}
With Perl, tr// works for efficient character counting, IIRC.
> For examples of 2 and 4 in use, see <http://synodinresistance.org/
> cgi-bin/anazetesis?all=1&and-glossa=&and-morphe=&g=en&q=thing>
> (which I’d like to switch to using KinoSearch, because it’s
> currently too slow).
I admire the sophistication of the excerpting provided. Kudos.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch