Mailing List Archive

Re: KinoSearch feature suggestions
Hi,

On Jan 21, 2008, at 2:16 PM, Father Chrysostomos wrote:

> I’d like to request that a few features be added to KinoSearch. I
> need these features myself, so I’m willing to contribute patches.
> Please let me know what you think.

I'm going to take the liberty of cc'ing this to the KinoSearch
mailing list, since it was filed as a public rt.cpan.org issue.

> 1. Wildcards in search queries

I am in favor of wildcards being available via a separate
distribution, and I would very much like to hammer out an elegant low-
level API to support such a distro. A lot of the work I have been
doing lately is intended to facilitate such endeavors.

Wildcards should not be in core KS, because they are by their nature
vastly more expensive than whole-word queries. I have observed that
their comparative cost often comes as an unpleasant shock. However,
providing a separate distro will prompt people to assess the costs
with open eyes.

> 2. I’d like KinoSearch::Highlight::Highlighter to be able to create
> non-contiguous excerpts (which I’m calling ‘summaries’; the
> contiguous sub-parts of each summary I’m calling excerpts):
>
> $highlighter->add_spec( excerpt_length => 50, summary_length =>
> 200, ...);
>
> The highlighter would find the most important word to highlight (as
> it currently does), and create a 50-char excerpt. Then it would
> create an excerpt for the second most important word and add that
> (removing overlap if necessary), repeating this process until the
> summary is the right length.

I think this should be implemented by abstracting out the excerpt
selection engine, analogous to the way that
KinoSearch::Highlight::Encoder and KinoSearch::Highlight::Formatter
abstract out other functionality used by the Highlighter. How about
if we outsource excerpting to subclasses of a new class,
KinoSearch::Highlight::Excerpter? Then you could release your own
distro, e.g. KSx::Highlight::SummaryExcerpter.

> 3. Custom ellipsis marks:
>
> $highlighter->add_spec( ellipsis_mark => "\x{2026}", ... )

I understand the problem, but adding a such a specific param to
Highlighter->add_spec seems brittle. I think this should be
something which is set via a custom excerpting engine.

Incidentally, Highlighter's treatment of the ellipsis also prompted
part of <http://rt.cpan.org/Public/Bug/Display.html?id=25400>.

> 4. Pagination (another highlighter feature): An index field could
> be designated as the ‘page offset’ field, containing byte offsets
> of page breaks.
>
> $highlighter->add_spec(
> page_offset_field => 'pageoffsets',
> page_offset_formatter => $object,
> );
>
> And $object would have to have a page_label method: sub page_label
> { my ($self, $fields_hashref, $page_no) = @_; ... }

This feature also seems like it should belong to a particular
Excerpter implementation.

> Though it might be more complicated, maybe we could have page
> breaks (chr 12) recorded automatically when the index is created.
> Then ‘page_offset_field’ won’t be necessary.

That would work well. It's trivial to implement effectively using C/
XS, because you can just zip along the string counting page breaks.

long
count_breaks(SV *input_sv) {
STRLEN len;
char *ptr = SvPV(input_sv, len);
char *end = SvEND(input_sv);
long count = 0;
while (ptr < end) { if (*ptr++ == 12) count++; }
return count;
}

With Perl, tr// works for efficient character counting, IIRC.

> For examples of 2 and 4 in use, see <http://synodinresistance.org/
> cgi-bin/anazetesis?all=1&and-glossa=&and-morphe=&g=en&q=thing>
> (which I’d like to switch to using KinoSearch, because it’s
> currently too slow).

I admire the sophistication of the excerpting provided. Kudos.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: KinoSearch feature suggestions [ In reply to ]
On Jan 22, 2008, at 5:04 PM, Father Chrysostomos wrote:

>> I am in favor of wildcards being available via a separate
>> distribution, and I would very much like to hammer out an elegant
>> low-level API to support such a distro. A lot of the work I have
>> been doing lately is intended to facilitate such endeavors.
>
> Is there anything I can do to help on the Perl side?

If things work out as I hope, we'll be able to prototype this in pure
Perl. :)

Abstract methods for many KS base classes are now implemented by
calling back from C to the host language (in this case, Perl). These
capabilities aren't well-documented or API-stable yet, but
KinoSearch::Search::MockScorer serves as proof-of-concept.

We'll need at least three classes:

* KSx::Search::WildCard::WildCardQuery
* KSx::Search::WildCard::WildCardWeight
* KSx::Search::WildCard::WildCardScorer

A Lucene-like implementation would work something like this:

* Find all the terms that match the wildcard query string.
* Create one PostingList per term, and put them in a
PriorityQueue which
sorts by $posting_list->get_doc_num.
* Have $wild_card_scorer->next advance the priority queue.

There are some flaws in this design. It tends to blow up if you
match too many terms (Lucene throws an exception at 1024 by default
rather than run the risk of an out-of-memory error). Scoring is a
bit of a bugaboo, but for now we can punt and just have
WildCardScorer assess a fixed score of 0 for each doc.

However, with this design we can get something working reasonably
quickly, and then check out prior art from other IR projects. Peter,
does SWISH do wildcards?

>> How about if we outsource excerpting to subclasses of a new class,
>> KinoSearch::Highlight::Excerpter?
>
> I think I can have a patch for this in a couple of days.

Sweet. :)

> But the *offsets* of the page breaks need to be recorded. Counting
> is not sufficient. I still have to think more about how this should
> work—unless you have some ideas.

We can modify that function to record offsets in a Perl array. This
(untested) variant renders those offsets as counts of Unicode code
points:

use Inline => C <<'END_C';

AV*
count_breaks(SV *input_sv) {
STRLEN len;
char *ptr = SvPV_utf8(input_sv, len);
char *start = ptr;
char *end = SvEND(input_sv);
long offset = 0;
AV *offsets = newAV();

/* Accumulate an array of offsets for form-feed characters. */
while (ptr < end) {
if (*ptr++ == 12) {
SV *offset_sv = newSViv(offset);
av_push(offsets, offset_sv);
}
ptr += UTF8SKIP(ptr);
offset++;
}

return offsets;
}

END_C

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: KinoSearch feature suggestions [ In reply to ]
Hi folks,

Turns out the mailing list and rt.cpan.org aren't interacting well
and each new message is opening a new RT issue. So please reply to
the mailing list only (<kinosearch@rectangular.com>) from here on
out. Father Chrysostomos, please subscribe if you haven't already:
<http://www.rectangular.com/mailman/listinfo/kinosearch/>.

Thanks,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: KinoSearch feature suggestions [ In reply to ]
On 01/23/2008 08:32 AM, Marvin Humphrey wrote:

> However, with this design we can get something working reasonably
> quickly, and then check out prior art from other IR projects. Peter,
> does SWISH do wildcards?
>

Yes, Swish-e supports the '*' to match 1 or more characters at the end of the word, and
'?' to match exactly one character anywhere in the word. However, Swish-e does that by
means of a 256-byte wide lookup table (iirc), which works only because Swish-e supports
single-byte encodings.


>>> How about if we outsource excerpting to subclasses of a new class,
>>> KinoSearch::Highlight::Excerpter?
>>

fwiw, Search::Tools offers highlighting and excerpting (snipping) via the building of
complex regular expressions. See
http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/Snipper.pm
http://search.cpan.org/~karman/Search-Tools-0.16/lib/Search/Tools/HiLiter.pm

The algorithm I use for snipping/excerpting is slow, and I would love to see how a
different approach could improve performance. I believe the primary reason my approach is
slow is that it uses a big regex.

--
Peter Karman . peter@peknet.com . http://peknet.com/


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: KinoSearch feature suggestions [ In reply to ]
On Jan 23, 2008, at 6:32 AM, Marvin Humphrey wrote:

> If things work out as I hope, we'll be able to prototype this in
> pure Perl. :)

I¢m trying to have a go at this.

How many times is the disk accessed when one does a boolean search
(e.g., 'this OR that OR the-other')? And what are those times?

I could find the answer myself by reading more source code, but it¢s
awfully time consuming....

Sorry for the trouble.


Father Chrysostomos


_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
Re: Re: KinoSearch feature suggestions [ In reply to ]
On Jan 24, 2008, at 8:20 PM, Father Chrysostomos wrote:

> I’m trying to have a go at this.
>
> How many times is the disk accessed when one does a boolean search
> (e.g., 'this OR that OR the-other')? And what are those times?

The stack is pretty deep. The Perl side looks something like...

KinoSearch::Search::Searchable::search
KinoSearch::Searcher::top_docs
KinoSearch::Searcher::collect

Then the C stack looks something like...

Scorer_collect (Scorer.c)
BoolScorer_skip_to (BooleanScorer.c)
ORScorer_skip_to (ORScorer.c)
advance_after_current (ORScorer.c)
ScorerDocQ_top_next (ScorerDocQueue.c)
TermScorer_next (TermScorer.c)
SegPList_next (SegPostingList.c)
ScorePost_read_record (ScorePosting.c)
InStream_read_c32 (Instream.c)
InStream_read_u8 (Instream.c)
refill (InStream.c)
read_internal (InStream.c)
FSFileDes_fdread (FSFileDes.c)
fread (stdio)

The InStream class is analogous to a filehandle. But we won't really
have to concern ourselves with InStream for this purpose. The
deepest we'll need to get is PostingList.

> I could find the answer myself by reading more source code, but
> it’s awfully time consuming....

In order to create legitimate subclasses to implement WildCard
queries, a bunch of stuff that isn't yet public will have to become
public. I'm starting that off by exposing the Lexicon class, along
with the factory method $index_reader->blank_lexicon($field_name).

The first thing we'll need to do is accumulate an array of terms that
match the wildcard using the Lexicon object.

Soon, we'll need the PostingList class and $index_reader-
>blank_posting_list($field_name);

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/



_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch