On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:
> Yes, Swish-e supports the '*' to match 1 or more characters at the
> end of the word, and
> '?' to match exactly one character anywhere in the word. However,
> Swish-e does that by
> means of a 256-byte wide lookup table (iirc), which works only
> because Swish-e supports
> single-byte encodings.
Interesting. Is the lookup table used when matching terms, or also
postings? Is there a clear division between the two stages, as there
is in KS?
Accumulating the set of terms that match the wildcard isn't that
hard. Here's a simple example with a trailing wildcard.
my $wildcard_query_string = "pet*";
my ($frag) = $wildcard_query_string =~ /(.*?)\*/;
my $lexicon = $index_reader->look_up_field("content");
$lexicon->seek($frag);
my @terms;
while ( $lexicon->next ) {
my $term = $lexicon->get_term;
last unless index( $term->get_text, $frag ) == 0;
push @terms, $term;
}
From there, we assemble a priority queue of PostingLists:
my $pri_q = KinoSearch::Search::PListQueue->new( size => scalar
@terms );
for my $term (@terms) {
my $posting_list = $index_reader->posting_list($term);
$pri_q->insert($posting_list);
}
The problem we have now is that the priority queue of PostingLists
probably isn't a good way to zip through a lot of matching terms.
There's going to be some disk seeking, as the results for "peter" and
"petroleum" and "petunia" are interleaved. Hmm...
If we punt on scoring, it might make sense from an i/o standpoint to
iterate through all the matches up front and save a BitVector with
matching doc nums set.
Actually, if we iterate up front, we could find out the IDF of the
fragment and then use that to assess a crude score. However, we
wouldn't have TF info available unless we use something bigger than
a BitVector to hold the temporary results.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
> Yes, Swish-e supports the '*' to match 1 or more characters at the
> end of the word, and
> '?' to match exactly one character anywhere in the word. However,
> Swish-e does that by
> means of a 256-byte wide lookup table (iirc), which works only
> because Swish-e supports
> single-byte encodings.
Interesting. Is the lookup table used when matching terms, or also
postings? Is there a clear division between the two stages, as there
is in KS?
Accumulating the set of terms that match the wildcard isn't that
hard. Here's a simple example with a trailing wildcard.
my $wildcard_query_string = "pet*";
my ($frag) = $wildcard_query_string =~ /(.*?)\*/;
my $lexicon = $index_reader->look_up_field("content");
$lexicon->seek($frag);
my @terms;
while ( $lexicon->next ) {
my $term = $lexicon->get_term;
last unless index( $term->get_text, $frag ) == 0;
push @terms, $term;
}
From there, we assemble a priority queue of PostingLists:
my $pri_q = KinoSearch::Search::PListQueue->new( size => scalar
@terms );
for my $term (@terms) {
my $posting_list = $index_reader->posting_list($term);
$pri_q->insert($posting_list);
}
The problem we have now is that the priority queue of PostingLists
probably isn't a good way to zip through a lot of matching terms.
There's going to be some disk seeking, as the results for "peter" and
"petroleum" and "petunia" are interleaved. Hmm...
If we punt on scoring, it might make sense from an i/o standpoint to
iterate through all the matches up front and save a BitVector with
matching doc nums set.
Actually, if we iterate up front, we could find out the IDF of the
fragment and then use that to assess a crude score. However, we
wouldn't have TF info available unless we use something bigger than
a BitVector to hold the temporary results.
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch