Mailing List Archive: Wildcards (was Re: KinoSearch feature suggestions)

Wildcards (was Re: KinoSearch feature suggestions)

marvin at rectangular

Jan 23, 2008, 10:31 PM

Post #1 of 6 (1997 views)

On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:
> Yes, Swish-e supports the '*' to match 1 or more characters at the
> end of the word, and
> '?' to match exactly one character anywhere in the word. However,
> Swish-e does that by
> means of a 256-byte wide lookup table (iirc), which works only
> because Swish-e supports
> single-byte encodings.

Interesting. Is the lookup table used when matching terms, or also
postings? Is there a clear division between the two stages, as there
is in KS?

Accumulating the set of terms that match the wildcard isn't that
hard. Here's a simple example with a trailing wildcard.

my $wildcard_query_string = "pet*";
my ($frag) = $wildcard_query_string =~ /(.*?)\*/;
my $lexicon = $index_reader->look_up_field("content");
$lexicon->seek($frag);
my @terms;
while ( $lexicon->next ) {
my $term = $lexicon->get_term;
last unless index( $term->get_text, $frag ) == 0;
push @terms, $term;
}

From there, we assemble a priority queue of PostingLists:

my $pri_q = KinoSearch::Search::PListQueue->new( size => scalar
@terms );
for my $term (@terms) {
my $posting_list = $index_reader->posting_list($term);
$pri_q->insert($posting_list);
}

The problem we have now is that the priority queue of PostingLists
probably isn't a good way to zip through a lot of matching terms.
There's going to be some disk seeking, as the results for "peter" and
"petroleum" and "petunia" are interleaved. Hmm...

If we punt on scoring, it might make sense from an i/o standpoint to
iterate through all the matches up front and save a BitVector with
matching doc nums set.

Actually, if we iterate up front, we could find out the IDF of the
fragment and then use that to assess a crude score. However, we
wouldn't have TF info available unless we use something bigger than
a BitVector to hold the temporary results.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was Re: KinoSearch feature suggestions) [ In reply to ]

Jan 23, 2008, 11:47 PM

Post #2 of 6 (1921 views)

On 1/23/08, Marvin Humphrey <marvin@rectangular.com> wrote:
> If we punt on scoring, it might make sense from an i/o standpoint to
> iterate through all the matches up front and save a BitVector with
> matching doc nums set.

Hi Marvin! I'm only half paying attention from here in the peanut
gallery, but this got my attention. Don't punt on the scoring!

>From my naive point of view, a wildcard just looks like another way of
specifying a boolean OR. Why not expand it out with the parser level?
Sure it might be really big, but there's nothing wrong with providing
support for industrial strength boolean queries. Of course, I say
that because I'm going to want them one day for my own nefarious
purposes, and with flexible scoring at that.

> Actually, if we iterate up front, we could find out the IDF of the
> fragment and then use that to assess a crude score.

I will be so appreciative some day if you move away from architectures
that presumes IDF is always going to be the way that things are
scored.

> The problem we have now is that the priority queue of PostingLists
> probably isn't a good way to zip through a lot of matching terms.
> There's going to be some disk seeking, as the results for "peter" and
> "petroleum" and "petunia" are interleaved. Hmm...

All these disk seeks you are seeing... have you ever caught one in
the wild? Modern systems have several levels of caching between you
and the disk head. While truly random lookup is bad, anything
remotely predictable is probably going to be cached. Don't avoid the
interleaved data until you're sure it is going to be a problem.

Nathan Kurz
nate@verse.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was Re: KinoSearch feature suggestions) [ In reply to ]

Jan 24, 2008, 9:52 AM

Post #3 of 6 (1927 views)

On Jan 23, 2008, at 10:31 PM, Marvin Humphrey wrote:

> Actually, if we iterate up front, we could find out the IDF of the
> fragment and then use that to assess a crude score. However, we
> wouldn't have TF info available unless we use something bigger than
> a BitVector to hold the temporary results.

Forgive my ignorance, but what are IDF and TF?

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was Re: KinoSearch feature suggestions) [ In reply to ]

Jan 24, 2008, 10:24 AM

Post #4 of 6 (1921 views)

On Jan 24, 2008, at 9:52 AM, Father Chrysostomos wrote:

>
> On Jan 23, 2008, at 10:31 PM, Marvin Humphrey wrote:
>
>> Actually, if we iterate up front, we could find out the IDF of the
>> fragment and then use that to assess a crude score. However, we
>> wouldn't have TF info available unless we use something bigger
>> than a BitVector to hold the temporary results.
>
> Forgive my ignorance, but what are IDF and TF?

Never mind. :-) I just found out.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was Re: KinoSearch feature suggestions) [ In reply to ]

peter at peknet

Jan 24, 2008, 7:10 PM

Post #5 of 6 (1907 views)

Marvin Humphrey wrote on 1/24/08 12:31 AM:
> On Jan 23, 2008, at 6:59 AM, Peter Karman wrote:
>> Yes, Swish-e supports the '*' to match 1 or more characters at the end
>> of the word, and
>> '?' to match exactly one character anywhere in the word. However,
>> Swish-e does that by
>> means of a 256-byte wide lookup table (iirc), which works only because
>> Swish-e supports
>> single-byte encodings.
>
> Interesting. Is the lookup table used when matching terms, or also
> postings? Is there a clear division between the two stages, as there is
> in KS?

The lookup table is used in matching terms only.

There is not such a clear in division in Swish-e. There is really only the
concept of the 'word' (i.e. term).

--
Peter Karman . http://peknet.com/ . peter@peknet.com

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was Re: KinoSearch feature suggestions) [ In reply to ]

marvin at rectangular

Jan 25, 2008, 2:11 AM

Post #6 of 6 (1920 views)

On Jan 24, 2008, at 9:52 AM, Father Chrysostomos wrote:

> Forgive my ignorance, but what are IDF and TF?

I saw that you've already gotten these, but you may still find the
documentation KinoSearch::Docs::IRTheory and
KinoSearch::Docs::FileFormat illuminating.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch