Mailing List Archive: Re: Wildcards (was: Re: KinoSearch feature suggestions)

Re: Wildcards (was: Re: KinoSearch feature suggestions)

Jan 25, 2008, 8:38 AM

Post #1 of 4 (1686 views)

On Jan 25, 2008, at 2:26 AM, Marvin Humphrey wrote:

>
> On Jan 24, 2008, at 8:20 PM, Father Chrysostomos wrote:
>
>> I¢m trying to have a go at this.
>>
>> How many times is the disk accessed when one does a boolean search
>> (e.g., 'this OR that OR the-other')? And what are those times?
>
> The stack is pretty deep. The Perl side looks something like...
>
> KinoSearch::Search::Searchable::search
> KinoSearch::Searcher::top_docs
> KinoSearch::Searcher::collect
>
I was wondering whether it would be just as efficient to create a
BooleanQuery as Mr. Kurz suggested, but I see the problem with the IDFs.

>> I could find the answer myself by reading more source code, but
>> it¢s awfully time consuming....
>
> In order to create legitimate subclasses to implement WildCard
> queries, a bunch of stuff that isn't yet public will have to become
> public. I'm starting that off by exposing the Lexicon class, along
> with the factory method $index_reader->blank_lexicon($field_name).

I think I¢m confused as to what the lexicon is for. In your earlier
example, you used

my $lexicon = $reader->look_up_field($field);

so it appears that $lexicon is a pointer (not in the C sense) into the
index for the list of terms in that field. Why would we need to create
a blank one? Or is the idea to have a lexicon that covers multiple
fields?

Another thing: Since ¡pet*¢ is essentially a type of simple regular
expression, why not provide support for Regexp queries? It should be
no less efficient if we look for a literal prefix (completely untested):

# get the literal prefix of the regexp, if any.
if($self->{re} =~
/^
(?: # prefix for qr//'s, without allowing /i :
\(\? ([a-hj-z]*) (?:-[a-z]*)?:
)?
(\\[GA]|\^) # anchor
([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or
comments)
/x
) {{
my ($mod,$anchor,$prefix) = ($1,$2,$3);
$anchor eq '^' and $mod =~ /m/ and last;
$mod =~ /x/ and $prefix =~ s/\s+//g;
$self->{prefix} = $prefix;
}}

Then a wild card query could be a subclass that does the following to
its input:

$str = quotemeta $str;
for($str) {
s/\\\*/.*/g;
s/(?:\.\*){2,}/.*/g;
s/^/^/;
s/\z/\\z/;
}

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was: Re: KinoSearch feature suggestions) [ In reply to ]

marvin at rectangular

Jan 25, 2008, 11:34 AM

Post #2 of 4 (1607 views)

Permalink

On Jan 25, 2008, at 8:38 AM, Father Chrysostomos wrote:

> I think I’m confused as to what the lexicon is for.

I've just committed some revised docs for Lexicon. Please let me
know if this is sufficiently clear:

KinoSearch::Index::Lexicon - Iterator for a field's Terms.

=head1 SYNOPSIS

my $lexicon = $index_reader->blank_lexicon('content');
while ( $lexicon->next ) {
print $lexicon->get_term->get_text . "\n";
}

=head1 DESCRIPTION

A Lexicon is an iterator which provides access to all the unique
Terms
for a given field in sorted order.

If an index consists of two documents with a 'content' field
holding
"three blind mice" and "three musketeers" respectively, then the
code
from the synopsis above would print this:

blind
mice
musketeers
three

> In your earlier example, you used
>
> my $lexicon = $reader->look_up_field($field);

I had just changed the name of this method to be "blank_lexicon".

my $lexicon = $reader->blank_lexicon($field);

However, that name choice seems to have been a bad call, judging by
your reaction:

> Why would we need to create a blank one?

Let's see if we can't come up with a less confusing solution.

Here's the rationale for the current names:

IndexReader provides factory methods which supply both Lexicons and
PostingLists.

my $seeked_lexicon = $reader->lexicon($term);
my $unseeked_lexicon = $reader->blank_lexicon($field);
my $seeked_posting_list = $reader->posting_list($term);
my $unseeked_posting_list = $reader->blank_posting_list($field);

blank_lexicon() and blank_posting_list() both return uninitialized
iterators. lexicon() and posting_list() both return iterators which
have been pre-located via seek($term).

With the name change, I'd hoped to eliminate any ambiguity as to
whether "look_up_field" would return a Lexicon or a PostingList, and
also psychologically reinforce the implication of parallel behavior
between "blank_lexicon" and "blank_posting_list".

One possible fix would be to supply only two methods, and have them
behave polymorphically based on whether a Term or a plain string
representing a field name was supplied:

my $seeked_lexicon = $reader->lexicon($term);
my $unseeked_lexicon = $reader->lexicon($field);
my $seeked_posting_list = $reader->posting_list($term);
my $unseeked_posting_list = $reader->posting_list($field);

That would be fine at the Perl level. However, I don't want to do
that at the C level, where these methods are actually implemented,
and it would be best if the C API and the Perl API resembled each
other as closely as possible.

So here's my latest thought:

my $seeked_lexicon = $reader->lexicon( 'content', 'foo' );
my $unseeked_lexicon = $reader->lexicon( 'content', undef );
my $seeked_posting_list = $reader->posting_list( 'content',
'foo' );
my $unseeked_posting_list = $reader->posting_list( 'content',
undef );

Does that make sense?

> Or is the idea to have a lexicon that covers multiple fields?

Lexicons explicitly do *not* cover multiple fields. In this, they
differ from the corresponding Lucene class, TermEnum, which always
iterates over all terms in all fields, sorting first by field name
then by term text. The file formats differ as well: for KS, each
field gets a dedicated file for simplicity's sake, while in Lucene,
values for multiple fields coexist within the same file.

There are two reasons behind this:

1. At some point, I intend to add support for specifying custom
sorting routines via FieldSpec, which would affect the order
in which the Terms in the Lexicon are sorted. Managing multiple
comparison functions within a single Lexicon object would be
complex and ugly.
2. At some point, it would be nice to support non-text fields.

Hmm. Thinking over the second point, perhaps it would be best if
Lexicons only stored field values rather than terms. In Lucene, that
wouldn't work because TermEnum objects handle multiple fields, but in
KS, the field is fixed.

Making such a change wouldn't be trivial, but it's probably worthwhile.

> Another thing: Since ‘pet*’ is essentially a type of simple
> regular expression, why not provide support for Regexp queries? It
> should be no less efficient if we look for a literal prefix
> (completely untested):
>
> # get the literal prefix of the regexp, if any.
> if($self->{re} =~
> /^
> (?: # prefix for qr//'s, without allowing /i :
> \(\? ([a-hj-z]*) (?:-[a-z]*)?:
> )?
> (\\[GA]|\^) # anchor
> ([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or
> comments)
> /x
> ) {{
> my ($mod,$anchor,$prefix) = ($1,$2,$3);
> $anchor eq '^' and $mod =~ /m/ and last;
> $mod =~ /x/ and $prefix =~ s/\s+//g;
> $self->{prefix} = $prefix;
> }}

A RegexQuery class would be nice to have, but it would have some
significant limitations. If it used the existing KS index data
structures, it would not behave like a typical SQL regex or LIKE
query, matching the regex against the non-tokenized contents for each
field. If you did something like this...

my $regex_query = KSx::Search::RegexQuery->new(
field => 'content',
regex => qr/three blind/,
);

... and the 'content' field was tokenized, the regex wouldn't match
against any of the values in the Lexicon, since e.g. "blind" doesn't
match qr/three blind/.

In contrast, a trailing wildcard is relatively easy to implement,
because you can find all the terms that match with this routine:

my ($wild_text) = "pet*" =~ /^(.*?)\*/;
my $lexicon = $index_reader->posting_list( 'content',
$wild_text );
while (1) {
my $term = $lexicon->get_term;
last unless $term;
my $term_text = $term->get_text;
last unless index( $term_text, $wild_text ) == 0;
push @term_texts, $term_text;
last unless $lexicon->next;
}

A non-trailing wildcard is more expensive to implement, because it
requires iterating over the entire Lexicon.

> Then a wild card query could be a subclass that does the following
> to its input:
>
> $str = quotemeta $str;
> for($str) {
> s/\\\*/.*/g;
> s/(?:\.\*){2,}/.*/g;
> s/^/^/;
> s/\z/\\z/;
> }

Using regexes to modify regexes is probably not something I would
have thought to do, but that's all the better. My goal is to provide
the scaffolding. I'm focused on getting Lexicon and PostingList
right, so that you can use or abuse them as you wish. :)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards (was: Re: KinoSearch feature suggestions) [ In reply to ]

sprout at cpan

Jan 25, 2008, 12:57 PM

Post #3 of 4 (1610 views)

Permalink

On Jan 25, 2008, at 11:34 AM, Marvin Humphrey wrote:

> I've just committed some revised docs for Lexicon. Please let me
> know if this is sufficiently clear:
>

> KinoSearch::Index::Lexicon - Iterator for a field's Terms.
>
> =head1 SYNOPSIS
[...]

Perfect. :-)

>
> So here's my latest thought:
>
> my $seeked_lexicon = $reader->lexicon( 'content', 'foo' );
> my $unseeked_lexicon = $reader->lexicon( 'content', undef );
> my $seeked_posting_list = $reader->posting_list( 'content',
> 'foo' );
> my $unseeked_posting_list = $reader->posting_list( 'content',
> undef );
>
> Does that make sense?

Yes, it does.

> 2. At some point, it would be nice to support non-text fields.

Since binary data can be stored in a string, it is already supported,
is it not?

> Hmm. Thinking over the second point, perhaps it would be best if
> Lexicons only stored field values rather than terms. In Lucene,
> that wouldn't work because TermEnum objects handle multiple fields,
> but in KS, the field is fixed.

Do you mean that the field contains the terms, which contain the field
name? This does seem redundant.

> Making such a change wouldn't be trivial, but it's probably
> worthwhile.

That would certainly make things simpler. Of course, it¢s up to you.

> A RegexQuery class would be nice to have, but it would have some
> significant limitations. If it used the existing KS index data
> structures, it would not behave like a typical SQL regex or LIKE
> query, matching the regex against the non-tokenized contents for
> each field. If you did something like this...
>
> my $regex_query = KSx::Search::RegexQuery->new(
> field => 'content',
> regex => qr/three blind/,
> );
>
> ... and the 'content' field was tokenized, the regex wouldn't match
> against any of the values in the Lexicon, since e.g. "blind" doesn't
> match qr/three blind/.

I should have called it RegexpTermQuery. :-) I still like the idea,
though.

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch

Re: Wildcards [ In reply to ]

marvin at rectangular

Jan 27, 2008, 8:20 PM

Post #4 of 4 (1608 views)

Permalink

On Jan 25, 2008, at 12:57 PM, Father Chrysostomos wrote:

>> 2. At some point, it would be nice to support non-text fields.
>
> Since binary data can be stored in a string, it is already
> supported, is it not?

No, that's not the case. In the devel branch, non-binary fields are
forced to UTF-8 very early on, in InvIndexer::add_doc.

>> Hmm. Thinking over the second point, perhaps it would be best if
>> Lexicons only stored field values rather than terms. In Lucene,
>> that wouldn't work because TermEnum objects handle multiple
>> fields, but in KS, the field is fixed.
>
> Do you mean that the field contains the terms, which contain the
> field name? This does seem redundant.

The KS implementation is not inefficient, it's just a little bloated.

Both the Lucene and KS file formats store the text change for each
use string deltas, so encoding "petard" followed by "petunia" looks
something like this:

my $prefix_length = 3;
my $diff_length = 4;
$outstream->print( pack( 'wwa*', $prefix_length, $diff_length,
"unia" ) );

Lucene needs one extra compressed integer per entry to represent the
field number -- not a big deal. The real bottleneck in search occurs
when processing postings, not when scanning through lexicon data.

>> Making such a change wouldn't be trivial, but it's probably
>> worthwhile.
>
> That would certainly make things simpler. Of course, it’s up to you.

I'll give it a whirl.

I'm thinking that Lexicon's get_term() method will probably go away
in favor of two new methods.

my $term_text = $lexicon->get_value;
my $value_obj = $lexicon->peek_value;

For now, only get_value will be public. Internally, KS will use
peek_value.

At the C level, Lex_Peek_Value() will return a pointer to the lexicon-
>value member variable. In anticipation of extending support for
non-text fields, we'll make that an Obj*. For now, it will always be
a ByteBuf.

Obj*
Lex_peek_value(Lexicon *self)
{
return self->value;
}

Before the iterator starts and after it finishes, Lex_Peek_Value will
return NULL.

get_value() will be XS-only. It will return either undef, a perl
object representing lexicon->value, or, if lexicon->value is a
ByteBuf, a plain Perl scalar string.

SV*
get_value(self)
kino_Lexicon *self;
CODE:
{
kino_Obj *value = Kino_Lex_Peek_Value(self);
if (value == NULL) {
RETVAL = newSV(0);
}
else if (KINO_OBJ_IS_A(value, KINO_BYTEBUF)) {
RETVAL = bb_to_sv(value);
}
else {
RETVAL = Kino_Obj_To_Native(value);
}
}
OUTPUT: RETVAL

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch