On Jan 25, 2008, at 2:26 AM, Marvin Humphrey wrote:
>
> On Jan 24, 2008, at 8:20 PM, Father Chrysostomos wrote:
>
>> I¢m trying to have a go at this.
>>
>> How many times is the disk accessed when one does a boolean search
>> (e.g., 'this OR that OR the-other')? And what are those times?
>
> The stack is pretty deep. The Perl side looks something like...
>
> KinoSearch::Search::Searchable::search
> KinoSearch::Searcher::top_docs
> KinoSearch::Searcher::collect
>
I was wondering whether it would be just as efficient to create a
BooleanQuery as Mr. Kurz suggested, but I see the problem with the IDFs.
>> I could find the answer myself by reading more source code, but
>> it¢s awfully time consuming....
>
> In order to create legitimate subclasses to implement WildCard
> queries, a bunch of stuff that isn't yet public will have to become
> public. I'm starting that off by exposing the Lexicon class, along
> with the factory method $index_reader->blank_lexicon($field_name).
I think I¢m confused as to what the lexicon is for. In your earlier
example, you used
my $lexicon = $reader->look_up_field($field);
so it appears that $lexicon is a pointer (not in the C sense) into the
index for the list of terms in that field. Why would we need to create
a blank one? Or is the idea to have a lexicon that covers multiple
fields?
Another thing: Since ¡pet*¢ is essentially a type of simple regular
expression, why not provide support for Regexp queries? It should be
no less efficient if we look for a literal prefix (completely untested):
# get the literal prefix of the regexp, if any.
if($self->{re} =~
/^
(?: # prefix for qr//'s, without allowing /i :
\(\? ([a-hj-z]*) (?:-[a-z]*)?:
)?
(\\[GA]|\^) # anchor
([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or
comments)
/x
) {{
my ($mod,$anchor,$prefix) = ($1,$2,$3);
$anchor eq '^' and $mod =~ /m/ and last;
$mod =~ /x/ and $prefix =~ s/\s+//g;
$self->{prefix} = $prefix;
}}
Then a wild card query could be a subclass that does the following to
its input:
$str = quotemeta $str;
for($str) {
s/\\\*/.*/g;
s/(?:\.\*){2,}/.*/g;
s/^/^/;
s/\z/\\z/;
}
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch
>
> On Jan 24, 2008, at 8:20 PM, Father Chrysostomos wrote:
>
>> I¢m trying to have a go at this.
>>
>> How many times is the disk accessed when one does a boolean search
>> (e.g., 'this OR that OR the-other')? And what are those times?
>
> The stack is pretty deep. The Perl side looks something like...
>
> KinoSearch::Search::Searchable::search
> KinoSearch::Searcher::top_docs
> KinoSearch::Searcher::collect
>
I was wondering whether it would be just as efficient to create a
BooleanQuery as Mr. Kurz suggested, but I see the problem with the IDFs.
>> I could find the answer myself by reading more source code, but
>> it¢s awfully time consuming....
>
> In order to create legitimate subclasses to implement WildCard
> queries, a bunch of stuff that isn't yet public will have to become
> public. I'm starting that off by exposing the Lexicon class, along
> with the factory method $index_reader->blank_lexicon($field_name).
I think I¢m confused as to what the lexicon is for. In your earlier
example, you used
my $lexicon = $reader->look_up_field($field);
so it appears that $lexicon is a pointer (not in the C sense) into the
index for the list of terms in that field. Why would we need to create
a blank one? Or is the idea to have a lexicon that covers multiple
fields?
Another thing: Since ¡pet*¢ is essentially a type of simple regular
expression, why not provide support for Regexp queries? It should be
no less efficient if we look for a literal prefix (completely untested):
# get the literal prefix of the regexp, if any.
if($self->{re} =~
/^
(?: # prefix for qr//'s, without allowing /i :
\(\? ([a-hj-z]*) (?:-[a-z]*)?:
)?
(\\[GA]|\^) # anchor
([^#\$()*+.?[\]\\^]+) # literal pat (no metachars or
comments)
/x
) {{
my ($mod,$anchor,$prefix) = ($1,$2,$3);
$anchor eq '^' and $mod =~ /m/ and last;
$mod =~ /x/ and $prefix =~ s/\s+//g;
$self->{prefix} = $prefix;
}}
Then a wild card query could be a subclass that does the following to
its input:
$str = quotemeta $str;
for($str) {
s/\\\*/.*/g;
s/(?:\.\*){2,}/.*/g;
s/^/^/;
s/\z/\\z/;
}
_______________________________________________
KinoSearch mailing list
KinoSearch@rectangular.com
http://www.rectangular.com/mailman/listinfo/kinosearch