Mailing List Archive: Analyzer API mods (was API request for KS::InvIndexer...)

Analyzer API mods (was API request for KS::InvIndexer...)

May 4, 2007, 11:18 AM

Post #1 of 8 (4256 views)

On May 3, 2007, at 8:01 PM, Peter Karman wrote:

> Sounds like you are suggesting an API change, but to the Analyzer
> class instead, giving it more power to affect individual fields.
> Sounds fine to me, especially given some of the API specifics you
> mention below.

Excellent. I think Analyzer needs three public analyze_xxxxx
methods: analyze_field, analyze_text, and analyze_batch. They will
take different arguments, but each will return a TokenBatch.

sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
my $batch = KinoSearch::Analysis::TokenBatch->new(
text => $doc->{$field_name},
);
return $self->analyze($batch);
}

sub analyze_text {
my $batch = KinoSearch::Analysis::TokenBatch->new( text => $_
[1] );
return $_[0]->analyze_batch($batch);
}

sub analyze_batch { shift->abstract_death }

analyze_batch() should take the place of the current analyze(). All
subclasses will have to implement it.

The only change from the current implementation of analyze_text() is
that calls to analyze() will need to be swapped out for calls to
analyze_batch(). Then it needs public docs.

SegWriter should be adapted to use analyze_field instead of
analyze_text as it does now.

Note: analyze_text() is not just a convenience method; it also
allows a small optimization, avoiding a string copy or two when
subclasses overrides it. Instead of copying the text into a
TokenBatch then processing the copy and creating a second TokenBatch,
we start with the original, process, and create a TokenBatch. That
saves 3 string alloc/copy ops in the case of LCNormalizer and 1 in
the case of Tokenizer. (LCNormalizer has more due to crossing the
Perl/C boundary in Token->get_text and Token->set_text.)

In order to make things work for you, I think we need to add
TokenBatch->eat.

$token_batch->eat( $other, $additional_pos_inc );

$additional_pos_inc would be added to the pos_inc of the last Token
in the cannibalistic batch. From perl-space we can have it default
to 0 if only one arg is supplied; from C it will be required, of
course. By setting it to 1, you'll be able to interrupt phrase
matching as requested.

sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
my $token_batch = KinoSearch::Analysis::TokenBatch->new;
my @frags = $self->{parser}->parse( $doc->{field_name} );
for my $frag (@frags) {
my $sub_batch = $self->{tokenizer}->analyze_text($frag);
$token_batch->eat( $sub_batch, 1 );
}
return $token_batch;
}

>> * The utf8::upgrade calls performed by InvIndexer, which
>> can probably be moved to individual analyzers.
>
> agreed. perhaps with a syntactically sweet wrapper in the base
> Analyzer class?
> So analyzer methods that care could call:
>
> $self->utf8ify( $field_value );

That's not a bad idea. utf8::upgrade is a funny, non-perlish
function. It modifies its argument in place. (So would utf8ify.)
Also, it's always available: you don't have to 'use utf8' in order to
get it -- and indeed you shouldn't, unless you really want your
source code interpreted as utf8.

Probably what we should do is implement our own replacement in XS.
I'm not sure it ought to be a method in Analyzer, though. It might
be better as a function in KinoSearch::Util::StringHelper (which
would get a public API). That way other classes can use it:
QueryParser, etc.

>> The other possibility is to add a tutorial under KinoSearch::Docs,
>> or even publish such a tutorial on a WikiToBeNamedLater, reserving
>> Analyzer's POD for concise API documentation. I lean towards
>> stuffing everything into Analyzer, though.
>
> docs_in_analyzer++

OK, cool.

Would you like to work collaboratively on this stuff, the way Pudge
and I did on the Filter classes? I can take care of everything, but
A) there's other work that has to be done that only I can do, B) the
code will come out better if we ensure that at least two people grok
it, and C) this is a point at which KS and Swish3 meet and the
handshaking will probably be cleaner if you develop a deep
understanding of how the KS side works.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/