Mailing List Archive

Analyzer API mods (was API request for KS::InvIndexer...)
On May 3, 2007, at 8:01 PM, Peter Karman wrote:

> Sounds like you are suggesting an API change, but to the Analyzer
> class instead, giving it more power to affect individual fields.
> Sounds fine to me, especially given some of the API specifics you
> mention below.

Excellent. I think Analyzer needs three public analyze_xxxxx
methods: analyze_field, analyze_text, and analyze_batch. They will
take different arguments, but each will return a TokenBatch.

sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
my $batch = KinoSearch::Analysis::TokenBatch->new(
text => $doc->{$field_name},
);
return $self->analyze($batch);
}

sub analyze_text {
my $batch = KinoSearch::Analysis::TokenBatch->new( text => $_
[1] );
return $_[0]->analyze_batch($batch);
}

sub analyze_batch { shift->abstract_death }

analyze_batch() should take the place of the current analyze(). All
subclasses will have to implement it.

The only change from the current implementation of analyze_text() is
that calls to analyze() will need to be swapped out for calls to
analyze_batch(). Then it needs public docs.

SegWriter should be adapted to use analyze_field instead of
analyze_text as it does now.

Note: analyze_text() is not just a convenience method; it also
allows a small optimization, avoiding a string copy or two when
subclasses overrides it. Instead of copying the text into a
TokenBatch then processing the copy and creating a second TokenBatch,
we start with the original, process, and create a TokenBatch. That
saves 3 string alloc/copy ops in the case of LCNormalizer and 1 in
the case of Tokenizer. (LCNormalizer has more due to crossing the
Perl/C boundary in Token->get_text and Token->set_text.)

In order to make things work for you, I think we need to add
TokenBatch->eat.

$token_batch->eat( $other, $additional_pos_inc );

$additional_pos_inc would be added to the pos_inc of the last Token
in the cannibalistic batch. From perl-space we can have it default
to 0 if only one arg is supplied; from C it will be required, of
course. By setting it to 1, you'll be able to interrupt phrase
matching as requested.

sub analyze_field {
my ( $self, $doc, $field_name ) = @_;
my $token_batch = KinoSearch::Analysis::TokenBatch->new;
my @frags = $self->{parser}->parse( $doc->{field_name} );
for my $frag (@frags) {
my $sub_batch = $self->{tokenizer}->analyze_text($frag);
$token_batch->eat( $sub_batch, 1 );
}
return $token_batch;
}


>> * The utf8::upgrade calls performed by InvIndexer, which
>> can probably be moved to individual analyzers.
>
> agreed. perhaps with a syntactically sweet wrapper in the base
> Analyzer class?
> So analyzer methods that care could call:
>
> $self->utf8ify( $field_value );

That's not a bad idea. utf8::upgrade is a funny, non-perlish
function. It modifies its argument in place. (So would utf8ify.)
Also, it's always available: you don't have to 'use utf8' in order to
get it -- and indeed you shouldn't, unless you really want your
source code interpreted as utf8.

Probably what we should do is implement our own replacement in XS.
I'm not sure it ought to be a method in Analyzer, though. It might
be better as a function in KinoSearch::Util::StringHelper (which
would get a public API). That way other classes can use it:
QueryParser, etc.

>> The other possibility is to add a tutorial under KinoSearch::Docs,
>> or even publish such a tutorial on a WikiToBeNamedLater, reserving
>> Analyzer's POD for concise API documentation. I lean towards
>> stuffing everything into Analyzer, though.
>
> docs_in_analyzer++

OK, cool.

Would you like to work collaboratively on this stuff, the way Pudge
and I did on the Filter classes? I can take care of everything, but
A) there's other work that has to be done that only I can do, B) the
code will come out better if we ensure that at least two people grok
it, and C) this is a point at which KS and Swish3 meet and the
handshaking will probably be cleaner if you develop a deep
understanding of how the KS side works.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
Marvin Humphrey scribbled on 5/4/07 12:58 PM:

> I think Analyzer needs three public analyze_xxxxx methods:
> analyze_field, analyze_text, and analyze_batch. They will take
> different arguments, but each will return a TokenBatch.

sounds sane.

> analyze_batch() should take the place of the current analyze(). All
> subclasses will have to implement it.
>
> The only change from the current implementation of analyze_text() is
> that calls to analyze() will need to be swapped out for calls to
> analyze_batch(). Then it needs public docs.
>
> SegWriter should be adapted to use analyze_field instead of analyze_text
> as it does now.
>
> Note: analyze_text() is not just a convenience method; it also allows a
> small optimization, avoiding a string copy or two when subclasses
> overrides it. Instead of copying the text into a TokenBatch then
> processing the copy and creating a second TokenBatch, we start with the
> original, process, and create a TokenBatch. That saves 3 string
> alloc/copy ops in the case of LCNormalizer and 1 in the case of
> Tokenizer. (LCNormalizer has more due to crossing the Perl/C boundary
> in Token->get_text and Token->set_text.)
>

glad you're thinking of these things.


> In order to make things work for you, I think we need to add
> TokenBatch->eat.
>
> $token_batch->eat( $other, $additional_pos_inc );
>
> $additional_pos_inc would be added to the pos_inc of the last Token in
> the cannibalistic batch. From perl-space we can have it default to 0 if
> only one arg is supplied; from C it will be required, of course. By
> setting it to 1, you'll be able to interrupt phrase matching as requested.
>
> sub analyze_field {
> my ( $self, $doc, $field_name ) = @_;
> my $token_batch = KinoSearch::Analysis::TokenBatch->new;
> my @frags = $self->{parser}->parse( $doc->{field_name} );
> for my $frag (@frags) {
> my $sub_batch = $self->{tokenizer}->analyze_text($frag);
> $token_batch->eat( $sub_batch, 1 );
> }
> return $token_batch;
> }
>

nice.


>
>>> * The utf8::upgrade calls performed by InvIndexer, which
>>> can probably be moved to individual analyzers.
>>
>> agreed. perhaps with a syntactically sweet wrapper in the base
>> Analyzer class?
>> So analyzer methods that care could call:
>>
>> $self->utf8ify( $field_value );
>
> That's not a bad idea. utf8::upgrade is a funny, non-perlish function.
> It modifies its argument in place. (So would utf8ify.) Also, it's
> always available: you don't have to 'use utf8' in order to get it -- and
> indeed you shouldn't, unless you really want your source code
> interpreted as utf8.
>
> Probably what we should do is implement our own replacement in XS. I'm
> not sure it ought to be a method in Analyzer, though. It might be
> better as a function in KinoSearch::Util::StringHelper (which would get
> a public API). That way other classes can use it: QueryParser, etc.
>

good idea to make it a Util method.

In Swish XS, I do something like this (not real code):

swish_utf8ify(self, str)
SV* self;
SV* str;

CODE:

char * buf = SvPV(str, PL_na);

if (!SvUTF8(str))
{
if (swish_is_ascii(buf))
SvUTF8_on(str); /* flags original SV */
else
croak("%s is not flagged as a UTF-8 string and is not ASCII", buf);
}

where swish_is_ascii() just makes sure there is no byte > 127. I'm sure there's
a native Perl equivalent.

You'll notice I don't handle Latin1 or EBCDIC as the utf8::upgrade() claims to.
The utf8 pod recommends Encode, and that's what Search::Tools::Transliterate
uses for its to_utf8() method (all in perl space).

That's not as friendly for the (common western) case of full Latin1, but I
figure it's a Good Thing to force users to be aware of their source encodings,
rather than quietly converting it to UTF-8. I know that not everyone agrees with
me on that.


>>> The other possibility is to add a tutorial under KinoSearch::Docs, or
>>> even publish such a tutorial on a WikiToBeNamedLater, reserving
>>> Analyzer's POD for concise API documentation. I lean towards
>>> stuffing everything into Analyzer, though.
>>
>> docs_in_analyzer++
>
> OK, cool.
>
> Would you like to work collaboratively on this stuff, the way Pudge and
> I did on the Filter classes? I can take care of everything, but A)
> there's other work that has to be done that only I can do, B) the code
> will come out better if we ensure that at least two people grok it, and
> C) this is a point at which KS and Swish3 meet and the handshaking will
> probably be cleaner if you develop a deep understanding of how the KS
> side works.

I'd be happy to try, though I can't promise to be as fast at it as that last
round with you and Pudge. I'll be more likely to chip away rather than plunge ahead.

I think your example code above is a great roadmap though and I'll get at it as
I can.

--
Peter Karman . http://peknet.com/ . peter@peknet.com
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
On May 4, 2007, at 11:52 AM, Peter Karman wrote:
> In Swish XS, I do something like this (not real code):
>
> swish_utf8ify(self, str)
> SV* self;
> SV* str;
>
> CODE:
>
> char * buf = SvPV(str, PL_na);
>
> if (!SvUTF8(str))
> {
> if (swish_is_ascii(buf))
> SvUTF8_on(str); /* flags original SV */
> else
> croak("%s is not flagged as a UTF-8 string and is not
> ASCII", buf);
> }
>
> where swish_is_ascii() just makes sure there is no byte > 127. I'm
> sure there's
> a native Perl equivalent.

The crucial perlapi functions are:

* sv_utf8_upgrade -- Converts an SV's string to utf8. The SV's
UTF8 flag will end up set no matter what. There are 3
possible outcomes.
o Source SV has UTF8 flag set: no-op.
o Source SV is pure ASCII: sets UTF8 flag, but no effect on
string.
o Source SV does not have UTF8 flag set, has some bytes > 127:
Converts string to utf8 assuming source encoding of Latin1,
reallocating as necessary.

* SvPVutf8 -- like SvPV, but converts the SV to utf8 first if
necessary.

* is_utf8_string -- Tests if a char* sequence of a length you
specify is valid utf8. Use this when you don't have access to or
don't want to trust a scalar's UTF8 flag.

None of them are directly equivalent to what you're doing. However,
using those (and a few others), I believe I've gotten KS to the point
where it handles all Perl character data transparently.

All output from KS is valid UTF-8 and has the UTF8 flag set, but you
wouldn't know that. Perl transparently downgrades everything to
Latin1 when it needs to. If there's a code point > 255, you might
see a "wide character in print" warning when printing to a filehandle
which thinks it's Latin1 (as STDOUT does by default), but you'd only
see that if you were supplying KS with something other than Latin1 to
begin with.

Here's XS code for StringHelper::utf8ify:

void
utf8ify(sv)
SV *sv;
PPCODE:
sv_utf8_upgrade(sv);

> You'll notice I don't handle Latin1 or EBCDIC as the utf8::upgrade
> () claims to.

EBCDIC isn't worth worrying about, IMO -- too few machines use it.

Handling Latin1 might be easier than you think -- you just have to be
consistent about using SvPVutf8 instead of SvPV so that all entry
points into your own string handling code convert if necessary. You
don't have to convert back -- just give Perl SVs with valid UTF-8
strings and the UTF8 flag on, and Perl will behave correctly (modulo
esoteric weirdness in pack/unpack and the regex engine).

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
> I'd be happy to try, though I can't promise to be as fast at it as
> that last
> round with you and Pudge. I'll be more likely to chip away rather
> than plunge ahead.

This is nowhere near as ambitious as what Pudge and I took on. We're
talkin' order-of-magnitude-less-ambitious.

That project had several major challenges. In addition to all the
good stuff Pudge contributed, I had to write the classes which are
now MultiLexicon, LexCache, SegLexCache, MatchFieldQuery,
MatchFieldWeight, and MatchFieldScorer, plus add a bunch of stuff to
BitVector. There was a lot of complex C code which had to be
written, tested, and debugged.

Here's what you and I need to do:

Task 1:

* Copy and paste analyze_field, analyze_text, and analyze_batch
from that previous email into Analyzer.pm, replacing the current
analyze() and analyze_text().
* Perform minor mods on existing analyze_text() methods in
LCNormalizer and PolyAnalyzer.
* Change 151-analyzer.t to use a custom subclass of Analyzer (since
analyze() is a no-op, but its replacement analyze_batch() dies an
abstract death).
* Add perfunctory tests for analyze_field to the relevant test
files.
o 150-polyanalyzer.t
o 151-analyzer.t
o 153-lc_normalizer.t
o 154-tokenizer.t
o 155-stopalizer.t
o 156-stemmer.t
* Adjust Analyzer's documentation to reflect the new regime.

Task 2:

* Change SegWriter to use analyze_field.
* Add optimized analyze_field implementations to LCNormalizer and
PolyAnalyzer.
* Add optimized analyze_field implementation to Tokenizer. This
one's
harder because it requires some advanced XS.
* Test that you can mod a document's contents, using code nearly
identical to what will end up in the Swish/KS glue eventually.

Task 3:

* Expand Analyzer's docs with regard to subclassing.

Task 4:

* Copy and paste the utf8ify code into StringHelper.pm.
* Add some tests to verify that it works.
* Replace calls to utf8::upgrade with utf8ify.
* We'll skip moving the utf8 conversion from InvIndexer to the
Analyzers for now, since that has other implications.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
Marvin Humphrey scribbled on 5/4/07 5:20 PM:
>
> Here's what you and I need to do:
>
> Task 1:
>
> * Copy and paste analyze_field, analyze_text, and analyze_batch
> from that previous email into Analyzer.pm, replacing the current
> analyze() and analyze_text().
> * Perform minor mods on existing analyze_text() methods in
> LCNormalizer and PolyAnalyzer.
> * Change 151-analyzer.t to use a custom subclass of Analyzer (since
> analyze() is a no-op, but its replacement analyze_batch() dies an
> abstract death).
> * Add perfunctory tests for analyze_field to the relevant test files.
> o 150-polyanalyzer.t
> o 151-analyzer.t
> o 153-lc_normalizer.t
> o 154-tokenizer.t
> o 155-stopalizer.t
> o 156-stemmer.t
> * Adjust Analyzer's documentation to reflect the new regime.
>

The attached patch addresses all items in Task 1 but the new tests for
analyze_field(). There were considerably more files affected that the ones you
listed above, so rather than try and lump the analyst -> analyze_batch change in
with the new analyze_field feature, I'm opting to just do them in two steps.

Here's step 1. 'make test' passes for me at r2396.


--
Peter Karman . http://peknet.com/ . peter@peknet.com
-------------- next part --------------
Index: buildlib/KinoTestUtils.pm
===================================================================
--- buildlib/KinoTestUtils.pm (revision 2396)
+++ buildlib/KinoTestUtils.pm (working copy)
@@ -130,12 +130,12 @@
}


-# Verify an Analyzer's analyze, analyze_text, and analyze_raw methods.
+# Verify an Analyzer's analyze_batch, analyze_field, analyze_text, and analyze_raw methods.
sub test_analyzer {
my ( $analyzer, $source, $expected, $message ) = @_;

my $batch = KinoSearch::Analysis::TokenBatch->new( text => $source );
- $batch = $analyzer->analyze($batch);
+ $batch = $analyzer->analyze_batch($batch);
my @got;
while ( my $token = $batch->next ) {
push @got, $token->get_text;
Index: t/154-tokenizer.t
===================================================================
--- t/154-tokenizer.t (revision 2396)
+++ t/154-tokenizer.t (working copy)
@@ -12,7 +12,7 @@
is( $text, "o'malley's", "multiple apostrophes for default token_re" );

my $batch = KinoSearch::Analysis::TokenBatch->new( text => "a b c" );
-$batch = $tokenizer->analyze($batch);
+$batch = $tokenizer->analyze_batch($batch);

my ( @token_texts, @start_offsets, @end_offsets );
while ( my $token = $batch->next ) {
@@ -26,7 +26,7 @@

$tokenizer = KinoSearch::Analysis::Tokenizer->new( token_re => qr/./ );
$batch = KinoSearch::Analysis::TokenBatch->new( text => "a b c" );
-$batch = $tokenizer->analyze($batch);
+$batch = $tokenizer->analyze_batch($batch);

@token_texts = ();
@start_offsets = ();
@@ -41,7 +41,7 @@
is_deeply( \@end_offsets, [ 1 .. 5 ], "ends: custom re" );

$batch->reset;
-$batch = $tokenizer->analyze($batch);
+$batch = $tokenizer->analyze_batch($batch);
@token_texts = ();
while ( my $token = $batch->next ) {
push @token_texts, $token->get_text;
Index: t/204-doc_reader.t
===================================================================
--- t/204-doc_reader.t (revision 2396)
+++ t/204-doc_reader.t (working copy)
@@ -3,6 +3,10 @@

use Test::More tests => 5;

+package TestAnalyzer;
+use base qw( KinoSearch::Analysis::Analyzer );
+sub analyze_batch { $_[1] }
+
package MySchema::textcomp;
use base qw( KinoSearch::Schema::FieldSpec );

@@ -23,10 +27,9 @@
use base qw( KinoSearch::Schema::FieldSpec );

sub stored {0}
-
+
package MySchema;
use base qw( KinoSearch::Schema );
-use KinoSearch::Analysis::Analyzer;

our %fields = (
text => 'KinoSearch::Schema::FieldSpec',
@@ -36,7 +39,7 @@
unstored => 'MySchema::unstored',
);

-sub analyzer { KinoSearch::Analysis::Analyzer->new }
+sub analyzer { TestAnalyzer->new }

package main;
use Encode qw( _utf8_on );
Index: t/605-store_pos_boost.t
===================================================================
--- t/605-store_pos_boost.t (revision 2396)
+++ t/605-store_pos_boost.t (working copy)
@@ -6,7 +6,7 @@
use base qw( KinoSearch::Analysis::Analyzer );
use KinoSearch::Analysis::TokenBatch;

-sub analyze {
+sub analyze_batch {
my ( $self, $batch ) = @_;
my $new_batch = KinoSearch::Analysis::TokenBatch->new;

Index: t/207-seg_lexicon.t
===================================================================
--- t/207-seg_lexicon.t (revision 2396)
+++ t/207-seg_lexicon.t (working copy)
@@ -3,18 +3,20 @@

use Test::More tests => 5;

+package TestAnalyzer;
+use base qw( KinoSearch::Analysis::Analyzer );
+sub analyze_batch { $_[1] }
+
package MySchema;
use base qw( KinoSearch::Schema );

-use KinoSearch::Analysis::Analyzer;
-
our %fields = (
a => 'KinoSearch::Schema::FieldSpec',
b => 'KinoSearch::Schema::FieldSpec',
c => 'KinoSearch::Schema::FieldSpec',
);

-sub analyzer { KinoSearch::Analysis::Analyzer->new }
+sub analyzer { TestAnalyzer->new }

package main;

Index: t/151-analyzer.t
===================================================================
--- t/151-analyzer.t (revision 2396)
+++ t/151-analyzer.t (working copy)
@@ -7,8 +7,13 @@
use KinoSearch::Analysis::Analyzer;
use KinoTestUtils qw( utf8_test_strings test_analyzer );

-my $analyzer = KinoSearch::Analysis::Analyzer->new;
+package TestAnalyzer;
+use base qw( KinoSearch::Analysis::Analyzer );
+sub analyze_batch { $_[1] } # satisfy mandatory override

+package main;
+my $analyzer = TestAnalyzer->new;
+
my ( $smiley, $not_a_smiley, $frowny ) = utf8_test_strings();

my ($got) = $analyzer->analyze_raw($not_a_smiley);
Index: lib/KinoSearch/Analysis/Analyzer.pm
===================================================================
--- lib/KinoSearch/Analysis/Analyzer.pm (revision 2396)
+++ lib/KinoSearch/Analysis/Analyzer.pm (working copy)
@@ -12,16 +12,24 @@

use KinoSearch::Analysis::TokenBatch;

-# usage: $token_batch = $analyzer->analyze($token_batch);
-sub analyze { return $_[1] }
+sub analyze_field {
+ my ( $self, $doc, $field_name ) = @_;
+ my $batch = KinoSearch::Analysis::TokenBatch->new(
+ text => $doc->{$field_name},
+ );
+ return $self->analyze_batch($batch);
+}

# Kick off an analysis chain, creating a TokenBatch. Occasionally optimized
# to minimize string copies.
sub analyze_text {
my $batch = KinoSearch::Analysis::TokenBatch->new( text => $_[1] );
- return $_[0]->analyze($batch);
+ return $_[0]->analyze_batch($batch);
}

+# Must override in a subclass
+sub analyze_batch { shift->abstract_death }
+
# Convenience method which takes text as input and returns a Perl array of
# token texts.
sub analyze_raw {
@@ -47,7 +55,7 @@

package MyAnalyzer;

- sub analyze {
+ sub analyze_batch {
my ( $self, $token_batch ) = @_;

while ( my $token = $token_batch->next ) {
@@ -72,11 +80,11 @@

=head1 SUBCLASSING

-All Analyzer subclasses must provide an C<analyze> method.
+All Analyzer subclasses must provide an C<analyze_batch> method.

-=head2 analyze
+=head2 analyze_batch

-C<analyze()> takes a single L<TokenBatch|KinoSearch::Analysis::TokenBatch> as
+C<analyze_batch()> takes a single L<TokenBatch|KinoSearch::Analysis::TokenBatch> as
input, and it returns a TokenBatch, either the same one (presumably
transformed in some way), or a new one.

Index: lib/KinoSearch/Analysis/PolyAnalyzer.pm
===================================================================
--- lib/KinoSearch/Analysis/PolyAnalyzer.pm (revision 2396)
+++ lib/KinoSearch/Analysis/PolyAnalyzer.pm (working copy)
@@ -33,11 +33,11 @@
}
}

-sub analyze {
+sub analyze_batch {
my ( $self, $token_batch ) = @_;

# iterate through each of the analyzers in order
- $token_batch = $_->analyze($token_batch) for @{ $self->{analyzers} };
+ $token_batch = $_->analyze_batch($token_batch) for @{ $self->{analyzers} };

return $token_batch;
}
@@ -54,7 +54,7 @@
}
else {
my $batch = $analyzers->[0]->analyze_text( $_[1] );
- $batch = $_->analyze($batch) for @{$analyzers}[ 1 .. $#$analyzers ];
+ $batch = $_->analyze_batch($batch) for @{$analyzers}[ 1 .. $#$analyzers ];
return $batch;
}
}
Index: lib/KinoSearch/Analysis/Tokenizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Tokenizer.pm (revision 2396)
+++ lib/KinoSearch/Analysis/Tokenizer.pm (working copy)
@@ -29,8 +29,8 @@
HV *self_hv;
SV *batch_or_text_sv;
ALIAS:
- analyze = 1
- analyze_text = 2
+ analyze_batch = 1
+ analyze_text = 2
CODE:
{
kino_TokenBatch *batch = NULL;
Index: lib/KinoSearch/Analysis/Stemmer.pm
===================================================================
--- lib/KinoSearch/Analysis/Stemmer.pm (revision 2396)
+++ lib/KinoSearch/Analysis/Stemmer.pm (working copy)
@@ -35,7 +35,7 @@
);
}

-sub analyze {
+sub analyze_batch {
my ( $self, $batch ) = @_;

# replace terms with stemmed versions.
Index: lib/KinoSearch/Analysis/Stopalizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Stopalizer.pm (revision 2396)
+++ lib/KinoSearch/Analysis/Stopalizer.pm (working copy)
@@ -45,7 +45,7 @@
MODULE = KinoSearch PACKAGE = KinoSearch::Analysis::Stopalizer

SV*
-analyze(self_hash, batch)
+analyze_batch(self_hash, batch)
HV *self_hash;
kino_TokenBatch *batch;
CODE:
Index: lib/KinoSearch/Analysis/LCNormalizer.pm
===================================================================
--- lib/KinoSearch/Analysis/LCNormalizer.pm (revision 2396)
+++ lib/KinoSearch/Analysis/LCNormalizer.pm (working copy)
@@ -14,7 +14,7 @@
use KinoSearch::Analysis::Token;
use KinoSearch::Analysis::TokenBatch;

-sub analyze {
+sub analyze_batch {
my ( $self, $batch ) = @_;

# lc all of the terms, one by one
@@ -27,7 +27,8 @@
}

sub analyze_text {
- return KinoSearch::Analysis::TokenBatch->new( text => lc($_[1] ) );
+ my $batch = KinoSearch::Analysis::TokenBatch->new( text => lc($_[1] ) );
+ return $_[0]->analyze_batch($batch);
}

1;
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
On May 4, 2007, at 8:36 PM, Peter Karman wrote:

> The attached patch addresses all items in Task 1 but the new tests
> for analyze_field(). There were considerably more files affected
> that the ones you listed above,

Heh. I originally had one more bullet point in there: "Fix things
until the test suite passes." I'd forgotten quite how many test
files used a plain ol' Analyzer. Looks like 3. Might be worth
factoring that into buildlib/TestAnalyzer.pm to satisfy DRY. Or
not. It's simple stuff.

> so rather than try and lump the analyst -> analyze_batch change in
> with the new analyze_field feature, I'm opting to just do them in
> two steps.
>
> Here's step 1. 'make test' passes for me at r2396.

Looks great. I've applied it as revision 2397. Thanks!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
Of the remaining tasks, the attached patch addresses these:

> * Add perfunctory tests for analyze_field to the relevant test files.
> o 150-polyanalyzer.t
> o 151-analyzer.t
> o 153-lc_normalizer.t
> o 154-tokenizer.t
> o 155-stopalizer.t
> o 156-stemmer.t
>
>
> * Change SegWriter to use analyze_field. [NOTE: Marvin did this]
> * Add optimized analyze_field implementations to LCNormalizer and
> PolyAnalyzer.
> * Add optimized analyze_field implementation to Tokenizer. This one's
> harder because it requires some advanced XS.
>
> * Copy and paste the utf8ify code into StringHelper.pm.
> * Add some tests to verify that it works.
> * Replace calls to utf8::upgrade with utf8ify.
> * We'll skip moving the utf8 conversion from InvIndexer to the
> Analyzers for now, since that has other implications.
>

I'm not convinced that the approach I took in Tokenizer's XS was most optimal.
But it passes all tests.


Still TODO:

> * Test that you can mod a document's contents, using code nearly
> identical to what will end up in the Swish/KS glue eventually.

> * Expand Analyzer's docs with regard to subclassing.

I expect I'll be able to do both of those once I think a little more about what
I want Swish to do.

--
Peter Karman . http://peknet.com/ . peter@peknet.com
-------------- next part --------------
Index: buildlib/KinoTestUtils.pm
===================================================================
--- buildlib/KinoTestUtils.pm (revision 2433)
+++ buildlib/KinoTestUtils.pm (working copy)
@@ -150,6 +150,13 @@

@got = $analyzer->analyze_raw($source);
Test::More::is_deeply( \@got, $expected, "analyze_raw: $message" );
+
+ $batch = $analyzer->analyze_field({content => $source}, 'content');
+ @got = ();
+ while ( my $token = $batch->next ) {
+ push @got, $token->get_text;
+ }
+ Test::More::is_deeply( \@got, $expected, "analyze_field: $message" );
}

1;
Index: t/508-hits.t
===================================================================
--- t/508-hits.t (revision 2433)
+++ t/508-hits.t (working copy)
@@ -8,7 +8,7 @@
use KinoSearch::Searcher;
use KinoTestUtils qw( create_invindex );

-my @docs = ( 'a b', 'a a b', 'a a a b', 'x' );
+my @docs = ( 'a b', 'a a b', 'a a a b', 'x' );
my $invindex = create_invindex(@docs);

my $searcher = KinoSearch::Searcher->new( invindex => $invindex, );
Index: t/154-tokenizer.t
===================================================================
--- t/154-tokenizer.t (revision 2433)
+++ t/154-tokenizer.t (working copy)
@@ -1,7 +1,7 @@
use strict;
use warnings;

-use Test::More tests => 8;
+use Test::More tests => 9;

use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::TokenBatch;
@@ -51,3 +51,13 @@
[ 'a', ' ', 'b', ' ', 'c' ],
"no freakout when fed multiple tokens"
);
+
+$batch->reset;
+$tokenizer = KinoSearch::Analysis::Tokenizer->new();
+$batch
+ = $tokenizer->analyze_field( { monroe => 'some like it hot' }, 'monroe' );
+@token_texts = ();
+while ( my $token = $batch->next ) {
+ push @token_texts, $token->get_text;
+}
+is_deeply( \@token_texts, [ 'some', 'like', 'it', 'hot' ], "analyze_field" );
Index: t/601-queryparser.t
===================================================================
--- t/601-queryparser.t (revision 2433)
+++ t/601-queryparser.t (working copy)
@@ -39,7 +39,7 @@
sub analyzer { KinoSearch::Analysis::Tokenizer->new }

package main;
-use Test::More tests => 210;
+use Test::More tests => 212;

use KinoSearch::QueryParser::QueryParser;
use KinoSearch::Analysis::PolyAnalyzer;
@@ -47,7 +47,7 @@
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;
use KinoSearch::Store::RAMFolder;
-use KinoSearch::Util::StringHelper qw( utf8_flag_on );
+use KinoSearch::Util::StringHelper qw( utf8_flag_on utf8ify );

use KinoTestUtils qw( create_invindex );

@@ -197,6 +197,16 @@
$hits = $searcher->search( query => $motorhead );
is( $hits->total_hits, 1, "QueryParser parses UTF-8 strings correctly" );

+$motorhead = "Mot\xF6rhead";
+utf8ify($motorhead);
+$unicode_invindex = create_invindex($motorhead);
+$searcher = KinoSearch::Searcher->new( invindex => $unicode_invindex, );
+
+$hits = $searcher->search( query => 'Mot' );
+is( $hits->total_hits, 0, "Pre-test - indexing worked properly" );
+$hits = $searcher->search( query => $motorhead );
+is( $hits->total_hits, 1, "QueryParser utf8ifys UTF-8 strings correctly" );
+
my $mf_folder = KinoSearch::Store::RAMFolder->new;
my $mf_schema = MultiFieldSchema->new;
my $mf_invindex = KinoSearch::InvIndex->create(
Index: t/519-and_or_scorer.t
===================================================================
--- t/519-and_or_scorer.t (revision 2433)
+++ t/519-and_or_scorer.t (working copy)
@@ -123,7 +123,7 @@
my $score_docs = $hc->get_hit_queue->score_docs;
my @by_score_then_num = map { $_->get_doc_num }
sort {
- $b->get_score <=> $a->get_score
+ $b->get_score <=> $a->get_score
|| $a->get_doc_num <=> $b->get_doc_num
} @$score_docs;
my @by_num = sort { $a <=> $b } @by_score_then_num;
Index: t/155-stopalizer.t
===================================================================
--- t/155-stopalizer.t (revision 2433)
+++ t/155-stopalizer.t (working copy)
@@ -2,7 +2,7 @@
use warnings;
use lib 'buildlib';

-use Test::More tests => 6;
+use Test::More tests => 8;
use KinoTestUtils qw( test_analyzer );

use KinoSearch::Analysis::Stopalizer;
Index: t/016-varray.t
===================================================================
--- t/016-varray.t (revision 2433)
+++ t/016-varray.t (working copy)
@@ -10,7 +10,7 @@
my ( $varray, @orig, @got );

$varray = KinoSearch::Util::VArray->new( capacity => 0 );
-@orig = 1 .. 10;
+@orig = 1 .. 10;

$varray->push( KinoSearch::Util::ByteBuf->new($_) ) for @orig;
is( $varray->get_size, 10, "get_size after pushing 10 elements" );
Index: t/215-term_vectors.t
===================================================================
--- t/215-term_vectors.t (revision 2433)
+++ t/215-term_vectors.t (working copy)
@@ -47,12 +47,12 @@
$invindexer->finish;

my $searcher = KinoSearch::Searcher->new( invindex => $invindex );
-my $doc_vec = $searcher->fetch_doc_vec(0);
+my $doc_vec = $searcher->fetch_doc_vec(0);

my $term_vector = $doc_vec->term_vector( "content", "foo" );
ok( defined $term_vector, "successfully retrieved term vector" );

-$doc_vec = $searcher->fetch_doc_vec(1);
+$doc_vec = $searcher->fetch_doc_vec(1);
$term_vector = $doc_vec->term_vector( 'content', 'ma??ana' );

ok( defined $term_vector, "utf-8 term vector retrieved" );
Index: t/151-analyzer.t
===================================================================
--- t/151-analyzer.t (revision 2433)
+++ t/151-analyzer.t (working copy)
@@ -2,7 +2,7 @@
use warnings;
use lib 'buildlib';

-use Test::More tests => 5;
+use Test::More tests => 6;

use KinoSearch::Analysis::Analyzer;
use KinoTestUtils qw( utf8_test_strings test_analyzer );
Index: t/505-hit_queue.t
===================================================================
--- t/505-hit_queue.t (revision 2433)
+++ t/505-hit_queue.t (working copy)
@@ -26,7 +26,8 @@
} @docs_and_scores;

my @correct_order = sort {
- $b->get_score <=> $a->get_score or $a->get_doc_num <=> $b->get_doc_num
+ $b->get_score <=> $a->get_score
+ or $a->get_doc_num <=> $b->get_doc_num
} @score_docs;
my @correct_docs = map { $_->get_doc_num } @correct_order;
my @correct_scores = map { $_->get_score } @correct_order;
Index: t/153-lc_normalizer.t
===================================================================
--- t/153-lc_normalizer.t (revision 2433)
+++ t/153-lc_normalizer.t (working copy)
@@ -2,7 +2,7 @@
use warnings;
use lib 'buildlib';

-use Test::More tests => 3;
+use Test::More tests => 4;
use KinoTestUtils qw( test_analyzer );

use KinoSearch::Analysis::LCNormalizer;
Index: t/518-or_scorer.t
===================================================================
--- t/518-or_scorer.t (revision 2433)
+++ t/518-or_scorer.t (working copy)
@@ -83,7 +83,7 @@
perform_search( [ 'a' .. $_ ] ) for 'a' .. 'z';

sub perform_search {
- my $letters = shift;
+ my $letters = shift;
my $letter_string = join ' ', @$letters;

my $subscorers
@@ -125,8 +125,8 @@
my @doc_nums = keys %{ $letters{$letter} };
$counts{$_} += 1 for @doc_nums;
}
- my @by_count_then_num =
- sort { $counts{$b} <=> $counts{$a} || $a <=> $b }
+ my @by_count_then_num
+ = sort { $counts{$b} <=> $counts{$a} || $a <=> $b }
keys %counts;

my @by_num = sort { $a <=> $b } @by_count_then_num;
@@ -139,7 +139,7 @@
my $score_docs = $hc->get_hit_queue->score_docs;
my @by_score_then_num = map { $_->get_doc_num }
sort {
- $b->get_score <=> $a->get_score
+ $b->get_score <=> $a->get_score
|| $a->get_doc_num <=> $b->get_doc_num
} @$score_docs;
my @by_num = sort { $a <=> $b } @by_score_then_num;
Index: t/012-priority_queue.t
===================================================================
--- t/012-priority_queue.t (revision 2433)
+++ t/012-priority_queue.t (working copy)
@@ -30,7 +30,7 @@
);

1 while defined $pq->pop; # empty queue;
-$pq = KinoSearch::Util::PriorityQueue->new( max_size => 5 );
+$pq = KinoSearch::Util::PriorityQueue->new( max_size => 5 );
@prioritized = ();

$pq->insert($_) for ( 1 .. 10, -3, 1590 .. 1600, 5 );
@@ -50,4 +50,3 @@
$pq->insert( splice( @nums, $tick, 1 ) );
}
is_deeply( $pq->pop_all, [ reverse 1 .. 100 ], "random order insertion" );
-
Index: t/156-stemmer.t
===================================================================
--- t/156-stemmer.t (revision 2433)
+++ t/156-stemmer.t (working copy)
@@ -2,7 +2,7 @@
use warnings;
use lib 'buildlib';

-use Test::More tests => 6;
+use Test::More tests => 8;
use KinoTestUtils qw( test_analyzer );

use KinoSearch::Analysis::Stemmer;
Index: t/514-and_scorer.t
===================================================================
--- t/514-and_scorer.t (revision 2433)
+++ t/514-and_scorer.t (working copy)
@@ -19,7 +19,7 @@
push @docs, ('c d x');
my $invindex = create_invindex(@docs);

-my $searcher = KinoSearch::Searcher->new( invindex => $invindex, );
+my $searcher = KinoSearch::Searcher->new( invindex => $invindex, );
my $similarity = KinoSearch::Search::Similarity->new;

my $c_query = KinoSearch::Search::TermQuery->new(
@@ -96,4 +96,3 @@
}
return \@doc_nums;
}
-
Index: t/150-polyanalyzer.t
===================================================================
--- t/150-polyanalyzer.t (revision 2433)
+++ t/150-polyanalyzer.t (working copy)
@@ -2,7 +2,7 @@
use warnings;
use lib 'buildlib';

-use Test::More tests => 15;
+use Test::More tests => 20;

use KinoTestUtils qw( test_analyzer );

Index: t/013-bit_vector.t
===================================================================
--- t/013-bit_vector.t (revision 2433)
+++ t/013-bit_vector.t (working copy)
@@ -60,7 +60,7 @@
}
}

-my @set_1 = ( 1 .. 3, 10, 20, 30 );
+my @set_1 = ( 1 .. 3, 10, 20, 30 );
my @set_2 = ( 2 .. 10, 25 .. 35 );

$bit_vec = KinoSearch::Util::BitVector->new;
Index: lib/KinoSearch/QueryParser/QueryParser.pm
===================================================================
--- lib/KinoSearch/QueryParser/QueryParser.pm (revision 2433)
+++ lib/KinoSearch/QueryParser/QueryParser.pm (working copy)
@@ -3,6 +3,7 @@

package KinoSearch::QueryParser::QueryParser;
use KinoSearch::Util::ToolSet;
+use KinoSearch::Util::StringHelper qw( utf8ify );
use base qw( KinoSearch::Util::Class );

our %instance_vars = (
@@ -100,7 +101,7 @@
sub parse {
my ( $self, $qstring_orig ) = @_;
$qstring_orig = '' unless defined $qstring_orig;
- utf8::upgrade($qstring_orig);
+ utf8ify($qstring_orig);
my $default_fields = $self->{fields};
my $default_boolop = $self->{default_boolop};
my @clauses;
Index: lib/KinoSearch/Analysis/PolyAnalyzer.pm
===================================================================
--- lib/KinoSearch/Analysis/PolyAnalyzer.pm (revision 2433)
+++ lib/KinoSearch/Analysis/PolyAnalyzer.pm (working copy)
@@ -18,7 +18,7 @@
use KinoSearch::Analysis::Stemmer;

sub init_instance {
- my $self = shift;
+ my $self = shift;
my $language = $self->{language} = lc( $self->{language} );

# create a default set of analyzers if language was specified
@@ -61,6 +61,24 @@
}
}

+sub analyze_field {
+ my $analyzers = $_[0]->{analyzers};
+
+ if ( !@$analyzers ) {
+ return KinoSearch::Analysis::TokenBatch->new(
+ text => $_[1]->{ $_[2] } );
+ }
+ elsif ( @$analyzers == 1 ) {
+ return $analyzers->[0]->analyze_field( $_[1], $_[2] );
+ }
+ else {
+ my $batch = $analyzers->[0]->analyze_field( $_[1], $_[2] );
+ $batch = $_->analyze_batch($batch)
+ for @{$analyzers}[ 1 .. $#$analyzers ];
+ return $batch;
+ }
+}
+
1;

__END__
Index: lib/KinoSearch/Analysis/Tokenizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Tokenizer.pm (revision 2433)
+++ lib/KinoSearch/Analysis/Tokenizer.pm (working copy)
@@ -25,12 +25,13 @@
MODULE = KinoSearch PACKAGE = KinoSearch::Analysis::Tokenizer

kino_TokenBatch*
-_do_analyze(self_hv, batch_or_text_sv)
+_do_analyze(self_hv, batch_or_text_sv, ...)
HV *self_hv;
SV *batch_or_text_sv;
ALIAS:
analyze_batch = 1
analyze_text = 2
+ analyze_field = 3
CODE:
{
kino_TokenBatch *batch = NULL;
@@ -40,11 +41,30 @@
chy_u32_t num_code_points = 0;
SV *wrapper = sv_newmortal();
RETVAL = kino_TokenBatch_new(NULL);
-
+ char *string = NULL;
+ STRLEN string_len = 0;
+
if (ix == 1) {
EXTRACT_STRUCT( batch_or_text_sv, batch, kino_TokenBatch*,
"KinoSearch::Analysis::TokenBatch");
}
+ if (ix == 2) {
+ string = SvPVutf8( ST(1), string_len );
+ }
+ if (ix == 3) {
+ if (items != 3)
+ CONFESS("analyze_text() takes 2 arguments, got %d", items - 1);
+ if (!SvROK(batch_or_text_sv))
+ CONFESS("first argument to analyze_text() must be hash ref");
+
+ STRLEN len;
+ char *field_name = SvPV(ST(2), len);
+ string = SvPVutf8(extract_sv(
+ (HV*)SvRV(batch_or_text_sv),
+ field_name,
+ len),
+ string_len);
+ }

/* extract regexp struct from qr// entity */
if (SvROK(token_re)) {
@@ -63,7 +83,6 @@
SvUTF8_on(wrapper);

while (1) {
- STRLEN len;
char *string_beg;
char *string_end;
char *string_arg;
@@ -72,20 +91,20 @@
kino_Token *token = Kino_TokenBatch_Next(batch);
if (token == NULL)
break;
- len = token->len;
+ string_len = token->len;
string_beg = token->text;
- string_end = string_beg + len;
+ string_end = string_beg + string_len;
string_arg = string_beg;
}
else {
- string_beg = SvPVutf8( ST(1), len );
- string_end = string_beg + len;
+ string_beg = string;
+ string_end = string_beg + string_len;
string_arg = string_beg;
}

/* wrap the string in an SV to please the regex engine */
SvPVX(wrapper) = string_beg;
- SvCUR_set(wrapper, len);
+ SvCUR_set(wrapper, string_len);
SvPOK_on(wrapper);

while (
@@ -128,7 +147,7 @@
REFCOUNT_DEC(new_token);
}

- if (ix == 2) /* analyze_text only runs one loop iter */
+ if (ix > 1) /* analyze_text and analyze_field only run one loop iter */
break;
}
}
Index: lib/KinoSearch/Analysis/Stopalizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Stopalizer.pm (revision 2433)
+++ lib/KinoSearch/Analysis/Stopalizer.pm (working copy)
@@ -16,7 +16,7 @@
use Lingua::StopWords;

sub init_instance {
- my $self = shift;
+ my $self = shift;
my $language = $self->{language} = lc( $self->{language} );

# verify a supplied stoplist
@@ -139,4 +139,3 @@
See L<KinoSearch> version 0.20.

=cut
-
Index: lib/KinoSearch/Analysis/LCNormalizer.pm
===================================================================
--- lib/KinoSearch/Analysis/LCNormalizer.pm (revision 2433)
+++ lib/KinoSearch/Analysis/LCNormalizer.pm (working copy)
@@ -31,6 +31,12 @@
return $_[0]->analyze_batch($batch);
}

+sub analyze_field {
+ my $batch = KinoSearch::Analysis::TokenBatch->new(
+ text => lc( $_[1]->{ $_[2] } ) );
+ return $_[0]->analyze_batch($batch);
+}
+
1;

__END__
Index: lib/KinoSearch/Index/SegWriter.pm
===================================================================
--- lib/KinoSearch/Index/SegWriter.pm (revision 2433)
+++ lib/KinoSearch/Index/SegWriter.pm (working copy)
@@ -3,6 +3,7 @@

package KinoSearch::Index::SegWriter;
use KinoSearch::Util::ToolSet;
+use KinoSearch::Util::StringHelper qw( utf8ify );
use base qw( KinoSearch::Util::Class );

our %instance_vars = (
@@ -91,7 +92,7 @@

# upgrade fields that aren't binary to utf8
if ( !$field_spec->binary ) {
- utf8::upgrade( $doc->{$field_name} );
+ utf8ify( $doc->{$field_name} );
}

next unless $field_spec->indexed;
Index: lib/KinoSearch/Util/StringHelper.pm
===================================================================
--- lib/KinoSearch/Util/StringHelper.pm (revision 2433)
+++ lib/KinoSearch/Util/StringHelper.pm (working copy)
@@ -9,6 +9,7 @@
utf8_flag_off
to_base36
from_base36
+ utf8ify
);

1;
@@ -62,6 +63,19 @@
RETVAL = strtol(str, NULL, 36);
OUTPUT: RETVAL

+=for comment
+
+Upgrade a SV to UTF8, converting Latin1 if necessary. Equivalent to utf::upgrade().
+
+=cut
+
+void
+utf8ify(sv)
+ SV *sv;
+PPCODE:
+ sv_utf8_upgrade(sv);
+
+
__POD__

=begin devdocs
Re: Analyzer API mods (was API request for KS::InvIndexer...) [ In reply to ]
On May 14, 2007, at 6:35 AM, Peter Karman wrote:

> Of the remaining tasks, the attached patch addresses these:

---->8 SNIP 8<----

Excellent, thanks! I applied the meat of this as r2439.

Some parts of the patch were whitespace only; they alerted me to the
fact that my local copies of Perl::Tidy needed to be updated. I then
tidied all of KS; the subsequent commit, r2438, included the cosmetic
portions of your patch.

There was a spot I chose to modify in 601-query_parser.t. These two
frags are nearly equivalent:

my $motorhead = "Mot\xC3\xB6rhead";
utf8_flag_on($motorhead);

$motorhead = "Mot\xF6rhead";
utf8ify($motorhead);

The only difference is that utf8ify introduces magical cached utf8
length information:

SV = PV(0x1935aa8) at 0x18e9908
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x1594460 "Mot\303\266rhead"\0 [UTF8 "Mot\x{f6}rhead"]
CUR = 10
LEN = 11

SV = PVMG(0x194d380) at 0x18e9908
REFCNT = 1
FLAGS = (PADBUSY,PADMY,SMG,POK,pPOK,UTF8)
IV = 0
NV = 0
PV = 0x1594fb0 "Mot\303\266rhead"\0 [UTF8 "Mot\x{f6}rhead"]
CUR = 10
LEN = 11
MAGIC = 0x15953d0
MG_VIRTUAL = &PL_vtbl_utf8
MG_TYPE = PERL_MAGIC_utf8(w)
MG_LEN = 9

That meant that the new test was testing the same information, just
prepared a different way. What I did was mod the old test to use
your way of prepping the string.

> I'm not convinced that the approach I took in Tokenizer's XS was
> most optimal. But it passes all tests.

I don't think it's possible to improve upon what you chose to do.
Nice work!

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/