Mailing List Archive

Fwd: [rt.cpan.org #21359] Default tokenizer regex breaks unicode
Begin forwarded message:

From: " via RT" <bug-KinoSearch@rt.cpan.org>
Date: September 6, 2006 1:57:06 PM PDT
To: undisclosed-recipients:;
Subject: [rt.cpan.org #21359] Default tokenizer regex breaks unicode
Reply-To: bug-KinoSearch a.t. rt.cpan.org


Wed Sep 06 16:57:05 2006: Request 21359 was acted upon.
Transaction: Ticket created by MCRAWFOR
Queue: KinoSearch
Subject: Default tokenizer regex breaks unicode
Broken in: 0.12, 0.13
Severity: Important
Owner: Nobody
Requestors: mcrawfor@cpan.org
Status: new
Ticket <URL: http://rt.cpan.org/Ticket/Display.html?id=21359 >


The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.

Building a custom Tokenizer with just non-whitespace like so:
my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S
+/);
fixes the issue.

I'm not sure why the built-in regex breaks unicode, but it seems like it
could leave it alone without too much trouble.

Example that fails to match:
--------------------------------
#!/usr/bin/perl
use KinoSearch::InvIndexer;
use KinoSearch::Analysis::PolyAnalyzer;
use KinoSearch::Searcher;

my $uni = "\x{3028}\x{3063}\x{3057}\x{3024}";

my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( language =>
'en');

my $invindexer = KinoSearch::InvIndexer->new(
invindex => 'kino.idx',
create => 1,
analyzer => $analyzer,
);

$invindexer->spec_field(
name => 'title',
boost => 3,
);
$invindexer->spec_field(
name => 'bodytext'
);

my $doc = $invindexer->new_doc;

$doc->set_value( title => $uni ." hellos" );
$doc->set_value( bodytext => 'horatio' );

$invindexer->add_doc($doc);

$invindexer->finish;


my $searcher = KinoSearch::Searcher->new(
invindex => 'kino.idx',
analyzer => $analyzer,
);

my $hits = $searcher->search( query => $uni );
while ( my $hit = $hits->fetch_hit_hashref ) {
print "$hit->{title}\n";
}
------------------------------------

But this same example works if you just create the analyzer like:

my $lc_normalizer = KinoSearch::Analysis::LCNormalizer->new();
my $tokenizer= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S
+/);
my $stemmer = KinoSearch::Analysis::Stemmer->new(language => 'en');
my $analyzer = KinoSearch::Analysis::PolyAnalyzer->new( analyzers =>
[$lc_normalizer, $tokenizer, $stemmer] );

Which is essentially the default, except for the replaced token_re.