On Aug 15, 2006, at 1:11 AM, Marc Elser wrote:
> First of all thanks a lot for the fast patch. I Just installed it
> from svn and stumbled across the following problems:
Thanks for the detailed reports.
> 1.) UTF-8 QueryStrings are still split by QueryParser at UTF-8
> special characters for example at a-umlaut (or in german ?). This
> still leads to the described problems that a words like
> "anl?sslich" are split into "anl" and "sslich" which produces false
> machtes for example a document which contains "Anleitung" and
> "verl?sslich" which is something completely different would match.
This is an important result. Even though the mis-tokenization
happens to be due to a bug (see below), it illustrates why moving to
UTF-8 is not backwards compatible.
If you have an index based on, say, Latin 1, and it uses characters
above 127, they will have been indexed verbatim -- but now, as you're
searching, the Query string will get passed through a UTF-8
converter, changing it into a different sequence of bytes -- and
either producing no results, or incorrect results.
Indexes which were produced under the current version WILL NOT WORK
PROPERLY after we make the transition. I'm intend to make KinoSearch
refuse to read them, so that you'll know you need to revert if you
can't regenerate right away.
It would be difficult, maybe impossible to make a translator. I
think I'm going to have to invoke the KinoSearch alpha clause.
We may as well make some planned changes to the file format at the
same time (this is for the rich positions stuff mentioned a few
months a go), consolidating all the disruptive stuff into one release.
> I don't know exactly where the problem is because your regex \b\w+
> (?:'\w+)?\b should still work if the string it is used against has
> the utf-8 flag on.
It's because the locale pragma was in effect (removing it fixed the
bug):
slothbear:~/perltest marvin$ cat locale_regex.plx
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw( _utf8_on );
my $m = "Mot\xC3\xB6rhead";
_utf8_on($m);
$m =~ /(\w*)/;
print "$1\n";
use locale;
$m =~ /(\w*)/;
print "$1\n";
slothbear:~/perltest marvin$ perl locale_regex.plx
Mot?rhead
Mot
slothbear:~/perltest marvin$
> Is It possible that the TokenBatch does not set the utf-8 flag
> correctly in gettext or does it somehow corrupt the string it returns?
I don't believe so. There are a few ways of testing a scalar to see
if the flag is on. For future reference, if you want to verify that
for yourself, my favorite is Devel::Peek::Dump(). Look for "UTF8" in
the "FLAGS" field, and the UTF8 value of the string.
slothbear:~/perltest marvin$ cat peek_dump.plx
#!/usr/bin/perl
use strict;
use warnings;
use Encode qw( _utf8_on );
use Devel::Peek;
my $m = "Mot\xC3\xB6rhead";
_utf8_on($m);
Dump($m);
slothbear:~/perltest marvin$ perl peek_dump.plx
SV = PV(0x1801660) at 0x180b59c
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x300bd0 "Mot\303\266rhead"\0 [UTF8 "Mot\x{f6}rhead"]
CUR = 10
LEN = 11
slothbear:~/perltest marvin$
> Cause I was also playing around with the "utf8::upgrade" function
> to upgrading the text returned from TokenBatch to utf8 before
> feeding it through the regex before trying your patched version,
> but it somehow did an additional utf8 encoding to the string
> causing the special character '?' beeing encoded twice resulting in
> 2 strange characters instead of '?'. Maybe the same happens now
> with the modified TokenBatch class.
If you can reproduce the problem, can you please provide me with a
Devel::Peek Dump of before and after?
>
> 2.) The Highlighter does not like the UTF-8 Sequences. It looks
> like it doesn't count UTF-8 special characters when computing the
> insertion points for the <strong>...</strong> tags. It looks like
> they are shift left by number of UTF-8 special characters. Here's
> an example where the search term was "klammerte":
>
> === start example ===
> verfilzten Pelz beiden H?nden nach Ansatzpunkten durchw?hlend,
> schwang sich Ford R?cken m?chtigen Tie<strong>res klamm</
> strong>erte sich, endlich sicher oben sa?, beiden H?nden braune
> Gestr?pp ...
> === end example ===
I believe that this was actually due to a problem in Tokenizer/
TokenBatch. Last night's patch measured Token starts in bytes from
the beginning of the field, but Token ends in Unicode code points
from the beginning of the field. Highlighter uses this information
(which has been stored in the index) to place the tags. It expects
bytecounts -- the code-point-count was bogus.
I've replaced the faulty algorithm with something slightly slower but
less tricky.
> Please, let me know if you fixed these problems. Especially the
> splitting of query terms at the wrong position is a big one. But I
> gladly play beta tester for utf-8 text in KinoSearch if it makes
> Kinosearch UTF-8 compatible :-)
I appreciate the offer. We're going to need a few more tests. It's
clear that the current test suite is not adequate, since it did not
reveal these problems.
> P.S. Did you ever think of wildcard searches like lucene offers. I
> would very much like to search for "business*" and also find
> "businesspartner" or search for "b*ess" and find "business" but
> also "boldness". I know that wildcards in front of the search terms
> are not supported by clucene and I think it slows things down but I
> really wonder if wildcards in between characters or at the end of
> the search term could be implemented in Kinosearch with good search
> speed.
Wildcard search in Lucene basically turns bus* into a giant OR'd
query, iterating through all the terms in the index that begin with
"bus". It's highly inefficient, producing unacceptably poor
performance much sooner than any of the other Query types. It also
raises an exception when more than 1024 terms are matched -- you can
set that number higher, but then you may well run out of memory.
People write all the time to the Lucene user list griping about
either the poor performance or the exceptions.
I'm not a big fan of that implementation. But then, the
alternatives, such as indexing all bigrams and trigrams etc, aren't
really that much better. Wildcard search is just inherently more
resource intensive than keyword search, because wildcards typically
match SO many more documents. Looking at if from a user's
perspective, though -- e.g. browsing the Lucene docs -- you wouldn't
know that.
There are many, many opportunities for expanding KinoSearch's
capabilities which make better use of the inverted index data
structure design. Personally I don't imagine I'll ever add Wildcard
queries to KinoSearch, and I'll be casting a vote for Lucy to avoid
them as well -- at least in core. They probably belong in a
"contrib" or "experimental" section somewhere with a "WARNING: VERY
SLOW" label at the top of the docs.
Do you know if anyone has tried a dictionary-based tokenizer for
German? I understand that with all the compound words, German needs
substring search more than other Indo-European languages. Stealing a
page from the CJK playbook and splitting on words would cost a lot at
index time, but be much faster than wildcards at search-time and
maybe address the same need. Would that help, at least in theory?
Best,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/ slothbear:~/projects/ks marvin$ svn diff -r 1026
Index: t/601-queryparser.t
===================================================================
--- t/601-queryparser.t (revision 1026)
+++ t/601-queryparser.t (working copy)
@@ -4,17 +4,20 @@
use lib 't';
use KinoSearch qw( kdump );
-use Test::More tests => 205;
+use Test::More tests => 207;
use File::Spec::Functions qw( catfile );
BEGIN { use_ok('KinoSearch::QueryParser::QueryParser') }
+use KinoSearchTestInvIndex qw( create_invindex );
+
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;
use KinoSearch::Store::RAMInvIndex;
use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::Stopalizer;
use KinoSearch::Analysis::PolyAnalyzer;
+use KinoSearch::Util::StringHelper qw( utf8_flag_on );
my $whitespace_tokenizer
= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/ );
@@ -175,3 +178,16 @@
#exit;
}
+my $motorhead = "Mot\xC3\xB6rhead";
+utf8_flag_on($motorhead);
+$invindex = create_invindex($motorhead);
+my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
+$searcher = KinoSearch::Searcher->new(
+ analyzer => $tokenizer,
+ invindex => $invindex,
+);
+
+my $hits = $searcher->search('Mot');
+is( $hits->total_hits, 0, "Pre-test - indexing worked properly" );
+$hits = $searcher->search($motorhead);
+is( $hits->total_hits, 1, "QueryParser parses UTF-8 strings
correctly" );
Index: lib/KinoSearch/Analysis/Tokenizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Tokenizer.pm (revision 1026)
+++ lib/KinoSearch/Analysis/Tokenizer.pm (working copy)
@@ -3,7 +3,6 @@
use warnings;
use KinoSearch::Util::ToolSet;
use base qw( KinoSearch::Analysis::Analyzer );
-use locale;
BEGIN {
__PACKAGE__->init_instance_vars(
@@ -50,20 +49,24 @@
# alias input to $_
while ( $batch->next ) {
local $_ = $batch->get_text;
+ my $copy = $_;
- # ensure that pos is set to 0 for this scalar
- pos = 0;
-
# accumulate token start_offsets and end_offsets
my ( @starts, @ends );
- 1 while ( m/$separator_re/g and push @starts,
- pos and m/$token_re/g and push @ends, pos );
+ my $orig_length = bytes::length($_);
+ while (1) {
+ s/$separator_re//;
+ push @starts, $orig_length - bytes::length($_);
+ last unless s/$token_re//;
+ push @ends, $orig_length - bytes::length($_);
+ }
+
# correct for overshoot
$#starts = $#ends;
# add the new tokens to the batch
- $new_batch->add_many_tokens( $_, \@starts, \@ends );
+ $new_batch->add_many_tokens( $copy, \@starts, \@ends );
}
return $new_batch;
Index: lib/KinoSearch/Analysis/TokenBatch.pm
===================================================================
--- lib/KinoSearch/Analysis/TokenBatch.pm (revision 1026)
+++ lib/KinoSearch/Analysis/TokenBatch.pm (working copy)
@@ -69,7 +69,6 @@
char *string_start = SvPV(string_sv, len);
I32 i;
const I32 max = av_len(starts_av);
- STRLEN unicount = 0;
for (i = 0; i <= max; i++) {
STRLEN start_offset, end_offset;
@@ -93,24 +92,9 @@
Kino_confess("end_offset > len (%d > %"UVuf")",
end_offset, (UV)len);
- /* advance the pointer past as many unicode characters as
required */
- while (1) {
- if (unicount == start_offset)
- break;
-
- /* header byte */
- string_start++;
-
- /* continutation bytes */
- while ((*string_start & 0xC0) == 0xC0)
- string_start++;
-
- unicount++;
- }
-
/* calculate the start of the substring and add the token */
token = Kino_Token_new(
- string_start,
+ string_start + start_offset,
(end_offset - start_offset),
start_offset,
end_offset,
Index: lib/KinoSearch/Index/Term.pm
===================================================================
--- lib/KinoSearch/Index/Term.pm (revision 1026)
+++ lib/KinoSearch/Index/Term.pm (working copy)
@@ -12,6 +12,8 @@
__PACKAGE__->ready_get_set(qw( field text ));
}
+use KinoSearch::Util::StringHelper qw( utf8_flag_on utf8_flag_off );
+
sub new {
croak("usage: KinoSearch::Index::Term->new( field, text )")
unless @_ == 3;
@@ -26,6 +28,7 @@
sub new_from_string {
my ( $class, $termstring, $finfos ) = @_;
my $field_num = unpack( 'n', bytes::substr( $termstring, 0, 2,
'' ) );
+ utf8_flag_on($termstring);
my $field_name = $finfos->field_name($field_num);
return __PACKAGE__->new( $field_name, $termstring );
}
@@ -37,7 +40,9 @@
my ( $self, $finfos ) = @_;
my $field_num = $finfos->get_field_num( $self->{field} );
return unless defined $field_num;
- return pack( 'n', $field_num ) . $self->{text};
+ my $termtext = $self->{text};
+ utf8_flag_off($termtext);
+ return pack( 'n', $field_num ) . $termtext;
}
sub to_string {
Index: lib/KinoSearch/Util/StringHelper.pm
===================================================================
--- lib/KinoSearch/Util/StringHelper.pm (revision 1026)
+++ lib/KinoSearch/Util/StringHelper.pm (working copy)
@@ -3,7 +3,7 @@
use warnings;
use base qw( Exporter );
-our @EXPORT_OK = qw( utf8_flag_on );
+our @EXPORT_OK = qw( utf8_flag_on utf8_flag_off );
1;
@@ -19,6 +19,12 @@
PPCODE:
SvUTF8_on(sv);
+void
+utf8_flag_off(sv)
+ SV *sv;
+PPCODE:
+ SvUTF8_off(sv);
+
__H__
#ifndef H_KINO_STRING_HELPER