Mailing List Archive: utf8 (unicode) any progress on TokenBatch?

utf8 (unicode) any progress on TokenBatch?

Aug 14, 2006, 3:06 AM

Post #1 of 13 (1981 views)

Hi Marvin,

I desperatly need utf8 support in Kinosearch. I already found your
discussion with someone else where you wrote that one would have to
write a special TokenBatch class, but I didn't exactly understand what
you mean by absorbing the utf8 flag from the last scalar.

Did you have time to do some work in this yet? Or would it be sufficient
to modify the analyzers and set the utf8 flag after the "gettext" method
from the TokenBatch?

It's especially a problem because words like "fr?hlich" are split into
two search terms "fr" and "lich" which produces false matches. As we
don't have one language only I must use utf8 strings.

Thanks for any help.

PS: I'm doin speed test and have indexed parsed office files. Currently
I have indexed about 100'000 files and it's lightning fast. Was trying
plucene before and it could not even handle 10'000 files (index grew up
to hundreds of megs and searches were ultra slow).

With Kinosearch it takes about 0.005 secs to search through the 100'000
docs. Amazing work! If this utf8 Problem is solved, Kinosearch is just
perfect.

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

marvin at rectangular

Aug 14, 2006, 12:13 PM

Post #2 of 13 (1999 views)

On Aug 14, 2006, at 3:04 AM, Marc Elser wrote:

> I desperatly need utf8 support in Kinosearch. I already found your
> discussion with someone else where you wrote that one would have to
> write a special TokenBatch class, but I didn't exactly understand what
> you mean by absorbing the utf8 flag from the last scalar.

Each Perl scalar carries a flag indicating whether it is UTF-8 or
not. When a scalar is added to a TokenBatch, that flag is lost.
When the scalar is recreated for Perl again, it can't be known
whether the UTF-8 flag should be set.

One way of addressing this is to add a per-Token flag which stores
the UTF-8 flag. Another way is to make the flag per-TokenBatch
rather than per-Token, and assume that the flag will be consistent.

> Did you have time to do some work in this yet? Or would it be
> sufficient to modify the analyzers and set the utf8 flag after the
> "gettext" method from the TokenBatch?

Marking the scalars with a UTF-8 flag as they emerge from the
TokenBatch is necessary but not sufficient. Tokenizer and
LCNormalizer would function properly then; however, Stemmer and
Stopalizer would not. There is also the issue of what to do when
information is retrieved from the index, since the index doesn't keep
track of how the text within it is encoded.

Stemmer is a wrapper around Lingua::Stem::Snowball, and Stopalizer is
a wrapper around Lingua::StopWords. Both of those modules present a
UTF-8 and a non-UTF-8 interface. Currently, the non-UTF-8 versions
are being loaded. One option is to load both versions -- so that
each Stemmer object keeps two LSS objects around and each Stopalizer
keeps two stoplist hashes -- and then decide which one to use based
on the TokenBatch's UTF-8 status.

Another option is to add an 'encoding' parameter to the constructors
for InvIndexer, Searcher, Analyzer (subclasses of Analyzer would
inherit), and whatever else might need it (Highlighter?
QueryParser?). That's ugly and error-prone, though.

The guts of KinoSearch -- the file format, the search algorithms, etc
-- are encoding-agnostic. I used to think that the ability to
support multiple encodings was an important feature, but now I'm
seeing the drawbacks more clearly. Those of you at my OSCON talk
heard me remark that Doug Cutting had made a powerful argument for
supporting UTF-8 only. The exchange took place on java-
dev@lucene.apache.org, over the last 4 messages in the thread
"Hacking Luke for bytecount-based strings".

http://www.mail-archive.com/java-dev@lucene.apache.org/msg04373.html

This is the key message from Doug:

http://www.mail-archive.com/java-dev@lucene.apache.org/msg04391.html

My inclination now is to go with the radical solution: force non-
UTF-8 source material into UTF-8 when it gets imported into a
TokenBatch (we'd guess the source encoding based on locale), make all
of KinoSearch's internals expect Unicode, and always output Unicode.

Downsides:

* This would be a backwards-incompatible, disruptive change.
* More people would start seeing the Unicode replacement
character (U+FFFD) in their retrieved texts and highlighted
excerpts. Under Latin-1, this renders as a y with umlauts,
followed by a y with an acute accent.
* The people most confused would be people least prepared to adapt.
* Unicode ain't easy, and Perl Unicode especially ain't easy.
* Unintended consequences.

Upsides:

* Simplest API.
* Simplest internals.
* Because of the simple API and internals, fewer bugs, both in
KinoSearch itself and in apps that use it.
* Parallel to Lucene.
* Might be the only way to complete UTF-8 support without
appalling kludges.

Thoughts?

Reflection: Doug's diplomatic dis of Perl's i18n implementation rings
true in my ears as I contemplate how to disentangle this mess...

[.aside to Dave Balmain: is there any hope for doing something similar
in Ruby?]

> It's especially a problem because words like "fr?hlich" are split
> into two search terms "fr" and "lich" which produces false matches.
> As we don't have one language only I must use utf8 strings.

Good example. We're going to need some tests.

> PS: I'm doin speed test and have indexed parsed office files.
> Currently
> I have indexed about 100'000 files and it's lightning fast. Was trying
> plucene before and it could not even handle 10'000 files (index
> grew up
> to hundreds of megs and searches were ultra slow).

Plucene was a valiant and important effort. See <http://
wiki.apache.org/jakarta-lucene/MinimizingObjectOverhead> for a shout-
out to Plucene's developers.

> With Kinosearch it takes about 0.005 secs to search through the
> 100'000
> docs. Amazing work! If this utf8 Problem is solved, Kinosearch is just
> perfect.

:)

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

david at kineticode

Aug 14, 2006, 12:27 PM

Post #3 of 13 (1976 views)

On Aug 14, 2006, at 12:10, Marvin Humphrey wrote:

> My inclination now is to go with the radical solution: force non-
> UTF-8 source material into UTF-8 when it gets imported into a
> TokenBatch (we'd guess the source encoding based on locale), make
> all of KinoSearch's internals expect Unicode, and always output
> Unicode.

Make it possible for the encoding to be supplieed to TokenBatch so
that you don't always have to guess. And guessing will be wrong, at
least sometimes.

We've taken a similar approach for Bricolage: Everything is required
to be UTF8, and if it's not, we have to know what it is so that we
can convert it to UTF8. This makes things a hell of a lot simpler in
the long run, because the rules are so straight-forward. I'll admit,
though, that it took a bit of doing to find all those places that
weren't setting the utf8 flag...

I have to admit, Unicode is the one thing that Java got right and
better than any other language. I hope Perl 6 does the same. :-)

Best,

David

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

lists at ryantate

Aug 14, 2006, 1:43 PM

Post #4 of 13 (1972 views)

On 8/14/06, Marvin Humphrey <marvin@rectangular.com> wrote:
> My inclination now is to go with the radical solution: force non-
> UTF-8 source material into UTF-8
> Thoughts?

Yay?

I recently put together a Web aggregator scraping and parsing various
documents of various encodings into a single summary page. I basically
decided everything would be converted into utf8 and output as utf8.
Along the way I discovered a number of utf8 issues in modules ranging
from LWP::Simple to XML::Atom.

The main problem is lack of awareness of encoding issues. But this is
the kind of thing people really, really should learn. Particularly in
dealing with Web-distributed documents. (See
http://www.joelonsoftware.com/printerFriendly/articles/Unicode.html
for a goof rant on the topic)

It seems if you force utf8, you have solved a big problem for people
who need it, and at worsed forced some important education on people
who do not think about character encoding.

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

marvin at rectangular

Aug 14, 2006, 2:42 PM

Post #5 of 13 (1971 views)

On Aug 14, 2006, at 12:24 PM, David Wheeler wrote:

> On Aug 14, 2006, at 12:10, Marvin Humphrey wrote:
>
>> My inclination now is to go with the radical solution: force non-
>> UTF-8 source material into UTF-8 when it gets imported into a
>> TokenBatch (we'd guess the source encoding based on locale), make
>> all of KinoSearch's internals expect Unicode, and always output
>> Unicode.
>
> Make it possible for the encoding to be supplieed to TokenBatch so
> that you don't always have to guess. And guessing will be wrong, at
> least sometimes.

Yes, absolutely, guessing based on locale will be wrong sometimes. I
was thinking that savvy users could go to the Encode module and do
the conversion themselves. Stuff that was already flagged as UTF-8
wouldn't be touched.

Encode is a bit of a beast, though... Maybe provide
KinoSearch::Analysis::UTF8alizer? I dunno about that. Complicating
matters is that Field properties are set once per InvIndexer session,
and so needing to submit documents with fields which may be in
different encodings from Doc to Doc is problematic, unless
set_encoding($field_name, $encoding) is added to Doc. Yuck.

> We've taken a similar approach for Bricolage: Everything is
> required to be UTF8, and if it's not, we have to know what it is so
> that we can convert it to UTF8.

KinoSearch and Bricolage both target high-level users, and there
doesn't seem to be any question about what the high-level users would
prefer. (Somebody's been making noise about sponsoring this bit of
development, actually, though I'm not going to wait for them to come
through before addressing it). I agree that your "convert-up-front"
approach is the right one.

However, KinoSearch is also meant to Just Work for slapping up a
basic search box on your website. That serves the relative novices,
but it also serves expert people who appreciate KinoSearch's initial
approachability. For the sake of both, it would be nice to keep
encoding as something you aren't forced to address right away.

Does Bricolage throw a fatal error if it encounters a non-UTF-8
scalar and you haven't told it an encoding?

Maybe it would suffice to create KinoSearch::Docs::FAQ and make "What
are those strange characters?" an item, with an Encode-based recipe
for solving the problem.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

david at kineticode

Aug 14, 2006, 3:08 PM

Post #6 of 13 (1971 views)

On Aug 14, 2006, at 14:39, Marvin Humphrey wrote:

> Encode is a bit of a beast, though... Maybe provide
> KinoSearch::Analysis::UTF8alizer?

Encode is not difficult to use if you know what encoding you're
converting from:

use Encode 'decode';
my $utf8 = decode($encoding, $string);

That's it. As long as you know the encoding, just use decode() to
decode your text to UTF8 and turn on the utf8 flag.

> Does Bricolage throw a fatal error if it encounters a non-UTF-8
> scalar and you haven't told it an encoding?

Yes. If you don't tell it an encoding, it assumes UTF-8, so you'll
get errors if you have non-UTF-8 or non-ASCII text.

> Maybe it would suffice to create KinoSearch::Docs::FAQ and make
> "What are those strange characters?" an item, with an Encode-based
> recipe for solving the problem.

It'd be a pretty simple recipe, I think.

Best,

David

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

marvin at rectangular

Aug 15, 2006, 12:01 AM

Post #7 of 13 (1987 views)

Marc,

I committed a patch tonight which may address your needs, provided
you precondition all input by converting it to UTF-8 before
KinoSearch sees it. A couple tests failed when I made the change --
which made me happy, as they were designed to deal with UTF-8 and
they should have failed. We could definitely stand some more, but in
the meantime, you might try checking out the current revision from
subversion and see if it behaves as you need it to.

Cheers,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

slothbear:~/projects/ks marvin$ svn diff
Index: lib/KinoSearch/Analysis/Stemmer.pm
===================================================================
--- lib/KinoSearch/Analysis/Stemmer.pm (revision 998)
+++ lib/KinoSearch/Analysis/Stemmer.pm (working copy)
@@ -27,7 +27,10 @@
unless $supported_languages{$language};
# create instance of Snowball stemmer
- $self->{stemmifier} = Lingua::Stem::Snowball->new( lang =>
$language );
+ $self->{stemmifier} = Lingua::Stem::Snowball->new(
+ lang => $language,
+ encoding => 'UTF-8',
+ );
}
sub analyze {
Index: lib/KinoSearch/Analysis/Stopalizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Stopalizer.pm (revision 1002)
+++ lib/KinoSearch/Analysis/Stopalizer.pm (working copy)
@@ -25,7 +25,8 @@
else {
# create a stoplist if language was supplied
if ( $language =~ /^(?:da|de|en|es|fr|it|nl|no|pt|ru|sv)$/ ) {
- $self->{stoplist} = Lingua::StopWords::getStopWords
($language);
+ $self->{stoplist}
+ = Lingua::StopWords::getStopWords($language, 'UTF-8');
}
# No Finnish stoplist, though we have a stemmmer.
elsif ( $language eq 'fi' ) {
Index: lib/KinoSearch/Analysis/TokenBatch.pm
===================================================================
--- lib/KinoSearch/Analysis/TokenBatch.pm (revision 1005)
+++ lib/KinoSearch/Analysis/TokenBatch.pm (working copy)
@@ -38,6 +38,7 @@
I32 pos_inc = 1;
Token *token;
PPCODE:
+ sv_utf8_upgrade(text_sv);
text = SvPV(text_sv, len);
if (items == 5)
pos_inc = SvIV( ST(4) );
@@ -50,7 +51,9 @@
=for comment
Add many tokens to the batch, by supplying the string to be
tokenized, and
-arrays of token starts and token ends (specified in bytes).
+arrays of token starts and token ends. The starts and ends must be
specified
+in *unicode code points*, which is bizarre, but works well with the
current
+implementation of Tokenizer.
=cut
@@ -66,6 +69,7 @@
char *string_start = SvPV(string_sv, len);
I32 i;
const I32 max = av_len(starts_av);
+ STRLEN unicount = 0;
for (i = 0; i <= max; i++) {
STRLEN start_offset, end_offset;
@@ -89,9 +93,24 @@
Kino_confess("end_offset > len (%d > %"UVuf")",
end_offset, (UV)len);
+ /* advance the pointer past as many unicode characters as
required */
+ while (1) {
+ if (unicount == start_offset)
+ break;
+
+ /* header byte */
+ string_start++;
+
+ /* continutation bytes */
+ while ((*string_start & 0xC0) == 0xC0)
+ string_start++;
+
+ unicount++;
+ }
+
/* calculate the start of the substring and add the token */
token = Kino_Token_new(
- (string_start + start_offset),
+ string_start,
(end_offset - start_offset),
start_offset,
end_offset,
@@ -209,6 +228,7 @@
}
/* fall through */
case 2: RETVAL = newSVpvn(batch->current->text, batch->current-
>len);
+ SvUTF8_on(RETVAL);
break;
case 3: batch->current->start_offset = SvIV( ST(1) );

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

marvin at rectangular

Aug 15, 2006, 10:43 PM

Post #8 of 13 (1976 views)

On Aug 15, 2006, at 1:11 AM, Marc Elser wrote:

> First of all thanks a lot for the fast patch. I Just installed it
> from svn and stumbled across the following problems:

Thanks for the detailed reports.

> 1.) UTF-8 QueryStrings are still split by QueryParser at UTF-8
> special characters for example at a-umlaut (or in german ?). This
> still leads to the described problems that a words like
> "anl?sslich" are split into "anl" and "sslich" which produces false
> machtes for example a document which contains "Anleitung" and
> "verl?sslich" which is something completely different would match.

This is an important result. Even though the mis-tokenization
happens to be due to a bug (see below), it illustrates why moving to
UTF-8 is not backwards compatible.

If you have an index based on, say, Latin 1, and it uses characters
above 127, they will have been indexed verbatim -- but now, as you're
searching, the Query string will get passed through a UTF-8
converter, changing it into a different sequence of bytes -- and
either producing no results, or incorrect results.

Indexes which were produced under the current version WILL NOT WORK
PROPERLY after we make the transition. I'm intend to make KinoSearch
refuse to read them, so that you'll know you need to revert if you
can't regenerate right away.

It would be difficult, maybe impossible to make a translator. I
think I'm going to have to invoke the KinoSearch alpha clause.

We may as well make some planned changes to the file format at the
same time (this is for the rich positions stuff mentioned a few
months a go), consolidating all the disruptive stuff into one release.

> I don't know exactly where the problem is because your regex \b\w+
> (?:'\w+)?\b should still work if the string it is used against has
> the utf-8 flag on.

It's because the locale pragma was in effect (removing it fixed the
bug):

slothbear:~/perltest marvin$ cat locale_regex.plx
#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( _utf8_on );

my $m = "Mot\xC3\xB6rhead";
_utf8_on($m);

$m =~ /(\w*)/;
print "$1\n";

use locale;
$m =~ /(\w*)/;
print "$1\n";

slothbear:~/perltest marvin$ perl locale_regex.plx
Mot?rhead
Mot
slothbear:~/perltest marvin$

> Is It possible that the TokenBatch does not set the utf-8 flag
> correctly in gettext or does it somehow corrupt the string it returns?

I don't believe so. There are a few ways of testing a scalar to see
if the flag is on. For future reference, if you want to verify that
for yourself, my favorite is Devel::Peek::Dump(). Look for "UTF8" in
the "FLAGS" field, and the UTF8 value of the string.

slothbear:~/perltest marvin$ cat peek_dump.plx
#!/usr/bin/perl
use strict;
use warnings;

use Encode qw( _utf8_on );
use Devel::Peek;

my $m = "Mot\xC3\xB6rhead";
_utf8_on($m);
Dump($m);

slothbear:~/perltest marvin$ perl peek_dump.plx
SV = PV(0x1801660) at 0x180b59c
REFCNT = 1
FLAGS = (PADBUSY,PADMY,POK,pPOK,UTF8)
PV = 0x300bd0 "Mot\303\266rhead"\0 [UTF8 "Mot\x{f6}rhead"]
CUR = 10
LEN = 11
slothbear:~/perltest marvin$

> Cause I was also playing around with the "utf8::upgrade" function
> to upgrading the text returned from TokenBatch to utf8 before
> feeding it through the regex before trying your patched version,
> but it somehow did an additional utf8 encoding to the string
> causing the special character '?' beeing encoded twice resulting in
> 2 strange characters instead of '?'. Maybe the same happens now
> with the modified TokenBatch class.

If you can reproduce the problem, can you please provide me with a
Devel::Peek Dump of before and after?

>
> 2.) The Highlighter does not like the UTF-8 Sequences. It looks
> like it doesn't count UTF-8 special characters when computing the
> insertion points for the <strong>...</strong> tags. It looks like
> they are shift left by number of UTF-8 special characters. Here's
> an example where the search term was "klammerte":
>
> === start example ===
> verfilzten Pelz beiden H?nden nach Ansatzpunkten durchw?hlend,
> schwang sich Ford R?cken m?chtigen Tie<strong>res klamm</
> strong>erte sich, endlich sicher oben sa?, beiden H?nden braune
> Gestr?pp ...
> === end example ===

I believe that this was actually due to a problem in Tokenizer/
TokenBatch. Last night's patch measured Token starts in bytes from
the beginning of the field, but Token ends in Unicode code points
from the beginning of the field. Highlighter uses this information
(which has been stored in the index) to place the tags. It expects
bytecounts -- the code-point-count was bogus.

I've replaced the faulty algorithm with something slightly slower but
less tricky.

> Please, let me know if you fixed these problems. Especially the
> splitting of query terms at the wrong position is a big one. But I
> gladly play beta tester for utf-8 text in KinoSearch if it makes
> Kinosearch UTF-8 compatible :-)

I appreciate the offer. We're going to need a few more tests. It's
clear that the current test suite is not adequate, since it did not
reveal these problems.

> P.S. Did you ever think of wildcard searches like lucene offers. I
> would very much like to search for "business*" and also find
> "businesspartner" or search for "b*ess" and find "business" but
> also "boldness". I know that wildcards in front of the search terms
> are not supported by clucene and I think it slows things down but I
> really wonder if wildcards in between characters or at the end of
> the search term could be implemented in Kinosearch with good search
> speed.

Wildcard search in Lucene basically turns bus* into a giant OR'd
query, iterating through all the terms in the index that begin with
"bus". It's highly inefficient, producing unacceptably poor
performance much sooner than any of the other Query types. It also
raises an exception when more than 1024 terms are matched -- you can
set that number higher, but then you may well run out of memory.
People write all the time to the Lucene user list griping about
either the poor performance or the exceptions.

I'm not a big fan of that implementation. But then, the
alternatives, such as indexing all bigrams and trigrams etc, aren't
really that much better. Wildcard search is just inherently more
resource intensive than keyword search, because wildcards typically
match SO many more documents. Looking at if from a user's
perspective, though -- e.g. browsing the Lucene docs -- you wouldn't
know that.

There are many, many opportunities for expanding KinoSearch's
capabilities which make better use of the inverted index data
structure design. Personally I don't imagine I'll ever add Wildcard
queries to KinoSearch, and I'll be casting a vote for Lucy to avoid
them as well -- at least in core. They probably belong in a
"contrib" or "experimental" section somewhere with a "WARNING: VERY
SLOW" label at the top of the docs.

Do you know if anyone has tried a dictionary-based tokenizer for
German? I understand that with all the compound words, German needs
substring search more than other Indo-European languages. Stealing a
page from the CJK playbook and splitting on words would cost a lot at
index time, but be much faster than wildcards at search-time and
maybe address the same need. Would that help, at least in theory?

Best,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

slothbear:~/projects/ks marvin$ svn diff -r 1026
Index: t/601-queryparser.t
===================================================================
--- t/601-queryparser.t (revision 1026)
+++ t/601-queryparser.t (working copy)
@@ -4,17 +4,20 @@
use lib 't';
use KinoSearch qw( kdump );
-use Test::More tests => 205;
+use Test::More tests => 207;
use File::Spec::Functions qw( catfile );
BEGIN { use_ok('KinoSearch::QueryParser::QueryParser') }
+use KinoSearchTestInvIndex qw( create_invindex );
+
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;
use KinoSearch::Store::RAMInvIndex;
use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::Stopalizer;
use KinoSearch::Analysis::PolyAnalyzer;
+use KinoSearch::Util::StringHelper qw( utf8_flag_on );
my $whitespace_tokenizer
= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/ );
@@ -175,3 +178,16 @@
#exit;
}
+my $motorhead = "Mot\xC3\xB6rhead";
+utf8_flag_on($motorhead);
+$invindex = create_invindex($motorhead);
+my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
+$searcher = KinoSearch::Searcher->new(
+ analyzer => $tokenizer,
+ invindex => $invindex,
+);
+
+my $hits = $searcher->search('Mot');
+is( $hits->total_hits, 0, "Pre-test - indexing worked properly" );
+$hits = $searcher->search($motorhead);
+is( $hits->total_hits, 1, "QueryParser parses UTF-8 strings
correctly" );
Index: lib/KinoSearch/Analysis/Tokenizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Tokenizer.pm (revision 1026)
+++ lib/KinoSearch/Analysis/Tokenizer.pm (working copy)
@@ -3,7 +3,6 @@
use warnings;
use KinoSearch::Util::ToolSet;
use base qw( KinoSearch::Analysis::Analyzer );
-use locale;
BEGIN {
__PACKAGE__->init_instance_vars(
@@ -50,20 +49,24 @@
# alias input to $_
while ( $batch->next ) {
local $_ = $batch->get_text;
+ my $copy = $_;
- # ensure that pos is set to 0 for this scalar
- pos = 0;
-
# accumulate token start_offsets and end_offsets
my ( @starts, @ends );
- 1 while ( m/$separator_re/g and push @starts,
- pos and m/$token_re/g and push @ends, pos );
+ my $orig_length = bytes::length($_);
+ while (1) {
+ s/$separator_re//;
+ push @starts, $orig_length - bytes::length($_);
+ last unless s/$token_re//;
+ push @ends, $orig_length - bytes::length($_);
+ }
+
# correct for overshoot
$#starts = $#ends;
# add the new tokens to the batch
- $new_batch->add_many_tokens( $_, \@starts, \@ends );
+ $new_batch->add_many_tokens( $copy, \@starts, \@ends );
}
return $new_batch;
Index: lib/KinoSearch/Analysis/TokenBatch.pm
===================================================================
--- lib/KinoSearch/Analysis/TokenBatch.pm (revision 1026)
+++ lib/KinoSearch/Analysis/TokenBatch.pm (working copy)
@@ -69,7 +69,6 @@
char *string_start = SvPV(string_sv, len);
I32 i;
const I32 max = av_len(starts_av);
- STRLEN unicount = 0;
for (i = 0; i <= max; i++) {
STRLEN start_offset, end_offset;
@@ -93,24 +92,9 @@
Kino_confess("end_offset > len (%d > %"UVuf")",
end_offset, (UV)len);
- /* advance the pointer past as many unicode characters as
required */
- while (1) {
- if (unicount == start_offset)
- break;
-
- /* header byte */
- string_start++;
-
- /* continutation bytes */
- while ((*string_start & 0xC0) == 0xC0)
- string_start++;
-
- unicount++;
- }
-
/* calculate the start of the substring and add the token */
token = Kino_Token_new(
- string_start,
+ string_start + start_offset,
(end_offset - start_offset),
start_offset,
end_offset,
Index: lib/KinoSearch/Index/Term.pm
===================================================================
--- lib/KinoSearch/Index/Term.pm (revision 1026)
+++ lib/KinoSearch/Index/Term.pm (working copy)
@@ -12,6 +12,8 @@
__PACKAGE__->ready_get_set(qw( field text ));
}
+use KinoSearch::Util::StringHelper qw( utf8_flag_on utf8_flag_off );
+
sub new {
croak("usage: KinoSearch::Index::Term->new( field, text )")
unless @_ == 3;
@@ -26,6 +28,7 @@
sub new_from_string {
my ( $class, $termstring, $finfos ) = @_;
my $field_num = unpack( 'n', bytes::substr( $termstring, 0, 2,
'' ) );
+ utf8_flag_on($termstring);
my $field_name = $finfos->field_name($field_num);
return __PACKAGE__->new( $field_name, $termstring );
}
@@ -37,7 +40,9 @@
my ( $self, $finfos ) = @_;
my $field_num = $finfos->get_field_num( $self->{field} );
return unless defined $field_num;
- return pack( 'n', $field_num ) . $self->{text};
+ my $termtext = $self->{text};
+ utf8_flag_off($termtext);
+ return pack( 'n', $field_num ) . $termtext;
}
sub to_string {
Index: lib/KinoSearch/Util/StringHelper.pm
===================================================================
--- lib/KinoSearch/Util/StringHelper.pm (revision 1026)
+++ lib/KinoSearch/Util/StringHelper.pm (working copy)
@@ -3,7 +3,7 @@
use warnings;
use base qw( Exporter );
-our @EXPORT_OK = qw( utf8_flag_on );
+our @EXPORT_OK = qw( utf8_flag_on utf8_flag_off );
1;
@@ -19,6 +19,12 @@
PPCODE:
SvUTF8_on(sv);
+void
+utf8_flag_off(sv)
+ SV *sv;
+PPCODE:
+ SvUTF8_off(sv);
+
__H__
#ifndef H_KINO_STRING_HELPER

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

marvin at rectangular

Aug 15, 2006, 10:51 PM

Post #9 of 13 (1975 views)

On Aug 15, 2006, at 1:11 AM, Marc Elser wrote:

> Please, let me know if you fixed these problems.

I've taken a shot at it. :) Please give the current repository
revision 1030 a try.

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

slothbear:~/projects/ks marvin$ svn diff -r 1026
Index: t/601-queryparser.t
===================================================================
--- t/601-queryparser.t (revision 1026)
+++ t/601-queryparser.t (working copy)
@@ -4,17 +4,20 @@
use lib 't';
use KinoSearch qw( kdump );
-use Test::More tests => 205;
+use Test::More tests => 207;
use File::Spec::Functions qw( catfile );
BEGIN { use_ok('KinoSearch::QueryParser::QueryParser') }
+use KinoSearchTestInvIndex qw( create_invindex );
+
use KinoSearch::InvIndexer;
use KinoSearch::Searcher;
use KinoSearch::Store::RAMInvIndex;
use KinoSearch::Analysis::Tokenizer;
use KinoSearch::Analysis::Stopalizer;
use KinoSearch::Analysis::PolyAnalyzer;
+use KinoSearch::Util::StringHelper qw( utf8_flag_on );
my $whitespace_tokenizer
= KinoSearch::Analysis::Tokenizer->new( token_re => qr/\S+/ );
@@ -175,3 +178,16 @@
#exit;
}
+my $motorhead = "Mot\xC3\xB6rhead";
+utf8_flag_on($motorhead);
+$invindex = create_invindex($motorhead);
+my $tokenizer = KinoSearch::Analysis::Tokenizer->new;
+$searcher = KinoSearch::Searcher->new(
+ analyzer => $tokenizer,
+ invindex => $invindex,
+);
+
+my $hits = $searcher->search('Mot');
+is( $hits->total_hits, 0, "Pre-test - indexing worked properly" );
+$hits = $searcher->search($motorhead);
+is( $hits->total_hits, 1, "QueryParser parses UTF-8 strings
correctly" );
Index: lib/KinoSearch/Analysis/Tokenizer.pm
===================================================================
--- lib/KinoSearch/Analysis/Tokenizer.pm (revision 1026)
+++ lib/KinoSearch/Analysis/Tokenizer.pm (working copy)
@@ -3,7 +3,6 @@
use warnings;
use KinoSearch::Util::ToolSet;
use base qw( KinoSearch::Analysis::Analyzer );
-use locale;
BEGIN {
__PACKAGE__->init_instance_vars(
@@ -50,20 +49,24 @@
# alias input to $_
while ( $batch->next ) {
local $_ = $batch->get_text;
+ my $copy = $_;
- # ensure that pos is set to 0 for this scalar
- pos = 0;
-
# accumulate token start_offsets and end_offsets
my ( @starts, @ends );
- 1 while ( m/$separator_re/g and push @starts,
- pos and m/$token_re/g and push @ends, pos );
+ my $orig_length = bytes::length($_);
+ while (1) {
+ s/$separator_re//;
+ push @starts, $orig_length - bytes::length($_);
+ last unless s/$token_re//;
+ push @ends, $orig_length - bytes::length($_);
+ }
+
# correct for overshoot
$#starts = $#ends;
# add the new tokens to the batch
- $new_batch->add_many_tokens( $_, \@starts, \@ends );
+ $new_batch->add_many_tokens( $copy, \@starts, \@ends );
}
return $new_batch;
Index: lib/KinoSearch/Analysis/TokenBatch.pm
===================================================================
--- lib/KinoSearch/Analysis/TokenBatch.pm (revision 1026)
+++ lib/KinoSearch/Analysis/TokenBatch.pm (working copy)
@@ -69,7 +69,6 @@
char *string_start = SvPV(string_sv, len);
I32 i;
const I32 max = av_len(starts_av);
- STRLEN unicount = 0;
for (i = 0; i <= max; i++) {
STRLEN start_offset, end_offset;
@@ -93,24 +92,9 @@
Kino_confess("end_offset > len (%d > %"UVuf")",
end_offset, (UV)len);
- /* advance the pointer past as many unicode characters as
required */
- while (1) {
- if (unicount == start_offset)
- break;
-
- /* header byte */
- string_start++;
-
- /* continutation bytes */
- while ((*string_start & 0xC0) == 0xC0)
- string_start++;
-
- unicount++;
- }
-
/* calculate the start of the substring and add the token */
token = Kino_Token_new(
- string_start,
+ string_start + start_offset,
(end_offset - start_offset),
start_offset,
end_offset,
Index: lib/KinoSearch/Index/Term.pm
===================================================================
--- lib/KinoSearch/Index/Term.pm (revision 1026)
+++ lib/KinoSearch/Index/Term.pm (working copy)
@@ -12,6 +12,8 @@
__PACKAGE__->ready_get_set(qw( field text ));
}
+use KinoSearch::Util::StringHelper qw( utf8_flag_on utf8_flag_off );
+
sub new {
croak("usage: KinoSearch::Index::Term->new( field, text )")
unless @_ == 3;
@@ -26,6 +28,7 @@
sub new_from_string {
my ( $class, $termstring, $finfos ) = @_;
my $field_num = unpack( 'n', bytes::substr( $termstring, 0, 2,
'' ) );
+ utf8_flag_on($termstring);
my $field_name = $finfos->field_name($field_num);
return __PACKAGE__->new( $field_name, $termstring );
}
@@ -37,7 +40,9 @@
my ( $self, $finfos ) = @_;
my $field_num = $finfos->get_field_num( $self->{field} );
return unless defined $field_num;
- return pack( 'n', $field_num ) . $self->{text};
+ my $termtext = $self->{text};
+ utf8_flag_off($termtext);
+ return pack( 'n', $field_num ) . $termtext;
}
sub to_string {
Index: lib/KinoSearch/Util/StringHelper.pm
===================================================================
--- lib/KinoSearch/Util/StringHelper.pm (revision 1026)
+++ lib/KinoSearch/Util/StringHelper.pm (working copy)
@@ -3,7 +3,7 @@
use warnings;
use base qw( Exporter );
-our @EXPORT_OK = qw( utf8_flag_on );
+our @EXPORT_OK = qw( utf8_flag_on utf8_flag_off );
1;
@@ -19,6 +19,12 @@
PPCODE:
SvUTF8_on(sv);
+void
+utf8_flag_off(sv)
+ SV *sv;
+PPCODE:
+ SvUTF8_off(sv);
+
__H__
#ifndef H_KINO_STRING_HELPER
slothbear:~/projects/ks marvin$

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

miyagawa at gmail

Aug 15, 2006, 11:35 PM

Post #10 of 13 (1971 views)

On 8/15/06, Ryan Tate <lists@ryantate.com> wrote:
> I recently put together a Web aggregator scraping and parsing various
> documents of various encodings into a single summary page. I basically
> decided everything would be converted into utf8 and output as utf8.
> Along the way I discovered a number of utf8 issues in modules ranging
> from LWP::Simple to XML::Atom.

Starting from XML::Atom 0.20, it has $ForceUnicode global flag (which
defaults to 0 for backward compat.) to make it explicitly work in
Unicode mode, rather than UTF-8 bytes.

HTH.

--
Tatsuhiko Miyagawa

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

lists at ryantate

Aug 16, 2006, 10:55 AM

Post #11 of 13 (1973 views)

On 8/15/06, Tatsuhiko Miyagawa <miyagawa@gmail.com> wrote:
> Starting from XML::Atom 0.20, it has $ForceUnicode global flag (which
> defaults to 0 for backward compat.) to make it explicitly work in
> Unicode mode, rather than UTF-8 bytes.

Yes, thank you. I was the one who filed the CPAN bug report, you may
recall ;-> Thank you for fixing so promptly.

Also, I apologize I did not return your email asking me to test it. I
did test it, but I had previously never made a failing test case,
simply fixing it with a workaround. So I decided to apply your fix and
try it in the wild for a while, but "a while" passed and I never
remembered to email you :-)

Thanks again!

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

Aug 16, 2006, 3:34 PM

Post #12 of 13 (1971 views)

Hi Marvin,

> Indexes which were produced under the current version WILL NOT WORK
> PROPERLY after we make the transition. I'm intend to make KinoSearch
> refuse to read them, so that you'll know you need to revert if you can't
> regenerate right away.
I think refusing to read that indexes is a good idea because especially
if you have lots of indexes it might happen that you forget to
reconstruct one of them which would lead to strange result

>> Cause I was also playing around with the "utf8::upgrade" function to
>> upgrading the text returned from TokenBatch to utf8 before feeding it
>> through the regex before trying your patched version, but it somehow
>> did an additional utf8 encoding to the string causing the special
>> character '?' beeing encoded twice resulting in 2 strange characters
>> instead of '?'. Maybe the same happens now with the modified
>> TokenBatch class.
>
> If you can reproduce the problem, can you please provide me with a
> Devel::Peek Dump of before and after?
I will have a look at that
>
>>
> Do you know if anyone has tried a dictionary-based tokenizer for
> German? I understand that with all the compound words, German needs
> substring search more than other Indo-European languages. Stealing a
> page from the CJK playbook and splitting on words would cost a lot at
> index time, but be much faster than wildcards at search-time and maybe
> address the same need. Would that help, at least in theory?
Very interesting idea, I don't know if someone ever tried, but with the
words from the stemmer it would be really great because the number of
compound words in german is nearly uncountable. You can combine any
words like (of course that combined word has to make sense, but in
theory you can combine what you want). So dictionary splitting or
stemmer splitting (if possible) would greatly increase the relevant hits.

But for compound words it would be enough if the search term would be
able to match partially. I think wildcards would not be so important if
I could search for "partner" for example and also find a word like
"Gesch?ftspartner" and likewise searching for "Gesch?ft" would also
match, even "part" should match. This would not be a "wildcard"-search
but it would help a lot and I could not manage to do this with the
current version of Kinosearch. Would this be a possible thing to do with
Kinosearch's index structure or would this also lead to very slow search
results?

Best regards,

Marc

utf8 (unicode) any progress on TokenBatch? [ In reply to ]

Aug 17, 2006, 12:06 AM

Post #13 of 13 (1971 views)

Hi Marvin,

I just installed revision 1030 of Kinosearch and did more tests. Just to
be sure I rebuilt the index with this revision and tried various query
terms and looked at the result of QueryParser and the created excerpts.

But I'm just stunned. It works perfectly know. I tried every known form
of search terms with and without UTF-8 combinations with MUST and SHOULD
plus and minus, exact matches with hypens and the "AND NOT" clause. But
nothing failed so far. It also handled very well the Punctuation Problem
of the sample texts I use.

As all the text's I've tested with so far are in german, I will do some
more test with polish texts. We have some at hand and I wonder if they
will perfrom just as well, although I will have to disable the stemmer
for the polish documents.

I will keep you informed.

I also did some googling about german compound words splitting. There
are lot of documentations and the methods to compound split german words
are controversial cause the can also produce false matches. The biggest
problem is, I didn't found any compound splitter I could use, although
there seems exist such splitters I couldn't find any downloadable
splitter anywhere. So I think decompounding german words would be a
great thing, but nearly impossible, cause I'm not such a genious that I
can write one of these describe compound splitters based on theoretical
discussion papers.

So, I'm really keen on knowing if partial word matching would be an
option for Kinosearch as wildcards seem to be impossible. Thanks for the
explanation how Lucene handles wildcards but the way. Didn't know that
it gives problems if you have more then 1024 possible words in the index.