Mailing List Archive: Re: [rt.cpan.org #21359] Default tokenizer regex breaks unicode

On Sep 6, 2006, at 1:57 PM, via RT wrote:
> The default regex in KinoSearch::Analysis::Tokenizer breaks unicode.

Thank you for the report. Thank you especially for the test case,
which I will incorporate into KinoSearch's test suite.

The problem exposed by your test appears to be due to the loss of the
scalar's UTF8 flag as the text is absorbed into a
KinoSearch::Analysis::TokenBatch object, then recreated later. By
adding Encode::_utf8_on($_) at the right spot in Tokenizer::analyze,
we get the desired behavior in your test with the stock English
PolyAnalyzer. Unfortunately, the TokenBatch bug is not the only
place where Unicode support does not work properly in KinoSearch
0.12/0.13.

All these issues were addressed a few weeks back, but there has not
yet been a release incorporating the changes. The fix -- KS now
converts everything to Unicode for internal processing -- is not
backwards compatible, and so I'm trying to put together a single 0.20
release which aggregates multiple backwards-incompatible changes.

I would appreciate it if you would try a recent version from
KinoSearch's subversion repository and see if it works properly for
you. As of this email, the current repository revision is 1216,
which I believe will work. However, there has been quite a bit of
churn lately, and you may wish to try revision 1030.

svn co -r 1216 http://www.rectangular.com/svn/kinosearch/trunk
kinosearch

Best,

--
Marvin Humphrey