Mailing List Archive

Question) Unicode AND Sorting
hi.

I have two questions.


1) How can I indexing unicode(utf-8) text?

2) When I use sort by field value?


Sorry my poor English and thank you for KinoSearch.
Question) Unicode AND Sorting [ In reply to ]
On Aug 3, 2006, at 6:49 AM, ???? wrote:

> 1) How can I indexing unicode(utf-8) text?

I was going to say, "the same way you handle regular text", but I've
just realized that the TokenBatch class is not preserving the UTF-8
flag of the scalars that it's derived from -- and therefore, all of
KinoSearch's Analyzers function in a non-UTF-8 context. :( So right
this moment the only way to do it is to write your own Tokenizer class.

I'm slammed putting out fires for my main client right now and can't
work on this today, but fixing this behavior is a high priority. The
fix will be to have the TokenBatch absorb the UTF8 flag of the latest
scalar that gets assigned to it. After that, the regular expressions
in KinoSearch's Tokenizer will adapt themselves and function either
in a UTF-8 context or not depending on the input.

> 2) When I use sort by field value?

This is only possible at present using a somewhat inefficient hack
that violates KinoSearch's public API.

Sorry,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/
Question) Unicode AND Sorting [ In reply to ]
Welcome back Marvin!

How about posting the text of your OSCON presentation to the list?

Cheers
Henk
Question) Unicode AND Sorting [ In reply to ]
On Aug 4, 2006, at 1:15 AM, henka@cityweb.co.za wrote:

>
> Welcome back Marvin!
>
> How about posting the text of your OSCON presentation to the list?

Text? We can do better. :)

A PDF is available from the KinoSearch homepage.

http://www.rectangular.com/kinosearch/

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/