On Aug 3, 2006, at 6:49 AM, ???? wrote:
> 1) How can I indexing unicode(utf-8) text?
I was going to say, "the same way you handle regular text", but I've
just realized that the TokenBatch class is not preserving the UTF-8
flag of the scalars that it's derived from -- and therefore, all of
KinoSearch's Analyzers function in a non-UTF-8 context. :( So right
this moment the only way to do it is to write your own Tokenizer class.
I'm slammed putting out fires for my main client right now and can't
work on this today, but fixing this behavior is a high priority. The
fix will be to have the TokenBatch absorb the UTF8 flag of the latest
scalar that gets assigned to it. After that, the regular expressions
in KinoSearch's Tokenizer will adapt themselves and function either
in a UTF-8 context or not depending on the input.
> 2) When I use sort by field value?
This is only possible at present using a somewhat inefficient hack
that violates KinoSearch's public API.
Sorry,
Marvin Humphrey
Rectangular Research
http://www.rectangular.com/