Mailing List Archive: Re: Choice of indexed Character set

Manish, in the future, please send questions to lucene-dev, not to me
directly. Thanks.

Manish Shukla wrote:
> Just wanted to ask you, what logic did we use to chose
> which characters to index while creating the
> StandardTokenizer.jj file
>
> We use follwing to index currently. and tokenize of
> rest.
>
> "\u0041"-"\u005a",
> "\u0061"-"\u007a",
> "\u00c0"-"\u00d6",
> "\u00d8"-"\u00f6",
> "\u00f8"-"\u00ff",
> "\u0100"-"\u1fff",
> "\u3040"-"\u318f",
> "\u3300"-"\u337f",
> "\u3400"-"\u3d2d",
> "\u4e00"-"\u9fff",
> "\uf900"-"\ufaff"
>
> Looking at the list it seems a little arbitrary in
> some respects. we are indexing
> Katakana, Hiragana, Bopomofo,Hangul Compatibility
> Jamo but we are skipping some of the characters in
> latin Supplement and extended latin ranges.
>
> I am a little confused. I want to index only 8859
> character set. Hence want to find out the logic. Am I
> missing something.

I don't remember where that came from. I think it may have been copied
from the Java 1.0 implementation of Character.isLetter(). It could
probably stand to be updated. Please feel free to make a proposal.

If you only want 8859, then you're probably best off writing your own
tokenizer, perhaps modelling it after StandardTokenizer.

Doug

--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>