Mailing List Archive

[contrib]: StandardTokenizer with sigram based CJK Support
> Attached StandardTokenizer.jj with Sigram Based east
> asia language support:
> tested under Windows and GNU/Linux
>
> Just treat different UnicodeBlock with different word
> segment method.
>
> Hope in the future released we can add more language
> support in StandardTokenizer.jj step by step and keep
> it fit for most i18n environment.
> Some common app, like Jive, can use it as default
> Analyser.
> Use localized Analyzier for advanced usage.
>
> Thank you.
>
> Che, Dong
>
> diff StandardTokenizer.jj StandardTokenizer.jj.orig
> 59c59
> < UNICODE_INPUT = true;
> ---
> > //UNICODE_INPUT = true;
> 121d120
> < | <SIGRAM: (<CJK>) >
> 130c129
> < | < #LETTER: //
> alphabets
> ---
> > | < #LETTER: //
> unicode letters
> 137c136,141
> < "\u0100"-"\u1fff"
> ---
> > "\u0100"-"\u1fff",
> > "\u3040"-"\u318f",
> > "\u3300"-"\u337f",
> > "\u3400"-"\u3d2d",
> > "\u4e00"-"\u9fff",
> > "\uf900"-"\ufaff"
> 140,148d143
> < | < #CJK: // non-alphabets
> < [.
> < "\u3040"-"\u318f",
> < "\u3300"-"\u337f",
> < "\u3400"-"\u3d2d",
> < "\u4e00"-"\u9fff",
> < "\uf900"-"\ufaff"
> < ]
> < >
>
> < token = <SIGRAM> |
>
>
>
>
>
> more on unicode standards:
>
> http://www.unicode.org/charts/normalization/
> http://www.unicode.org/charts/
>
> http://octopus.cdut.edu.cn/~yf17/oreilly/langref/appa_01.htm
> http://klomp.org/mark/classpath/html/Character_8java-source.html
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com
Re: [contrib]: StandardTokenizer with sigram based CJK Support [ In reply to ]
+1

Che Dong wrote:
>>Attached StandardTokenizer.jj with Sigram Based east
>>asia language support:
>>tested under Windows and GNU/Linux
>>
>>Just treat different UnicodeBlock with different word
>>segment method.
>>
>>Hope in the future released we can add more language
>>support in StandardTokenizer.jj step by step and keep
>>it fit for most i18n environment.
>>Some common app, like Jive, can use it as default
>>Analyser.
>>Use localized Analyzier for advanced usage.
>>
>>Thank you.
>>
>>Che, Dong



--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>