> Attached StandardTokenizer.jj with Sigram Based east
> asia language support:
> tested under Windows and GNU/Linux
>
> Just treat different UnicodeBlock with different word
> segment method.
>
> Hope in the future released we can add more language
> support in StandardTokenizer.jj step by step and keep
> it fit for most i18n environment.
> Some common app, like Jive, can use it as default
> Analyser.
> Use localized Analyzier for advanced usage.
>
> Thank you.
>
> Che, Dong
>
> diff StandardTokenizer.jj StandardTokenizer.jj.orig
> 59c59
> < UNICODE_INPUT = true;
> ---
> > //UNICODE_INPUT = true;
> 121d120
> < | <SIGRAM: (<CJK>) >
> 130c129
> < | < #LETTER: //
> alphabets
> ---
> > | < #LETTER: //
> unicode letters
> 137c136,141
> < "\u0100"-"\u1fff"
> ---
> > "\u0100"-"\u1fff",
> > "\u3040"-"\u318f",
> > "\u3300"-"\u337f",
> > "\u3400"-"\u3d2d",
> > "\u4e00"-"\u9fff",
> > "\uf900"-"\ufaff"
> 140,148d143
> < | < #CJK: // non-alphabets
> < [.
> < "\u3040"-"\u318f",
> < "\u3300"-"\u337f",
> < "\u3400"-"\u3d2d",
> < "\u4e00"-"\u9fff",
> < "\uf900"-"\ufaff"
> < ]
> < >
>
> < token = <SIGRAM> |
>
>
>
>
>
> more on unicode standards:
>
> http://www.unicode.org/charts/normalization/
> http://www.unicode.org/charts/
>
> http://octopus.cdut.edu.cn/~yf17/oreilly/langref/appa_01.htm
> http://klomp.org/mark/classpath/html/Character_8java-source.html
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com
> asia language support:
> tested under Windows and GNU/Linux
>
> Just treat different UnicodeBlock with different word
> segment method.
>
> Hope in the future released we can add more language
> support in StandardTokenizer.jj step by step and keep
> it fit for most i18n environment.
> Some common app, like Jive, can use it as default
> Analyser.
> Use localized Analyzier for advanced usage.
>
> Thank you.
>
> Che, Dong
>
> diff StandardTokenizer.jj StandardTokenizer.jj.orig
> 59c59
> < UNICODE_INPUT = true;
> ---
> > //UNICODE_INPUT = true;
> 121d120
> < | <SIGRAM: (<CJK>) >
> 130c129
> < | < #LETTER: //
> alphabets
> ---
> > | < #LETTER: //
> unicode letters
> 137c136,141
> < "\u0100"-"\u1fff"
> ---
> > "\u0100"-"\u1fff",
> > "\u3040"-"\u318f",
> > "\u3300"-"\u337f",
> > "\u3400"-"\u3d2d",
> > "\u4e00"-"\u9fff",
> > "\uf900"-"\ufaff"
> 140,148d143
> < | < #CJK: // non-alphabets
> < [.
> < "\u3040"-"\u318f",
> < "\u3300"-"\u337f",
> < "\u3400"-"\u3d2d",
> < "\u4e00"-"\u9fff",
> < "\uf900"-"\ufaff"
> < ]
> < >
>
> < token = <SIGRAM> |
>
>
>
>
>
> more on unicode standards:
>
> http://www.unicode.org/charts/normalization/
> http://www.unicode.org/charts/
>
> http://octopus.cdut.edu.cn/~yf17/oreilly/langref/appa_01.htm
> http://klomp.org/mark/classpath/html/Character_8java-source.html
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Finance - Get real-time stock quotes
> http://finance.yahoo.com