Mailing List Archive: Handling Indian regional languages

Handling Indian regional languages

Jan 9, 2023, 11:03 PM

Post #1 of 3 (122 views)

For handling Indian regional languages, what is the advisable approach?

1. Indexing each language data(Tamil, Hindi etc) in specific fields like
content_tamil, content_hindi with specific per field Analyzer like Tamil
for content_tamil, HindiAnalyzer for content_hindi?

2. Indexing all language data in the same field but handling tokenization
with specific unicode range(similar to THAI) in tokenizer like mentioned
below..

THAI = [\u0E00-\u0E59]
> TAMIL = [\u0B80-\u0BFF]
> // basic word: a sequence of digits & letters (includes Thai to enable
> ThaiAnalyzer to function)
> ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+

Note: I am using lucene 4.10.4. But open to know suggestions from latest
lucene versions as well as lucene 4..

--
*K*umaran *R*

Re: Handling Indian regional languages [ In reply to ]

rcmuir at gmail

Jan 16, 2023, 9:45 AM

Post #2 of 3 (117 views)

Permalink

On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian
<kums.134@gmail.com> wrote:
>
> For handling Indian regional languages, what is the advisable approach?
>
> 1. Indexing each language data(Tamil, Hindi etc) in specific fields like
> content_tamil, content_hindi with specific per field Analyzer like Tamil
> for content_tamil, HindiAnalyzer for content_hindi?

You don't need to do this just to tokenize. You only need to do this
if you want to do something fancier on top (e.g. stemming and so on).
If you look at newer lucene versions, there are more analyzers for
more languages.

>
> 2. Indexing all language data in the same field but handling tokenization
> with specific unicode range(similar to THAI) in tokenizer like mentioned
> below..
>
> THAI = [\u0E00-\u0E59]
> > TAMIL = [\u0B80-\u0BFF]
> > // basic word: a sequence of digits & letters (includes Thai to enable
> > ThaiAnalyzer to function)
> > ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+

Don't do this: Just use StandardTokenizer instead of ClassicTokenizer.
StandardTokenizer can tokenize all the Indian writing systems
out-of-box.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Handling Indian regional languages [ In reply to ]

kums.134 at gmail

Jan 23, 2023, 8:54 AM

Post #3 of 3 (117 views)

Permalink

Hi Robert Muir, we will check on this. Thanks a lot for the pointers.

--
*K*umaran
*R*

On Mon, Jan 16, 2023 at 11:16 PM Robert Muir <rcmuir@gmail.com> wrote:

> On Tue, Jan 10, 2023 at 2:04 AM Kumaran Ramasubramanian
> <kums.134@gmail.com> wrote:
> >
> > For handling Indian regional languages, what is the advisable approach?
> >
> > 1. Indexing each language data(Tamil, Hindi etc) in specific fields like
> > content_tamil, content_hindi with specific per field Analyzer like Tamil
> > for content_tamil, HindiAnalyzer for content_hindi?
>
> You don't need to do this just to tokenize. You only need to do this
> if you want to do something fancier on top (e.g. stemming and so on).
> If you look at newer lucene versions, there are more analyzers for
> more languages.
>
> >
> > 2. Indexing all language data in the same field but handling tokenization
> > with specific unicode range(similar to THAI) in tokenizer like mentioned
> > below..
> >
> > THAI = [\u0E00-\u0E59]
> > > TAMIL = [\u0B80-\u0BFF]
> > > // basic word: a sequence of digits & letters (includes Thai to enable
> > > ThaiAnalyzer to function)
> > > ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+
>
> Don't do this: Just use StandardTokenizer instead of ClassicTokenizer.
> StandardTokenizer can tokenize all the Indian writing systems
> out-of-box.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>