For handling Indian regional languages, what is the advisable approach?
1. Indexing each language data(Tamil, Hindi etc) in specific fields like
content_tamil, content_hindi with specific per field Analyzer like Tamil
for content_tamil, HindiAnalyzer for content_hindi?
2. Indexing all language data in the same field but handling tokenization
with specific unicode range(similar to THAI) in tokenizer like mentioned
below..
THAI = [\u0E00-\u0E59]
> TAMIL = [\u0B80-\u0BFF]
> // basic word: a sequence of digits & letters (includes Thai to enable
> ThaiAnalyzer to function)
> ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+
Note: I am using lucene 4.10.4. But open to know suggestions from latest
lucene versions as well as lucene 4..
--
*K*umaran *R*
1. Indexing each language data(Tamil, Hindi etc) in specific fields like
content_tamil, content_hindi with specific per field Analyzer like Tamil
for content_tamil, HindiAnalyzer for content_hindi?
2. Indexing all language data in the same field but handling tokenization
with specific unicode range(similar to THAI) in tokenizer like mentioned
below..
THAI = [\u0E00-\u0E59]
> TAMIL = [\u0B80-\u0BFF]
> // basic word: a sequence of digits & letters (includes Thai to enable
> ThaiAnalyzer to function)
> ALPHANUM = ({LETTER}|{THAI}|{TAMIL}|[:digit:])+
Note: I am using lucene 4.10.4. But open to know suggestions from latest
lucene versions as well as lucene 4..
--
*K*umaran *R*