Mailing List Archive: language support

Greetings. Using the Lucene demo, I set up a JSP front end into an index of
HTML documentation. This was a FTS prototype for a publications department.
It's worked well and we would like to develop it further. There appears,
however, to be one potential hitch -- language support. Our documentation
will eventually be localized into several languages (not yet specified). I'm
not a programmer, but I have some understanding of what a search engine has
to do to support a language. I'd like to determine how much language-support
you get with the standard tokenizer/analyzer (I know that a German analyzer
is included with the core API).

1) The FAQ states:
"You can use Lucene for documents in any language. However if you want to
use a stemmer you will have to write one yourself.
Also, for some languages, the tokenizing rules of Lucene's tokenizes
inappropriate(ly) so you many need to write your own tokenizer as well."

Does this mean that the standard tokenizer/analyzer would suffice for some
European languages? And if so, which ones?

2) For Japanese or Chinese, Lucene supports the character sets. But could
you use the default tokenizer and analyzer for those languages and expect
decent results?

Thanks,

Frank Eichenlaub

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>