Hi,
You could try Doug Beeferman's variable-length character n-gram approach
to identify a language among 13 european ones.
http://www.dougb.com/ident.html I tried it and it works pretty well. It's based on a similarity mesure
(cosine)
between a corpus-model and the input text.
There are issues depending on which character set you used (iso-latin1,
or other asci flavor).
If you just have 4 or 5 languages to deal with, you can build your
own with the most frequent word lists for each language. I have some
trivial C++ code that does it and can send it to you it you need.
Identified language is choosen on a frequency criterion.
Of course commercial product are available for that (try Xerox & inXight
for instance $$).
The point is how many language you have to identify...
Complement:
Some time ago, Bright Station (UK) had some open source C/C++ code
for a variety of stemmers for european language (adapted from
the Porter stemmer approach).
I hope this helps,
Elie Naulleau
Semio-Sys
-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:Stephan.Strittmatter.ext@kst.siemens.de]
Envoyé : vendredi 23 novembre 2001 14:56
À : 'Lucene Users List'
Objet : Automatically determin Language of document
Hi,
has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?
I will use Lucene to index an multilingual Portal
and want to filter the hits by language.
Thanks for any ideas,
Stephan
--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>
--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>