Mailing List Archive: Automatically determin Language of document

Automatically determin Language of document

Nov 23, 2001, 6:56 AM

Post #1 of 4 (1340 views)

Hi,

has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,

Stephan

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Automatically determin Language of document [ In reply to ]

semiosys at wanadoo

Nov 23, 2001, 7:13 AM

Post #2 of 4 (1314 views)

Permalink

Hi,
You could try Doug Beeferman's variable-length character n-gram approach
to identify a language among 13 european ones.
http://www.dougb.com/ident.html

I tried it and it works pretty well. It's based on a similarity mesure
(cosine)
between a corpus-model and the input text.

There are issues depending on which character set you used (iso-latin1,
or other asci flavor).

If you just have 4 or 5 languages to deal with, you can build your
own with the most frequent word lists for each language. I have some
trivial C++ code that does it and can send it to you it you need.
Identified language is choosen on a frequency criterion.

Of course commercial product are available for that (try Xerox & inXight
for instance $$).

The point is how many language you have to identify...

Complement:
Some time ago, Bright Station (UK) had some open source C/C++ code
for a variety of stemmers for european language (adapted from
the Porter stemmer approach).

I hope this helps,
Elie Naulleau
Semio-Sys

-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:Stephan.Strittmatter.ext@kst.siemens.de]
Envoyé : vendredi 23 novembre 2001 14:56
À : 'Lucene Users List'
Objet : Automatically determin Language of document

Hi,

has anyone done anything to autodetect Language of an
HTML-Document which will be indexed by Lucene?

I will use Lucene to index an multilingual Portal
and want to filter the hits by language.

Thanks for any ideas,

Stephan

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Automatically determin Language of document [ In reply to ]

Stephan.Strittmatter.ext at kst

Nov 28, 2001, 12:40 AM

Post #3 of 4 (1302 views)

Permalink

Hi Elie,

> You could try Doug Beeferman's variable-length character n-gram approach
> to identify a language among 13 european ones.
> http://www.dougb.com/ident.html

> If you just have 4 or 5 languages to deal with, you can build your
> own with the most frequent word lists for each language. I have some
> trivial C++ code that does it and can send it to you it you need.
> Identified language is choosen on a frequency criterion.
>

I have at the moment only two languages (en, de) but this could increase.
But I think not more than yours 4 to 5.
It would be great if you could send me your example code.
Probably I try to port it to Java.

Thanks in advance,

Stephan Strittmatter

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: Automatically determin Language of document [ In reply to ]

semiosys at wanadoo

Nov 28, 2001, 2:20 AM

Post #4 of 4 (1320 views)

Permalink

Hi Stephan,

You'll find the example code attached to this message. The archive
contain also a model.dat file for french and english.
Remember that this is a simplisitic approach for language guessing.
It will works to distinguish between on french, english, spanish, etc
but is likely to fail between finnish, suedish, norvegian, ...etc
Porting to Java should be straightforward.

Elie

-----Message d'origine-----
De : Strittmatter Stephan (external)
[mailto:Stephan.Strittmatter.ext@kst.siemens.de]
Envoyé : mercredi 28 novembre 2001 08:40
À : 'Elie Naulleau'; 'Lucene Users List'
Objet : RE: Automatically determin Language of document

Hi Elie,

> You could try Doug Beeferman's variable-length character n-gram approach
> to identify a language among 13 european ones.
> http://www.dougb.com/ident.html

> If you just have 4 or 5 languages to deal with, you can build your
> own with the most frequent word lists for each language. I have some
> trivial C++ code that does it and can send it to you it you need.
> Identified language is choosen on a frequency criterion.
>

I have at the moment only two languages (en, de) but this could increase.
But I think not more than yours 4 to 5.
It would be great if you could send me your example code.
Probably I try to port it to Java.

Thanks in advance,

Stephan Strittmatter

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

Mailing List Archive

Mailing List Archive

Attached Files: