Hi,
I'd like to talk about the normalization (aka filter) processing of a string
being indexed/searched, and how it is done in lucene. I'll end with a
proposal for another method of handling it.
The lucene engine includes some filter which purpose is to remove some
meaningless morphological mark, in order to extend the document retrieval
with pertinent documents that do not match the exact forms used by the users
in their queries.
There are some filters provided off-the-shelf along with lucene, a Porter
stemmer and a stemmer specific to german. However, my point is that not only
there can't be a single stemmer for all language (this is obvious for
everybody I guess), but ideally there would be several filter for a same
language. For example, the Porter filter is fine for standard english, but
rather inapropriate for proper nouns. At the contrary, the soundex is
probably fine for names, but it generates innacurate results when used as a
filter on a whole document. Generally speaking, there may be very different
strategies when normalizing text, whether it be highly aggressive (like the
soundex) or rather soft (like a simple diacritics removal). But this is up
to the designer of the search engine to choose carefully its strategy
according to his/her audience and targetted document. It is even possible to
mix several strategies by including an information extraction system that
would additionnaly store in separate indexes the proper nouns, the dates,
the places, etc.
In my opinion, stemming is not the perfect, unique solution for
normalization. For example, I personnaly prefer a normalization that
includes stemming, but also some light phonetic simplification that discards
the differences of close phonemes (like the french
é/è/ê/ei/ai/ait/ais/aient/etc or ain/ein/in/un/etc), as it gives good
results on texts issued from usenet (while it may be a bit too aggressive
for newspaper texts written by journalists).
Well, in fact my main point is the following : having one filter per
language is wrong. Second point is: having the filter algorithm hard-coded
in a programming language is wrong as well. There should be a simple way of
specifying a filter in a simple, dedicated language. In this way, the
snowball project is really interesting as it solves the issue. In my mind,
there should be mainly a normalizer engine, with many configuration files,
easy to modify to implement or adapt a filter. This is an important issue,
as the accuracy of the search engine is directly linked to the normalization
strategy.
However, an important point is also the ease of use of such a language. In
my attempt to build such a simple description language, I came with
something that I hope is quite simple, yet powerful enough : something that
just specify the letters to transform, the right and left context, and the
replacement string. In my opinion, this covers 80% of the need for (at
least) european languages. I implemented it (in java) and wrote a normalizer
for french, which stems and phonetically simplifies its input.
Just as an example, here is a small excerpt of my french normalizer (written
in the toy language I implemented):
:: sh :: > ch
:: sch :: > ch
// transform the "in"/"yn" into the same string, when not pronounced "inn"
:: in :: [~aeiouymn] > 1
[~aeiouy] :: yn :: [~aeiouynm] > 1 // "syndicat", "synchro", but not
"payer"
:: ives :: $ > if // "consécutives"
Before the first "::" is the left context, after the second "::" is the
right context. "$" indicates a word boundary.
Some features are still missing in my implementation, such as putting
constraints on the word length (i.e. to apply a transformation only on words
that have more than x letters) or the like, but I am globally satisfied with
it.
As an exemple of result (the two input forms are pronounced identically in
french, although the second is not written correctly):
read: <démesuré> result: <demezur>
read: <daimesurré> result: <demezur>
Before going on the process of submitting it to the lucene project, I'd like
to hear your comments on the approach. Of high concern is the language used
to describe the normalization process, as I am not plenty satisfied of it,
but hey it's hard to find something really simple yet just expressive
enough. Any idea ?
Rodrigo
http://www.charabia.net
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>
I'd like to talk about the normalization (aka filter) processing of a string
being indexed/searched, and how it is done in lucene. I'll end with a
proposal for another method of handling it.
The lucene engine includes some filter which purpose is to remove some
meaningless morphological mark, in order to extend the document retrieval
with pertinent documents that do not match the exact forms used by the users
in their queries.
There are some filters provided off-the-shelf along with lucene, a Porter
stemmer and a stemmer specific to german. However, my point is that not only
there can't be a single stemmer for all language (this is obvious for
everybody I guess), but ideally there would be several filter for a same
language. For example, the Porter filter is fine for standard english, but
rather inapropriate for proper nouns. At the contrary, the soundex is
probably fine for names, but it generates innacurate results when used as a
filter on a whole document. Generally speaking, there may be very different
strategies when normalizing text, whether it be highly aggressive (like the
soundex) or rather soft (like a simple diacritics removal). But this is up
to the designer of the search engine to choose carefully its strategy
according to his/her audience and targetted document. It is even possible to
mix several strategies by including an information extraction system that
would additionnaly store in separate indexes the proper nouns, the dates,
the places, etc.
In my opinion, stemming is not the perfect, unique solution for
normalization. For example, I personnaly prefer a normalization that
includes stemming, but also some light phonetic simplification that discards
the differences of close phonemes (like the french
é/è/ê/ei/ai/ait/ais/aient/etc or ain/ein/in/un/etc), as it gives good
results on texts issued from usenet (while it may be a bit too aggressive
for newspaper texts written by journalists).
Well, in fact my main point is the following : having one filter per
language is wrong. Second point is: having the filter algorithm hard-coded
in a programming language is wrong as well. There should be a simple way of
specifying a filter in a simple, dedicated language. In this way, the
snowball project is really interesting as it solves the issue. In my mind,
there should be mainly a normalizer engine, with many configuration files,
easy to modify to implement or adapt a filter. This is an important issue,
as the accuracy of the search engine is directly linked to the normalization
strategy.
However, an important point is also the ease of use of such a language. In
my attempt to build such a simple description language, I came with
something that I hope is quite simple, yet powerful enough : something that
just specify the letters to transform, the right and left context, and the
replacement string. In my opinion, this covers 80% of the need for (at
least) european languages. I implemented it (in java) and wrote a normalizer
for french, which stems and phonetically simplifies its input.
Just as an example, here is a small excerpt of my french normalizer (written
in the toy language I implemented):
:: sh :: > ch
:: sch :: > ch
// transform the "in"/"yn" into the same string, when not pronounced "inn"
:: in :: [~aeiouymn] > 1
[~aeiouy] :: yn :: [~aeiouynm] > 1 // "syndicat", "synchro", but not
"payer"
:: ives :: $ > if // "consécutives"
Before the first "::" is the left context, after the second "::" is the
right context. "$" indicates a word boundary.
Some features are still missing in my implementation, such as putting
constraints on the word length (i.e. to apply a transformation only on words
that have more than x letters) or the like, but I am globally satisfied with
it.
As an exemple of result (the two input forms are pronounced identically in
french, although the second is not written correctly):
read: <démesuré> result: <demezur>
read: <daimesurré> result: <demezur>
Before going on the process of submitting it to the lucene project, I'd like
to hear your comments on the approach. Of high concern is the language used
to describe the normalization process, as I am not plenty satisfied of it,
but hey it's hard to find something really simple yet just expressive
enough. Any idea ?
Rodrigo
http://www.charabia.net
--
To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>