Mailing List Archive

Strange Results with German Analyzer
Hi,

I used a German Analyzer for Indexing and Searching. afaik, the search is
case insensitive. At least I get the same searchresults for

kapitalanlagen
Kapitalanlagen

But, for some words the Analyzer behaves somewhat funny:

Holland -> 22 results
hollAnd -> 22 results
hollanD -> 22 results
HOLLAND -> 22 results

holland -> 1 result (!) which is NOT in the 22 results mentioned above.

I have no idea and my knowledge about Searching, stemming, indexing etc is,
well, small.

Jan


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
RE: Strange Results with German Analyzer [ In reply to ]
take a look at the end of GermanAnalyzer.java

http://cvs.apache.org/viewcvs/jakarta-lucene/src/java/org/apache/lucene/anal
ysis/de/GermanAnalyzer.java?rev=1.2&content-type=text/vnd.viewcvs-markup


public final TokenStream tokenStream( String fieldName, Reader reader ) {
TokenStream result = new StandardTokenizer( reader );
result = new StandardFilter( result );
result = new StopFilter( result, stoptable );
result = new GermanStemFilter( result, excltable );
// Convert to lowercase after stemming!
result = new LowerCaseFilter( result );
return result;
}

as you can see the analyzer converts all words to lowercase to save some
space, you can ofcourse remove the LowerCaseFilter) to get case sensetive
search. the reason why holland gives 1 and hollAnd returns 22 i can not
say...

mvh karl øie



-----Original Message-----
From: Jan Stövesand [mailto:j.stoevesand@finix.de]
Sent: 20. desember 2001 12:36
To: Lucene Users List
Subject: Strange Results with German Analyzer


Hi,

I used a German Analyzer for Indexing and Searching. afaik, the search is
case insensitive. At least I get the same searchresults for

kapitalanlagen
Kapitalanlagen

But, for some words the Analyzer behaves somewhat funny:

Holland -> 22 results
hollAnd -> 22 results
hollanD -> 22 results
HOLLAND -> 22 results

holland -> 1 result (!) which is NOT in the 22 results mentioned above.

I have no idea and my knowledge about Searching, stemming, indexing etc is,
well, small.

Jan


--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: Strange Results with German Analyzer [ In reply to ]
Hello,

Jan Stvvesand wrote:
>
> Hi,
>
> I used a German Analyzer for Indexing and Searching. afaik, the search is
> case insensitive. At least I get the same searchresults for
>
> kapitalanlagen
> Kapitalanlagen
>
> But, for some words the Analyzer behaves somewhat funny:
>
> Holland -> 22 results
> hollAnd -> 22 results
> hollanD -> 22 results
> HOLLAND -> 22 results
>
> holland -> 1 result (!) which is NOT in the 22 results mentioned above.

That result is correct.

> I have no idea and my knowledge about Searching, stemming, indexing etc is,
> well, small.

Well, I try to explain it in short.
Words starting with an uppercase letter become stemmed an other way
than all other words. Words containing one uppercase letter that is
not starting the word and words containing more than one uppercase
letter become will not be stemmed.

So the stemming looks like this:

Holland -> possibly noun -> stemmed to "holland"
hollAnd -> not a regular german word -> ignored -> lowercasefilter ->
"holland"
hollanD -> not a regular german word -> ignored -> lowercasefilter ->
"holland"
HOLLAND -> not a regular german word -> ignored -> lowercasefilter ->
"holland"
holland -> stemmed to "holla" ("nd" is a suffix to be stripped from
non-nouns)

It looks like the check for irregular words need some improvement,
it should be less restrictive with possibly mistyped words.

Another thing is, that the search _is_ case sensitive when the
GermanAnalyzer is used. This is because in german you should search
a substantive as a substantive. And stemming nouns a different way
than the rest gives much better results than medium stemming that
ignores case from the beginning.


HTH,
Gerhard

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>