Mailing List Archive: scandinavian characters.

scandinavian characters.

Nov 27, 2001, 2:16 AM

Post #1 of 11 (4516 views)

Hi, i got a problem with scandinavian characters (æåø), when i insert text
with scand-chars it passes the analyzer correctly, but the QueryParser
chokes when i try to search for the same characters.

anyone know anything about how i can fix this?

karl øie/gan meida

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

Re: scandinavian characters. [ In reply to ]

david at bit-bang

Nov 27, 2001, 2:43 AM

Post #2 of 11 (4440 views)

Permalink

Hi Karl !!!

I´m spanish and I have a lot of problems programming with our not english characters. I use LUCENE with spanish accents and it works fine...

Have you tried to use the java.net.URLEncoder and java.net.URLDecoder with your fields to index ?

Best Regards from Spain !
__________________________
David Bonilla Fuertes
THE BIT BANG NETWORK
http://www.bit-bang.com
Profesor Waksman, 8, 6º B
28036 Madrid
SPAIN
Tel.: (+34) 914 577 747
Móvil: 656 62 83 92
Fax: (+34) 914 586 176
__________________________

RE: scandinavian characters. [ In reply to ]

karl at gan

Nov 27, 2001, 3:11 AM

Post #3 of 11 (4435 views)

Permalink

no it's even stranger than that, i have decoded the querystring, the problem
is that it seems like something is changed on the way in. if i search for
"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is
changed to Ä and 's' is removed.

is the querystring translated some where?

mvh karl øie
-----Original Message-----
From: David Bonilla [mailto:david@bit-bang.com]
Sent: 27. november 2001 10:43
To: Lucene Users List; karl@gan.no
Subject: Re: scandinavian characters.

Hi Karl !!!

I´m spanish and I have a lot of problems programming with our not english
characters. I use LUCENE with spanish accents and it works fine...

Have you tried to use the java.net.URLEncoder and java.net.URLDecoder with
your fields to index ?

Best Regards from Spain !
__________________________
David Bonilla Fuertes
THE BIT BANG NETWORK
http://www.bit-bang.com
Profesor Waksman, 8, 6º B
28036 Madrid
SPAIN
Tel.: (+34) 914 577 747
Móvil: 656 62 83 92
Fax: (+34) 914 586 176
__________________________

Re: scandinavian characters. [ In reply to ]

david at bit-bang

Nov 27, 2001, 3:33 AM

Post #4 of 11 (4440 views)

Permalink

Hi again Karl !!!

Are you using the SympleAnalizer ? Some of my problems finished when I started to use the StopAnalyzer.

Try it my friend !!!
__________________________
David Bonilla Fuertes
THE BIT BANG NETWORK
http://www.bit-bang.com
Profesor Waksman, 8, 6º B
28036 Madrid
SPAIN
Tel.: (+34) 914 577 747
Móvil: 656 62 83 92
Fax: (+34) 914 586 176
__________________________

Re: scandinavian characters. [ In reply to ]

david at bit-bang

Nov 27, 2001, 3:40 AM

Post #5 of 11 (4447 views)

Permalink

Excuse me... I was confused... I don´t use the StopAnalyzer but the StandardAnalyzer !!!.

I beg your pardon !!!
__________________________
David Bonilla Fuertes
THE BIT BANG NETWORK
http://www.bit-bang.com
Profesor Waksman, 8, 6º B
28036 Madrid
SPAIN
Tel.: (+34) 914 577 747
Móvil: 656 62 83 92
Fax: (+34) 914 586 176
__________________________

RE: scandinavian characters. [ In reply to ]

karl at gan

Nov 27, 2001, 4:34 AM

Post #6 of 11 (4435 views)

Permalink

i tried the SimpleAnalyzer and got the same result. but i forgot to provide
the stacktrace;

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\u00c3" (195), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

>-----Original Message-----
>From: David Bonilla [mailto:david@bit-bang.com]
>Sent: 27. november 2001 11:33
>To: Lucene Users List; karl@gan.no
>Subject: Re: scandinavian characters.
>
>
>Hi again Karl !!!
>
>Are you using the SympleAnalizer ? Some of my problems finished when I
started to use the >StopAnalyzer.
>
>Try it my friend !!!
>__________________________
>David Bonilla Fuertes
>THE BIT BANG NETWORK
>http://www.bit-bang.com
>Profesor Waksman, 8, 6º B
>28036 Madrid
>SPAIN
>Tel.: (+34) 914 577 747
>Móvil: 656 62 83 92
>Fax: (+34) 914 586 176
>__________________________

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: scandinavian characters. [ In reply to ]

karl at gan

Nov 27, 2001, 5:00 AM

Post #7 of 11 (4436 views)

Permalink

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (ø, &oaelig;, å) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\u00c3" (195), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (ø, &oaelig;, å) then it is
translated wrongly into the swedish/german ä regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...

mvh karl

>no it's even stranger than that, i have decoded the querystring, the
problem
>is that it seems like something is changed on the way in. if i search for
>"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is
>changed to Ä and 's' is removed.
>
>is the querystring translated some where?
>
>mvh karl øie
> -----Original Message-----
> From: David Bonilla [mailto:david@bit-bang.com]
> Sent: 27. november 2001 10:43
> To: Lucene Users List; karl@gan.no
> Subject: Re: scandinavian characters.
>
>
> Hi Karl !!!
>
> I´m spanish and I have a lot of problems programming with our not english
>characters. I use LUCENE with spanish accents and it works fine...
>
> Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
>your fields to index ?
>
> Best Regards from Spain !
> __________________________
> David Bonilla Fuertes
> THE BIT BANG NETWORK
> http://www.bit-bang.com
> Profesor Waksman, 8, 6º B
> 28036 Madrid
> SPAIN
> Tel.: (+34) 914 577 747
> Móvil: 656 62 83 92
> Fax: (+34) 914 586 176
> __________________________

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: scandinavian characters. [ In reply to ]

jonas.bechlund at framfab

Nov 27, 2001, 5:51 AM

Post #8 of 11 (4436 views)

Permalink

Hi Karl,

It is a little bit tricky - but when you get the idea it is not that bad...

I had the same problem with the danish characters. I made changes TOKEN
definition in the "Token Definitions" section of the file "QueryParser.jj"
and that actually solved the problem. One minor detail is that you have to
rebuild the jar file with ANT. (See build.txt for instructions)

I guess that solves your problem,
Regards,
/ Jonas

-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: 27 November 2001 13:01
To: Lucene Users List
Subject: RE: scandinavian characters.

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (ø, &oaelig;, å) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\u00c3" (195), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (ø, &oaelig;, å) then it is
translated wrongly into the swedish/german ä regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...

mvh karl

>no it's even stranger than that, i have decoded the querystring, the
problem
>is that it seems like something is changed on the way in. if i search for
>"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is
>changed to Ä and 's' is removed.
>
>is the querystring translated some where?
>
>mvh karl øie
> -----Original Message-----
> From: David Bonilla [mailto:david@bit-bang.com]
> Sent: 27. november 2001 10:43
> To: Lucene Users List; karl@gan.no
> Subject: Re: scandinavian characters.
>
>
> Hi Karl !!!
>
> I´m spanish and I have a lot of problems programming with our not english
>characters. I use LUCENE with spanish accents and it works fine...
>
> Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
>your fields to index ?
>
> Best Regards from Spain !
> __________________________
> David Bonilla Fuertes
> THE BIT BANG NETWORK
> http://www.bit-bang.com
> Profesor Waksman, 8, 6º B
> 28036 Madrid
> SPAIN
> Tel.: (+34) 914 577 747
> Móvil: 656 62 83 92
> Fax: (+34) 914 586 176
> __________________________

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: scandinavian characters. [ In reply to ]

karl at gan

Nov 27, 2001, 7:20 AM

Post #9 of 11 (4445 views)

Permalink

that is a good step on the way, I added;

| <#_ALPHANUM_CHAR: [ "a"-"z", "A"-"Z", "0"-"9", "\u0080"-"\uFFFE" ] >

into the "QueryParser.jj" and now it doesn't create any exceptions on øæå,
but it's still translated into ä ?!?

the strange thing is that the cvs version actually already has this into
it's code.. perhaps I should try a full rebuild from the cvs version...

could you send me your "QueryParser.jj" so i could have a look at it?

btw: thanks for the tips!

mvh karl øie

-----Original Message-----
From: Jonas Bechlund [mailto:jonas.bechlund@framfab.dk]
Sent: 27. november 2001 13:52
To: 'Lucene Users List'
Subject: RE: scandinavian characters.

Hi Karl,

It is a little bit tricky - but when you get the idea it is not that bad...

I had the same problem with the danish characters. I made changes TOKEN
definition in the "Token Definitions" section of the file "QueryParser.jj"
and that actually solved the problem. One minor detail is that you have to
rebuild the jar file with ANT. (See build.txt for instructions)

I guess that solves your problem,
Regards,
/ Jonas

-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: 27 November 2001 13:01
To: Lucene Users List
Subject: RE: scandinavian characters.

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (ø, &oaelig;, å) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\u00c3" (195), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (ø, &oaelig;, å) then it is
translated wrongly into the swedish/german ä regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...

mvh karl

>no it's even stranger than that, i have decoded the querystring, the
problem
>is that it seems like something is changed on the way in. if i search for
>"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is
>changed to Ä and 's' is removed.
>
>is the querystring translated some where?
>
>mvh karl øie
> -----Original Message-----
> From: David Bonilla [mailto:david@bit-bang.com]
> Sent: 27. november 2001 10:43
> To: Lucene Users List; karl@gan.no
> Subject: Re: scandinavian characters.
>
>
> Hi Karl !!!
>
> I´m spanish and I have a lot of problems programming with our not english
>characters. I use LUCENE with spanish accents and it works fine...
>
> Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
>your fields to index ?
>
> Best Regards from Spain !
> __________________________
> David Bonilla Fuertes
> THE BIT BANG NETWORK
> http://www.bit-bang.com
> Profesor Waksman, 8, 6º B
> 28036 Madrid
> SPAIN
> Tel.: (+34) 914 577 747
> Móvil: 656 62 83 92
> Fax: (+34) 914 586 176
> __________________________

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: scandinavian characters. [ In reply to ]

karl at gan

Nov 27, 2001, 7:52 AM

Post #10 of 11 (4443 views)

Permalink

after i had replaced "QueryParser.jj" with the newest version from cvs the
queryparser accepts my query, and i can now perform ø/æ/å searches from
commandline, then i guess there is something wrong with my search servlets
unicode handling :-)

thank you very much!

karl øie/gan media

-----Original Message-----
From: Jonas Bechlund [mailto:jonas.bechlund@framfab.dk]
Sent: 27. november 2001 13:52
To: 'Lucene Users List'
Subject: RE: scandinavian characters.

Hi Karl,

It is a little bit tricky - but when you get the idea it is not that bad...

I had the same problem with the danish characters. I made changes TOKEN
definition in the "Token Definitions" section of the file "QueryParser.jj"
and that actually solved the problem. One minor detail is that you have to
rebuild the jar file with ANT. (See build.txt for instructions)

I guess that solves your problem,
Regards,
/ Jonas

-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: 27 November 2001 13:01
To: Lucene Users List
Subject: RE: scandinavian characters.

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (ø, &oaelig;, å) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\u00c3" (195), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (ø, &oaelig;, å) then it is
translated wrongly into the swedish/german ä regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...

mvh karl

>no it's even stranger than that, i have decoded the querystring, the
problem
>is that it seems like something is changed on the way in. if i search for
>"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is
>changed to Ä and 's' is removed.
>
>is the querystring translated some where?
>
>mvh karl øie
> -----Original Message-----
> From: David Bonilla [mailto:david@bit-bang.com]
> Sent: 27. november 2001 10:43
> To: Lucene Users List; karl@gan.no
> Subject: Re: scandinavian characters.
>
>
> Hi Karl !!!
>
> I´m spanish and I have a lot of problems programming with our not english
>characters. I use LUCENE with spanish accents and it works fine...
>
> Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
>your fields to index ?
>
> Best Regards from Spain !
> __________________________
> David Bonilla Fuertes
> THE BIT BANG NETWORK
> http://www.bit-bang.com
> Profesor Waksman, 8, 6º B
> 28036 Madrid
> SPAIN
> Tel.: (+34) 914 577 747
> Móvil: 656 62 83 92
> Fax: (+34) 914 586 176
> __________________________

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>

RE: scandinavian characters. [ In reply to ]

karl at gan

Nov 27, 2001, 8:34 AM

Post #11 of 11 (4443 views)

Permalink

found a fix to the problem;

the "QueryParser.jj" in rc2 does not accept unicode, version 1.6 in cvs
does, so i replaced the file with the newest one from cvs and also had to
include the FastCharStream.java class to compile.

then i just had to force-convert the querystring that came from the browser
to utf-8 and it worked (guess the browser sent the string as ascii!!! i'm so
happy and thanks to you both jonas and david!!

String query = this.request.getParameter( "query" );
if( query!=null ) {
query = new String( query.getBytes(), "UTF-8" );
}

mvh karl øie/gan media

-----Original Message-----
From: Jonas Bechlund [mailto:jonas.bechlund@framfab.dk]
Sent: 27. november 2001 13:52
To: 'Lucene Users List'
Subject: RE: scandinavian characters.

Hi Karl,

It is a little bit tricky - but when you get the idea it is not that bad...

I had the same problem with the danish characters. I made changes TOKEN
definition in the "Token Definitions" section of the file "QueryParser.jj"
and that actually solved the problem. One minor detail is that you have to
rebuild the jar file with ANT. (See build.txt for instructions)

I guess that solves your problem,
Regards,
/ Jonas

-----Original Message-----
From: Karl Øie [mailto:karl@gan.no]
Sent: 27 November 2001 13:01
To: Lucene Users List
Subject: RE: scandinavian characters.

there must be something seriously broken with the queryparse code.

if a query starts with ø/æ/å (ø, &oaelig;, å) then an exception
in the queryparser occurs.

org.apache.lucene.queryParser.TokenMgrError: Lexical error at line 1, column
1. Encountered: "\u00c3" (195), after : ""
at
org.apache.lucene.queryParser.QueryParserTokenManager.getNextToken(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.jj_ntk(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.Modifiers(Unknown
Source)
at org.apache.lucene.queryParser.QueryParser.Query(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)
at org.apache.lucene.queryParser.QueryParser.parse(Unknown Source)

but if the query contains ø/æ/å (ø, &oaelig;, å) then it is
translated wrongly into the swedish/german ä regardless of what
character it was.

if someone could point me to where to start I could try to find the problem
because I guess it is errorous unicode translation...

mvh karl

>no it's even stranger than that, i have decoded the querystring, the
problem
>is that it seems like something is changed on the way in. if i search for
>"fjøs" (fjøs) i get the swedish "fjä" (fjÄ). Where ø is
>changed to Ä and 's' is removed.
>
>is the querystring translated some where?
>
>mvh karl øie
> -----Original Message-----
> From: David Bonilla [mailto:david@bit-bang.com]
> Sent: 27. november 2001 10:43
> To: Lucene Users List; karl@gan.no
> Subject: Re: scandinavian characters.
>
>
> Hi Karl !!!
>
> I´m spanish and I have a lot of problems programming with our not english
>characters. I use LUCENE with spanish accents and it works fine...
>
> Have you tried to use the java.net.URLEncoder and java.net.URLDecoder
with
>your fields to index ?
>
> Best Regards from Spain !
> __________________________
> David Bonilla Fuertes
> THE BIT BANG NETWORK
> http://www.bit-bang.com
> Profesor Waksman, 8, 6º B
> 28036 Madrid
> SPAIN
> Tel.: (+34) 914 577 747
> Móvil: 656 62 83 92
> Fax: (+34) 914 586 176
> __________________________

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail:
<mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail:
<mailto:lucene-user-help@jakarta.apache.org>

--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>