Mailing List Archive

non-ASCII char search problem with nightly build (12 nov.)
Hi!
I am trying to use Lucene with russian texts. I created an index of xml
documents (UTF-8 encoded), but when I am trying to search an index with a
query from a servlet, it seems, that Lucene just finds nothing (though I am
SURE it MUST find a term). Search string is reencoded to UTF-8 too, so I do
not know what to do... If I search this index with english letters - it
works as it should (there are mixed chracters in xml files). Could anybody
help me? Please, note, I use latest nightly build (12 nov) - it claims to
have non-ASCII search ability :(

P.S. maybe, I must reindex docs with this new version?

Phil


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: non-ASCII char search problem with nightly build (12 nov.) [ In reply to ]
Philipp Chudinov wrote:

>Hi!
>I am trying to use Lucene with russian texts. I created an index of xml
>documents (UTF-8 encoded), but when I am trying to search an index with a
>query from a servlet, it seems, that Lucene just finds nothing (though I am
>SURE it MUST find a term). Search string is reencoded to UTF-8 too, so I do
>not know what to do... If I search this index with english letters - it
>works as it should (there are mixed chracters in xml files). Could anybody
>help me? Please, note, I use latest nightly build (12 nov) - it claims to
>have non-ASCII search ability :(
>
The problem probably lies in the QueryParser class, as it takes only the
less significant bytes of the characters given in the query. I had a
very similar problem with querying for polish strings, as they contain
characters, that are composed from two bytes in the UTF-8. Also, the
chars that appeared in the polish alphabet were not contained in the
grammar definition that the query parser accepted.

The solution I had developed is to modify the original grammar
definition so that it accepts the non-english characters. Also, the
grammar must not contain the less significant bytes of the accepted
characters in the term delimeters list.

This was posted as a bug description in the old Lucene site, but as the
search engine had moved to Jakarta, all old bugs had gone.

If you want the modified grammar as an example, or if you want further
guidance, feel free to mail me at: andrzej.jarmoniuk@e-point.pl

Andrzej Jarmoniuk
Internet Developer
E-Point S.A.


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: non-ASCII char search problem with nightly build (12 nov.) [ In reply to ]
>The problem probably lies in the QueryParser class, as it takes only the
>less significant bytes of the characters given in the query.

Are you sure of that? I recently switched from using the standard JavaCC
AsciiCharStream implementation, to using Doug's FastCharStream
implementation, which should accept two-byte characters. Are you sure
you're using the current version?

>I had a very similar problem with querying for polish strings, as they
>contain characters, that are composed from two bytes in the UTF-8. Also,
>the chars that appeared in the polish alphabet were not contained in the
>grammar definition that the query parser accepted.

There are definitely recent fixes for that -- are you sure you're using the
current version?



--
Brian Goetz
Quiotix Corporation
brian@quiotix.com Tel: 650-843-1300 Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: non-ASCII char search problem with nightly build (12 nov.) [ In reply to ]
Additional question: in QueryParser.jj beginning at strings #258-259 there
is a unicode symbol range ("\u0080"-"\uFFFE"). I understand that Lucene can
search an index with unicode symbols in this range, right? So, question.
Russian (Cyrillic) symbols in unicode table are in this range: u\0401 -
u\04F9 (I took this from Character Map app at system tools on W2K). Does
cyrillic range feets original QP.jj range?


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>