Mailing List Archive

setting encoding
i need to search non-english text and it is written using Cp1252 encoding.
there are some fields i need to store using that encoding. i am able to
store them but some chars specific to 1252 are lost. how can i tell lucene
to store fields using specific encoding?

thanks everybody

_________________________________________________________________
Send and receive Hotmail on your mobile device: http://mobile.msn.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: setting encoding [ In reply to ]
I don't know how have Lucene store in cp1252 (Windows latin-1), but I don't
think you have to.
I'm pretty sure it will take what ever information you have in a Java String
and save it as unicode. Then recreate it into a Java String.

So the issue I think you have is converting from cp1252 into a Java String
which is pretty straight forward.


Also, does the encoding matter?

Can you convert cp1252 to UTF-8 on the fly (and even backward if needed)?

The biggest problem is some cp1252 characters are "private" in the unicode
byte set.

You can get the conversion from the Glue lossless transcoder project.
http://www.ascc.net/xml/en/utf-8/transcode-index.html

I hope these random thoughts help.

--Peter

On 5/18/02 3:15 PM, "Dario Novakovic" <darionis@hotmail.com> wrote:

> i need to search non-english text and it is written using Cp1252 encoding.
> there are some fields i need to store using that encoding. i am able to
> store them but some chars specific to 1252 are lost. how can i tell lucene
> to store fields using specific encoding?
>
> thanks everybody
>
> _________________________________________________________________
> Send and receive Hotmail on your mobile device: http://mobile.msn.com
>
>
> --
> To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
> For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
>
>


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>
Re: setting encoding [ In reply to ]
> The biggest problem is some cp1252 characters are "private" in the unicode
> byte set.

those chararcters may not be in the unicode byte (char) set at all and that is the major trouble with processing chinese,

convert your native code to unicode (UTF16) with the following lines:

File f = new File('cp1252_input');
FileInputStream tmp = new FileInputStream(f);
BufferedReader brin = new BufferedReader( new InputStreamReader( tmp, "CP1252"));
String inputString = brin.readLine();

not sure your code designater is CP1252, chech that out in Java Docs.


redpineseed
Re: setting encoding [ In reply to ]
actualy, there is no need to set encoding. i only need to read files using
proper decoding and then lucene stores it index properly, so when i retrive
docs, they are proper strings with letters with accents.

i tought it can't be so simple. the whole thing is in reading and decoding,
lucene takes care of the rest.

thanks everybody for suggestions

dario




>From: "redpineseed" <redpineseed@telus.net>
>Reply-To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>To: "Lucene Users List" <lucene-user@jakarta.apache.org>
>Subject: Re: setting encoding
>Date: Mon, 20 May 2002 13:29:58 -0700
>
>
>convert your native code to unicode (UTF16) with the following lines:
>
>File f = new File('cp1252_input');
>FileInputStream tmp = new FileInputStream(f);
>BufferedReader brin = new BufferedReader( new InputStreamReader( tmp,
>"CP1252"));
>String inputString = brin.readLine();
>
>not sure your code designater is CP1252, chech that out in Java Docs.
>
>
>redpineseed


_________________________________________________________________
Chat with friends online, try MSN Messenger: http://messenger.msn.com


--
To unsubscribe, e-mail: <mailto:lucene-user-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-user-help@jakarta.apache.org>