Mailing List Archive

UTF-8 well-formedness for SimpleTextCodec
Hi there,

I was recently writing up a short Lucene file format tutorial (
https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
using SimpleTextCodec for educational purposes.

I found that SimpleTextSegmentInfo tries to output the segment ID as raw
bytes, which will often result in malformed UTF-8 output. I wrote a little
fix to output as the text representation of a byte array (
https://github.com/apache/lucene/pull/12897). I noticed that it's a similar
sort of thing with binary doc values (where the bytes get written
directly).

Is there any general desire for SImpleTextCodec to output well-formed UTF-8
where possible?

Thanks,
Froh
Re: UTF-8 well-formedness for SimpleTextCodec [ In reply to ]
Hey Michael,

Writing well-formed UTF-8 with SimpleTextformat sounds desirable indeed,
e.g. your PR makes sense. I don't think we would want to be heroic about
it, but if we can serialize the same information easily, then it sounds
like something we should do. Thanks for improving SimpleTextCodec!

On Mon, Dec 18, 2023 at 6:01?PM Michael Froh <msfroh@gmail.com> wrote:

> Hi there,
>
> I was recently writing up a short Lucene file format tutorial (
> https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
> using SimpleTextCodec for educational purposes.
>
> I found that SimpleTextSegmentInfo tries to output the segment ID as raw
> bytes, which will often result in malformed UTF-8 output. I wrote a little
> fix to output as the text representation of a byte array (
> https://github.com/apache/lucene/pull/12897). I noticed that it's a
> similar sort of thing with binary doc values (where the bytes get written
> directly).
>
> Is there any general desire for SImpleTextCodec to output well-formed
> UTF-8 where possible?
>
> Thanks,
> Froh
>


--
Adrien