Mailing List Archive: UTF-8 well-formedness for SimpleTextCodec

Hey Michael,

Writing well-formed UTF-8 with SimpleTextformat sounds desirable indeed,
e.g. your PR makes sense. I don't think we would want to be heroic about
it, but if we can serialize the same information easily, then it sounds
like something we should do. Thanks for improving SimpleTextCodec!

On Mon, Dec 18, 2023 at 6:01?PM Michael Froh <msfroh@gmail.com> wrote:

> Hi there,
>
> I was recently writing up a short Lucene file format tutorial (
> https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
> using SimpleTextCodec for educational purposes.
>
> I found that SimpleTextSegmentInfo tries to output the segment ID as raw
> bytes, which will often result in malformed UTF-8 output. I wrote a little
> fix to output as the text representation of a byte array (
> https://github.com/apache/lucene/pull/12897). I noticed that it's a
> similar sort of thing with binary doc values (where the bytes get written
> directly).
>
> Is there any general desire for SImpleTextCodec to output well-formed
> UTF-8 where possible?
>
> Thanks,
> Froh
>

--
Adrien