Mailing List Archive

Internal Format (Re: Internationalization Toolkit)
Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
> http://starship.skyport.net/~lemburg/unicode-proposal.txt

Marc-Andre writes:

The internal format for Unicode objects should either use a Python
specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte
little endian byte order) or a compiler provided wchar_t format (if
available). Using the wchar_t format will ease embedding of Python in
other Unicode aware applications, but will also make internal format
dumps platform dependent.

having been there and done that, I strongly suggest
a third option: a 16-bit unsigned integer, in platform
specific byte order (PY_UNICODE_T). along all other
roads lie code bloat and speed penalties...

(besides, this is exactly how it's already done in
unicode.c and what 'sre' prefers...)

</F>
Re: Internal Format (Re: Internationalization Toolkit) [ In reply to ]
Fredrik Lundh wrote:
>
> Guido van Rossum <guido@CNRI.Reston.VA.US> wrote:
> > http://starship.skyport.net/~lemburg/unicode-proposal.txt
>
> Marc-Andre writes:
>
> The internal format for Unicode objects should either use a Python
> specific fixed cross-platform format <PythonUnicode> (e.g. 2-byte
> little endian byte order) or a compiler provided wchar_t format (if
> available). Using the wchar_t format will ease embedding of Python in
> other Unicode aware applications, but will also make internal format
> dumps platform dependent.
>
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T). along all other
> roads lie code bloat and speed penalties...
>
> (besides, this is exactly how it's already done in
> unicode.c and what 'sre' prefers...)

Ok, byte order can cause a speed penalty, so it might be
worthwhile introducing sys.bom (or sys.endianness) for this
reason and sticking to 16-bit integers as you have already done
in unicode.h.

What I don't like is using wchar_t if available (and then addressing
it as if it were defined as unsigned integer). IMO, it's better
to define a Python Unicode representation which then gets converted
to whatever wchar_t represents on the target machine.

Another issue is whether to use UCS2 (as you have done) or UTF16
(which is what Unicode 3.0 requires)... see my other post
for a discussion.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 51 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Internal Format (Re: Internationalization Toolkit) [ In reply to ]
> What I don't like is using wchar_t if available (and then addressing
> it as if it were defined as unsigned integer). IMO, it's better
> to define a Python Unicode representation which then gets converted
> to whatever wchar_t represents on the target machine.

you should read the unicode.h file a bit more carefully:

...

/* Unicode declarations. Tweak these to match your platform */

/* set this flag if the platform has "wchar.h", "wctype.h" and the
wchar_t type is a 16-bit unsigned type */
#define HAVE_USABLE_WCHAR_H

#if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)

(this uses wchar_t, and also iswspace and friends)

...

#else

/* Use if you have a standard ANSI compiler, without wchar_t support.
If a short is not 16 bits on your platform, you have to fix the
typedef below, or the module initialization code will complain. */

(this maps iswspace to isspace, for 8-bit characters).

#endif

...

the plan was to use the second solution (using "configure"
to figure out what integer type to use), and its own uni-
code database table for the is/to primitives

(iirc, the unicode.txt file discussed this, but that one
seems to be missing from the zip archive).

</F>
Re: Internal Format (Re: Internationalization Toolkit) [ In reply to ]
Fredrik Lundh wrote:
>
> > What I don't like is using wchar_t if available (and then addressing
> > it as if it were defined as unsigned integer). IMO, it's better
> > to define a Python Unicode representation which then gets converted
> > to whatever wchar_t represents on the target machine.
>
> you should read the unicode.h file a bit more carefully:
>
> ...
>
> /* Unicode declarations. Tweak these to match your platform */
>
> /* set this flag if the platform has "wchar.h", "wctype.h" and the
> wchar_t type is a 16-bit unsigned type */
> #define HAVE_USABLE_WCHAR_H
>
> #if defined(WIN32) || defined(HAVE_USABLE_WCHAR_H)
>
> (this uses wchar_t, and also iswspace and friends)
>
> ...
>
> #else
>
> /* Use if you have a standard ANSI compiler, without wchar_t support.
> If a short is not 16 bits on your platform, you have to fix the
> typedef below, or the module initialization code will complain. */
>
> (this maps iswspace to isspace, for 8-bit characters).
>
> #endif
>
> ...
>
> the plan was to use the second solution (using "configure"
> to figure out what integer type to use), and its own uni-
> code database table for the is/to primitives

Oh, I did read unicode.h, stumbled across the mixed usage
and decided not to like it ;-)

Seriously, I find the second solution where you use the 'unsigned
short' much more portable and straight forward. You never know what
the compiler does for isw*() and it's probably better sticking
to one format for all platforms. Only endianness gets in the way,
but that's easy to handle.

So I opt for 'unsigned short'. The encoding used in these 2 bytes
is a different question though. If HP insists on Unicode 3.0, there's
probably no other way than to use UTF-16.

> (iirc, the unicode.txt file discussed this, but that one
> seems to be missing from the zip archive).

It's not in the file I downloaded from your site. Could you post
it here ?

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 51 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Internal Format (Re: Internationalization Toolkit) [ In reply to ]
Fredrik Lundh writes:
> having been there and done that, I strongly suggest
> a third option: a 16-bit unsigned integer, in platform
> specific byte order (PY_UNICODE_T). along all other

I actually like this best, but I understand that there are reasons
for using wchar_t, especially for interfacing with other code that
uses Unicode.
Perhaps someone who knows more about the specific issues with
interfacing using wchar_t can summarize them, or point me to whatever
I've already missed. p-)


-Fred

--
Fred L. Drake, Jr. <fdrake@acm.org>
Corporation for National Research Initiatives