Mailing List Archive

Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported
On Wed, Aug 16, 2000 at 03:25:04PM +0200, Werner Koch wrote:
> On Tue, 15 Aug 2000, Daniel Resare wrote:
>
> > was tranlated into US-ASCII by glibc. To correct this problem the
> > LC_CTYPE locale needs to be something else than "C". The included
>
> >From the man page:
>
> The setlocale() function is used to set or query the pro­
> gram's current locale. If locale is "C" or "POSIX", the
> current locale is set to the portable locale.
>
> If locale is "", the locale is set to the default locale
> which is selected from the environment variable LANG.
>
> On startup of the main program, the portable "C" locale is
> selected as default.
>
> By using "" we use isxxxxx() functions which handle 8 bit characters and
> that is something for which the program is not designed. The problem must
> be somewhere else. Using setlocale() for LC_TIME and _MESSAGES is what we
> actually want (if we have those and are not forced to use LC_ALL).
>

Ok, i'll take this up with the glibc guys. Let me just se if i get this
correctly:

The LC_CTYPE locale category traditionally only had effect on
the functions in ctype.h (isalpha() and so on) not on which
characters that can be printed on the screen. Gnupg relies on
isalpha() and firends to use the US-ASCII charset when dtermining
what is a character and what is not. Therefore the LC_CTYPE
needs to be set to C (the value that it as a coincidence has
when untouched). With the advent of the new glibc the LC_TYPE
also affects what characters that should be displayed in a
console. Therefore internationalized messages from gnupg gets
munged into US-ASCII by glibc. Possible solutions to the problem:

1) Ask the glibc maintainers kindly not to let the LC_CTYPE
affect console output. That behaviour is IMHO unobvious and
should be considered an extention so large that a new LC_
category could be invented for the purpose.

* interesting question: how does this work on other systems?

2) wrap all calls to the ctype.h functions (isalpha() and friends
in setlocale(LC_CTYPE, "C")) (some grep'ing shows 56 occurances)

3) review all uses of the ctype.h functions and perhaps use the
isascii() function instead where appliciable (according to
the manpage isascii() is a BSD and SVID extension and should
be quite widely available)

4) write platform indipendent replacements of the used ctype.h
functions that check against the US-ASCII charset. (Shouldn't
be difficult)

If needed, I could volunteer for the work. (If my wife doesn't
kill me *smile*)

cheers
/daniel


--
nuclear cia fbi spy password code president bomb
8D97 F297 CA0D 8751 D8EB 12B6 6EA6 727F 9B8D EC2A
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
I'd forgotten I was on this list, it has been so quiet.

I expect the glibc people can give you a more authoritative answer,
but here's my opinion anyway.

Daniel Resare <noa@metamatrix.se>:

> 1) Ask the glibc maintainers kindly not to let the LC_CTYPE
> affect console output. That behaviour is IMHO unobvious and
> should be considered an extention so large that a new LC_
> category could be invented for the purpose.

I assume that what's happening is that gettext is converting messages
into the locale charset in order to display them. This behaviour is
correct: if you have German messages stored in ISO-8859-1 and the
locale charset is UTF-8, or vice versa, then you have to convert the
messages before displaying them. If the locale charset is US-ASCII
then the conversion is non-reversible. (What's it doing in fact? Does
it change Ä to AE?)

By the way, I think there's no guarantee that the charset of the
portable "C" locale is US-ASCII. Today it usually is, but in the
future it might more often be UTF-8.

It seems to me quite reasonable that the locale charset depends on
LC_CTYPE. You couldn't really have the behaviour of isalpha() being
independent of the locale charset.

> 2) wrap all calls to the ctype.h functions (isalpha() and friends
> in setlocale(LC_CTYPE, "C")) (some grep'ing shows 56 occurances)

Yuck.

> 3) review all uses of the ctype.h functions and perhaps use the
> isascii() function instead where appliciable (according to
> the manpage isascii() is a BSD and SVID extension and should
> be quite widely available)
>
> 4) write platform indipendent replacements of the used ctype.h
> functions that check against the US-ASCII charset. (Shouldn't
> be difficult)

To me, these solutions look best. You could use a configure test to
choose between them. I assume that either way you're assuming that the
locale charset is compatible with US-ASCII, so GnuPG won't work in an
EBCDIC locale, but who cares.

Edmund
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
On Wed, Aug 16, 2000 at 04:32:06PM +0100, Edmund GRIMLEY EVANS wrote:
> (What's it doing in fact? Does
> it change Ä to AE?)

Yes the swedish chars i have in my translation gets mangled

å -> aa
ä -> ae
ö -> oe

>
> By the way, I think there's no guarantee that the charset of the
> portable "C" locale is US-ASCII. Today it usually is, but in the
> future it might more often be UTF-8.

I think the glibc infopages prove you wrong here, at least if we
work with a system based on ISO C.

(libc.info.gz)Standard Locales:
`"C"'
This is the standard C locale. The attributes and behavior it
provides are specified in the ISO C standard. When your program
starts up, it initially uses this locale by default.


>
> > 2) wrap all calls to the ctype.h functions (isalpha() and friends
> > in setlocale(LC_CTYPE, "C")) (some grep'ing shows 56 occurances)
>
> Yuck.
>
> > 3) review all uses of the ctype.h functions and perhaps use the
> > isascii() function instead where appliciable (according to
> > the manpage isascii() is a BSD and SVID extension and should
> > be quite widely available)
> >
> > 4) write platform indipendent replacements of the used ctype.h
> > functions that check against the US-ASCII charset. (Shouldn't
> > be difficult)
>
> To me, these solutions look best. You could use a configure test to
> choose between them. I assume that either way you're assuming that the
> locale charset is compatible with US-ASCII, so GnuPG won't work in an
> EBCDIC locale, but who cares.
>

Even though Werner Koch mailed me privately saying 'please no' to
alternative 4 I fail to see the problem with it. The US-ASCII definition
(as found in ISO646) is set in stone and will never change, it defines
values that a char (as defined in ISO C) can have that maps to glyphs.
A completely portable, clear, bugfree and efficient implementation of
an isascii() function could be written in about 1 hour.

Benefits:
1) no dependency of the layout of the C locale. (who knows AIX
or someone might have gotten it wrong)
2) no dependency of the LC_CTYPE setting (i fooled some redhat person
to accept my patch to change LC_CTYPE to "" before i was caught by
Werner) What can happen once, usually happens twice.

so, please do. Until I (or someone else) have time enough to convert
everything to UTF-8

cheers/daniel

--
nuclear cia fbi spy password code president bomb
8D97 F297 CA0D 8751 D8EB 12B6 6EA6 727F 9B8D EC2A
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
Daniel Resare <noa@metamatrix.se>:

> > By the way, I think there's no guarantee that the charset of the
> > portable "C" locale is US-ASCII. Today it usually is, but in the
> > future it might more often be UTF-8.
>
> I think the glibc infopages prove you wrong here, at least if we
> work with a system based on ISO C.
>
> (libc.info.gz)Standard Locales:
> `"C"'
> This is the standard C locale. The attributes and behavior it
> provides are specified in the ISO C standard. When your program
> starts up, it initially uses this locale by default.

And what does ISO say about the character set?

I don't have the ISO standard, but I've seen various documents that
seem to be carefully worded so as to allow EBCDIC, e.g.
http://www.opennc.org/onlinepubs/7908799/xbd/charset.html

But clearly you have to use US-ASCII in network protocols, so it can't
be much fun getting most programs to work on an EBCDIC system ...

> so, please do. Until I (or someone else) have time enough to convert
> everything to UTF-8

The usual advice is to use the wchar_t API, let the system libraries
handle it and avoid putting anything UTF-8-specific into the
application.

Edmund
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
On Wed, 16 Aug 2000, Edmund GRIMLEY EVANS wrote:

> The usual advice is to use the wchar_t API, let the system libraries
> handle it and avoid putting anything UTF-8-specific into the

The problem we have is that OpenPGP specifies the use of UTF-8 and
therefore I don't see any reason to assume an unknown encoding. Okay.
there are some output functions for it just becuase not all system
support UTF-8 and frankly, I don't know how to determine whether a
system supports UTF-8 or how to switch the TTY to UTF-8.

Werner


--
Werner Koch GnuPG key: 621CC013
OpenIT GmbH http://www.OpenIT.de
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
Werner Koch <wk@gnupg.org>:

> > The usual advice is to use the wchar_t API, let the system libraries
> > handle it and avoid putting anything UTF-8-specific into the
>
> The problem we have is that OpenPGP specifies the use of UTF-8 and
> therefore I don't see any reason to assume an unknown encoding. Okay.
> there are some output functions for it just becuase not all system
> support UTF-8 and frankly, I don't know how to determine whether a
> system supports UTF-8 or how to switch the TTY to UTF-8.

What I wrote about using the wchar_t API concerns any data that is in
the local charset, e.g. terminal input and output.

Where UTF-8 is specified for data that is transmitted between
machines, on a modern system you should be able to convert between
UTF-8 and the locale charset using iconv, and find out what the locale
charset is using nl_langinfo(CODESET).

In case you don't have nl_langinfo(CODESET), or to override it if it's
wrong or uses names that are incompatible with iconv for some stupid
reason, there should be a way for the user to optionally specify the
local charset.

If case you don't have iconv you can include your own simple version
of it. You might do nothing, or replace non-ascii chars by '?', or
include full support for a few popular charsets, as you wish.

Is there a separate development branch of GnuPG, or is development
happening on the same branch as gnupg-1.0.2?

Edmund
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
On Thu, 17 Aug 2000, Edmund GRIMLEY EVANS wrote:

> Where UTF-8 is specified for data that is transmitted between
> machines, on a modern system you should be able to convert between
> UTF-8 and the locale charset using iconv, and find out what the locale
> charset is using nl_langinfo(CODESET).

Quite some time ago, Thomas Roesler told me that there is no portable
way to query the locale charset. It seems that this has changed and
in every case we can do an autoconf check for nl_langinfo and check
whether the iconv implemenation regarding UTF-8 is "secure", meaning
not to allow overlong UTF-8 encodings to give a different encoding for
the standard ASCII characters like LF or BS.

> If case you don't have iconv you can include your own simple version
> of it. You might do nothing, or replace non-ascii chars by '?', or

BTW, do you know what happens in glibc's iconv when an invalid
sequence etc. is encountered? I assume you get one of the error codes
back and then you have to output ? or C-qouted characters.

Well, I think it is time to change GnuPGs simple UTF8 conversion to
an iconv() based one.

> Is there a separate development branch of GnuPG, or is development
> happening on the same branch as gnupg-1.0.2?

GnuPG 1.1 (CVS head) has been merged with the stable branch and
development should happen there.

Werner


--
Werner Koch GnuPG key: 621CC013
OpenIT GmbH http://www.OpenIT.de
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
On Wed, 16 Aug 2000, Edmund GRIMLEY EVANS wrote:

> But clearly you have to use US-ASCII in network protocols, so it can't
> be much fun getting most programs to work on an EBCDIC system ...

And given the fact that Linux has been ported to OS/390 using ASCII
and doing EBCDIC translation only in some drivers for IBM devices. I
don't see much reason to cope with EBCDIC anymore.

Werner

--
Werner Koch GnuPG key: 621CC013
OpenIT GmbH http://www.OpenIT.de
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
Werner Koch <wk@gnupg.org>:

> > Where UTF-8 is specified for data that is transmitted between
> > machines, on a modern system you should be able to convert between
> > UTF-8 and the locale charset using iconv, and find out what the locale
> > charset is using nl_langinfo(CODESET).
>
> Quite some time ago, Thomas Roesler told me that there is no portable
> way to query the locale charset. It seems that this has changed and
> in every case we can do an autoconf check for nl_langinfo and check
> whether the iconv implemenation regarding UTF-8 is "secure", meaning
> not to allow overlong UTF-8 encodings to give a different encoding for
> the standard ASCII characters like LF or BS.

Is that check definitely necessary, or you just being extra careful?

> BTW, do you know what happens in glibc's iconv when an invalid
> sequence etc. is encountered? I assume you get one of the error codes
> back and then you have to output ? or C-qouted characters.

An invalid sequence gives EILSEQ. According to the spec, a valid input
sequence should always be converted by iconv, even if the conversion
is non-reversible. In practive, every implementaton of iconv sometimes
returns EILSEQ in this case, too. The implementation in glibc-2.1
never converts non-reversibly and also gives a different return value
from the spec. If you're lucky, you can ignore these implementation
annoyances and always do the same thing when you get EILSEQ: output
the original octet as ? or quoted and advance the input pointer.

I discovered the following with Mutt, which might or might not be
relevant for you:

dnl (2) In glibc-2.1.2 and earlier there is a bug that messes up ob and
dnl obl when args 2 and 3 are 0 (fixed in glibc-2.1.3).

Edmund
Re: gnupg-1.0.2 patch: LC_CTYPE needs to be imported [ In reply to ]
On Thu, 17 Aug 2000, Edmund GRIMLEY EVANS wrote:

> > not to allow overlong UTF-8 encodings to give a different encoding for
> > the standard ASCII characters like LF or BS.
>
> Is that check definitely necessary, or you just being extra careful?

Yes. Otherwise I won't need the print_string functions which are used
to filter such things out. Assuming the user sits on some standard
terminal you can create GPG messages which fake the out: e.g. you
apply a faked user ID to a key and bvy using control sequences you
overwrite the warning GPG gives or you use the control sequences in
Notation data to replace GnuPG's BAD SIGNATURE message by "Good
signature".

There are probably a lot more attacks possible. Bruce Schneier talked
about such issues in of his last CrytoGrams and Markus Kuhn gave
additional information in the last CryptoGram.

> annoyances and always do the same thing when you get EILSEQ: output
> the original octet as ? or quoted and advance the input pointer.

Okay, I will see whether I can get this into the next release.

> dnl (2) In glibc-2.1.2 and earlier there is a bug that messes up ob and
> dnl obl when args 2 and 3 are 0 (fixed in glibc-2.1.3).

Thanks.

Werner


--
Werner Koch GnuPG key: 621CC013
OpenIT GmbH http://www.OpenIT.de