Mailing List Archive

Japanese and UTF8
Hi,

now that we have a Japanese translation, we have to do a conversion from
EOC_JP to UTF-8, because UTF-8 is the required encoding for user IDs
and some other strings in OpenPGP.

I don't think that the currently used simple mapping approach works
with that character set, because it is a simple one-to-one mapping and
I expect that EOC_JP uses state shifting.

What is a portable way to this conversion? I had some talks about
that in Tokyo and it boiled down to let the OS/libc do it. Okay, how?


Werner
Re: Japanese and UTF8 [ In reply to ]
Werner Koch writes:
>I expect that EOC_JP uses state shifting.

I see you wanted to mean EUC-JP :-).

In single octet base, it has state shifting.
In multiple octets base, it doesn't.

* EUC uses two kinds of single shift characters. They are
SS2R and SS3R, which are coded in 8/14 and 8/15 respectively.
In this sense, it has state shifting.

* OTOH,
- ASCII character set (strictly speaking, it may be
Latin Alphabetic character set of JIS X 0201, but there
is no big deal between two) is ALWAYS designed in G0
element. G0 is ALWAYS invoked by GL area.
- 2 byte KANZI character set is ALWAYS designated in G1
element. Two GL byte sequence without leading shingle
shift invokes G1.
- KATAKANA character set is ALWAYS designated in G2
element. Sequence of SS2R and one GR byte invokes G2.
- 2 byte supplementary KANZI character set is ALWAYS
designated in G3. Sequence of SS3R and one GR byte
invokes G3.
- Other sequences are illegal.
In this sense, it doesn't have state shifting and thus
logic can be hard coded (if you want :-).
Roughly, the following is the code:

if (isascii(c)) {
/* bit pattern 0xxx xxxx */
Frob_Ascii(c);
} else switch (0xff & c) {
case 0x8e:
/* bit pattern 1000 1110 1xxx xxxx */
Get_one_more_byte_and_frob_it_as_KANA();
break;
case 0x8f:
/* bit pattern 1000 1111 1xxx xxxx 1xxx xxxx */
Get_two_more_bytes_and_frob_them_as_supplementary_KANZI();
break;
default:
if (0xa0 < c && c < 0xff) {
/* bit pattern 1xxx xxxx 1xxx xxxx */
Get_one_more_byte_and_frob_them_as_KANZI(c);
} else {
Alert_Error(c);
}
}
--
iida
Re: Japanese and UTF8 [ In reply to ]
Werner Koch <wk@gnupg.org>:

> now that we have a Japanese translation, we have to do a conversion from
> EOC_JP to UTF-8, because UTF-8 is the required encoding for user IDs
> and some other strings in OpenPGP.
>
> I don't think that the currently used simple mapping approach works
> with that character set, because it is a simple one-to-one mapping and
> I expect that EOC_JP uses state shifting.

Probably.

> What is a portable way to this conversion? I had some talks about
> that in Tokyo and it boiled down to let the OS/libc do it. Okay, how?

The official API uses iconv_open, iconv and iconv_close and is defined
in iconv.h. The version in glibc-2.1 doesn't do Japanese and deviates
from the standard. I hope glibc-2.2 will have a more correct and
complete implementation. Bruno Haible has a portable libiconv that
provides the same functions and does do Japanese. (I'm using it now,
linked with mutt, on a glibc-2.1 machine.)

There's concise info and relevant links at:

ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-5.html#ss5.1

If you're going to use iconv, then you might want to get rid of the
charset tables in util/strgutil.c. On the other hand, you might want
to do what I did with mutt: leave them in but use them only if
configure fails to detect iconv. I can send you my configure.in for
mutt, if you want; it's mostly adapated from Bruno Haible's
configure.in for clisp, if I remember correctly. You probably know
more about aoutconf and can tell me what I did wrong ...

Edmund
Re: Japanese and UTF8 [ In reply to ]
Werner Koch <wk@gnupg.org> writes:
>that in Tokyo and it boiled down to let the OS/libc do it. Okay, how?

The libc does have convert functions, but we need to
specify from which code and to which.
Inside of OpenPGP is ALWAYS UTF-8 and we are happy.
Outside of OpenPGP may vary depending on environment.

I see there are some options.

* To assume outside is also UTF-8 and we don't convert at
all.
* To assume outside is also UTF-8, but we do convert from/to
printable ASCII (with technique such as in RFC 2253).
* To assume outside is always fixed charset, say
ISO-2022-JP (or EUC-JP or whatever).
* To assume that the charset is specified explicitly.
--
iida
Re: Japanese and UTF8 [ In reply to ]
On Thu, 17 Feb 2000, Edmund GRIMLEY EVANS wrote:

> The official API uses iconv_open, iconv and iconv_close and is defined
> in iconv.h. The version in glibc-2.1 doesn't do Japanese and deviates

Okay.

> ftp://ftp.ilog.fr/pub/Users/haible/utf8/Unicode-HOWTO-5.html#ss5.1

Thanks.

> to do what I did with mutt: leave them in but use them only if
> configure fails to detect iconv. I can send you my configure.in for

Probably this is waht I will do.

> mutt, if you want; it's mostly adapated from Bruno Haible's
> configure.in for clisp, if I remember correctly. You probably know
> more about aoutconf and can tell me what I did wrong ...

Is this in the current mutt CVS version?


Werner
Re: Japanese and UTF8 [ In reply to ]
On Thu, 17 Feb 2000, IIDA Yosiaki wrote:

> * To assume outside is also UTF-8 and we don't convert at
> all.

That is easy. I simply add a dummy --charset utf8 which does not do
any conversion.

> * To assume outside is also UTF-8, but we do convert from/to
> printable ASCII (with technique such as in RFC 2253).

Do you mean to escape all non 7bit characters like "\dd" ?
I wonder why the don't suggest to use "\xdd".

> * To assume outside is always fixed charset, say
> ISO-2022-JP (or EUC-JP or whatever).
> * To assume that the charset is specified explicitly.

You know have to do this using the --charset option which defaults to
latin-1. Therefore I should use libiconv.

There is still one problem: Are there any control characters outside
of the 0..127 range? I assume yes and than I need a way to test for
them. For security reasons we can't print any data without checking
first. Hmmm, the second options seems to be best for this but than
you won't see any Japanese characters :-(. I'll better go and read
something about libiconv.


Werner
Re: Japanese and UTF8 [ In reply to ]
Werner Koch <wk@gnupg.org> writes:
>>* To assume outside is also UTF-8, but we do convert from/to
>>printable ASCII (with technique such as in RFC 2253).
>Do you mean to escape all non 7bit characters like "\dd" ?
>I wonder why the don't suggest to use "\xdd".

I don't stick to \dd. The point is that the outside is
ASCII physically, but it can be intrepretable as other
charset logically. So I suggest \xdd also.

>There is still one problem: Are there any control characters outside
>of the 0..127 range? I assume yes and than I need a way to test for

128 through 159 is C1 control in most environment
(but not all )-: .
--
iida
Re: Japanese and UTF8 [ In reply to ]
Werner Koch <wk@gnupg.org>:

> > to do what I did with mutt: leave them in but use them only if
> > configure fails to detect iconv. I can send you my configure.in for
>
> Probably this is waht I will do.
>
> > mutt, if you want; it's mostly adapated from Bruno Haible's
> > configure.in for clisp, if I remember correctly. You probably know
> > more about aoutconf and can tell me what I did wrong ...
>
> Is this in the current mutt CVS version?

No. Thomas Roessler wants to get another stable release out first, I
think. My patch and a description of how to use it are at
http://www.rano.org/mutt.html. (It doesn't apply cleanly to the
current CVS version, but only for unimportant reasons; it should work
against CVS from a week or so ago.)

Edmund
Re: Japanese and UTF8 [ In reply to ]
Werner Koch <wk@gnupg.org> writes:
>now that we have a Japanese translation, we have to do a conversion from
>EOC_JP to UTF-8, because UTF-8 is the required encoding for user IDs
>and some other strings in OpenPGP.

At the last time I replied to this message, maybe I was
too excited and lost sime points.

As this list is named gnupg-i18n, it is for internationalization.
Internationalization is a process of generalization. OTOH,
Japanization and other localizations are processes of
specialization, and they are facing the very opposite side.

So my first lost point is this:
In this list, do we want to discuss localizations as
well as internationalization?

>Do you mean to escape all non 7bit characters like "\dd" ?
...
>You know have to do this using the --charset option which defaults to
>latin-1. Therefore I should use libiconv.

My second option, escapings like \dd, \xdd and others come
to me as idea of --transfer-encoding option. And specifying
a charset and specifying a transfer encoding are two
DIFFERENT things. You may want to use \xdd even when you
are using default Latin-1 charset and thus we'd better not
get confused by these two.
--
iida
Re: Japanese and UTF8 [ In reply to ]
Werner Koch <wk@gnupg.org>:

> > * To assume outside is always fixed charset, say
> > ISO-2022-JP (or EUC-JP or whatever).
> > * To assume that the charset is specified explicitly.
>
> You know have to do this using the --charset option which defaults to
> latin-1. Therefore I should use libiconv.

There is a function that lets you discover the charset of the current
locale. This would be a better default than iso-8859-1, when the
function is available.

The trouble is that this function is not very widely available, and
some people have broken locales, so you probably also need a --charset
option that overrides the locale, if you want gnupg to be portable and
easy to install.

> There is still one problem: Are there any control characters outside
> of the 0..127 range? I assume yes and than I need a way to test for
> them. For security reasons we can't print any data without checking
> first.

I think the offially correct way to remove control characters from a
string before printing it is to use mbtowc and iswprint. Both these
functions take account of the locale.

As usual, you would have to supply your own simplified version for use
where configure fails to find them in a library.

Do you have to line non-ascii things up in columns at all? I need that
in mutt. I use wcwidth to tell me how many character cells a character
will occupy on the display. For a printable character, the result is
0, 1 or 2. Since wcwidth returns -1 for a non-printable character
(except null), you don't need iswprint when you're using wcwidth. If I
remember correctly, wcwidth is in glibc-2.2, but it's not part of the
UNIX98 standard. In mutt I supply my own definition, copied from
Markus Kuhn.

You probably don't need wcwidth.

int mbtowc(wchar_t *pwc, const char *s, size_t n);
int iswprint(wint_t wc);

int wcwidth (wchar_t wc);

Edmund