Mailing List Archive

Feature request: Kana entities
This feature should be easy to do.
Unfortunately my PHP knowledge is limited, so I think
it will be better if I just ask for it instead of trying to do it myself :)

Using japanese characters in non-japanese wikipedias is currently hard.
One have to write them as &#xHEXCODE; or &#DECIMALCODE;

I think that it would be much better if parser were able to parse
fake kana (at least basic kana, full kanji would be much more work) &entities;
and convert them to numeric codes.

So one can write &hiragana_wa; or &katakana_chi;
This isn't likely to conflict with anything.

Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana

Entities that would be needed:
* Full hiragana ぁ to ゔ
* Full katakana ァ to ヺ
* Prolongation mark ー

Proposed names:
* &hiragana_x; &hiragana_smallx;
* &katakana_x; &katakana_smallx;
* &kana_long;

I also think that it might be good idea to extend it to other writing sytems
in the future.

Is it possible ?
Re: Feature request: Kana entities [ In reply to ]
This is a neat idea, but...

> So one can write &hiragana_wa; or &katakana_chi;
> This isn't likely to conflict with anything.
>
> Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana

In general, why would it ever be common that one would wish to put
Kana into non-Japanese wikipedias?

At least at first thought (and I'm totally open to have my mind
changed!) the only place in an English or Polish wikipedia where
showing Kana would make sense would be an article _about_ Kana.

--Jimbo
Re: Feature request: Kana entities [ In reply to ]
On Thu, Mar 07, 2002 at 03:59:34PM -0800, Jimmy Wales wrote:
> This is a neat idea, but...
>
> > So one can write &hiragana_wa; or &katakana_chi;
> > This isn't likely to conflict with anything.
> >
> > Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana
>
> In general, why would it ever be common that one would wish to put
> Kana into non-Japanese wikipedias?
>
> At least at first thought (and I'm totally open to have my mind
> changed!) the only place in an English or Polish wikipedia where
> showing Kana would make sense would be an article _about_ Kana.
>
> --Jimbo

Just see articles about anything Japanese on English Wikipedia.
They contain Japanese names of everything.
Re: Feature request: Kana entities [ In reply to ]
On ĵaŭ, 2002-03-07 at 22:07, Tomasz Wegrzanowski wrote:
> On Thu, Mar 07, 2002 at 03:59:34PM -0800, Jimmy Wales wrote:
> > This is a neat idea, but...
> >
> > > So one can write &hiragana_wa; or &katakana_chi;
> > > This isn't likely to conflict with anything.
> > >
> > > Kana Unicode table (in "English") is on http://pl.wikipedia.com/wiki.cgi?Kana
> >
> > In general, why would it ever be common that one would wish to put
> > Kana into non-Japanese wikipedias?
> >
> > At least at first thought (and I'm totally open to have my mind
> > changed!) the only place in an English or Polish wikipedia where
> > showing Kana would make sense would be an article _about_ Kana.
>
> Just see articles about anything Japanese on English Wikipedia.
> They contain Japanese names of everything.

Sure, but more often kanji than kana, so special kana markup wouldn't be
that big a win. See the thread "International Upgrades"; the vague plan
is to standardise the internal character set and present the wikipedias
in Unicode to capable browsers. (Please comment!)

As a result, we should be able to use the customary input methods or
cut-n-paste to put any characters into any of the wikis, which is
certainly a lot easier than looking up entities or running text through
a UTF-8-to-entities convertor (which is what I currently do).

-- brion vibber (brion @ pobox.com)
Re: Feature request: Kana entities [ In reply to ]
On Thu, Mar 07, 2002 at 11:34:53PM -0800, Brion L. VIBBER wrote:
> On ??a??, 2002-03-07 at 22:07, Tomasz Wegrzanowski wrote:
> > Just see articles about anything Japanese on English Wikipedia.
> > They contain Japanese names of everything.
>
> Sure, but more often kanji than kana, so special kana markup wouldn't be
> that big a win. See the thread "International Upgrades"; the vague plan
> is to standardise the internal character set and present the wikipedias
> in Unicode to capable browsers. (Please comment!)

Uhm, right. But most non-japanese people don't know names of too many kanjis,
so kanjis aren't that important. ;) On the other hand more people that
it is usually though know kana, so it might be beneficial for them.

Hmmm. Now I think that some general method would be more useful:
&katakana_a; &kanji_b; &hebrew_c; or &cyrilic_d;

I think that it won't need too many changes in parser.
Perl code:
Init:

%Entities = {'&katakana_o;' => 'オ',
...
};

On HTML output:

s/(&[a-zA-Z0-9_]+;)/$Entities{$x}?$Entities{$x}:$x;/eg;

> As a result, we should be able to use the customary input methods or
> cut-n-paste to put any characters into any of the wikis, which is
> certainly a lot easier than looking up entities or running text through
> a UTF-8-to-entities convertor (which is what I currently do).
>
> -- brion vibber (brion @ pobox.com)

Hmmm. Wouldn't that need some modifications to browsers ?
Re: Feature request: Kana entities [ In reply to ]
I wrote:
> > In general, why would it ever be common that one would wish to put
> > Kana into non-Japanese wikipedias?
> >
> > At least at first thought (and I'm totally open to have my mind
> > changed!) the only place in an English or Polish wikipedia where
> > showing Kana would make sense would be an article _about_ Kana.

Tomasz Wegrzanowski wrote:
> Just see articles about anything Japanese on English Wikipedia.
> They contain Japanese names of everything.

But shouldn't these Japanese names generally be written in the Roman
alphabet (Romaji), not in Kana? If I open up an Encyclopedia
Britannica article about 'anime' or 'sushi' or 'Hirohito' or 'Konoe
Fumimaro' I don't expect to see kana, but Romaji.

I'm not a real stickler on this point; as I say, I could be convinced.
I'm just saying that it strikes me as fairly odd to put Kana or Kanji
character sets into other languages, except in some very special
cases.

--Jimbo
Re: Feature request: Kana entities [ In reply to ]
Brion L. VIBBER wrote:
> Sure, but more often kanji than kana, so special kana markup wouldn't be
> that big a win. See the thread "International Upgrades"; the vague plan
> is to standardise the internal character set and present the wikipedias
> in Unicode to capable browsers. (Please comment!)

Really? There are kanji in articles about Japan? I mean, articles
other than articles about the language or other special cases?

That seems odd to me. I'm not opposed to it, necessarily, but it
seems very odd. I mean, there's no reason to expect that kanji will
be useful to the vast majority of readers.

Can you send some examples?

--Jimbo
Re: Feature request: Kana entities [ In reply to ]
O.k., I'm starting to see the light on this.

Brion L. VIBBER wrote:
> > I'm not a real stickler on this point; as I say, I could be convinced.
> > I'm just saying that it strikes me as fairly odd to put Kana or Kanji
> > character sets into other languages, except in some very special
> > cases.
>
> What other special case could there be than "something originating in
> culture X, here's its real name in the language of X in case you can
> read X and want to look up more information or, heck, are just curious".

Oh, the special cases I had in mind were articles _about_ kana or kanji, or in
cases where the kana or kanji are likely to be well-known in a certain context.

I don't know much about 'anime', for example, but I imagine that fans
of anime are familiar with the kana for 'a ni me'. New people
interested in that area may have seen those kana around but not yet
grasped what they mean. So they'd be excited to find out by readin
our 'a ni me' article.

That's different from just sticking a kanji in after someone's name.

But, now I'm starting to see the light. So long as this is just
presented as parenthetical information, there's no harm and it could
be very useful. Take the Sushi article as an example. Someone could
use it to become familiar with the kanji for different things like
nigirisushi, and then have more fun the next time at a Japanese
restaurant.

Boku wa nihongo no gakusei. De mo watashi no nihongo joozu de wa
arimasen. "I am a Japanese language student. But, my Japanese
language is not proficient."

So consier me totally converted on this point.

(However, my Konquerer browser does not render these characters at all.)

--Jimbo
Re: Feature request: Kana entities [ In reply to ]
On ven, 2002-03-08 at 09:18, Jimmy Wales wrote:
> Tomasz Wegrzanowski wrote:
> > Just see articles about anything Japanese on English Wikipedia.
> > They contain Japanese names of everything.
>
> But shouldn't these Japanese names generally be written in the Roman
> alphabet (Romaji), not in Kana? If I open up an Encyclopedia
> Britannica article about 'anime' or 'sushi' or 'Hirohito' or 'Konoe
> Fumimaro' I don't expect to see kana, but Romaji.

Bring up the wikipedia article on [[Miyazaki Hayao]] (or, for that
matter, [[Sushi]]) for an example of what we're talking about.
Kanji/kana are provided as supplementary parenthetical information,
while the main text uses the English name and, if different, the Romaji
form.

> I'm not a real stickler on this point; as I say, I could be convinced.
> I'm just saying that it strikes me as fairly odd to put Kana or Kanji
> character sets into other languages, except in some very special
> cases.

What other special case could there be than "something originating in
culture X, here's its real name in the language of X in case you can
read X and want to look up more information or, heck, are just curious".

-- brion vibber (brion @ pobox.com)
Re: Feature request: Kana entities [ In reply to ]
On Fri, Mar 08, 2002 at 09:20:16AM -0800, Jimmy Wales wrote:
> Brion L. VIBBER wrote:
> > Sure, but more often kanji than kana, so special kana markup wouldn't be
> > that big a win. See the thread "International Upgrades"; the vague plan
> > is to standardise the internal character set and present the wikipedias
> > in Unicode to capable browsers. (Please comment!)
>
> Really? There are kanji in articles about Japan? I mean, articles
> other than articles about the language or other special cases?
>
> That seems odd to me. I'm not opposed to it, necessarily, but it
> seems very odd. I mean, there's no reason to expect that kanji will
> be useful to the vast majority of readers.
>
> Can you send some examples?

Murasaki Shikibu
Anime
Princess Mononoke
Neon Genesis Evangelion
Miyazaki Hayao
(and lot more)

In fact majority of Japan-related article have some kanjis.
Re: Feature request: Kana entities [ In reply to ]
On ven, 2002-03-08 at 09:20, Jimmy Wales wrote:
> Brion L. VIBBER wrote:
> > Sure, but more often kanji than kana, so special kana markup wouldn't be
> > that big a win. See the thread "International Upgrades"; the vague plan
> > is to standardise the internal character set and present the wikipedias
> > in Unicode to capable browsers. (Please comment!)
>
> Really? There are kanji in articles about Japan?

Yeeessss.... You have such a difficult time accepting this. :)

> I mean, articles
> other than articles about the language or other special cases?

Define "special cases".

> That seems odd to me. I'm not opposed to it, necessarily, but it
> seems very odd. I mean, there's no reason to expect that kanji will
> be useful to the vast majority of readers.

No, but there's no reason to expect that any particular *article* will
be useful to the vast majority of readers for that matter.

A few question marks or boxes in parentheses aren't going to drive
non-Japanese readers mad with one look, but for those who *do* know it,
they *do* get more information because they now can recognize the term
in Japanese text, or usefully look it up in Japanese informational
resources.

The English wikipedia isn't just for English monolinguals, is it?

> Can you send some examples?

From a quick search...

Nagano, Japan
Japan/Meiji
Emperor Akihito of Japan
Emperor Jimmu of Japan
Satsuma
Okinawa
Hideki Tojo
Tokyo
Meiji-era leaders
Shogun
Koto
Dejima
Yen
Tokugawa shoguns
Kamakura shoguns
Hanko
Akihabara
Samurai
Cyprinus carpio
Nintendo
Nissan
Ashikaga shoguns
Kyoto
Morihei Ueshiba
Toyotomi Hideyoshi
Jokichi Takamine
Ju-jitsu
Junichiro Koizumi
World War II/Hiryu
World War II/Kaga
Iron Chef
World War II/Soryu
Akira Kurosawa
Kamikaze
World War II/Zuikaku
Heisuke Hironaka
Amakusa
Tsurugi
Raku
Karaoke
Kaifu Toshiki
Toyota
Tsunami
Shibasaburo Kitasato
Judo
Gomoku
Suzuki
Kendo
Zhu Shijie
Isoroku Yamamoto
Miyazaki Hayao
Sushi
Choshu
Anime
Otaku No Video
The Vision of Escaflowne
Kia Asayama
Ghost in the Shell
Tenchi Muyo
Star Blazers
Princess Mononoke
Doraemon
Hentai
Masamune Shirow
Manga
My Neighbor Totoro
Trigun
Sailor Moon
Ranma 1/2
Rumiko Takahashi

I'm sure I missed plenty. These can be broadly categorized as:
* Geographical names, with the local native kanji "spelling" as a
sidenote
* Personal names of politicians, scientists, and artists, with their
native kanji "spelling" as a sidenote
* Various cultural items originating in Japan (sushi, karaoke, martial
arts, companies, works of art/pop culture) with their native kanji/kana
"spelling" as a sidenote

That said, I'm still not convinced there's much usefulness in more
special codes that work on our wiki and nowhere else in the world; only
a fraction of the above use kana at all.

-- brion vibber (brion @ pobox.com)
Re: Feature request: Kana entities [ In reply to ]
On Fri, Mar 08, 2002 at 10:07:58AM -0800, Brion L. VIBBER wrote:
> That said, I'm still not convinced there's much usefulness in more
> special codes that work on our wiki and nowhere else in the world; only
> a fraction of the above use kana at all.

*Anything* is better than using numerics.
Many of them are mixed kana + kanji.
That still saves half of the work.
Re: Feature request: Kana entities [ In reply to ]
On ven, 2002-03-08 at 04:07, Tomasz Wegrzanowski wrote:
> On Thu, Mar 07, 2002 at 11:34:53PM -0800, Brion L. VIBBER wrote:
> > On ??a??, 2002-03-07 at 22:07, Tomasz Wegrzanowski wrote:
> > > Just see articles about anything Japanese on English Wikipedia.
> > > They contain Japanese names of everything.
> >
> > Sure, but more often kanji than kana, so special kana markup wouldn't be
> > that big a win. See the thread "International Upgrades"; the vague plan
> > is to standardise the internal character set and present the wikipedias
> > in Unicode to capable browsers. (Please comment!)
>
> Uhm, right. But most non-japanese people don't know names of too many kanjis,
> so kanjis aren't that important. ;) On the other hand more people that
> it is usually though know kana, so it might be beneficial for them.

But, what are people who don't know much Japanese going to _do_ with
kana?

Speaking as someone with a very very poor command of the Japanese
language, my own usage of Japanese characters on the non-Japanese
wikipedias is limited to:
* Demonstration of japanese characters in articles about the language
* Showing the local form of a place, personal, or other name in
articles about Japan and Japanese culture

The former are a limited genre (Jimbo's "special case"), and the latter
are overwhelmingly kanji.

> Hmmm. Now I think that some general method would be more useful:
> &katakana_a; &kanji_b; &hebrew_c; or &cyrilic_d;

Hmm. Perhaps you should take this up with the w3 and get these put into
the next XHTML standard. :)

> > As a result, we should be able to use the customary input methods or
> > cut-n-paste to put any characters into any of the wikis, which is
> > certainly a lot easier than looking up entities or running text through
> > a UTF-8-to-entities convertor (which is what I currently do).
>
> Hmmm. Wouldn't that need some modifications to browsers ?

Only if you've got a really limited browser. (Perhaps Netscape 4, the
bane of web developers worldwide, or a text-mode browser in a non UTF-8
locale.)

Mozilla/Netscape 6, Internet Explorer 5+, Konqueror (if fonts are set up
right), you should have no problem. Configuring keyboards/input methods,
of course, is a system-dependent matter. (Japanese input is notoriously
difficult to set up on Unixish systems that aren't running a primarily
Japanese locale; it's quite easy on relatively current Mac or Windows
systems, though.)

-- brion vibber (brion @ pobox.com)
Re: Feature request: Kana entities [ In reply to ]
On ven, 2002-03-08 at 10:26, Tomasz Wegrzanowski wrote:
> On Fri, Mar 08, 2002 at 10:07:58AM -0800, Brion L. VIBBER wrote:
> > That said, I'm still not convinced there's much usefulness in more
> > special codes that work on our wiki and nowhere else in the world; only
> > a fraction of the above use kana at all.
>
> *Anything* is better than using numerics.
> Many of them are mixed kana + kanji.
> That still saves half of the work.

What are you doing, looking up every character individually? No wonder
you're having trouble!

What I currently do is to type the desired text into yudit
(http://yudit.org) using its support for the kinput2 input method (or
cut-n-paste into yudit from another web page), save the file, and run it
through this little program:

#!/usr/bin/perl -p
# disassemble non-ASCII codes from UTF-8 stream

# borrowed from http://czyborra.com/utf/

#$format=$ENV{"UCFORMAT"}||'<U%04X>';
$format='&#%d;';
s/([\xC0-\xDF])([\x80-\xBF])/sprintf($format,
unpack("c",$1)<<6&0x07C0|unpack("c",$2)&0x003F)/ge;
s/([\xE0-\xEF])([\x80-\xBF])([\x80-\xBF])/sprintf($format,
unpack("c",$1)<<12&0xF000|unpack("c",$2)<<6&0x0FC0|unpack("c",$3)&0x003F)/ge;
s/([\xF0-\xF7])([\x80-\xBF])([\x80-\xBF])([\x80-\xBF])/sprintf($format,
unpack("c",$1)<<18&0x1C0000|unpack("c",$2)<<12&0x3F000|
unpack("c",$3)<<6&0x0FC0|unpack("c",$4)&0x003F)/ge;

Paste the output into the Wikipedia edit box, and presto!

If I have one name, it gets done at once. If I have two names, they get
done at once. If I put in a whole passage of text, it all gets done at
once. It would actually be *more* work for me to separately write out
the kana characters in special codes.

Once we've got the new system with Unicode up, you should be able to
type or paste the characters in directly (unless you have a very limited
browser, see my earlier post) and bypass all this rigamarole.

-- brion vibber (brion @ pobox.com)
Re: Feature request: Kana entities [ In reply to ]
On Fri, Mar 08, 2002 at 04:03:02PM -0800, lcrocker@nupedia.com wrote:
>
>
> >Hmmm. Now I think that some general method would be more useful:
> >&katakana_a; &kanji_b; &hebrew_c; or &cyrilic_d;
>
> If and when the W3C ever /standardizes/ these as HTML named
> entity references, we might use them. Until then, I think it's
> better to be able to point to an officially sanctioned doc and
> say "we support these", and let people complain the the
> standards body.
> 0

W3C standarizes HTML, and I'm talking about Wiki markup.
Wiki named entities will be converted into HTML numeric entities
on output, so they have nothing to do with HTML.
Re: Feature request: Kana entities [ In reply to ]
Brion L. Vibber wrote:
> On ven, 2002-03-08 at 09:20, Jimmy Wales wrote:
> > Really? There are kanji in articles about Japan?
> Yeeessss.... You have such a difficult time accepting this. :)

For the matter of implementing the search engine, Latin search and
Kanji search could be two different functions. Just like image search
(Google style) is a third function and mathematic equations search
could be a fourth kind of search, once LaTeX support is integrated in
Wikipedia. To me, the kanji is just like images and I have no
keyboard to input that in the search window anyway.

The English Wikipedia might implement all three searches, but the
Norwegian and German ones might only need the Latin search (until a
significant number of German Wikipedia pages have kanji, images or
equations in them).

Perhaps this separation of implementations can help us get forward?
I still have no advice for the non-Latin Wikipediae.

> The English wikipedia isn't just for English monolinguals, is it?

Is this the new politically correct term for Americans? :-)


--
Lars Aronsson (lars@aronsson.se)
Aronsson Datateknik
Teknikringen 1e, SE-583 30 Linuxköping, Sweden
tel +46-70-7891609
http://aronsson.se/ http://elektrosmog.nu/ http://susning.nu/
Re: Feature request: Kana entities [ In reply to ]
On dim, 2002-03-10 at 04:34, Lars Aronsson wrote:
> Brion L. Vibber wrote:
> > On ven, 2002-03-08 at 09:20, Jimmy Wales wrote:
> > > Really? There are kanji in articles about Japan?
> > Yeeessss.... You have such a difficult time accepting this. :)
>
> For the matter of implementing the search engine, Latin search and
> Kanji search could be two different functions. Just like image search
> (Google style) is a third function and mathematic equations search
> could be a fourth kind of search, once LaTeX support is integrated in
> Wikipedia. To me, the kanji is just like images and I have no
> keyboard to input that in the search window anyway.
>
> The English Wikipedia might implement all three searches, but the
> Norwegian and German ones might only need the Latin search (until a
> significant number of German Wikipedia pages have kanji, images or
> equations in them).
>
> Perhaps this separation of implementations can help us get forward?
> I still have no advice for the non-Latin Wikipediae.

Well, my point is that there is no need whatsoever to make these
separate functions. They work 100% THE SAME WAY. You put in text, it
munges it a bit to make non-ASCII characters behave, and searches for
it. No images involved.

No great hardship to the person who never types a kanji. (Or an
o-with-umlaut!)

No reason to make them separate.

> > The English wikipedia isn't just for English monolinguals, is it?
>
> Is this the new politically correct term for Americans? :-)

:)

-- brion vibber (brion @ pobox.com)