Mailing List Archive

UTF
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii

>Opera, Mozilla, IE, Netscape and Konqueror ALL
SUPPORT UTF-8.
I have used them all to edit Polish Wikipedia
(which is UTF-8) and had never any problems about
that. Opera has broken charset autodetection, you may
wish to play with settings a bit.

Not on Macintosh.

Btw, I left you a sample in the sandbox (Opera work).
And I saved the main page at the meta (excuse me, I am
a vandal :-))))) (Netscape work) (I'll restore it
later...)

You sounded so sure of yourself...please have a look.
This is not a minor issue, but I have no idea why
netscape does that...

>Other problems you describe have nothing to do with
UTF-8. If it realy doesn't work, try upgrading your
browser.

The fact they don't have anything to do with UTF
doesn't make them work any better. I forgot one other
bug. With Netscape and UTF, the browser inserts some
random spaces in the text. Brion had to clean after me
a couple of times.
I was told to upgrade to Opera 6. I tried and it
crashed the computer 2 times, so I gave up.

Well, your problem is a *major* one. So I need to find
a way. I'll tried again that Opera 6.

>One reason are Interwiki links, which now must be
coded
by %-sequences, and making en->pl links is very
inconvenient,
especially when using konqueror, which displays normal
characters, not %-sequences in URL bar. But the really
important one is that we really need characters
outside ISO-8859-1. How can I write article about any
Polish city if I can't
write half of Polish diactrics ? The same applies to
most
of central Europe, and to most romanizations schemes
of other
scripts (which use lot of diactrics).
How can I write any article about people from such
places ?
Engish Wikipedia screws this issue completely by
stripping
essential diactrics (like
http://www.wikipedia.org/wiki/Lech_Walesa). How can I
write article about some language that doesn't
use ISO-8859-1 (that's some 90% of world languages) ?
Polish Wikipedia contains more linguistic information
than English now, mostly because you can't do any
decent
linguistics without using native scripts. Even if I
wanted
to translate Polish articles to English, I wouldn't be
able
to put them on English Wikipedia.


Ok. That's a major issue. Though...I wonder about how
many articles will be using diactrics on the
fr.wiki...The international links yes. But they should
not be in the article anyway. Could not they be coded
differently ?

Well...I know I bear no chance because we have a
linguist on the fr. wiki, and I know he would *love*
UTF.

But I wonder...even if I succeed to make that Opera 6
work (I have no other option, right ?), that doesnot
change the fact that Wikipedia will not be editable by
a french macintosh user with Netscape 4.5, or IE, or
Opera 5 or Mozilla (and a couple of other minor
versions).

You might say "who care about Macintosh user" ?

Yeah, who care...
Imac were sold a lot in France (much more in % than in
any other country I believe). Many scientific people
use Mac. Basically 100% of graphists, a lot of
journalists... So, yes, that troubles me to know that
most mac editors will maybe be treated as vandals
first.

Please, mac users, what do you use ????

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
Re: UTF [ In reply to ]
Anthere wrote:

>
>Please, mac users, what do you use ????
>
>
>
I Use Mac OS X 10.1
Chimera browser
:-)

Are you still on OS 9?



>
>
Re: UTF [ In reply to ]
>If it realy doesn't work, try upgrading your browser.

This is a totally wrong-headed way to look at things.
Most net users have out-of-date browsers and we
shouldn't expect them to upgrade to the latest version
of their browser just to write articles. /Very/
unWiki.

The only major issue I see with UTF support deals with
interlanguage links but there are already plans being
worked-out to have these links outside the main
content window of articles. Therefore the main content
window can be in the charset that is best for that
particular language Wikipedia and the interlanguage
link edit window can be in UTF.

Most browsers can deal with Latin-charset-based accent
marks so this isn't such a big issue anymore. BTW
everyone should be writing in the language of their
Wikipedia. French articles should be in French,
English articles in English, Spanish articles in
Spanish (unless most common usage of the term is to
use the foreign spelling). Yes this means that the
titles of articles are often approximated to fit
within the language you are writing when the two
languages have different alphabets. Most important
terms have more or less widely-used latin-based
approximations. All the less important terms have to
be dealt with on a case-by-case basis.

Wiktionary is UTF though (and rightly so since all
words in all languages are the subjects of articles).
In the encyclopedias, however, the subjects are not
the words, but persons, places, things and ideas (the
names are for indexing purposes only - so that the
information about the subjects can be found by the
target audience). So providing access to your target
audience is more important than using a zillion
different alphabets (just how the subject's name looks
in another language can be expressed in the article
using Unicode).

In short, for now at least, the benefits of using UTF
for Latin-based Wikipedias are over-shadowed by the
negative repercussions.

-- Daniel Mayer (aka mav)

__________________________________________________
Do you Yahoo!?
Yahoo! Mail Plus - Powerful. Affordable. Sign up now.
http://mailplus.yahoo.com
Re: Re: UTF [ In reply to ]
Daniel Mayer wrote:

>>If it realy doesn't work, try upgrading your browser.
>>
>>
>
>This is a totally wrong-headed way to look at things.
>Most net users have out-of-date browsers and we
>shouldn't expect them to upgrade to the latest version
>of their browser just to write articles. /Very/
>unWiki.
>
>
Does anyone know a way to detect the browser settings/abilities for UTF?
We could use UTF and convert back-and forth on-the-fly for old browsers...
Re: Re: UTF [ In reply to ]
On Thu, Jan 09, 2003 at 12:42:53PM -0800, Daniel Mayer wrote:
> >If it realy doesn't work, try upgrading your browser.
>
> This is a totally wrong-headed way to look at things.
> Most net users have out-of-date browsers and we
> shouldn't expect them to upgrade to the latest version
> of their browser just to write articles. /Very/
> unWiki.

There's NOTHING wrong with it. UTF-8 isn't any new technology
and every browser that's not really ancient or really
broken supports that.

If we wanted "every ancient broken browser", then
we shouldn't use PNG, CSS, JavaScript, OGG, colors,
and all things like that in articles, because there's
always some ancient broken browser that doesn't support
that. In fact all these are more likely to cause problems
than UTF-8.

> The only major issue I see with UTF support deals with
> interlanguage links but there are already plans being
> worked-out to have these links outside the main
> content window of articles. Therefore the main content
> window can be in the charset that is best for that
> particular language Wikipedia and the interlanguage
> link edit window can be in UTF.

No, that's rather minor issue, but it should be fixed too.

> Most browsers can deal with Latin-charset-based accent
> marks so this isn't such a big issue anymore. BTW
> everyone should be writing in the language of their
> Wikipedia. French articles should be in French,
> English articles in English, Spanish articles in
> Spanish (unless most common usage of the term is to
> use the foreign spelling). Yes this means that the
> titles of articles are often approximated to fit
> within the language you are writing when the two
> languages have different alphabets. Most important
> terms have more or less widely-used latin-based
> approximations. All the less important terms have to
> be dealt with on a case-by-case basis.
>
> Wiktionary is UTF though (and rightly so since all
> words in all languages are the subjects of articles).
> In the encyclopedias, however, the subjects are not
> the words, but persons, places, things and ideas (the
> names are for indexing purposes only - so that the
> information about the subjects can be found by the
> target audience). So providing access to your target
> audience is more important than using a zillion
> different alphabets (just how the subject's name looks
> in another language can be expressed in the article
> using Unicode).

On Polish Wikipedia the policy is - if word is in Latin-based script,
it should be spelled in Latin characters with all diactrics,
and if it's not, original spelling (be it Cyrillic, Kanji or whatever else)
should be given in article.

If "screw the spelling" is not, as is widely thought now,
just a temporary technical problem, but an oficial policy,
then sooner or later you will see a fork of English Wikipedia.

> In short, for now at least, the benefits of using UTF
> for Latin-based Wikipedias are over-shadowed by the
> negative repercussions.

It's completely opposite. "Problems" of UTF-8, coming
from broken and ancient browsers, are completely irrelevant compared
to benefits of being able to write thing correctly.
Re: UTF [ In reply to ]
On 10 Jan 2003 at 0:03, Tomasz Wegrzanowski wrote:
> On Thu, Jan 09, 2003 at 12:42:53PM -0800, Daniel Mayer wrote:
> > >If it realy doesn't work, try upgrading your browser.
> >
> > This is a totally wrong-headed way to look at things.
> > Most net users have out-of-date browsers and we
> > shouldn't expect them to upgrade to the latest version
> > of their browser just to write articles. /Very/
> > unWiki.
>
> There's NOTHING wrong with it. UTF-8 isn't any new technology
> and every browser that's not really ancient or really
> broken supports that.
>
> If we wanted "every ancient broken browser", then
> we shouldn't use PNG, CSS, JavaScript, OGG, colors,
> and all things like that in articles, because there's
> always some ancient broken browser that doesn't support
> that. In fact all these are more likely to cause problems
> than UTF-8.

So the problem might be finding out what the standard
for the browser in some parts of the world is and... write
for the browser slightly below the standard.
The policy "share your knowledge with the Wikipedia
with all the high technology that we have" is what makes
me thinking what this all Wikipedias are for...?

> On Polish Wikipedia the policy is - if word is in Latin-based script,
> it should be spelled in Latin characters with all diactrics, and if
> it's not, original spelling (be it Cyrillic, Kanji or whatever else)
> should be given in article.

Mind you thou Tomasz, that there's no consensus
on Polish Wikipedia on exeception: how to deal with words
in Latin-based scripts which have popular or established Polish way
of writing them without diactrics. So your sentence
gave only first approximation of the policy and was lacking sth.
The working copy of the policy and the ratio can be found,
unfortunately in Polish language only at:
http://pl.wikipedia.org/wiki/Wikipedia:Zasady_pisowni_nazw_obcoj%C4%99zycznych

> If "screw the spelling" is not, as is widely thought now,
> just a temporary technical problem, but an oficial policy,
> then sooner or later you will see a fork of English Wikipedia.

Let's hope that ancient browsers extinct sooner.

> > In short, for now at least, the benefits of using UTF
> > for Latin-based Wikipedias are over-shadowed by the
> > negative repercussions.
>
> It's completely opposite. "Problems" of UTF-8, coming
> from broken and ancient browsers, are completely irrelevant compared
> to benefits of being able to write thing correctly.

For people with direct access to the Internet, living in the
so called Western World it is a "big" benefit.
For people living in other parts of the world without all the
"very" high technology, using some kind of "lower" high technology
and using future off-line version of Wikipedia it might not be
Both of us are happy because we've got the knowledge
how to make our computers and our browsers deal with UTF-8.
Other people who just to want to read free encyclopedia,
might not be as happy as we are - just watching some
broken texts.
Some people in this world prefer basic knowledge over nice look.

By the way: what about accessability issue and all
the unexpected non-latin characters and strange diactrics?

Youandme


----------------------------------------------------------------------
Rozrywkowe info w portalu INTERIA.PL >>> http://link.interia.pl/f16b8
Re: UTF [ In reply to ]
On Fri, Jan 10, 2003 at 03:34:18AM +0100, youandme@poczta.fm wrote:
> On 10 Jan 2003 at 0:03, Tomasz Wegrzanowski wrote:
> > If we wanted "every ancient broken browser", then
> > we shouldn't use PNG, CSS, JavaScript, OGG, colors,
> > and all things like that in articles, because there's
> > always some ancient broken browser that doesn't support
> > that. In fact all these are more likely to cause problems
> > than UTF-8.
>
> So the problem might be finding out what the standard
> for the browser in some parts of the world is and... write
> for the browser slightly below the standard.
> The policy "share your knowledge with the Wikipedia
> with all the high technology that we have" is what makes
> me thinking what this all Wikipedias are for...?

UTF-8 certainly qualifies.

> > On Polish Wikipedia the policy is - if word is in Latin-based script,
> > it should be spelled in Latin characters with all diactrics, and if
> > it's not, original spelling (be it Cyrillic, Kanji or whatever else)
> > should be given in article.
>
> Mind you thou Tomasz, that there's no consensus
> on Polish Wikipedia on exeception: how to deal with words
> in Latin-based scripts which have popular or established Polish way
> of writing them without diactrics. So your sentence
> gave only first approximation of the policy and was lacking sth.
> The working copy of the policy and the ratio can be found,
> unfortunately in Polish language only at:
> http://pl.wikipedia.org/wiki/Wikipedia:Zasady_pisowni_nazw_obcoj%C4%99zycznych

Nobody ever suggested stripping diactrics other than from
"traditionally polonized" names. I even remember accents
on all these French names when we had old software.

> > If "screw the spelling" is not, as is widely thought now,
> > just a temporary technical problem, but an oficial policy,
> > then sooner or later you will see a fork of English Wikipedia.
>
> Let's hope that ancient browsers extinct sooner.

Unless somebody can show stats that say there is more than 1% of them,
we can think of them as already extinct.

> > It's completely opposite. "Problems" of UTF-8, coming
> > from broken and ancient browsers, are completely irrelevant compared
> > to benefits of being able to write thing correctly.
>
> For people with direct access to the Internet, living in the
> so called Western World it is a "big" benefit.
> For people living in other parts of the world without all the
> "very" high technology, using some kind of "lower" high technology
> and using future off-line version of Wikipedia it might not be

What do you mean ?
Standard browsers on both Linux and Windows support UTF-8.
How does UTF-8 or not changes that ?

Much bigger problem for them is lack of CD Wikipedia distributions.

> Both of us are happy because we've got the knowledge
> how to make our computers and our browsers deal with UTF-8.

I haven't configured anything. It can deal with it out of the box.

> Other people who just to want to read free encyclopedia,
> might not be as happy as we are - just watching some
> broken texts.
> Some people in this world prefer basic knowledge over nice look.

It's not about look, but correctness. They are losing lot of knowledge
now - after reading description of "Wroclaw" on English Wikipedia
they won't be able to find it on map or in search engine.
They are losing linguistic knowledge, because you can't write that
without UTF-8.

Anyway, how will UTF-8 or not affect them ?

> By the way: what about accessability issue and all
> the unexpected non-latin characters and strange diactrics?

Accessibility isn't handled at all now. In future it should be,
but it will require lot of design changes, about as many as CJK.

If that's what you mean, speech synthetisers won't be able to
pronounce right without right diactrics, so we aren't doing them any good.

CJK and accessibility even have something in common - we should have
optional furigana support on Japanese Wiki and optional speech synthesis
on every Wiki for blind users. Both will require strong default
synthetizers, option of giving additional information (for example that
given word is in other language), and overriding things.
Re: UTF [ In reply to ]
On 10 Jan 2003 at 4:41, Tomasz Wegrzanowski wrote:

> On Fri, Jan 10, 2003 at 03:34:18AM +0100, youandme@poczta.fm wrote:
> > > On 10 Jan 2003 at 0:03, Tomasz Wegrzanowski wrote:
> > > If we wanted "every ancient broken browser", then
> > > we shouldn't use PNG, CSS, JavaScript, OGG, colors,
> > > and all things like that in articles, because there's
> > > always some ancient broken browser that doesn't support
> > > that. In fact all these are more likely to cause problems than UTF-8.
> > So the problem might be finding out what the
> > standard for the browser in some parts of the world is and... write
> > for the browser slightly below the standard.
> > The policy "share your knowledge with the Wikipedia
> > with all the high technology that we have" is what makes
> > me thinking what this all Wikipedias are for...?
>
> UTF-8 certainly qualifies.

IMHO UTF-8 at present _only_ qualifies (but is a must in the future),
and OGG is still a matter of "tomorrow".

> > > On Polish Wikipedia the policy is - if word is in Latin-based
> > > script, it should be spelled in Latin characters with all
> > > diactrics, and if it's not, original spelling (be it Cyrillic,
> > > Kanji or whatever else) should be given in article.
> >
> > Mind you thou Tomasz, that there's no consensus
> > on Polish Wikipedia on exeception: how to deal with words
> > in Latin-based scripts which have popular or established Polish way
> > of writing them without diactrics. So your sentence gave only first
> > approximation of the policy and was lacking sth. The working copy of
> > the policy and the ratio can be found, unfortunately in Polish
> > language only at:
> > http://pl.wikipedia.org/wiki/Wikipedia:Zasady_pisowni_nazw_obcoj%C4%
> > 99zycznych
>
> Nobody ever suggested stripping diactrics other than from
> "traditionally polonized" names. I even remember accents
> on all these French names when we had old software.

On Polish Wikipedia Kuomintag was changed to Guomindang with tone marks
Tokio was changed to Tokyo with macrons by... you!
(instead of Polish traditionally polonized versions: Kuomintang and Tokio
- and this has priority IMO and in the opinion of others, over the romanization rules).

Let me quote you from another mail:

On 9 Jan 2003 at 14:29, Tomasz Wegrzanowski wrote:
> How can I write article about any Polish city if I can't
> write half of Polish diactrics ? The same applies to most
> of central Europe, and to most romanizations schemes of other
> scripts (which use lot of diactrics).
> How can I write any article about people from such places ?
> Engish Wikipedia screws this issue completely by stripping
> essential diactrics (like http://www.wikipedia.org/wiki/Lech_Walesa).

So I can understand their willing to write Lech Walesa
without our diactrics if that's the way they consider as already
traditional in English language.
However, let anglophones speak for themselves.
For sure you can write articles in English without our diactrics
as I can. There are more painful things in this world.
Writing in any language is writing in this specific language
respecting _it's_ rules, even if the lack of diactrics is a rule.
Instead of rather writing in the language overriden by some
extra syntax. Giving a correct spelling once in parentheses
is what IMO is sufficient.

Going back to:

On 10 Jan 2003 at 4:41, Tomasz Wegrzanowski wrote:
> > Let's hope that ancient browsers extinct sooner.
>
> Unless somebody can show stats that say there is more than 1% of them,
> we can think of them as already extinct.

For sure, results are available somewhere in the net. I'll try to find out.
But I'm doubting... who does research among Third World countries?

> > For people with direct access to the Internet, living in the
> > so called Western World it is a "big" benefit.
> > For people living in other parts of the world without all the
> > "very" high technology, using some kind of "lower" high technology
> > and using future off-line version of Wikipedia it might not be
>
> What do you mean ?
> Standard browsers on both Linux and Windows support UTF-8.
> How does UTF-8 or not changes that ?

Hmm... W98 PL SE + it's standard IE 5.0 made some problems
(that's my experience).
I know people who like to stay on even cheaper system: W95 and IE 3.0
(I guess that's what was installed on W95) - let's hope that they are
in those 1%.

> Much bigger problem for them is lack of CD Wikipedia distributions.

I agree.

> > Both of us are happy because we've got the knowledge
> > how to make our computers and our browsers deal with UTF-8.
>
> I haven't configured anything. It can deal with it out of the box.

Same as me - I use IE 6.0. But I tried to use 5.0 and... failed.
And IE 5.0 is not an extinct one!

> > Other people who just to want to read free encyclopedia,
> > might not be as happy as we are - just watching some
> > broken texts.
> > Some people in this world prefer basic knowledge over nice look.
>
> It's not about look, but correctness.

What's the use of correctness if one cannot see the correct form?

> They are losing lot of knowledge
> now - after reading description of "Wroclaw" on English Wikipedia they
> won't be able to find it on map

For sure they are more intelligent than you suggest!

> or in search engine.

One could say: that's the problem of the search engines, not our!

> They are losing
> linguistic knowledge, because you can't write that without UTF-8.

Yes, they lose that but at _present_ that's not so important.
Many of them don't care about it in daily life: you see how most
of the users of English Wikipedia are happy with what they've got?
Why to make they happier?
Of course, once the English Wikipedia gain more non-anglophone
users it will gain more insight in those things as well.

WIkipedia is aimed to last more that 2 years. So the problem will
be solved somehow sooner or better. But I understand
that you can't wait...

> Anyway, how will UTF-8 or not affect them ?

See above

Youandme

----------------------------------------------------------------------
Rozrywkowe info w portalu INTERIA.PL >>> http://link.interia.pl/f16b8
Re: UTF [ In reply to ]
On Fri, Jan 10, 2003 at 06:29:06AM +0100, youandme@poczta.fm wrote:
> So I can understand their willing to write Lech Walesa
> without our diactrics if that's the way they consider as already
> traditional in English language.
> However, let anglophones speak for themselves.

English Wikipedia, more than any other, is also used a lot by
people whose native language is not English.

> For sure you can write articles in English without our diactrics
> as I can. There are more painful things in this world.

But we can fix that one, and not the others.

> Writing in any language is writing in this specific language
> respecting _it's_ rules, even if the lack of diactrics is a rule.

No. Its rules must be obeyed only for words of that language.
For words from other languages, like names of people, places etc.
rules of source language must be obeyed.

> Instead of rather writing in the language overriden by some
> extra syntax. Giving a correct spelling once in parentheses
> is what IMO is sufficient.

Without UTF-8 you can't even do that.

> > > Both of us are happy because we've got the knowledge
> > > how to make our computers and our browsers deal with UTF-8.
> >
> > I haven't configured anything. It can deal with it out of the box.
>
> Same as me - I use IE 6.0. But I tried to use 5.0 and... failed.
> And IE 5.0 is not an extinct one!

Wikipedia logs show that there are people using MSIE 5.0 and editing,
so I don't know what's the problem again.

> > They are losing lot of knowledge
> > now - after reading description of "Wroclaw" on English Wikipedia they
> > won't be able to find it on map
>
> For sure they are more intelligent than you suggest!

If you strip diactrics, names of many places suddenly become the same.
So instead of one place with that name, you may have 5.
Now they will probably choose least diactricized of them,
being usually wrong.

> > or in search engine.
>
> One could say: that's the problem of the search engines, not our!

Try searching for 'Lech Walesa' on Polish Wikipedia.

Search engine is not supposed to accept broken spelling.
If it was, you couldn't search for a word that genuinely has no
diactrics, if some with the same base betters has (zadanie vs.,
ehm, "zadanie").

> > They are losing
> > linguistic knowledge, because you can't write that without UTF-8.
>
> Yes, they lose that but at _present_ that's not so important.
> Many of them don't care about it in daily life: you see how most
> of the users of English Wikipedia are happy with what they've got?

Polls "Are you happy wih English Wikipedia ?" were never done so far.
Most people I know seem to have one problem or another with it.

> Why to make they happier?
> Of course, once the English Wikipedia gain more non-anglophone
> users it will gain more insight in those things as well.

It already has a lot of non-anglophone users.

> WIkipedia is aimed to last more that 2 years. So the problem will
> be solved somehow sooner or better. But I understand
> that you can't wait...

You have to switch to UTF-8 at some point. The sooner the better.
Re: UTF [ In reply to ]
On Fri, Jan 10, 2003 at 02:27:57PM +0100, Tomasz Wegrzanowski wrote:
> On Fri, Jan 10, 2003 at 06:29:06AM +0100, youandme@poczta.fm wrote:
> > Writing in any language is writing in this specific language
> > respecting _it's_ rules, even if the lack of diactrics is a rule.
>
> No. Its rules must be obeyed only for words of that language.
> For words from other languages, like names of people, places etc.
> rules of source language must be obeyed.

No. If you for example write an article on China and always write
city names in Chinese no average user will be able to read this.
I know about Peking, or Bejing, but I won't recognize it in
Chinese.

I agree that on an article about Bejing there should be the
Chinese name mentioned, like e.g. one would do in

'''Munich''' (German: München) is a town in Germany.

We could provide Chinese letters as images like we provide
mathematical formulas as images.

Same might be true for other languages if we find that there
is a remarkable number of users not able to use wikipedia
if we switch to utf-8.

BTW, I feel the same for Cologne Blue. I have several computers
and browsers that will not render Cologne Blue readable.
I'd vote for autodetecting "known good" browsers and presenting
Cologne Blue to them only. Forcing users to log in in order to read
an article should not be our policy.

Best regards,

JeLuF
Re: UTF [ In reply to ]
Tomasz Wegrzanowski wrote:

>youandme wrote:

>>Writing in any language is writing in this specific language
>>respecting _it's_ rules, even if the lack of diactrics is a rule.

>No. Its rules must be obeyed only for words of that language.
>For words from other languages, like names of people, places etc.
>rules of source language must be obeyed.

Have y'all read the discussions from November on <wikiEN-l>
(often spilling into <wikipedia-l> since the English list was new)
on the English Wikipedia's policy of anglicising names?
There was a lot of argument that could give context here.

>>Instead of rather writing in the language overriden by some
>>extra syntax. Giving a correct spelling once in parentheses
>>is what IMO is sufficient.

>Without UTF-8 you can't even do that.

You can (and we often do, on [[en:]]) using HTML entities,
such as &#268; (for "C" with a hacek, TeX's "\v C").
What UTF-8 encoding does is to allow:
* Direct entry of the UTF-8 character into the edit box;
* UTF-8 characters in titles.

Direct entry of even Latin-1 characters into the edit box
already screws up a few browsers every once in a while,
so I always change them to HTML entities when I see them.
Thus, the reason for UTF-8 would be non-Latin-1 titles.
Since [[en:]]'s policy favours anglicisation
(more than *I* would like! ^_^), this isn't vital.

Still, there are a few times when it would be appropriate under the policy.
When there is no English standard for a foreign name,
then we give it diactritics in the title if it's Latin-1;
with UTF-8, we could extend this practice to other Latin alphabets.
And (thanks to TeX, I suppose), mathematicians commonly use
any feature of the Latin alphabet supported by plain TeX,
including the aforementioned &#268; (as in "C(ech homology").
So UTF-8 encoding would still be useful on [[en:]] --
just not (IMO) a pressing concern.


-- Toby
Re: UTF [ In reply to ]
On Sat, Jan 11, 2003 at 12:44:18PM -0800, Toby Bartels wrote:
> Tomasz Wegrzanowski wrote:
> >Without UTF-8 you can't even do that.
>
> You can (and we often do, on [[en:]]) using HTML entities,
> such as &#268; (for "C" with a hacek, TeX's "\v C").
> What UTF-8 encoding does is to allow:
> * Direct entry of the UTF-8 character into the edit box;
> * UTF-8 characters in titles.

Unless it has obvious symbolic name AND is used just once or twice
it is not any solution.

Do you really expect people to write articles like
http://wiktionary.org/wiki/Polish_language or
http://pl.wikipedia.org/wiki/S%C5%82ownictwo_informatyczne_w_j%C4%99zyku_japo%C5%84skim
with &#numbers; ?
Re: UTF [ In reply to ]
On Fri, Jan 10, 2003 at 02:27:57PM +0100, Tomasz Wegrzanowski wrote:
>> WIkipedia is aimed to last more that 2 years. So the problem will
>> be solved somehow sooner or better. But I understand
>> that you can't wait...
>
>You have to switch to UTF-8 at some point. The sooner the better.

I second the motion.

Jonathan

--
Geek House Productions, Ltd.

Providing Unix & Internet Contracting and Consulting,
QA Testing, Technical Documentation, Systems Design & Implementation,
General Programming, E-commerce, Web & Mail Services since 1998

Phone: 604-435-1205
Email: djw@reactor-core.org
Webpage: http://reactor-core.org
Address: 2459 E 41st Ave, Vancouver, BC V5R2W2
Re: UTF [ In reply to ]
On Sat, Jan 11, 2003 at 12:44:18PM -0800, Toby Bartels wrote:
>You can (and we often do, on [[en:]]) using HTML entities,
>such as &#268; (for "C" with a hacek, TeX's "\v C").

That approach borks things up. Specifically, it screws websearches.
How many people are going to enter, or know to enter, the HTML entity
when they type in a search term? Related, but slightly different,
it screws up collation. With collation you can find things with
diacritics even when you aren't putting the diacritics in yourself,
and sorting order gets done properly.

I think UTF-8 is the way to go. It's been out for years, and is now
widely supported.

Jonathan

--
Geek House Productions, Ltd.

Providing Unix & Internet Contracting and Consulting,
QA Testing, Technical Documentation, Systems Design & Implementation,
General Programming, E-commerce, Web & Mail Services since 1998

Phone: 604-435-1205
Email: djw@reactor-core.org
Webpage: http://reactor-core.org
Address: 2459 E 41st Ave, Vancouver, BC V5R2W2
Re: UTF [ In reply to ]
Tomasz Wegrzanowski wrote:

>Toby Bartels wrote:

>>You can (and we often do, on [[en:]]) using HTML entities,
>>such as &#268; (for "C" with a hacek, TeX's "\v C").
>>What UTF-8 encoding does is to allow:
>>* Direct entry of the UTF-8 character into the edit box;
>>* UTF-8 characters in titles.

>Unless it has obvious symbolic name AND is used just once or twice
>it is not any solution.

They should all have symbolic names *eventually*.
Let's write a letter to the W3C ^_^!

>Do you really expect people to write articles like
>http://wiktionary.org/wiki/Polish_language or
>http://pl.wikipedia.org/wiki/S%C5%82ownictwo_informatyczne_w_j%C4%99zyku_japo%C5%84skim
>with &#numbers; ?

No, I don't think that it would work very well at all!
Which must be why these wikis are already on UTF8 --
I was talking about switching [[en:]], [[fr:]], and the like.
(I wouldn't even presume to change "é" to "&eacute;" on [[fr:]],
but I'd definitely change "C(" to "&#268;" on [[en:]].)


-- Toby
Re: UTF [ In reply to ]
Clutch wrote:

>Toby Bartels wrote:

>>You can (and we often do, on [[en:]]) using HTML entities,
>>such as &#268; (for "C(", "C" with a hacek, TeX's "\v C").

>That approach borks things up. Specifically, it screws websearches.
>How many people are going to enter, or know to enter, the HTML entity
>when they type in a search term?

Google is quite capable of finding "&#268;" when a user enters "C("
(well not literally "C(", but the actual Czech letter itself).
If Wikipedia's own search engine isn't, then we should fix that anyway.

>Related, but slightly different,
>it screws up collation. With collation you can find things with
>diacritics even when you aren't putting the diacritics in yourself,
>and sorting order gets done properly.

I don't see how this is relevant to text.
It *is* relevant to titles, but I already agree with you
that UTF-8 would be nice to have for those!

>I think UTF-8 is the way to go. It's been out for years, and is now
>widely supported.

Not widely enough, if anthere is accurate evidence.
(Better evidence would be citations from server logs
for the various Latin-1 wikis that people want to switch over.)
I don't know anybody that would oppose switching everything to Unicode
once it's nearly universally supported -- so that's what it comes down to.

I don't want to get into this argument too much --
I support switching to UTF-8 if it won't screw things up,
and I oppose switching if it will screw things up.
Other than that, I just have some evidence (from meta)
that it *can* screw things up, so we need to watch for it;
but switching may well still be the right thing to do!
I just wanted to point that the functionality is there
(but not conveniently) in the body of the article (but not the title).


-- Toby
Re: Re: UTF [ In reply to ]
On Sun, Jan 12, 2003 at 04:22:21PM -0800, Daniel Mayer wrote:
> On Saturday 11 January 2003 04:00 am, Tomasz Wegrzanowski wrote:
> > No. Its rules must be obeyed only for words of that language.
> > For words from other languages, like names of people, places etc.
> > rules of source language must be obeyed.
>
> When words from other languages are used in any language that language tends
> to modify those words so that they are pronounceable and usable by people who
> speak that language. In English this is called Anglicisation and in Polish it
> is called Polonisation (which has already been pointed out to you).
>
> All languages pick up and modify words from other languages. When these
> languages do this the words are no longer foreign, but they are now part of a
> new language. Words are merely used to name things and different languages
> have different words for things.

"Wrocl/aw" or "Wal/e,sa" don't suddenly become "English" words because you
use them it English sentence. They're still Polish words and should use
correct Polish spelling. The same applies to all other languages.
Re: UTF [ In reply to ]
On Saturday 11 January 2003 04:00 am, Tomasz Wegrzanowski wrote:
> No. Its rules must be obeyed only for words of that language.
> For words from other languages, like names of people, places etc.
> rules of source language must be obeyed.

And who are you to dictate what the naming conventions on the English
Wikipedia should be?

When words from other languages are used in any language that language tends
to modify those words so that they are pronounceable and usable by people who
speak that language. In English this is called Anglicisation and in Polish it
is called Polonisation (which has already been pointed out to you).

All languages pick up and modify words from other languages. When these
languages do this the words are no longer foreign, but they are now part of a
new language. Words are merely used to name things and different languages
have different words for things.

-- Daniel Mayer (aka mav)