Mailing List Archive

Title characters
In the process of writing some standards documents for the Wikipedia
content model (some lower level behind-the-scenes stuff that needs to
be done before working on the syntax and to beef up the test suite),
I've come to the point were I need to decide exactly what characters
are and are not allowed in page titles. I'd like to solicit input on
this. Keep in mind here that what I'm specifying is what set of
characters can a page title be chosen from; that is, what strings
will be allowed between the brackets of a link, and displayed at the
top of a page, regardless of whatever URL-encoding tricks we have to
use to make that happen. _After_ we specify that, then we can specify
exactly how to construct URLs from them. Here are my current thoughts:

* Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets),
{} (braces), <> (greater,less), + (plus), \ (backslash) because
allowing them would interfere with link syntax and make the
software more tricky to write. I can live without these, though
I think + might be handy in some places (like C++), and might be
worth the effort to allow.

* Should allow anything Unicode calls a letter, numeral, syllable,
or ideograph.

* Should not allow Unicode diacriticals, combining forms, display
forms (ligatures), controls, and other specials.

* Should allow most ASCII punctuation that might appear in a name
or title in text, specifically - , . ( ) ' & : ; % ! ? / $ *
(Note that some of these, like *, are not currently alowed,
and that : is a special case that's allowed but only when the
text before it doesn't match a namespace, etc.)

* Should not allow non-ASCII punctuation like em dash, curly
quotes, etc., because they cause problems on machines with
strict ISO character sets.

* Space is allowed. Underscore is allowed, but indistinguishable
from space. No other controls (tab, etc.) are allowed.

Anyone have other ideas/suggestions?

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Title characters [ In reply to ]
On Fri, 23 May 2003, Lee Daniel Crocker wrote:
> * Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets),
> {} (braces), <> (greater,less), + (plus), \ (backslash) because
> allowing them would interfere with link syntax and make the
> software more tricky to write. I can live without these, though
> I think + might be handy in some places (like C++), and might be
> worth the effort to allow.

Plus + and quote " are frequently asked for. These would not interfere
with wiki syntax at all, though both would require escaping in URLs (as
does the ampersand & when used in the query string and the percent % and
question mark ? always, all of which we presently allow).

> * Should allow anything Unicode calls a letter, numeral, syllable,
> or ideograph.

Okay...

> * Should not allow Unicode diacriticals, combining forms, display
> forms (ligatures), controls, and other specials.

Waitaminute... that would seem to exclude the use of accented characters
that do not have a precombined form. This could be seriously detrimental
to some languages.

(In any case, we ought to do a little fancier work with UTF-8 to make sure
that canonical forms are used to prevent false non-matches. I don't know
if there's a library we can link into PHP to do this or if we'd have to
write something.)

-- brion vibber (brion @ pobox.com)
Re: Title characters [ In reply to ]
On Fri, 23 May 2003, Lee Daniel Crocker wrote:

> Date: Fri, 23 May 2003 13:46:28 -0500
> From: Lee Daniel Crocker <lee@piclab.com>
> Subject: [Wikitech-l] Title characters
>
<snip>
>
> * Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets),
> {} (braces), <> (greater,less), + (plus), \ (backslash) because
> allowing them would interfere with link syntax and make the
> software more tricky to write. I can live without these, though
> I think + might be handy in some places (like C++), and might be
> worth the effort to allow.
>
> * Should allow most ASCII punctuation that might appear in a name
> or title in text, specifically - , . ( ) ' & : ; % ! ? / $ *
> (Note that some of these, like *, are not currently alowed,
> and that : is a special case that's allowed but only when the
> text before it doesn't match a namespace, etc.)
>
> * Should not allow non-ASCII punctuation like em dash, curly
> quotes, etc., because they cause problems on machines with
> strict ISO character sets.
>
> * Space is allowed. Underscore is allowed, but indistinguishable
> from space. No other controls (tab, etc.) are allowed.
>
> Anyone have other ideas/suggestions?

Missed one: the "at" symbol, @, is currently not allowed. I don't feel
strongly one way or the other about it myself, but it's come up on the
Village Pump recently when someone wanted to use it, so it should probably
be on one of those lists.

--
John R. Owens http://www.ghiapet.homeip.net/
Ah, arrogance and stupidity all in the same package. How efficient of
you!
--Londo Mollari
Re: Title characters [ In reply to ]
"Lee Daniel Crocker" <lee@piclab.com> wrote in
message news:20030523184628.GA22556@piclab.com...

...

> Anyone have other ideas/suggestions?

Here's one that's almost unrelated, but your post has reminded me of it.
We've had a number of requests for titles with an initial lowercase letter.
I would like to see a checkbox next to "Watch this article" on the edit
page, labelled "Title starts with lowercase letter". Clicking on "save" with
this checked would set a flag in the cur table, instructing
getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and
[[iMac]] still go to the same place, but when the software has to display
the title, it comes out as [[iMac]].

Alternately you could just have a link in the sidebar, like for page
protection -- but please make sure the change is registered in RC.

-- Tim Starling.
Re: Re: Title characters [ In reply to ]
On Sun, May 25, 2003 at 12:20:12AM +1000, Tim Starling wrote:
>
> "Lee Daniel Crocker" <lee@piclab.com> wrote in
> message news:20030523184628.GA22556@piclab.com...
>
> ...
>
> > Anyone have other ideas/suggestions?
>
> Here's one that's almost unrelated, but your post has reminded me of it.
> We've had a number of requests for titles with an initial lowercase letter.
> I would like to see a checkbox next to "Watch this article" on the edit
> page, labelled "Title starts with lowercase letter". Clicking on "save" with
> this checked would set a flag in the cur table, instructing
> getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and
> [[iMac]] still go to the same place, but when the software has to display
> the title, it comes out as [[iMac]].
>
> Alternately you could just have a link in the sidebar, like for page
> protection -- but please make sure the change is registered in RC.

IMHO there is no sense to have 2 articles that differ only by capitalization -
there should be some "canonical" form, but all links, no matter what
capitalization they have, should go to the same article.

That would help a lot with computer stuff.
Re: Re: Title characters [ In reply to ]
On Sun, 25 May 2003, Tomasz Wegrzanowski wrote:

> On Sun, May 25, 2003 at 12:20:12AM +1000, Tim Starling wrote:
> >
> > "Lee Daniel Crocker" <lee@piclab.com> wrote in
> > message news:20030523184628.GA22556@piclab.com...
> >
> > ...
> >
> > > Anyone have other ideas/suggestions?
> >
> > Here's one that's almost unrelated, but your post has reminded me of it.
> > We've had a number of requests for titles with an initial lowercase letter.
> > I would like to see a checkbox next to "Watch this article" on the edit
> > page, labelled "Title starts with lowercase letter". Clicking on "save" with
> > this checked would set a flag in the cur table, instructing
> > getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and
> > [[iMac]] still go to the same place, but when the software has to display
> > the title, it comes out as [[iMac]].
> >
> > Alternately you could just have a link in the sidebar, like for page
> > protection -- but please make sure the change is registered in RC.
>
> IMHO there is no sense to have 2 articles that differ only by capitalization -
> there should be some "canonical" form, but all links, no matter what
> capitalization they have, should go to the same article.

That's exactly what Tim is proposing - the only change is that the canonical
form can be decided for each word separately rather than being always said
at 'with capital'.

Andre Engels
Re: Title characters [ In reply to ]
On Fri, 23 May 2003, Lee Daniel Crocker wrote:

> * Cannot allow: # (sharp), | (pipe), " (quote), [] (brackets),
> {} (braces), <> (greater,less), + (plus), \ (backslash) because
> allowing them would interfere with link syntax and make the
> software more tricky to write. I can live without these, though
> I think + might be handy in some places (like C++), and might be
> worth the effort to allow.

(...)

> * Should allow most ASCII punctuation that might appear in a name
> or title in text, specifically - , . ( ) ' & : ; % ! ? / $ *
> (Note that some of these, like *, are not currently alowed,
> and that : is a special case that's allowed but only when the
> text before it doesn't match a namespace, etc.)

Note that currently & is allowed, but not working - linking to a page
with '&' in the title, takes you to the page with only the part before
the '&'. Thus, this one in my opinion counts also as 'interfering with
link syntax'. It's a very useful one, so if you are going to do things
to make some of these possible, this one should certainly be included.
Same type of problem might exist with '?', I don't know about that.

Andre Engels
Re: Title characters [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Je Lundo 26 Majo 2003 00:52, Andre Engels skribis:
> Note that currently & is allowed, but not working - linking to a page
> with '&' in the title, takes you to the page with only the part
> before the '&'.

No, that's just a bug in the rewrite rules; some of the wikis didn't get
the proper escaping fix added in. Fixed on NL now.

- -- brion vibber (brion @ pobox.com)
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQE+0crFxVlOmwh1xjgRAiTXAJ9+XmVhrq2sqis7fBtHrszZKs25hwCfUDZ+
SM/lgxuAf0Q38vpkUZVKjHo=
=jkSJ
-----END PGP SIGNATURE-----
Re: Re: Title characters [ In reply to ]
On Mon, May 26, 2003 at 09:49:15AM +0200, Andre Engels wrote:
> On Sun, 25 May 2003, Tomasz Wegrzanowski wrote:
>
> > On Sun, May 25, 2003 at 12:20:12AM +1000, Tim Starling wrote:
> > >
> > > "Lee Daniel Crocker" <lee@piclab.com> wrote in
> > > message news:20030523184628.GA22556@piclab.com...
> > >
> > > ...
> > >
> > > > Anyone have other ideas/suggestions?
> > >
> > > Here's one that's almost unrelated, but your post has reminded me of it.
> > > We've had a number of requests for titles with an initial lowercase letter.
> > > I would like to see a checkbox next to "Watch this article" on the edit
> > > page, labelled "Title starts with lowercase letter". Clicking on "save" with
> > > this checked would set a flag in the cur table, instructing
> > > getPrefixedText() to set the initial letter to lowercase. So [[IMac]] and
> > > [[iMac]] still go to the same place, but when the software has to display
> > > the title, it comes out as [[iMac]].
> > >
> > > Alternately you could just have a link in the sidebar, like for page
> > > protection -- but please make sure the change is registered in RC.
> >
> > IMHO there is no sense to have 2 articles that differ only by capitalization -
> > there should be some "canonical" form, but all links, no matter what
> > capitalization they have, should go to the same article.
>
> That's exactly what Tim is proposing - the only change is that the canonical
> form can be decided for each word separately rather than being always said
> at 'with capital'.

Well, I'm more concerned about "UNIX" vs. "Unix".
Re: Re: Title characters [ In reply to ]
> (Tomasz Wegrzanowski <taw@users.sourceforge.net>):
>
> Well, I'm more concerned about "UNIX" vs. "Unix".

Or more generally, acronyms. "CAT" is computer assisted tomography,
while "cat" is a furry creature. But if we did go to complete
case-insensitivity, the problem would be merely another source of
title ambiguity, which we are already used to dealing with (i.e.,
the "cat" page would deal with the creature and the machine just
as the "Mercury" page deals with the metal, the planet, and the god),
so that's not a major impediment.

We'd have to canonicalize the URLs in some way (for example, by
making every character in the URL lowercase all the time), and then
make a guess about what actual title to create for new pages.

I don't know if it's possible to make every case easy, so we have
to settle for making the majority of cases easy. I think most page
titles are still such that they should be capitalized as titles but
not in running text, just like "cat". So the present system handles
the common case well. True, it doesn't handle some other cases, but
I'm not really sure we could do that without complicating the more
common case.

I'd need to see more argument about exactly how to handle this
before I'd be convinced to change it.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Re: Title characters [ In reply to ]
On Tue, May 27, 2003 at 02:18:39PM -0500, Lee Daniel Crocker wrote:
> > (Tomasz Wegrzanowski <taw@users.sourceforge.net>):
> >
> > Well, I'm more concerned about "UNIX" vs. "Unix".
>
> Or more generally, acronyms. "CAT" is computer assisted tomography,
> while "cat" is a furry creature. But if we did go to complete
> case-insensitivity, the problem would be merely another source of
> title ambiguity, which we are already used to dealing with (i.e.,
> the "cat" page would deal with the creature and the machine just
> as the "Mercury" page deals with the metal, the planet, and the god),
> so that's not a major impediment.
>
> We'd have to canonicalize the URLs in some way (for example, by
> making every character in the URL lowercase all the time), and then
> make a guess about what actual title to create for new pages.
>
> I don't know if it's possible to make every case easy, so we have
> to settle for making the majority of cases easy. I think most page
> titles are still such that they should be capitalized as titles but
> not in running text, just like "cat". So the present system handles
> the common case well. True, it doesn't handle some other cases, but
> I'm not really sure we could do that without complicating the more
> common case.
>
> I'd need to see more argument about exactly how to handle this
> before I'd be convinced to change it.

We need 2 canonical forms - database canonical form for linking,
always lowercase, and presentation canonical forms, which is by default
ucfirst(title_of_link_that_created_article), and can be overriden
by #CANONICALFORM iMac or something.
Re: Title characters [ In reply to ]
> (Brion Vibber <vibber@aludra.usc.edu>):
>
> > * Should not allow Unicode diacriticals, combining forms, display
> > forms (ligatures), controls, and other specials.
>
> Waitaminute... that would seem to exclude the use of accented characters
> that do not have a precombined form. This could be seriously detrimental
> to some languages.
>
> (In any case, we ought to do a little fancier work with UTF-8 to make sure
> that canonical forms are used to prevent false non-matches. I don't know
> if there's a library we can link into PHP to do this or if we'd have to
> write something.)

I confess ignorance here. Are there really languages for which the
simplest canonical representation in Unicode requires combining forms?
If so, then I remove the restriction, but we must then specify a
specific canonical representation for titles in each language, as you
suggest; perhaps something like a Stringprep profile would be needed.

--
Lee Daniel Crocker <lee@piclab.com> <http://www.piclab.com/lee/>
"All inventions or works of authorship original to me, herein and past,
are placed irrevocably in the public domain, and may be used or modified
for any purpose, without permission, attribution, or notification."--LDC
Re: Title characters [ In reply to ]
On Tue, 27 May 2003, Lee Daniel Crocker wrote:
> I confess ignorance here. Are there really languages for which the
> simplest canonical representation in Unicode requires combining forms?

Off the top of my head, one Aleutian language (Unangam Tunuu) uses
x-with-circumflex; Guarani apparently uses g-with-tilde. Tone marks for
Chinese Zhuyin phoenetic script are combining characters; I think the
Indian scripts are pretty dependant on this kind of thing as well.

Precombined characters are theoretically only included for round-trip
conversion with legacy character sets, so they're not really making new
ones for orthographies that are just getting started in the wonderful
world of character encoding.

> If so, then I remove the restriction, but we must then specify a
> specific canonical representation for titles in each language, as you
> suggest; perhaps something like a Stringprep profile would be needed.

They've thought of that already too, it seems. :)
See Unicode Standard Annex #15, "Unicode normalization forms":
http://www.unicode.org/unicode/reports/tr15/

-- brion vibber (brion @ pobox.com)
Re: Title characters [ In reply to ]
Brion Vibber wrote:

>On Tue, 27 May 2003, Lee Daniel Crocker wrote:
>
>>I confess ignorance here. Are there really languages for which the
>>simplest canonical representation in Unicode requires combining forms?
>>
>
>Off the top of my head, one Aleutian language (Unangam Tunuu) uses
>x-with-circumflex; Guarani apparently uses g-with-tilde. Tone marks for
>Chinese Zhuyin phoenetic script are combining characters; I think the
>Indian scripts are pretty dependant on this kind of thing as well.
>
Also nasalized vowels for IPA.

Ec