Mailing List Archive

Avoid invisible characters in page titles
Hi,

isn't it better to avoid invisible characters in page titles while creating
the pages?

Please look here, there has been problems with invisible characters working
with it when parsing or page linking those page titles with invisible
unicode characters:
https://de.wikipedia.org/wiki/Benutzer_Diskussion:Wurgl#Liste_der_Biografien/Ci

Instead of this there will never be a problem when invisible characters
within the page title name will be deleted when creating the page.

What do you think about it and what technical approaches do already exist?
How are LTR and RTL marks dealt if creating pages with them?

Thank you very much and kind regards
Martin
aka user:Doc_Taxon
Re: Avoid invisible characters in page titles [ In reply to ]
Hi,

On Tue, 2023-01-17 at 12:03 +0100, Martin Domdey wrote:
> isn't it better to avoid invisible characters in page titles
> while creating the pages? 
>
> Please look here, there has been problems with invisible characters
> working with it when parsing or page linking those page titles with
> invisible unicode
> characters: https://de.wikipedia.org/wiki/Benutzer_Diskussion:Wurgl#L
> iste_der_Biografien/Ci

See also
https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#UTF-8_ZERO_WIDTH_SPACE_in_page_title

> Instead of this there will never be a problem when invisible
> characters within the page title name will be deleted when
> creating the page.
>
> What do you think about it and what technical approaches do
> already exist? How are LTR and RTL marks dealt if creating pages with
> them?

See https://phabricator.wikimedia.org/maniphest/query/GDxAs4QdEDTG/#R
for related bugs, and a ticket about improving cleanupTitles.php.

Cheers,
andre
--
Andre Klapper (he/him) | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Avoid invisible characters in page titles [ In reply to ]
Thank you,

it seems that there's nobody working on it anymore, right?

Kind regards,
Martin ...



Am Di., 17. Jan. 2023 um 12:28 Uhr schrieb Andre Klapper <
aklapper@wikimedia.org>:

> Hi,
>
> On Tue, 2023-01-17 at 12:03 +0100, Martin Domdey wrote:
> > isn't it better to avoid invisible characters in page titles
> > while creating the pages?
> >
> > Please look here, there has been problems with invisible characters
> > working with it when parsing or page linking those page titles with
> > invisible unicode
> > characters: https://de.wikipedia.org/wiki/Benutzer_Diskussion:Wurgl#L
> > iste_der_Biografien/Ci
>
> See also
>
> https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#UTF-8_ZERO_WIDTH_SPACE_in_page_title
>
> > Instead of this there will never be a problem when invisible
> > characters within the page title name will be deleted when
> > creating the page.
> >
> > What do you think about it and what technical approaches do
> > already exist? How are LTR and RTL marks dealt if creating pages with
> > them?
>
> See https://phabricator.wikimedia.org/maniphest/query/GDxAs4QdEDTG/#R
> for related bugs, and a ticket about improving cleanupTitles.php.
>
> Cheers,
> andre
> --
> Andre Klapper (he/him) | Bugwrangler / Developer Advocate
> https://blogs.gnome.org/aklapper/
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: Avoid invisible characters in page titles [ In reply to ]
Disallowing invisible characters or cleaning them is a bad idea.

Invisible characters are actually heavily used in many languages including
Persian (and part of the official manual of style of the language taught in
schools) it is downright wrong to check and fix those in many wikis in
those languages.

Also many wikis have titles in other languages such as wiktionaries or
redirects in a different languages (For example:
https://en.wikipedia.org/w/index.php?title=%D8%AA%D9%87%D8%B1%D8%A7%D9%86&redirect=no)
which means removing ZWNJ or similar characters would be also unacceptable
in English Wiktionary or English Wikipedia as well.

There are some exemptions though: Two invisible characters are wrong, or an
invisible character at the end or beginning. But all of these are cases in
Persian language and another language might actually allow that as well.

Best

Am Di., 17. Jan. 2023 um 12:48 Uhr schrieb Martin Domdey <dr.taxon@gmail.com
>:

> Thank you,
>
> it seems that there's nobody working on it anymore, right?
>
> Kind regards,
> Martin ...
>
>
>
> Am Di., 17. Jan. 2023 um 12:28 Uhr schrieb Andre Klapper <
> aklapper@wikimedia.org>:
>
>> Hi,
>>
>> On Tue, 2023-01-17 at 12:03 +0100, Martin Domdey wrote:
>> > isn't it better to avoid invisible characters in page titles
>> > while creating the pages?
>> >
>> > Please look here, there has been problems with invisible characters
>> > working with it when parsing or page linking those page titles with
>> > invisible unicode
>> > characters: https://de.wikipedia.org/wiki/Benutzer_Diskussion:Wurgl#L
>> > iste_der_Biografien/Ci
>>
>> See also
>>
>> https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%28technical%29#UTF-8_ZERO_WIDTH_SPACE_in_page_title
>>
>> > Instead of this there will never be a problem when invisible
>> > characters within the page title name will be deleted when
>> > creating the page.
>> >
>> > What do you think about it and what technical approaches do
>> > already exist? How are LTR and RTL marks dealt if creating pages with
>> > them?
>>
>> See https://phabricator.wikimedia.org/maniphest/query/GDxAs4QdEDTG/#R
>> for related bugs, and a ticket about improving cleanupTitles.php.
>>
>> Cheers,
>> andre
>> --
>> Andre Klapper (he/him) | Bugwrangler / Developer Advocate
>> https://blogs.gnome.org/aklapper/
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



--
Amir (he/him)
Re: Avoid invisible characters in page titles [ In reply to ]
Even in english, you still have emoiji that use ZWJ characters. e.g. ??????????
has "invisible" characters. There are all sorts of control characters in
unicode that usually do nothing but sometimes do something.

There may be some invisible characters that make sense to strip or
normalize. Indeed we already do so for some. But one has to be very careful
about these things.

-
Bawolff

On Tuesday, January 17, 2023, Amir Sarabadani <ladsgroup@gmail.com> wrote:

> Disallowing invisible characters or cleaning them is a bad idea.
>
> Invisible characters are actually heavily used in many languages including
> Persian (and part of the official manual of style of the language taught in
> schools) it is downright wrong to check and fix those in many wikis in
> those languages.
>
> Also many wikis have titles in other languages such as wiktionaries or
> redirects in a different languages (For example:
> https://en.wikipedia.org/w/index.php?title=%D8%AA%D9%87%
> D8%B1%D8%A7%D9%86&redirect=no) which means removing ZWNJ or similar
> characters would be also unacceptable in English Wiktionary or English
> Wikipedia as well.
>
> There are some exemptions though: Two invisible characters are wrong, or
> an invisible character at the end or beginning. But all of these are cases
> in Persian language and another language might actually allow that as well.
>
> Best
>
> Am Di., 17. Jan. 2023 um 12:48 Uhr schrieb Martin Domdey <
> dr.taxon@gmail.com>:
>
>> Thank you,
>>
>> it seems that there's nobody working on it anymore, right?
>>
>> Kind regards,
>> Martin ...
>>
>>
>>
>> Am Di., 17. Jan. 2023 um 12:28 Uhr schrieb Andre Klapper <
>> aklapper@wikimedia.org>:
>>
>>> Hi,
>>>
>>> On Tue, 2023-01-17 at 12:03 +0100, Martin Domdey wrote:
>>> > isn't it better to avoid invisible characters in page titles
>>> > while creating the pages?
>>> >
>>> > Please look here, there has been problems with invisible characters
>>> > working with it when parsing or page linking those page titles with
>>> > invisible unicode
>>> > characters: https://de.wikipedia.org/wiki/Benutzer_Diskussion:Wurgl#L
>>> > iste_der_Biografien/Ci
>>>
>>> See also
>>> https://en.wikipedia.org/wiki/Wikipedia:Village_pump_%
>>> 28technical%29#UTF-8_ZERO_WIDTH_SPACE_in_page_title
>>>
>>> > Instead of this there will never be a problem when invisible
>>> > characters within the page title name will be deleted when
>>> > creating the page.
>>> >
>>> > What do you think about it and what technical approaches do
>>> > already exist? How are LTR and RTL marks dealt if creating pages with
>>> > them?
>>>
>>> See https://phabricator.wikimedia.org/maniphest/query/GDxAs4QdEDTG/#R
>>> for related bugs, and a ticket about improving cleanupTitles.php.
>>>
>>> Cheers,
>>> andre
>>> --
>>> Andre Klapper (he/him) | Bugwrangler / Developer Advocate
>>> https://blogs.gnome.org/aklapper/
>>> _______________________________________________
>>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>>> https://lists.wikimedia.org/postorius/lists/wikitech-l.
>>> lists.wikimedia.org/
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.
>> lists.wikimedia.org/
>
>
>
> --
> Amir (he/him)
>
>