Mailing List Archive

ListFiles special page
Hello,

I have got issue with ListFiles page in mediawiki 1.35.1
Filtering worked not very good, was case-sensitive and not always got
text in middle of file name.
I looked in DB and saw that img_name column is varbinary, but
pagers/ImageListPager.php tries to do case-insensitive select with
LOWERing both sides of strings. But LOWER does not work for varbinary
So I think that following change will be reasonable:

--- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300
+++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300
@@ -90,9 +90,10 @@

if ( $nt ) {
$dbr = wfGetDB( DB_REPLICA );
- $this->mQueryConds[] = 'LOWER(img_name)'
.
+ $this->mQueryConds[] =
'LOWER(CONVERT(img_name USING utf8))' .
$dbr->buildLike(
$dbr->anyString(),
- strtolower(
$nt->getDBkey() ), $dbr->anyString() );
+ mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() );
+
}
}

@@ -161,9 +162,9 @@
$nt = Title::newFromText( $this->mSearch );
if ( $nt ) {
$dbr = wfGetDB( DB_REPLICA );
- $conds[] = 'LOWER(' . $prefix . '_name)'
.
+ $conds[] = 'LOWER(CONVERT(' . $prefix .
'_name USING utf8))' .
$dbr->buildLike(
$dbr->anyString(),
- strtolower(
$nt->getDBkey() ), $dbr->anyString() );
+ mb_strtolower(
$nt->getDBkey() ), $dbr->anyString() );
}
}



--
Sergey
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: ListFiles special page [ In reply to ]
Hi Sergey,

On Thu, 2021-10-14 at 18:16 +0300, Sergey Dorofeev wrote:
> I have got issue with ListFiles page in mediawiki 1.35.1
> text in middle of file name.
> I looked in DB and saw that img_name column is varbinary, but
> pagers/ImageListPager.php tries to do case-insensitive select with
> LOWERing both sides of strings. But LOWER does not work for varbinary
> So I think that following change will be reasonable:

Thanks for looking at the code! If you think that you found a software
bug, please feel free to report it in the issue tracker:
https://www.mediawiki.org/wiki/How_to_report_a_bug
Feel also very welcome to upload the patch into Gerrit for review:
https://www.mediawiki.org/wiki/Gerrit/Tutorial

Again, thanks a lot!
andre
--
Andre Klapper (he/him) | Bugwrangler / Developer Advocate
https://blogs.gnome.org/aklapper/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: ListFiles special page [ In reply to ]
I agree that LOWER doesn't make much sense in binary collation.

Sadly, a utf8 (3-byte UTF-8) conversion may fail for 4-byte characters, so
at the very least it should be utf8mb4 (4-byte UTF-8). I am not so familiar
with ListPager to say if there could be other issues arising from that-
sending a code review would be easier for better context.

On Thu, Oct 14, 2021 at 5:16 PM Sergey Dorofeev <sergey@fidoman.ru> wrote:

> Hello,
>
> I have got issue with ListFiles page in mediawiki 1.35.1
> Filtering worked not very good, was case-sensitive and not always got
> text in middle of file name.
> I looked in DB and saw that img_name column is varbinary, but
> pagers/ImageListPager.php tries to do case-insensitive select with
> LOWERing both sides of strings. But LOWER does not work for varbinary
> So I think that following change will be reasonable:
>
> --- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300
> +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300
> @@ -90,9 +90,10 @@
>
> if ( $nt ) {
> $dbr = wfGetDB( DB_REPLICA );
> - $this->mQueryConds[] = 'LOWER(img_name)'
> .
> + $this->mQueryConds[] =
> 'LOWER(CONVERT(img_name USING utf8))' .
> $dbr->buildLike(
> $dbr->anyString(),
> - strtolower(
> $nt->getDBkey() ), $dbr->anyString() );
> + mb_strtolower(
> $nt->getDBkey() ), $dbr->anyString() );
> +
> }
> }
>
> @@ -161,9 +162,9 @@
> $nt = Title::newFromText( $this->mSearch );
> if ( $nt ) {
> $dbr = wfGetDB( DB_REPLICA );
> - $conds[] = 'LOWER(' . $prefix . '_name)'
> .
> + $conds[] = 'LOWER(CONVERT(' . $prefix .
> '_name USING utf8))' .
> $dbr->buildLike(
> $dbr->anyString(),
> - strtolower(
> $nt->getDBkey() ), $dbr->anyString() );
> + mb_strtolower(
> $nt->getDBkey() ), $dbr->anyString() );
> }
> }
>
>
>
> --
> Sergey
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>


--
Jaime Crespo
<http://wikimedia.org>
Re: ListFiles special page [ In reply to ]
Thank you, did not know about it. Real UTF-8 in mysql is utf8mb4, I
think it should be used here.

---
Sergey

Jaime Crespo ????? 2021-10-14 18:32:

> I agree that LOWER doesn't make much sense in binary collation.
>
> Sadly, a utf8 (3-byte UTF-8) conversion may fail for 4-byte characters, so at the very least it should be utf8mb4 (4-byte UTF-8). I am not so familiar with ListPager to say if there could be other issues arising from that- sending a code review would be easier for better context.
>
> On Thu, Oct 14, 2021 at 5:16 PM Sergey Dorofeev <sergey@fidoman.ru> wrote:
>
>> Hello,
>>
>> I have got issue with ListFiles page in mediawiki 1.35.1
>> Filtering worked not very good, was case-sensitive and not always got
>> text in middle of file name.
>> I looked in DB and saw that img_name column is varbinary, but
>> pagers/ImageListPager.php tries to do case-insensitive select with
>> LOWERing both sides of strings. But LOWER does not work for varbinary
>> So I think that following change will be reasonable:
>>
>> --- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300
>> +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300
>> @@ -90,9 +90,10 @@
>>
>> if ( $nt ) {
>> $dbr = wfGetDB( DB_REPLICA );
>> - $this->mQueryConds[] = 'LOWER(img_name)'
>> .
>> + $this->mQueryConds[] =
>> 'LOWER(CONVERT(img_name USING utf8))' .
>> $dbr->buildLike(
>> $dbr->anyString(),
>> - strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> + mb_strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> +
>> }
>> }
>>
>> @@ -161,9 +162,9 @@
>> $nt = Title::newFromText( $this->mSearch );
>> if ( $nt ) {
>> $dbr = wfGetDB( DB_REPLICA );
>> - $conds[] = 'LOWER(' . $prefix . '_name)'
>> .
>> + $conds[] = 'LOWER(CONVERT(' . $prefix .
>> '_name USING utf8))' .
>> $dbr->buildLike(
>> $dbr->anyString(),
>> - strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> + mb_strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> }
>> }
>>
>> --
>> Sergey
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> --
>
> Jaime Crespo <http://wikimedia.org>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: ListFiles special page [ In reply to ]
It's years since I discovered that mysql's utf8 is broken in this way, but I can still feel the pain. What part of "universal" did they not understand? The mysql docs more or less say that "utf8" is deprecated, certainly not future-proof, and suggest you use utf8mb4. See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html <https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8.html>)

> On Oct 14, 2021, at 1:17 PM, Sergey Dorofeev <sergey@fidoman.ru> wrote:
>
> Thank you, did not know about it. Real UTF-8 in mysql is utf8mb4, I think it should be used here.
>
> ---
> Sergey
>
>
> Jaime Crespo ????? 2021-10-14 18:32:
>
>> I agree that LOWER doesn't make much sense in binary collation.
>>
>> Sadly, a utf8 (3-byte UTF-8) conversion may fail for 4-byte characters, so at the very least it should be utf8mb4 (4-byte UTF-8). I am not so familiar with ListPager to say if there could be other issues arising from that- sending a code review would be easier for better context.
>>
>> On Thu, Oct 14, 2021 at 5:16 PM Sergey Dorofeev <sergey@fidoman.ru <mailto:sergey@fidoman.ru>> wrote:
>> Hello,
>>
>> I have got issue with ListFiles page in mediawiki 1.35.1
>> Filtering worked not very good, was case-sensitive and not always got
>> text in middle of file name.
>> I looked in DB and saw that img_name column is varbinary, but
>> pagers/ImageListPager.php tries to do case-insensitive select with
>> LOWERing both sides of strings. But LOWER does not work for varbinary
>> So I think that following change will be reasonable:
>>
>> --- ImageListPager.php.orig 2021-10-14 16:31:52.000000000 +0300
>> +++ ImageListPager.php 2021-10-14 16:00:10.127694733 +0300
>> @@ -90,9 +90,10 @@
>>
>> if ( $nt ) {
>> $dbr = wfGetDB( DB_REPLICA );
>> - $this->mQueryConds[] = 'LOWER(img_name)'
>> .
>> + $this->mQueryConds[] =
>> 'LOWER(CONVERT(img_name USING utf8))' .
>> $dbr->buildLike(
>> $dbr->anyString(),
>> - strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> + mb_strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> +
>> }
>> }
>>
>> @@ -161,9 +162,9 @@
>> $nt = Title::newFromText( $this->mSearch );
>> if ( $nt ) {
>> $dbr = wfGetDB( DB_REPLICA );
>> - $conds[] = 'LOWER(' . $prefix . '_name)'
>> .
>> + $conds[] = 'LOWER(CONVERT(' . $prefix .
>> '_name USING utf8))' .
>> $dbr->buildLike(
>> $dbr->anyString(),
>> - strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> + mb_strtolower(
>> $nt->getDBkey() ), $dbr->anyString() );
>> }
>> }
>>
>>
>>
>> --
>> Sergey
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org <mailto:wikitech-l@lists.wikimedia.org>
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org <mailto:wikitech-l-leave@lists.wikimedia.org>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ <https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/>
>>
>> --
>> Jaime Crespo
>> <http://wikimedia.org <http://wikimedia.org/>>
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org <mailto:wikitech-l@lists.wikimedia.org>
>> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org <mailto:wikitech-l-leave@lists.wikimedia.org>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ <https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/>_______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-leave@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
Re: ListFiles special page [ In reply to ]
I don't want to defend MySQL development decisions- in fact PHP made some
similarly bad ones, but it would be unfair to judge them too harsly with
the "power of hindsight" [0]- but... /pedantic on

On Thu, Oct 14, 2021 at 7:37 PM Roy Smith <roy@panix.com> wrote:

> What part of "universal" did they not understand?
>

... several years ago, during the end of the century/start of a new one, no
one used UTF-8 [1] and PHP didn't even support multi-byte strings. The
original spec for UTF-8 called for up to 6 bytes[2]. The BMP, however (3
bytes) contained characters for most modern languages [3], which was a
waste of space and performance because at the time, MySQL worked much
faster with fixed-width columns, which would be a waste of space (double!).
My guess is that someone said "this is probably good enough", and would it
be too outrageous to think that we may not need as many extra characters as
stars in our Galaxy, when less than 65K were practically needed?

3 things changed after that:
* Unicode limited UTF-8 to encoding for 21 bits in 2003 [4], requiring only
4 bytes- only one more than on MySQL's utf8
* Apple wanted to sell iPhones in Japan, so they were added to unicode in
2010, and its subsequent popularity
* MySQL/InnoDB has been highly optimized for the fast handling of
variable-length strings

However, you cannot just arbitrarily break backwards compatibility and
rename the meaning of configuration- specially with storage software that
has been continuously supporting incremental upgrades as long as I can
remember. You can just support the new standard and encourage its usage,
make it the default, etc.

This is a bit offtopic here (feel free to PM to continue the conversation),
and just to be clear, I am _not fully justifying the decisions_, just
giving historical context, but I want to end with some relevant lessons to
the list:

* It is very difficult to build future-proof applications- PHP, MySQL,
Mediawiki, they have a long history and we should be gentle when we judge
them from the future. My work, involving backups, makes sometimes
supporting storage of stuff for over 5 years (unchanged) challenging,
because encryption algorithms are found to be weak, or end up being
unsupported/unavailable in just 2 releases of the operating system!
* Standards also change, they are not as "universal" as we may want to
believe (there have been 32 extra unicode versions since 1991). I expect
new collations to be needed in the future that are currently not
implemented, too.
* It is ok to make "mistakes", as long as we learn from them and improve
upon them :-)

Sorry for the text block.

[0] <url:https://powerlisting.fandom.com/wiki/Hindsight>
[1] <url:https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg>
[2] <url:https://www.rfc-editor.org/rfc/rfc2279>
[3] <url:
https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
[4] <url:https://www.rfc-editor.org/rfc/rfc3629>
Re: ListFiles special page [ In reply to ]
s/they/emojis/

On Fri, Oct 15, 2021 at 2:12 PM Jaime Crespo <jcrespo@wikimedia.org> wrote:

> I don't want to defend MySQL development decisions- in fact PHP made some
> similarly bad ones, but it would be unfair to judge them too harsly with
> the "power of hindsight" [0]- but... /pedantic on
>
> On Thu, Oct 14, 2021 at 7:37 PM Roy Smith <roy@panix.com> wrote:
>
>> What part of "universal" did they not understand?
>>
>
> ... several years ago, during the end of the century/start of a new one,
> no one used UTF-8 [1] and PHP didn't even support multi-byte strings. The
> original spec for UTF-8 called for up to 6 bytes[2]. The BMP, however (3
> bytes) contained characters for most modern languages [3], which was a
> waste of space and performance because at the time, MySQL worked much
> faster with fixed-width columns, which would be a waste of space (double!).
> My guess is that someone said "this is probably good enough", and would it
> be too outrageous to think that we may not need as many extra characters as
> stars in our Galaxy, when less than 65K were practically needed?
>
> 3 things changed after that:
> * Unicode limited UTF-8 to encoding for 21 bits in 2003 [4], requiring
> only 4 bytes- only one more than on MySQL's utf8
> * Apple wanted to sell iPhones in Japan, so they were added to unicode in
> 2010, and its subsequent popularity
> * MySQL/InnoDB has been highly optimized for the fast handling of
> variable-length strings
>
> However, you cannot just arbitrarily break backwards compatibility and
> rename the meaning of configuration- specially with storage software that
> has been continuously supporting incremental upgrades as long as I can
> remember. You can just support the new standard and encourage its usage,
> make it the default, etc.
>
> This is a bit offtopic here (feel free to PM to continue the
> conversation), and just to be clear, I am _not fully justifying the
> decisions_, just giving historical context, but I want to end with some
> relevant lessons to the list:
>
> * It is very difficult to build future-proof applications- PHP, MySQL,
> Mediawiki, they have a long history and we should be gentle when we judge
> them from the future. My work, involving backups, makes sometimes
> supporting storage of stuff for over 5 years (unchanged) challenging,
> because encryption algorithms are found to be weak, or end up being
> unsupported/unavailable in just 2 releases of the operating system!
> * Standards also change, they are not as "universal" as we may want to
> believe (there have been 32 extra unicode versions since 1991). I expect
> new collations to be needed in the future that are currently not
> implemented, too.
> * It is ok to make "mistakes", as long as we learn from them and improve
> upon them :-)
>
> Sorry for the text block.
>
> [0] <url:https://powerlisting.fandom.com/wiki/Hindsight>
> [1] <url:https://commons.wikimedia.org/wiki/File:Utf8webgrowth.svg>
> [2] <url:https://www.rfc-editor.org/rfc/rfc2279>
> [3] <url:
> https://en.wikipedia.org/wiki/Plane_(Unicode)#Basic_Multilingual_Plane>
> [4] <url:https://www.rfc-editor.org/rfc/rfc3629>
>
>

--
Jaime Crespo
<http://wikimedia.org>