Mailing List Archive

Bug: setlocale and case-insensitive pattern matching
Hi !

I run perl 5.001m in an environment where I use the Iso-Latin-1
character set.

When trying to do case-insensitive pattern matching with strings
and patterns containing non-ASCII characters (the swedish
characters åäöÅÄÖ) strange things happened.

The code looked something like:

if ($word =~ /dalaröbrygga/i) {
print "Found $&\n";
}

This matched "dalaröbrygga" and "DALARöBRYGGA" but not "DALARÖBRYGGA"
Re: Bug: setlocale and case-insensitive pattern matching [ In reply to ]
Hi again !

Yesterday I sent this bug report to perlbug@perl.com but from
the first reply I got it seems that the mail was cut off in the
middle by some stupid sendmail (!?),
Therefore I send it once again,

Since this is MIME, the swedish charcters (that the bug is about)
may show up as the infamous Quoted-Printable,

The following was sent yesterday, The last line should be "THE END",

------------------------

I run perl 5.001m in an environment where I use the Iso-Latin-1
character set,

When trying to do case-insensitive pattern matching with strings
and patterns containing non-ASCII characters (the swedish
characters åäöÅÄÖ) strange things happened,

The code looked something like:

if ($word =~ /dalaröbrygga/i) {
print "Found $&\n";
}

This matched "dalaröbrygga" and "DALARöBRYGGA" but not "DALARÖBRYGGA",

I have looked around in the code and believe two things (at least)
need to be changed to make this work as expected:


1) the "locale" set by the LC_* variables should be honored
by perl without the need to call "setlocale" in the script,
I saw that the bug report NETaa14799 already deals with this,
but think that maybe it should be done slightly different
from what is described in that bug-report:

a) The second argument to 'setlocale' should be "", not the
value of LC_CTYPE, 'setlocale' has it's own logic of
choosing an appropriate value from LC_ALL, LC_CTYPE,
LANG, ,,,

b) Maybe all categories should be set, not just the
LC_CTYPE category,

I tried to add a call to "setlocale" at the beginning of "main",


2) In the file "perl.h" there is an array 'fold' mapping characters
to their "opposite case", This array seem to be used when case
insensitive pattern matching is done, The array should be
different depending on which "locale" that is used,
I added the following right after the call to "setlocale" in "main":


int i;

for (i=0; i<256; i++) {
fold[i] = i ^ toupper(i) ^ tolower(i);
}


After I made these changes my code seems to work as expected,


Do you agree that what I describe is a bug ?
Have I fixed it the right way ?

Regards,

Johan Holmberg

-----------------------------------------------------------------------
Johan Holmberg Email: holmberg@upp.promotor.telia.se
Telia Promotor AB Phone: +46 18 18 94 55
Box 1218 Mobile: +46 70 528 94 55
751 42 Uppsala, SWEDEN Fax: +46 18 18 94 99
-----------------------------------------------------------------------

THE END
Re: Bug: setlocale and case-insensitive pattern matching [ In reply to ]
I already sent an answer to Johan this morning that in perl 5.002beta1
this "just works", I do not know why he has not received my reply.

johan> b) Maybe all categories should be set, not just the
johan> LC_CTYPE category,

Maybe. But this might break things in places where the locale support
by the vendor is broken...LC_CTYPE should be the most harmless one of
them. I say we look for a while whether the setlocale(LC_CTYPE, "")
in the main() of 5.002 seems to work ok. If it does, we can add the
rest of the LC_*.

++jhi;
Re: Bug: setlocale and case-insensitive pattern matching [ In reply to ]
I already sent an answer to Johan this morning that in perl 5.002beta1
this "just works", I do not know why he has not received my reply.

johan> b) Maybe all categories should be set, not just the
johan> LC_CTYPE category,

Maybe. But this might break things in places where the locale support
by the vendor is broken...LC_CTYPE should be the most harmless one of
them. I say we look for a while whether the setlocale(LC_CTYPE, "")
in the main() of 5.002 seems to work ok. If it does, we can add the
rest of the LC_*.

++jhi;
Re: Bug: setlocale and case-insensitive pattern matching [ In reply to ]
I already sent an answer to Johan this morning that in perl 5.002beta1
this "just works", I do not know why he has not received my reply.

johan> b) Maybe all categories should be set, not just the
johan> LC_CTYPE category,

Maybe. But this might break things in places where the locale support
by the vendor is broken...LC_CTYPE should be the most harmless one of
them. I say we look for a while whether the setlocale(LC_CTYPE, "")
in the main() of 5.002 seems to work ok. If it does, we can add the
rest of the LC_*.

++jhi;
Re: Bug: setlocale and case-insensitive pattern matching [ In reply to ]
Yes, you seem to be right -- your example does not work even with
the setlocale() stuff already in main().

The funny thing is that my example (which I used when I saw your
complaint) does work:

for $pat ("dalar\xf6brygga", "DALAR\XD6BRYGGA") {
for $word ("dalar\xf6brygga", "DALAR\XD6BRYGGA") {
print "$word matched $pat\n" if ($word =~ /$pat/i);
}
}

The \xf6 and \xd6 being the odiaeresis and Odiaeresis.
All the four matches succeed.

If Johan's test is rewritten to be:

$x = "dalar\xf6brygga";

if ($word =~ /$x/i) {
print "Found $word\n";
}
else {
print "\tCant find $word\n";
}

The word is found. Strange. After some testing it seems that the
patch is needed when

an uppercase word is matched against a fixed pattern

When the pattern is 'soft' (as in my example) or the word is
lowercase, things 'just work' (Disclaimer: in Digital UNIX 3.2C,
Finnish locale, perl 5.002b1f) as they are supposed to.

In conclusion: Johan's patch seems a wise choice.
Please include it in 5.002, it fixes one naughty gap in Perl's I18N.
(Gaps which I tried to fix with my setlocale(LC_CTYPE, "") but apparently
tried not enough...)

++jhi;