I've noticed that the traditional locale-based case conversion functions
(ucfirst(), strtolower(), etc) aren't too reliable for anything but
English. Even when they do work, it's very dependant on the system
configuaration, and thus isn't really transparently portable.
So, I've added new case conversion functions ucfirstIntl(),
strtoupperIntl(), and strtolowerIntl() which can more or less properly
convert cases in a system-independent manner. For single-byte character
encodings this is very simple, based on the PHP strtr() function; just
define strings $wikiUpperChars containing all the uppercase characters
and $wikiLowerChars containing all the lowercase chars. (See example for
iso-8859-1 in wikiTextEn.php)
For multibyte character sets it's a little more complex, using the same
function in an array mode that associates byte sequences. Most multibyte
character sets are for Asian languages which don't have a case
distinction, so it's not likely to come up often except for those using
UTF-8. I've included conversion arrays for UTF-8 in utf8Case.php which
should cover just about everything, so any future 'pedias that may use
UTF-8 need just include that (as does wikiTextEo.php).
Also, it should be possible to extend ucfirstIntl() a bit to allow for
multiple-character first letter sequences (for instance treating ij->IJ
as one letter, which I believe is the officially correct behavior for
Dutch).
-- brion vibber (brion @ pobox.com)
(ucfirst(), strtolower(), etc) aren't too reliable for anything but
English. Even when they do work, it's very dependant on the system
configuaration, and thus isn't really transparently portable.
So, I've added new case conversion functions ucfirstIntl(),
strtoupperIntl(), and strtolowerIntl() which can more or less properly
convert cases in a system-independent manner. For single-byte character
encodings this is very simple, based on the PHP strtr() function; just
define strings $wikiUpperChars containing all the uppercase characters
and $wikiLowerChars containing all the lowercase chars. (See example for
iso-8859-1 in wikiTextEn.php)
For multibyte character sets it's a little more complex, using the same
function in an array mode that associates byte sequences. Most multibyte
character sets are for Asian languages which don't have a case
distinction, so it's not likely to come up often except for those using
UTF-8. I've included conversion arrays for UTF-8 in utf8Case.php which
should cover just about everything, so any future 'pedias that may use
UTF-8 need just include that (as does wikiTextEo.php).
Also, it should be possible to extend ucfirstIntl() a bit to allow for
multiple-character first letter sequences (for instance treating ij->IJ
as one letter, which I believe is the officially correct behavior for
Dutch).
-- brion vibber (brion @ pobox.com)