Mailing List Archive: Log and special characters

Log and special characters

ben.rubson at gmail

Aug 1, 2017, 10:30 AM

Post #1 of 9 (2604 views)

Hi,

The following UTF-8 :
warn("warn with special char ééèè");
$r->log->error("log with special char ééèè");

Produces :
warn with special char ééèè at ...
[Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8

Why all these \x symbols ?
How to avoid them ?

Thank you very much !

Ben

Re: Log and special characters [ In reply to ]

ben.rubson at gmail

Aug 2, 2017, 1:26 AM

Post #2 of 9 (2603 views)

> On 01 Aug 2017, at 19:30, Ben RUBSON <ben.rubson@gmail.com> wrote:
>
> $r->log->error("log with special char ééèè");
>
> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>
> Why all these \x symbols ?

Well, sounds like Apache escapes characters > 255 in server/util.c, ap_escape_errorlog_item().
Once converted to ASCII, é and è are greater than 255, so are escaped.
Here's the explanation :)

Ben

Re: Log and special characters [ In reply to ]

Aug 2, 2017, 1:52 AM

Post #3 of 9 (2603 views)

On 01.08.2017 19:30, Ben RUBSON wrote:
> Hi,
>
> The following UTF-8 :
> warn("warn with special char ééèè");
> $r->log->error("log with special char ééèè");
>
> Produces :
> warn with special char ééèè at ...
> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>
> Why all these \x symbols ?

These represent the *bytes* which correspond to the UTF-8 encoding of your "special"
characters above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9
(hexadecimal). When encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9
(hexadecimal). The "\x" prefix is a common way to indicate that the symbols which follow
should be interpreted as a hexadecimal number.

The exact reason why $r->log->error chooses to represent these characters in such a way in
the logfile (instead of just printing them as the bytes that constitute their UTF-8
encoding) is not really known to me, but I can make a guess :

Internally, perl "knows" that these characters are Unicode. But when it writes them out
to a file (such as here the logfile of Apache), it does not necessarily know that this
file itself is opened "in UTF-8 mode" and that it can just send the characters that way.
So it "escapes" them in a way that will make them readable by a human, no matter what (*).
And those are the \x.. (pure ASCII) representations that you see in the logfile.

On the other hand, the "warn()" that you also use above, that is perl writing directly to
its STDERR. And because that is a file that perl opened itself, it knows that it can
handle UTF-8, so it writes these characters directly that way.

> How to avoid them ?

In this case, I don't know, because it may depend on the way that Apache handles its
logfiles, and not only on perl/mod_perl.

>
(*) for example, no matter which text editor you later use to view the logfile. All text
editors can handle ASCII, but not necessarily UTF-8.

Ah, and I just saw your follow-up message, and between that and the above, we should have
some reasonable explanation together.

Re: Log and special characters [ In reply to ]

ben.rubson at gmail

Aug 2, 2017, 1:59 AM

Post #4 of 9 (2603 views)

> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>
> On 01.08.2017 19:30, Ben RUBSON wrote:
>> Hi,
>>
>> The following UTF-8 :
>> warn("warn with special char ééèè");
>> $r->log->error("log with special char ééèè");
>>
>> Produces :
>> warn with special char ééèè at ...
>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>>
>> Why all these \x symbols ?
>
> These represent the *bytes* which correspond to the UTF-8 encoding of your "special" characters above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" prefix is a common way to indicate that the symbols which follow should be interpreted as a hexadecimal number.
>
> The exact reason why $r->log->error chooses to represent these characters in such a way in the logfile (instead of just printing them as the bytes that constitute their UTF-8 encoding) is not really known to me, but I can make a guess :
>
> Internally, perl "knows" that these characters are Unicode. But when it writes them out to a file (such as here the logfile of Apache), it does not necessarily know that this file itself is opened "in UTF-8 mode" and that it can just send the characters that way.
> So it "escapes" them in a way that will make them readable by a human, no matter what (*).
> And those are the \x.. (pure ASCII) representations that you see in the logfile.
>
> On the other hand, the "warn()" that you also use above, that is perl writing directly to its STDERR. And because that is a file that perl opened itself, it knows that it can handle UTF-8, so it writes these characters directly that way.
>
>> How to avoid them ?
>
> In this case, I don't know, because it may depend on the way that Apache handles its logfiles, and not only on perl/mod_perl.
>
>>
> (*) for example, no matter which text editor you later use to view the logfile. All text editors can handle ASCII, but not necessarily UTF-8.
>
> Ah, and I just saw your follow-up message, and between that and the above, we should have some reasonable explanation together.

Thank you very much for your detailed answer André !
Yes Perl must certainly escape UTF-8 characters as you just explained.
If we convert the string to ascii first (using Encode), these special characters are not correctly displayed, this time due to Apache ap_escape_errorlog_item() function.

Best thing is then to avoid them :)

Many thanks !

Re: Log and special characters [ In reply to ]

Aug 2, 2017, 2:17 AM

Post #5 of 9 (2603 views)

On 02.08.2017 10:59, Ben RUBSON wrote:
>
>> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>>
>> On 01.08.2017 19:30, Ben RUBSON wrote:
>>> Hi,
>>>
>>> The following UTF-8 :
>>> warn("warn with special char ééèè");
>>> $r->log->error("log with special char ééèè");
>>>
>>> Produces :
>>> warn with special char ééèè at ...
>>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>>>
>>> Why all these \x symbols ?
>>
>> These represent the *bytes* which correspond to the UTF-8 encoding of your "special" characters above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" prefix is a common way to indicate that the symbols which follow should be interpreted as a hexadecimal number.
>>
>> The exact reason why $r->log->error chooses to represent these characters in such a way in the logfile (instead of just printing them as the bytes that constitute their UTF-8 encoding) is not really known to me, but I can make a guess :
>>
>> Internally, perl "knows" that these characters are Unicode. But when it writes them out to a file (such as here the logfile of Apache), it does not necessarily know that this file itself is opened "in UTF-8 mode" and that it can just send the characters that way.
>> So it "escapes" them in a way that will make them readable by a human, no matter what (*).
>> And those are the \x.. (pure ASCII) representations that you see in the logfile.
>>
>> On the other hand, the "warn()" that you also use above, that is perl writing directly to its STDERR. And because that is a file that perl opened itself, it knows that it can handle UTF-8, so it writes these characters directly that way.
>>
>>> How to avoid them ?
>>
>> In this case, I don't know, because it may depend on the way that Apache handles its logfiles, and not only on perl/mod_perl.
>>
>>>
>> (*) for example, no matter which text editor you later use to view the logfile. All text editors can handle ASCII, but not necessarily UTF-8.
>>
>> Ah, and I just saw your follow-up message, and between that and the above, we should have some reasonable explanation together.
>
> Thank you very much for your detailed answer André !
> Yes Perl must certainly escape UTF-8 characters as you just explained.
> If we convert the string to ascii first (using Encode), these special characters are not correctly displayed, this time due to Apache ap_escape_errorlog_item() function.
>
> Best thing is then to avoid them :)
>

Unfortunately, this is not an option when applications have to deal with multiple
languages, and maybe log some important data that just is "not english" (like names of
people, or filenames that people use).
And unfortunately too, that is an issue which often does not seem so important to a lot of
english-native-language programmers, who tend to consider such characters as indeed
"special" and get very confused by them. To 80% of the people on earth, such characters
are not "special" at all; they are an integral part of their language, just like "a" or
"b" are an integral part of the English language. Hell, I can't even write my own name
correctly without them ! (and neither can a multitude of websites and email programs,
still today. I still get called Andr~O or similar all the time).

Re: Log and special characters [ In reply to ]

aarts.eric at gmail

Aug 2, 2017, 2:24 AM

Post #6 of 9 (2603 views)

What a stunning coincidence…

?? starting a new conversation ‘MP framework’ just after André his reply on
‘Log and special characters’.

Totally agree with you André, as we serve customers all over Europe and in
China.

Regards, Eric

On Wed, Aug 2, 2017 at 11:17 AM, André Warnier (tomcat) <aw@ice-sa.com>
wrote:

> On 02.08.2017 10:59, Ben RUBSON wrote:
>
>>
>> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>>>
>>> On 01.08.2017 19:30, Ben RUBSON wrote:
>>>
>>>> Hi,
>>>>
>>>> The following UTF-8 :
>>>> warn("warn with special char ééèè");
>>>> $r->log->error("log with special char ééèè");
>>>>
>>>> Produces :
>>>> warn with special char ééèè at ...
>>>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client
>>>> 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>>>>
>>>> Why all these \x symbols ?
>>>>
>>>
>>> These represent the *bytes* which correspond to the UTF-8 encoding of
>>> your "special" characters above. E.g. the character "é" has the Unicode
>>> codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8
>>> encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x"
>>> prefix is a common way to indicate that the symbols which follow should be
>>> interpreted as a hexadecimal number.
>>>
>>> The exact reason why $r->log->error chooses to represent these
>>> characters in such a way in the logfile (instead of just printing them as
>>> the bytes that constitute their UTF-8 encoding) is not really known to me,
>>> but I can make a guess :
>>>
>>> Internally, perl "knows" that these characters are Unicode. But when it
>>> writes them out to a file (such as here the logfile of Apache), it does not
>>> necessarily know that this file itself is opened "in UTF-8 mode" and that
>>> it can just send the characters that way.
>>> So it "escapes" them in a way that will make them readable by a human,
>>> no matter what (*).
>>> And those are the \x.. (pure ASCII) representations that you see in the
>>> logfile.
>>>
>>> On the other hand, the "warn()" that you also use above, that is perl
>>> writing directly to its STDERR. And because that is a file that perl opened
>>> itself, it knows that it can handle UTF-8, so it writes these characters
>>> directly that way.
>>>
>>> How to avoid them ?
>>>>
>>>
>>> In this case, I don't know, because it may depend on the way that Apache
>>> handles its logfiles, and not only on perl/mod_perl.
>>>
>>>
>>>> (*) for example, no matter which text editor you later use to view the
>>> logfile. All text editors can handle ASCII, but not necessarily UTF-8.
>>>
>>> Ah, and I just saw your follow-up message, and between that and the
>>> above, we should have some reasonable explanation together.
>>>
>>
>> Thank you very much for your detailed answer André !
>> Yes Perl must certainly escape UTF-8 characters as you just explained.
>> If we convert the string to ascii first (using Encode), these special
>> characters are not correctly displayed, this time due to Apache
>> ap_escape_errorlog_item() function.
>>
>> Best thing is then to avoid them :)
>>
>>
> Unfortunately, this is not an option when applications have to deal with
> multiple languages, and maybe log some important data that just is "not
> english" (like names of people, or filenames that people use).
> And unfortunately too, that is an issue which often does not seem so
> important to a lot of english-native-language programmers, who tend to
> consider such characters as indeed "special" and get very confused by them.
> To 80% of the people on earth, such characters are not "special" at all;
> they are an integral part of their language, just like "a" or "b" are an
> integral part of the English language. Hell, I can't even write my own name
> correctly without them ! (and neither can a multitude of websites and email
> programs, still today. I still get called Andr~O or similar all the time).
>
>
>
>
>

Re: Log and special characters [ In reply to ]

ben.rubson at gmail

Aug 2, 2017, 2:25 AM

Post #7 of 9 (2603 views)

> On 02 Aug 2017, at 11:17, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>
> On 02.08.2017 10:59, Ben RUBSON wrote:
>>
>>> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>>>
>>> On 01.08.2017 19:30, Ben RUBSON wrote:
>>>> Hi,
>>>>
>>>> The following UTF-8 :
>>>> warn("warn with special char ééèè");
>>>> $r->log->error("log with special char ééèè");
>>>>
>>>> Produces :
>>>> warn with special char ééèè at ...
>>>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>>>>
>>>> Why all these \x symbols ?
>>>
>>> These represent the *bytes* which correspond to the UTF-8 encoding of your "special" characters above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" prefix is a common way to indicate that the symbols which follow should be interpreted as a hexadecimal number.
>>>
>>> The exact reason why $r->log->error chooses to represent these characters in such a way in the logfile (instead of just printing them as the bytes that constitute their UTF-8 encoding) is not really known to me, but I can make a guess :
>>>
>>> Internally, perl "knows" that these characters are Unicode. But when it writes them out to a file (such as here the logfile of Apache), it does not necessarily know that this file itself is opened "in UTF-8 mode" and that it can just send the characters that way.
>>> So it "escapes" them in a way that will make them readable by a human, no matter what (*).
>>> And those are the \x.. (pure ASCII) representations that you see in the logfile.
>>>
>>> On the other hand, the "warn()" that you also use above, that is perl writing directly to its STDERR. And because that is a file that perl opened itself, it knows that it can handle UTF-8, so it writes these characters directly that way.
>>>
>>>> How to avoid them ?
>>>
>>> In this case, I don't know, because it may depend on the way that Apache handles its logfiles, and not only on perl/mod_perl.
>>>
>>>>
>>> (*) for example, no matter which text editor you later use to view the logfile. All text editors can handle ASCII, but not necessarily UTF-8.
>>>
>>> Ah, and I just saw your follow-up message, and between that and the above, we should have some reasonable explanation together.
>>
>> Thank you very much for your detailed answer André !
>> Yes Perl must certainly escape UTF-8 characters as you just explained.
>> If we convert the string to ascii first (using Encode), these special characters are not correctly displayed, this time due to Apache ap_escape_errorlog_item() function.
>>
>> Best thing is then to avoid them :)
>>
>
> Unfortunately, this is not an option when applications have to deal with multiple languages, and maybe log some important data that just is "not english" (like names of people, or filenames that people use).
> And unfortunately too, that is an issue which often does not seem so important to a lot of english-native-language programmers, who tend to consider such characters as indeed "special" and get very confused by them. To 80% of the people on earth, such characters are not "special" at all; they are an integral part of their language, just like "a" or "b" are an integral part of the English language. Hell, I can't even write my own name correctly without them ! (and neither can a multitude of websites and email programs, still today. I still get called Andr~O or similar all the time).

Yes you're right, this is an issue if we need to log things such as user input.
Supporting the extended ASCII table (up to decimal 255) would at least help a little.
We would then be able to correctly log 'André' :)
But many characters would still not be supported...

Re: Log and special characters [ In reply to ]

Aug 2, 2017, 4:37 AM

Post #8 of 9 (2603 views)

On 02.08.2017 11:25, Ben RUBSON wrote:
>
>> On 02 Aug 2017, at 11:17, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>>
>> On 02.08.2017 10:59, Ben RUBSON wrote:
>>>
>>>> On 02 Aug 2017, at 10:52, André Warnier (tomcat) <aw@ice-sa.com> wrote:
>>>>
>>>> On 01.08.2017 19:30, Ben RUBSON wrote:
>>>>> Hi,
>>>>>
>>>>> The following UTF-8 :
>>>>> warn("warn with special char ééèè");
>>>>> $r->log->error("log with special char ééèè");
>>>>>
>>>>> Produces :
>>>>> warn with special char ééèè at ...
>>>>> [Tue Aug 01 19:25:28.914947 2017] [perl:error] [pid 56938] [client 127.0.0.1:59952] log with special char \xc3\xa9\xc3\xa9\xc3\xa8\xc3\xa8
>>>>>
>>>>> Why all these \x symbols ?
>>>>
>>>> These represent the *bytes* which correspond to the UTF-8 encoding of your "special" characters above. E.g. the character "é" has the Unicode codepoint 233 (decimal) or E9 (hexadecimal). When encoded using the UTF-8 encoding, this is represented by 2 bytes C3 A9 (hexadecimal). The "\x" prefix is a common way to indicate that the symbols which follow should be interpreted as a hexadecimal number.
>>>>
>>>> The exact reason why $r->log->error chooses to represent these characters in such a way in the logfile (instead of just printing them as the bytes that constitute their UTF-8 encoding) is not really known to me, but I can make a guess :
>>>>
>>>> Internally, perl "knows" that these characters are Unicode. But when it writes them out to a file (such as here the logfile of Apache), it does not necessarily know that this file itself is opened "in UTF-8 mode" and that it can just send the characters that way.
>>>> So it "escapes" them in a way that will make them readable by a human, no matter what (*).
>>>> And those are the \x.. (pure ASCII) representations that you see in the logfile.
>>>>
>>>> On the other hand, the "warn()" that you also use above, that is perl writing directly to its STDERR. And because that is a file that perl opened itself, it knows that it can handle UTF-8, so it writes these characters directly that way.
>>>>
>>>>> How to avoid them ?
>>>>
>>>> In this case, I don't know, because it may depend on the way that Apache handles its logfiles, and not only on perl/mod_perl.
>>>>
>>>>>
>>>> (*) for example, no matter which text editor you later use to view the logfile. All text editors can handle ASCII, but not necessarily UTF-8.
>>>>
>>>> Ah, and I just saw your follow-up message, and between that and the above, we should have some reasonable explanation together.
>>>
>>> Thank you very much for your detailed answer André !
>>> Yes Perl must certainly escape UTF-8 characters as you just explained.
>>> If we convert the string to ascii first (using Encode), these special characters are not correctly displayed, this time due to Apache ap_escape_errorlog_item() function.
>>>
>>> Best thing is then to avoid them :)
>>>
>>
>> Unfortunately, this is not an option when applications have to deal with multiple languages, and maybe log some important data that just is "not english" (like names of people, or filenames that people use).
>> And unfortunately too, that is an issue which often does not seem so important to a lot of english-native-language programmers, who tend to consider such characters as indeed "special" and get very confused by them. To 80% of the people on earth, such characters are not "special" at all; they are an integral part of their language, just like "a" or "b" are an integral part of the English language. Hell, I can't even write my own name correctly without them ! (and neither can a multitude of websites and email programs, still today. I still get called Andr~O or similar all the time).
>
> Yes you're right, this is an issue if we need to log things such as user input.
> Supporting the extended ASCII table (up to decimal 255) would at least help a little.
> We would then be able to correctly log 'André' :)
> But many characters would still not be supported...
>

One thing to say, is that the current way in which Apache handles its logfiles, at least
logs these "extended" characters, without generating an error, and without corrupting or
losing data (the bytes composing the correct UTF-8 encoded characters are there, even if
they are difficult to read by a human). That's better than having some undecipherable "[]"
or "?" replacement symbol.

But I guess that a "proper" or "better" solution would be for Apache to write the logs
(always/optionally) in Unicode/UTF-8, and be able to tell mod_perl's $r->log->error() that
this is the case (or maybe that would not even be necessary then).
Maybe a problem with this however, would be the multiple "log analysis" programs which
exist (awstat e.g.), and which may not be able right now to handle this.

I'll try to float the idea on the Apache httpd list.

Re: Log and special characters [ In reply to ]

Aug 2, 2017, 6:45 AM

Post #9 of 9 (2603 views)

On 02.08.2017 11:25, Ben RUBSON wrote:
> We would then be able to correctly log 'André'

Actually, this is how I most often get it, on the web and in scam emails :

"Hi AndrÃ©,"

To the savvy and experienced multilingual-application-programming expert, this of course
is entirely transparent :
- the letters "Ã©" are in reality the (bad) interpretation as ISO-8859-1, of the UTF-8
2-bytes sequence \xc3\xa9, which as you well know now, represents the Unicode character
with codepoint 233 (decimal) (or E9 (hexadecimal)), which is the printable latin letter "é".
- the misinterpretation is due to some program in the chain leading to this HTML page or
email, which does not, or incorrectly, support UTF-8, and which has interpreted this as 2
bytes, instead of 1 character.

(It gets even funnier when some other program in the chain tries to do "the right thing"
and re-encodes these 2 characters as UTF-8, thus yelding 4 bytes which have nothing to do
anymore with the original, no matter how encoded. And I am sure than anyone dealing with
the Chinese language has better stories to tell.)

Knowing how it happens does nothing to alleviate the frustration though.

It even happens with organisations which by all means /should/ really know better.
The following is extracted from emails received from the *Apache httpd* Developpers
mailing list :

[...]
Weird behaviour with mod_ssl and SSLCryptoDevice
85066 by: jean-frederic clere
85067 by: Stefan Eissing
85068 by: jean-frederic clere
85069 by: jean-frederic clere
85075 by: Jan KaluÅ¾a <------------

[...]
scoreboard and http2
85003 by: Stefan Eissing
85004 by: Graham Leggett
85005 by: PlÃ¼m, RÃ¼diger, Vodafone Group <---

Poor Rüdiger, who systematically sees his name mangled each time he contributes.
And who knows what Jan's name really looks like ?

Mind you, this is about Apache httpd, the webserver which powers more than 50% of
worldwide websites. So I believe that other character-set confused programmers have some
excuse.
:-)