Mailing List Archive

Problem with accented characters in mailbox.Maildir()
I have a custom mail filter in python that uses the mailbox package to
open a mail message and give me access to the headers.

So I have the following code to open each mail message:-

#
#
# Read the message from standard input and make a message object from it
#
msg = mailbox.MaildirMessage(sys.stdin.buffer.read())

and then later I have (among many other bits and pieces):-

#
#
# test for string in Subject:
#
if searchTxt in str(msg.get("subject", "unknown")):
do
various
things


This works exactly as intended most of the time but occasionally a
message whose subject should match the test is missed. I have just
realised when this happens, it's when the Subject: has accented
characters in it (this is from a mailing list about canals in France).

So, for example, the latest case of this happening has:-

Subject: aka Marne à la Saône (Waterways Continental Europe)

where the searchTxt in the code above is "Waterways Continental Europe".


Is there any way I can work round this issue? E.g. is there a way to
strip out all extended characters from a string? Or maybe it's
msg.get() that isn't managing to handle the accented string correctly?

Yes, I know that accented characters probably aren't allowed in
Subject: but I'm not going to get that changed! :-)


--
Chris Green
·
--
https://mail.python.org/mailman/listinfo/python-list
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
Chris Green ha scritto:
> I have a custom mail filter in python that uses the mailbox package to
> open a mail message and give me access to the headers.
>
> So I have the following code to open each mail message:-
>
> #
> #
> # Read the message from standard input and make a message object from it
> #
> msg = mailbox.MaildirMessage(sys.stdin.buffer.read())
>
> and then later I have (among many other bits and pieces):-
>
> #
> #
> # test for string in Subject:
> #
> if searchTxt in str(msg.get("subject", "unknown")):
> do
> various
> things
>
>
> This works exactly as intended most of the time but occasionally a
> message whose subject should match the test is missed. I have just
> realised when this happens, it's when the Subject: has accented
> characters in it (this is from a mailing list about canals in France).
>
> So, for example, the latest case of this happening has:-
>
> Subject: aka Marne à la Saône (Waterways Continental Europe)
>
> where the searchTxt in the code above is "Waterways Continental Europe".
>
>
> Is there any way I can work round this issue? E.g. is there a way to
> strip out all extended characters from a string? Or maybe it's
> msg.get() that isn't managing to handle the accented string correctly?
>
> Yes, I know that accented characters probably aren't allowed in
> Subject: but I'm not going to get that changed! :-)
>
>

Hi,
you could try extracting the "Content-Type:charset" and then using it
for subject conversion:

subj = str(raw_subj, encoding='...')

--
https://mail.python.org/mailman/listinfo/python-list
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
A bit more information, msg.get("subject", "unknown") does return a
string, as follows:-

Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

So it's the 'searchTxt in msg.get("subject", "unknown")' that's
failing. I.e. for some reason 'in' isn't working when the searched
string has utf-8 characters.

Surely there's a way to handle this.

--
Chris Green
·
--
https://mail.python.org/mailman/listinfo/python-list
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
Chris Green <cl@isbd.net> wrote:
> A bit more information, msg.get("subject", "unknown") does return a
> string, as follows:-
>
> Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
>
> So it's the 'searchTxt in msg.get("subject", "unknown")' that's
> failing. I.e. for some reason 'in' isn't working when the searched
> string has utf-8 characters.
>
> Surely there's a way to handle this.
>
... and of course I now see the issue! The Subject: with utf-8
characters in it gets spaces changed to underscores. So searching for
'(Waterways Continental Europe)' fails.

I'll either need to test for both versions of the string or I'll need
to change underscores to spaces in the Subject: returned by msg.get().
It's a long enough string that I'm searching for that I won't get any
false positives.


Sorry for the noise everyone, it's a typical case of explaining the
problem shows one how to fix it! :-)

--
Chris Green
·
--
https://mail.python.org/mailman/listinfo/python-list
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
Chris Green ha scritto:
> Chris Green <cl@isbd.net> wrote:
>> A bit more information, msg.get("subject", "unknown") does return a
>> string, as follows:-
>>
>> Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
>>
>> So it's the 'searchTxt in msg.get("subject", "unknown")' that's
>> failing. I.e. for some reason 'in' isn't working when the searched
>> string has utf-8 characters.
>>
>> Surely there's a way to handle this.
>>
> ... and of course I now see the issue! The Subject: with utf-8
> characters in it gets spaces changed to underscores. So searching for
> '(Waterways Continental Europe)' fails.
>
> I'll either need to test for both versions of the string or I'll need
> to change underscores to spaces in the Subject: returned by msg.get().
> It's a long enough string that I'm searching for that I won't get any
> false positives.
>
>
> Sorry for the noise everyone, it's a typical case of explaining the
> problem shows one how to fix it! :-)
>

This is probably what you need:

import email.header

raw_subj =
'=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='

subj = email.header.decode_header(raw_subj)[0]

subj[0].decode(subj[1])

'aka Marne à la Saône (Waterways Continental Europe)'




--
https://mail.python.org/mailman/listinfo/python-list
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
On 2023-05-06 16:27:04 +0200, jak wrote:
> Chris Green ha scritto:
> > Chris Green <cl@isbd.net> wrote:
> > > A bit more information, msg.get("subject", "unknown") does return a
> > > string, as follows:-
> > >
> > > Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
[...]
> > ... and of course I now see the issue! The Subject: with utf-8
> > characters in it gets spaces changed to underscores. So searching for
> > '(Waterways Continental Europe)' fails.
> >
> > I'll either need to test for both versions of the string or I'll need
> > to change underscores to spaces in the Subject: returned by msg.get().

You need to decode the Subject properly. Unfortunately the Python email
module doesn't do that for you automatically. But it does provide the
necessary tools. Don't roll your own unless you've read and understood
the relevant RFCs.

>
> This is probably what you need:
>
> import email.header
>
> raw_subj =
> '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
>
> subj = email.header.decode_header(raw_subj)[0]
>
> subj[0].decode(subj[1])
>
> 'aka Marne ? la Sa?ne (Waterways Continental Europe)'

You are an the right track, but that works only because the example
exists only of a single encoded word. This is not always the case (and
indeed not what the RFC recommends).

email.header.decode_header returns a *list* of chunks and you have to
process and concatenate all of them.

Here is a snippet from a mail to html converter I wrote a few years ago:

def decode_rfc2047(s):
if s is None:
return None
r = ""
for chunk in email.header.decode_header(s):
if chunk[1]:
try:
r += chunk[0].decode(chunk[1])
except LookupError:
r += chunk[0].decode("windows-1252")
except UnicodeDecodeError:
r += chunk[0].decode("windows-1252")
elif type(chunk[0]) == bytes:
r += chunk[0].decode('us-ascii')
else:
r += chunk[0]
return r

(this is maybe a bit more forgiving than the OP needs, but I had to deal
with malformed mails)

I do have to say that Python is extraordinarily clumsy in this regard.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
Peter J. Holzer ha scritto:
> On 2023-05-06 16:27:04 +0200, jak wrote:
>> Chris Green ha scritto:
>>> Chris Green <cl@isbd.net> wrote:
>>>> A bit more information, msg.get("subject", "unknown") does return a
>>>> string, as follows:-
>>>>
>>>> Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> [...]
>>> ... and of course I now see the issue! The Subject: with utf-8
>>> characters in it gets spaces changed to underscores. So searching for
>>> '(Waterways Continental Europe)' fails.
>>>
>>> I'll either need to test for both versions of the string or I'll need
>>> to change underscores to spaces in the Subject: returned by msg.get().
>
> You need to decode the Subject properly. Unfortunately the Python email
> module doesn't do that for you automatically. But it does provide the
> necessary tools. Don't roll your own unless you've read and understood
> the relevant RFCs.
>
>>
>> This is probably what you need:
>>
>> import email.header
>>
>> raw_subj =
>> '=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?='
>>
>> subj = email.header.decode_header(raw_subj)[0]
>>
>> subj[0].decode(subj[1])
>>
>> 'aka Marne ? la Sa?ne (Waterways Continental Europe)'
>
> You are an the right track, but that works only because the example
> exists only of a single encoded word. This is not always the case (and
> indeed not what the RFC recommends).
>
> email.header.decode_header returns a *list* of chunks and you have to
> process and concatenate all of them.
>
> Here is a snippet from a mail to html converter I wrote a few years ago:
>
> def decode_rfc2047(s):
> if s is None:
> return None
> r = ""
> for chunk in email.header.decode_header(s):
> if chunk[1]:
> try:
> r += chunk[0].decode(chunk[1])
> except LookupError:
> r += chunk[0].decode("windows-1252")
> except UnicodeDecodeError:
> r += chunk[0].decode("windows-1252")
> elif type(chunk[0]) == bytes:
> r += chunk[0].decode('us-ascii')
> else:
> r += chunk[0]
> return r
>
> (this is maybe a bit more forgiving than the OP needs, but I had to deal
> with malformed mails)
>
> I do have to say that Python is extraordinarily clumsy in this regard.
>
> hp
>

Thanks for the reply. In fact, I gave that answer because I did
not understand what the OP wanted to achieve. In addition, the
OP opened a second thread on the similar topic in which I gave a
more correct answer (subject: "What do these '=?utf-8?' sequences
mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").
I was interested in this thread because a few years ago I wrote a
program in C that sent, via email, the log file of an application
in the event that it crashed and I had created the attachment
based64, however at the time I did not know of the RFC2047
relating to the subject. In addition, investigating the needs of
the OP, I discovered that the MAME is not the only format used
to compose the subject. I found an example in a thread of same
days ago where the subject contained Arabic text (sender:
"Uhrda education <Fatmaelhlwany9@gmail.com>", date: "Wed, 03
May 2023 00:18:14 UTC"). This is the raw version of the subject:

=?UTF-8?B?2LTZh9in2K/YqSDYo9iu2LXYp9im2Yog2K7Yr9mF2Kkg2LnZhdmE2KfYoSDZhdi52KrZhQ==?=

=?UTF-8?B?2K8gI9in2YjZhtmE2KfZitmGINio2LHYs9mI2YUg2YXYrtmB2LbYqSDYrtmE2KfZhCDYtNmH2LEg2YU=?=

=?UTF-8?B?2KfZitmIMjAyMyDZhNmE2KfYs9iq2YHYs9in2LEg2YjYp9iq2LMgLyAwMDIwMTAwOTMwNjExMQ==?=

As you can see, the penultimate letter of the header is not a
'q' as in the OP message but it is a 'b' and the body of the
message is covered according to the base64. This made me think
that a library could not delegate to the programmer the burden of
managing all these exceptions, then I have further investigated
to discover that the library also provides the conversion
function beyond that of coding and this makes our labors vain:

----------
from email.header import decode_header, make_header

subject = make_header(decode_header( raw_subject )))
----------

This line of code correctly converts the message of the OP
and also the one with the text in Arabic.

I greet you with cordiality.
--
https://mail.python.org/mailman/listinfo/python-list
Re: Problem with accented characters in mailbox.Maildir() [ In reply to ]
On 2023-05-08 23:02:18 +0200, jak wrote:
> Peter J. Holzer ha scritto:
> > On 2023-05-06 16:27:04 +0200, jak wrote:
> > > Chris Green ha scritto:
> > > > Chris Green <cl@isbd.net> wrote:
> > > > > A bit more information, msg.get("subject", "unknown") does return a
> > > > > string, as follows:-
> > > > >
> > > > > Subject: =?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=
> > [...]
> > > > ... and of course I now see the issue! The Subject: with utf-8
> > > > characters in it gets spaces changed to underscores. So searching for
> > > > '(Waterways Continental Europe)' fails.
> > > >
> > > > I'll either need to test for both versions of the string or I'll need
> > > > to change underscores to spaces in the Subject: returned by msg.get().
[...]
> > >
> > > subj = email.header.decode_header(raw_subj)[0]
> > >
> > > subj[0].decode(subj[1])
[...]
> > email.header.decode_header returns a *list* of chunks and you have to
> > process and concatenate all of them.
> >
> > Here is a snippet from a mail to html converter I wrote a few years ago:
> >
> > def decode_rfc2047(s):
> > if s is None:
> > return None
> > r = ""
> > for chunk in email.header.decode_header(s):
[...]
> > r += chunk[0].decode(chunk[1])
[...]
> > return r
[...]
> >
> > I do have to say that Python is extraordinarily clumsy in this regard.
>
> Thanks for the reply. In fact, I gave that answer because I did
> not understand what the OP wanted to achieve. In addition, the
> OP opened a second thread on the similar topic in which I gave a
> more correct answer (subject: "What do these '=?utf-8?' sequences
> mean in python?", date: "Sat, 6 May 2023 14:50:40 UTC").

Right. I saw that after writing my reply. I should have read all
messages, not just that thread before replying.

> the OP, I discovered that the MAME is not the only format used
> to compose the subject.

Not sure what "MAME" is. If it's a typo for MIME, then the base64
variant of RFC 2047 is just as much a part of it as the quoted-printable
variant.

> This made me think that a library could not delegate to the programmer
> the burden of managing all these exceptions,

email.header.decode_header handles both variants, but it produces bytes
sequences which still have to be decoded to get a Python string.


> then I have further investigated to discover that the library also
> provides the conversion function beyond that of coding and this makes
> our labors vain:
>
> ----------
> from email.header import decode_header, make_header
>
> subject = make_header(decode_header( raw_subject )))
> ----------

Yup. I somehow missed that. That's a lot more convenient than calling
decode in a loop (or generator expression). Depending on what you want
to do with the subject you may have wrap that in a call to str(), but
it's still a one-liner.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"