Mailing List Archive

What do these '=?utf-8?' sequences mean in python?
I'm having a real hard time trying to do anything to a string (?)
returned by mailbox.MaildirMessage.get().

I'm extracting the Subject: header from a message and, if I write what
it returns to a log file using the python logging module what I see
in the log file (when the Subject: has non-ASCII characters in it) is:-

=?utf-8?Q?aka_Marne_=C3=A0_la_Sa=C3=B4ne_(Waterways_Continental_Europe)?=

Whatever I try I am unable to change the underscore characters in the
above string back to spaces.


So, what do those =?utf-8? and ?= sequences mean? Are they part of
the string or are they wrapped around the string on output as a way to
show that it's utf-8 encoded?

If I have the string in a variable how do I replace the underscores
with spaces? Simply doing "subject.replace('_', ' ')" doesn't work,
nothing happens at all.

All I really want to do is throw the non-ASCII characters away as the
string I'm trying to match in the subject is guaranteed to be ASCII.

--
Chris Green
ยท
--
https://mail.python.org/mailman/listinfo/python-list
Re: What do these '=?utf-8?' sequences mean in python? [ In reply to ]
On Sat, 6 May 2023 14:50:40 +0100, Chris Green <cl@isbd.net> wrote:
[snip]
> So, what do those =?utf-8? and ?= sequences mean? Are they part of
> the string or are they wrapped around the string on output as a way to
> show that it's utf-8 encoded?

Yes, "=?utf-8?" signals "MIME header encoding".

I've only blundered about briefly in this area, but I think you
need to make sure that all header values you work with have been
converted to UTF-8 before proceeding.
Here's the code that seemed to work for me:

def mime_decode_single(pair):
"""Decode a single (bytestring, charset) pair.
"""
b, charset = pair
result = b if isinstance(b, str) else b.decode(
charset if charset else "utf-8")
return result

def mime_decode(s):
"""Decode a MIME-header-encoded character string.
"""
decoded_pairs = email.header.decode_header(s)
return "".join(mime_decode_single(d) for d in decoded_pairs)



--
To email me, substitute nowhere->runbox, invalid->com.
--
https://mail.python.org/mailman/listinfo/python-list
Re: What do these '=?utf-8?' sequences mean in python? [ In reply to ]
Peter Pearson ha scritto:
> On Sat, 6 May 2023 14:50:40 +0100, Chris Green <cl@isbd.net> wrote:
> [snip]
>> So, what do those =?utf-8? and ?= sequences mean? Are they part of
>> the string or are they wrapped around the string on output as a way to
>> show that it's utf-8 encoded?
>
> Yes, "=?utf-8?" signals "MIME header encoding".
>
> I've only blundered about briefly in this area, but I think you
> need to make sure that all header values you work with have been
> converted to UTF-8 before proceeding.
> Here's the code that seemed to work for me:
>
> def mime_decode_single(pair):
> """Decode a single (bytestring, charset) pair.
> """
> b, charset = pair
> result = b if isinstance(b, str) else b.decode(
> charset if charset else "utf-8")
> return result
>
> def mime_decode(s):
> """Decode a MIME-header-encoded character string.
> """
> decoded_pairs = email.header.decode_header(s)
> return "".join(mime_decode_single(d) for d in decoded_pairs)
>
>
>

HI,
You could also use make_header:

from email.header import decode_header, make_header

print(make_header(decode_header( subject )))

--
https://mail.python.org/mailman/listinfo/python-list