Tom Lynn <tlynn@users.sourceforge.net> added the comment:
The only difference between the two regexps is that the email/header.py
version looks for::
(?=[ \t]|$) # whitespace or the end of the string
at the end (with re.MULTILINE, so $ also matches '\n').
To expand on "There is nothing about that thing in RFC 2047", it says::
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser.
RFC 822 says::
atom = 1*<any CHAR except specials, SPACE and CTLs>
...
specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
/ "," / ";" / ":" / "\" / <"> ; string, to use
/ "." / "[" / "]" ; within a word.
So an example of mis-parsing is::
>>> import email.header
>>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)'
>>> email.header.decode_header(h)
[('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)]
The correct result would be::
>>> email.header.decode_header(h)
[('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)]
which is what you get if you insert a space before the '(' in h.
----------
nosy: +tlynn
_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue1079>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
The only difference between the two regexps is that the email/header.py
version looks for::
(?=[ \t]|$) # whitespace or the end of the string
at the end (with re.MULTILINE, so $ also matches '\n').
To expand on "There is nothing about that thing in RFC 2047", it says::
IMPORTANT: 'encoded-word's are designed to be recognized as 'atom's
by an RFC 822 parser.
RFC 822 says::
atom = 1*<any CHAR except specials, SPACE and CTLs>
...
specials = "(" / ")" / "<" / ">" / "@" ; Must be in quoted-
/ "," / ";" / ":" / "\" / <"> ; string, to use
/ "." / "[" / "]" ; within a word.
So an example of mis-parsing is::
>>> import email.header
>>> h = '=?utf-8?q?=E2=98=BA?=(unicode white smiling face)'
>>> email.header.decode_header(h)
[('=?utf-8?q?=E2=98=BA?=(unicode white smiling face)', None)]
The correct result would be::
>>> email.header.decode_header(h)
[('\xe2\x98\xba', 'utf-8'), ('(unicode white smiling face)', None)]
which is what you get if you insert a space before the '(' in h.
----------
nosy: +tlynn
_______________________________________
Python tracker <report@bugs.python.org>
<http://bugs.python.org/issue1079>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com