Mailing List Archive

[Bug 3191] Word boundaries are lost after HTML processing
http://bugzilla.spamassassin.org/show_bug.cgi?id=3191





------- Additional Comments From michaelb@opentext.com 2004-03-18 11:19 -------
Created an attachment (id=1858)
--> (http://bugzilla.spamassassin.org/attachment.cgi?id=1858&action=view)
Example message that causes problems.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3191] Word boundaries are lost after HTML processing [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3191





------- Additional Comments From kcivey@cpcug.org 2004-03-18 11:31 -------
That's nothing SpamAssassin-specific. It has to do with the way '\b' is defined
and what your locale is set to. Presumably you're using a locale in which 'á' is
not a word character, so '\b' won't match after it (unless the following
character is a word character). You need something more complicated than a
simple '\b'. Even if you adjusted your locale, you'd run into similar problems
with obfuscation techniques that use characters like '@' and '|' that aren't
word characters in any locale.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3191] Word boundaries are lost after HTML processing [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3191





------- Additional Comments From cmt-spamassassin@someone.dhs.org 2004-03-18 11:40 -------
This one is easy to explain: a word boundary is any word char [\w] followed by a
non-word char [\W] or the other way around... so \w\W or \W\w. An accented
character is NOT part of the \w class, therefore "á " doesn't count as "\w\W".

I worked around this in my obfu rule generator (http://sandgnat.com/cmos/) by
using an "or grouping" when matching word boundaries. See how the rules
generated by http://sandgnat.com/cmos/cmos.jsp?words=foo (which is based on the
regexp /\bfoo\b/) have this pattern embedded at the very end:

(?:[o0]\b|(?:[\*\xB0\xBA\xD8\xF8\xD2-\xD6\xF2-\xF6]|\(\)|\[\]|\xC5[\x8C-\x91]|\xC6[\xA0-\xA1]|\xC7[\x91-\x92]|\xC7[\xBE-\xBF]|\xCE\x8C|\xCE\x98|\xCE\x9F|\xCE\xB8|\xCE\xBF|\xCF\x8C|\xD0\x9E|\xD0\xBE|\xD5\x95)\B)

That regexp snippet matches the "o\b" part of /\bfoo\b/ including all the
accented versions (My script by default doesn't print the literal accented
values, but instead the escaped versions, such as "\xB0", because some browsers
have issues w/copy/pasting them) and also multi-byte characters (which
HTML::Entities generates from &xxx; entities)



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3191] Word boundaries are lost after HTML processing [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3191

michaelb@opentext.com changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |INVALID



------- Additional Comments From michaelb@opentext.com 2004-03-18 14:48 -------
The error of my ways has been pointed out to me. :) Thanks.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.