Mailing List Archive

[Bug 3449] New: Testing markup tags, HTML_FONT_LOW_CONTRAST not triggered due to bad HTML parsing
http://bugzilla.spamassassin.org/show_bug.cgi?id=3449

Summary: Testing markup tags, HTML_FONT_LOW_CONTRAST not
triggered due to bad HTML parsing
Product: Spamassassin
Version: 2.63
Platform: PC
URL: http://olo.ab.altkom.pl/domowa/spam/samples/low_contrast
/
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P5
Component: Libraries
AssignedTo: spamassassin-dev@incubator.apache.org
ReportedBy: aleksander.adamowski.spamassassin@altkom.pl


This bug is a follow=up to the discussion started way back on february 18 on
spamassassin-users mailin list ("Testing markup tags", "Semi-invisible font
missed by SA").

There was a consensus that there's something definitely wrong with SpamAssassin
HTML parsing when a spammer uses excessive line breaks inside HTML FONT tags
between attribute name ("color") and value ("#FFFFsomething").

Back then, I've published sample messages here:
http://olo.ab.altkom.pl/domowa/spam/samples/low_contrast/

The problem was, that the spammers use the following construct aimed directly at
SpamAssassin HTML analysis method to bypass the test
html_test('font_near_invisible') and not trigger the rule HTML_FONT_LOW_CONTRAST
in effect:
<font color=

"#FFFFFB">some random text to fool Bayes</font>

The excessive line breaks between "color=" and "#FFFFFB" fool the parser to not
detect the presence of that attribute.

I've analysed SpamAssasin 2.63 code back then in 23 Feb, and discovered that SA
code indeed does receive a string "color" instead of hash code for the value of
"color" attribute.

Those messages keep coming and sometimes pass through SA not triggering
HTML_FONT_LOW_CONTRAST, and I'm currently using a custom rule to give them
additional score:

rawbody LOC_HTMLSPLITFONT /^\"?\#[a-z0-9]{6}\"?\>/i
describe LOC_HTMLSPLITFONT font color on separate line from font tag
score LOC_HTMLSPLITFONT 2.1 1.6 2.1 1.6

But this rule has a potential for FP-ing, so the ideal solution would be to make
SpamAssassin parse those tags using HTML::Parser correctly.

I've made a test Perl script that parses HTML and outputs the attribute names
and values, and running it indicates that HTML::Parser works fine. You can see
the script and test data here:
http://olo.ab.altkom.pl/domowa/admin/spamassassin/

There are 4 files there:

My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml
font_attribute_line_break_corrected.html
font_attribute_line_break_orig.html
parse_test.pl


The .eml file contains the message that has passed through not triggering
HTML_FONT_LOW_CONTRAST.
The file parse_test.pl is the Perl script.
The 2 .html files contain the HTML code from the .eml message, the "_orig" one
contains the code unchanged, the "_corrected" has excessive line breaks removed.

running parse_test.pl on both HTML files shows that HTML::Parser does its job
fine in both cases, so the problem must lie somewhere in SpamAssassin code that
does the parsing using HTML::Parser. However, the SA code is too bit to
convoluted for me - so I'm asking its original author to have a look at it.

SA needs to be fixed to trigger HTML_FONT_LOW_CONTRAST rule when processing the
message My_prgivate_s_ge_x_life_is_now_available_to_you_unarmed_grotesques.eml.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.