Mailing List Archive

Semi-invisible font missed by SA
A message containing a very bright font used to conceal Bayes poison got
through yesterday.

Spamassassin score was:

X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on
nmail.altkom.pl
X-Spam-Status: No, hits=1.9 required=5.0 tests=BAYES_50,
HTML_FONTCOLOR_UNKNOWN,HTML_MESSAGE,MIME_HTML_ONLY,SUBJECT_PHARMACY
autolearn=no version=2.60
X-Spam-Level: *

Relevant fragments:

--- Start sample ---
<font color=

#fee4e8> sleepwalk negotiate sonic, diluting democrat humpback,
encamping palaces MacDonald Hewlett brainstem crops cautions chartering
discharged chronicle disagree presided accordion. sentiments transplant
corpse defeat downright, immersed Boltzmann skulk beatitudes espouse
planks palmer compresses populace almsman bivouac tolerance. cookery
Ridgway scalded. ribbing mockery Oakley glover reopens
satellites.</font><br>
ONLY REAL SUPER VIAGDRA CALLED CIADLIS IS EFFECTIVE! Annual Sale: ONLY
$3 per dose<br>
--- End sample ---

another one:

--- Start sample ---
<font color=

#ebeff4>whitely chord cowing gayety aviary, nostalgic glucose Hyannis
employ; subdued movements mischief smartly intonation reserved distaff
standoff terrifies. heavily acquirable beach adulthood invertible,
traversing vacuo enraged Dobbin Avogadro Agnes Bruno enfeeble credible
notorious carelessly octaves. negotiate makeup SIMULA. sagebrush
imaginably heiressesfalcons.</font><br>
--- End sample ---

Anybody got a rule for this type of stuff?

BTW I've refined the rule that catches invisible font sizes to include 0
and 1 pixel/point fontsizes:

rawbody LOCAL_ZERO_FONTSIZE /\bfont-size\: ?[01]p[xt]\b/i

Please check if it hits your ham.

--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575
http://olo.ab.altkom.pl
Re: [spa] Semi-invisible font missed by SA [ In reply to ]
On Wed, 18 Feb 2004, Aleksander Adamowski wrote:
> <font color=
>
> #fee4e8> sleepwalk negotiate sonic, diluting democrat humpback,

I added a check for the dangling color spec....

rawbody LOC_HTMLSPLITFONT /^\#[a-z0-9]{6}\>/i
describe LOC_HTMLSPLITFONT font color on separate line from font tag
score LOC_HTMLSPLITFONT 0.7

Can't score it too high becuase of potential FP's, but it helps.

- C
Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Charles Gregory wrote:

>On Wed, 18 Feb 2004, Aleksander Adamowski wrote:
>
>
>><font color=
>>
>>#fee4e8> sleepwalk negotiate sonic, diluting democrat humpback,
>>
>>
>
>I added a check for the dangling color spec....
>
>rawbody LOC_HTMLSPLITFONT /^\#[a-z0-9]{6}\>/i
>describe LOC_HTMLSPLITFONT font color on separate line from font tag
>score LOC_HTMLSPLITFONT 0.7
>
>Can't score it too high becuase of potential FP's, but it helps.
>

What's intriguing is that the standard SA 2.60 test
HTML_FONT_LOW_CONTRAST didn't catch it. I've tried raising distance
threshold in HTML.pm (line 350) to even aburdly high values (over 80)
and it still didn't match!

Is it possible that the dangling value causes HTML.pm tests to miss this
tag?

--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575
http://olo.ab.altkom.pl
Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Here's the messae in question, it's interesting in general as it
received an unusually low SA score.

--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575
http://olo.ab.altkom.pl
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
On Thu, 19 Feb 2004, Aleksander Adamowski wrote:
> What's intriguing is that the standard SA 2.60 test
> HTML_FONT_LOW_CONTRAST didn't catch it. I've tried raising distance
> threshold in HTML.pm (line 350) to even aburdly high values (over 80)
> and it still didn't match!
> Is it possible that the dangling value causes HTML.pm tests to miss this
> tag?

That would be my guess. We are now fully into the part of the 'game' where
the spammers get hold of spamassassin and run their spew through it
*before* trying to mail it, so that they can try 'tricks' like these, and
keep trying different ones until something works.

- Charles
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
On Thu, 19 Feb 2004, Aleksander Adamowski wrote:
> Here's the messae in question, it's interesting in general as it
> received an unusually low SA score.

Sorry, you sent it as an attachment. We don't do those. (smile)
Please post plain text, where possible....

- C
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Raquel Rice wrote:

>On Thu, 19 Feb 2004 13:46:25 -0500 (EST)
>Charles Gregory <cgregory@hwcn.org> wrote:
>
>
>The problem with your theory, is that your bayes hasn't been trained
>the way mine has, nor has mine been trained the way that Matt's has.
> The likelihood of any given spam getting past two of us, let alone
>all three of us, is very slim indeed.
>

Unfortunately the sample spam I've sent is quite good at defeating bayes
with its poison and hiding the poison from both Spamassassin rules and
the eye of the recipient.
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Aleksander Adamowski <aleksander.adamowski.spamassassin@altkom.pl> wrote:
> Raquel Rice wrote:

> >The problem with your theory, is that your bayes hasn't been trained
> >the way mine has, nor has mine been trained the way that Matt's has.
> > The likelihood of any given spam getting past two of us, let alone
> >all three of us, is very slim indeed.
>
> Unfortunately the sample spam I've sent is quite good at defeating bayes
> with its poison and hiding the poison from both Spamassassin rules and
> the eye of the recipient.

Could you post the debug output from spamassassin for one of
these messages? I'm very curious to see why you think the
poison is defeating Bayes. It's certainly possible that every
once in a while a spammer will randomly hit on a word that's a
good nonspam indicator for you, but I don't believe it can
happen for any substantial fraction of messages.

The only SA change the message your posted seems to suggest is
a modification of the rule for catching low-contrast font
color, which has nothing to do with Bayes. Looking at the
spam, it got BAYES_50, so the "poison" didn't affect Bayes at
all. It had no strong spam or nonspam indicators even without
the added words.

--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Keith C. Ivey wrote:

>Could you post the debug output from spamassassin for one of
>these messages? I'm very curious to see why you think the
>poison is defeating Bayes. It's certainly possible that every
>once in a while a spammer will randomly hit on a word that's a
>good nonspam indicator for you, but I don't believe it can
>happen for any substantial fraction of messages.

The output you asked for is posted at the end. I've already trained my
Bayes with 2 messages similar to this one (layout is identical, but the
URL is in a different domain in each one, and of course, completely
different set of random poison words).

>The only SA change the message your posted seems to suggest is
>a modification of the rule for catching low-contrast font
>color, which has nothing to do with Bayes. Looking at the
>spam, it got BAYES_50, so the "poison" didn't affect Bayes at
>all. It had no strong spam or nonspam indicators even without
>the added words.

Understood, what I wanted to say is that Bayes isn't effective against
this sort of stuff and currently the other SA mechanisms aren't
sufficient to catch this spam.

This is mainly because HTML.pm can be fooled by dangling attributes.
Ideally, HTML parser should parse HTML the same way as popular browsers
(IE, Mozilla). Unfortuanately I cannot fix this in HTML.pm myself, this
code is too bity convoluted for me. I think that the help of original
author of HTML.pm is needed here.

--- BEGIN OUTPUT ---
debug: Score set 0 chosen.
debug: running in taint mode? yes
debug: Running in taint mode, removing unsafe env vars, and resetting PATH
debug: PATH included '/usr/kerberos/sbin', keeping.
debug: PATH included '/usr/kerberos/bin', keeping.
debug: PATH included '/usr/lib/courier/bin', keeping.
debug: PATH included '/usr/lib/courier/sbin', keeping.
debug: PATH included '/usr/local/sbin', keeping.
debug: PATH included '/usr/local/bin', keeping.
debug: PATH included '/sbin', keeping.
debug: PATH included '/bin', keeping.
debug: PATH included '/usr/sbin', keeping.
debug: PATH included '/usr/bin', keeping.
debug: PATH included '/usr/X11R6/bin', keeping.
debug: PATH included '/root/bin', keeping.
debug: Final PATH set to:
/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib/courier/bin:/usr/lib/courier/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin
debug: using "/usr/share/spamassassin" for default rules dir
debug: using "/etc/mail/spamassassin" for site rules dir
debug: using "/root/.spamassassin" for user state dir
debug: using "/root/.spamassassin/user_prefs" for user prefs file
debug: bayes: 28543 tie-ing to DB file R/O /etc/mail/spamassassin/bayes_toks
debug: bayes: 28543 tie-ing to DB file R/O /etc/mail/spamassassin/bayes_seen
debug: bayes: found bayes db version 2
debug: Score set 3 chosen.
debug: Initialising learner
debug: is Net::DNS::Resolver available? yes
debug: trying (3) microsoft.com...
debug: looking up MX for 'microsoft.com'
debug: MX for 'microsoft.com' exists? 1
debug: MX lookup of microsoft.com succeeded => Dns available (set
dns_available to hardcode)
debug: is DNS available? 1
debug: all '*From' addrs: ninawrithed@beerbloat.com
debug: running header regexp tests; score so far=0
debug: running body-text per-line regexp tests; score so far=0.5
debug: bayes corpus size: nspam = 1819, nham = 6265
debug: uri tests: Done uriRE
debug: tokenize: header tokens for *p = "U*ninawrithed D*beerbloat.com
D*com"
debug: tokenize: header tokens for *M = " qsnlk 636881hohclfgayg
Kmorphynbkzderzfc com "
debug: tokenize: header tokens for *F = "U*ninawrithed D*beerbloat.com
D*com"
debug: tokenize: header tokens for To = "U*olo D*altkom.com.pl D*com.pl
D*pl"
debug: tokenize: header tokens for Mime-Version = "1.0"
debug: tokenize: header tokens for *c = "/html; charset=iso-8859-1"
debug: tokenize: header tokens for Content-Transfer-Encoding = "7bit"
debug: tokenize: header tokens for X-Mime-Autoconverted = "from 8bit to
7bit by courier 0.44"
debug: tokenize: header tokens for *r = " olo ([::ffff:202.196.220]) by
nmail.altkom.pl esmtp; "
debug: bayes token 'H*c:html' => 0.997358361790176
debug: bayes token 'disagree' => 0.00297237569060773
debug: bayes token 'Sale' => 0.996940397350993
debug: bayes token 'Bruno' => 0.00410687022900763
debug: bayes token 'H*r:olo' => 0.993492957746479
debug: bayes token 'beach' => 0.993492957746479
debug: bayes token 'CALLED' => 0.990941176470588
debug: bayes token 'adulthood' => 0.985096774193548
debug: bayes token 'CheapPharmacy' => 0.978
debug: bayes token 'CIADLIS' => 0.978
debug: bayes token 'VIAGDRA' => 0.978
debug: bayes token 'EFFECTIVE!' => 0.978
debug: bayes token 'tolerance' => 0.0256190476190476
debug: bayes token 'carelessly' => 0.0256190476190476
debug: bayes token 'URI' => 0.96844194358858
debug: bayes token 'REAL' => 0.958964997782887
debug: bayes token 'movements' => 0.958
debug: bayes token 'makeup' => 0.958
debug: bayes token 'chord' => 0.958
debug: bayes token 'downright' => 0.958
debug: bayes token 'sagebrush' => 0.958
debug: bayes token 'corpse' => 0.958
debug: bayes token 'aviary' => 0.958
debug: bayes token 'HTo:U*olo' => 0.95257244243949
debug: bayes token 'Hewlett' => 0.0489090909090909
debug: bayes token 'reopens' => 0.0489090909090909
debug: bayes token 'chronicle' => 0.0489090909090909
debug: bayes token 'cautions' => 0.0489090909090909
debug: bayes token 'discharged' => 0.0489090909090909
debug: bayes token 'compresses' => 0.0489090909090909
debug: bayes token 'notorious' => 0.0489090909090909
debug: bayes token 'defeat' => 0.0489090909090909
debug: bayes token 'smartly' => 0.0489090909090909
debug: bayes token 'credible' => 0.947986086684282
debug: bayes token 'ONLY' => 0.942476065364982
debug: bayes token 'employ' => 0.929714918635859
debug: bayes token 'SUPER' => 0.928422130125509
debug: bayes token 'dose' => 0.916960992788963
debug: bayes token 'href' => 0.902075983318674
debug: bayes token 'HTo:D*altkom.com.pl' => 0.885057514092106
debug: bayes token 'HTo:D*com.pl' => 0.883537429955864
debug: bayes token 'H*r:ffff' => 0.876766493636693
debug: bayes token 'Annual' => 0.864494012282618
debug: bayes: score = 0.622580654529175
debug: bayes: 28543 untie-ing
debug: bayes: 28543 untie-ing db_toks
debug: bayes: 28543 untie-ing db_seen
debug: Razor2 is not available
debug: running raw-body-text per-line regexp tests; score so far=1.92
debug: running uri tests; score so far=2.62
debug: uri tests: Done uriRE
debug: running full-text regexp tests; score so far=2.62
debug: Current PATH is:
/usr/kerberos/sbin:/usr/kerberos/bin:/usr/lib/courier/bin:/usr/lib/courier/sbin:/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin:/usr/X11R6/bin:/root/bin
debug: Pyzor is not available: pyzor not found
debug: Razor2 is not available
debug: DCCifd is not available: no r/w dccifd socket found.
debug: DCC is not available: no executable dccproc found.
debug: all '*To' addrs: olo@altkom.com.pl
debug: DNS MX records found: 1
debug: RBL: success for 1 of 1 queries
debug: running meta tests; score so far=2.62
debug: is spam? score=4.212 required=5
tests=BAYES_60,HTML_FONTCOLOR_UNKNOWN,HTML_MESSAGE,LOC_HTMLSPLITFONT,MIME_HTML_ONLY,SUBJECT_PHARMACY
Delivered-To: olo@altkom.com.pl
Return-Path: <ninawrithed@beerbloat.com>
Received: from olo ([::ffff:202.196.220.93])
by nmail.altkom.pl with esmtp; Tue, 17 Feb 2004 10:21:19 +0100
Message-ID: <qsnlk.636881hohclfgayg@Kmorphynbkzderzfc.com>
From: "Kmorphy" <ninawrithed@beerbloat.com>
Date: Tue, 17 Feb 2004 17:21:40 +0800
To: olo@altkom.com.pl
Subject: upholders CheapPharmacy acoustics
Mime-Version: 1.0
Content-Type: text/html; charset=iso-8859-1
Content-Transfer-Encoding: 7bit
X-Mime-Autoconverted: from 8bit to 7bit by courier 0.44
X-Spam-Level: ****
X-Spam-Checker-Version: SpamAssassin 2.60 (1.212-2003-09-23-exp) on
nmail.altkom.pl
X-Spam-Status: No, hits=4.2 required=5.0 tests=BAYES_60,
HTML_FONTCOLOR_UNKNOWN,HTML_MESSAGE,LOC_HTMLSPLITFONT,MIME_HTML_ONLY,
SUBJECT_PHARMACY autolearn=no version=2.60

<html>
<font color=

#fee4e8> sleepwalk negotiate sonic, diluting democrat humpback,
encamping palaces MacDonald Hewlett brainstem crops cautions chartering
discharged chronicle disagree presided accordion. sentiments transplant
corpse defeat downright, immersed Boltzmann skulk beatitudes espouse
planks palmer compresses populace almsman bivouac tolerance. cookery
Ridgway scalded. ribbing mockery Oakley glover reopens
satellites.</font><br>
ONLY REAL SUPER VIAGDRA CALLED CIADLIS IS EFFECTIVE! Annual Sale: ONLY
$3 per dose<br>
<br>convening<br>
<br><a hrefredrawnhref=http://multilayer.com href=

"http://goandtakeit.com/sv/index.php?pid=expert">Website</a>
<br><br>
<font color=

#ebeff4>whitely chord cowing gayety aviary, nostalgic glucose Hyannis
employ; subdued movements mischief smartly intonation reserved distaff
standoff terrifies. heavily acquirable beach adulthood invertible,
traversing vacuo enraged Dobbin Avogadro Agnes Bruno enfeeble credible
notorious carelessly octaves. negotiate makeup SIMULA. sagebrush
imaginably heiressesfalcons.</font><br>
</html>

--- END OUTPUT ---




--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575
http://olo.ab.altkom.pl
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Aleksander Adamowski <aleksander.adamowski.spamassassin@altkom.pl> wrote:

> Understood, what I wanted to say is that Bayes isn't effective against
> this sort of stuff and currently the other SA mechanisms aren't
> sufficient to catch this spam.

My point was that the extra words have no effect one way or the
other on the Bayes classification. If they hadn't been there,
the message would still have slipped through, so it's not
appropriate to call the extra words "Bayes poison". People
talk about "Bayes poison" a lot, but I have yet to see an
example that actually affects Bayes.

> This is mainly because HTML.pm can be fooled by dangling attributes.

You've lost me there. What do "dangling attributes" have to do
with this case? HTML_FONTCOLOR_UNKNOWN was triggered, so the
COLOR attributes were seen. The problem is they weren't
recognized as being nearly invisible, so the problem seems to
be with the HTML_FONT_LOW_CONTRAST test, not with parsing.

--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Keith C. Ivey wrote:

>>This is mainly because HTML.pm can be fooled by dangling attributes.
>>
>>
>You've lost me there. What do "dangling attributes" have to do
>with this case? HTML_FONTCOLOR_UNKNOWN was triggered, so the
>COLOR attributes were seen. The problem is they weren't
>recognized as being nearly invisible, so the problem seems to
>be with the HTML_FONT_LOW_CONTRAST test, not with parsing.
>
That's interesting. I'll try to debug HTML.pm tomorrow using those test
messages. I have a suspicion that the html_font_invisible() function
isn't called at all...

--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575
http://olo.ab.altkom.pl
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Keith C. Ivey wrote:

>You've lost me there. What do "dangling attributes" have to do
>with this case? HTML_FONTCOLOR_UNKNOWN was triggered, so the
>COLOR attributes were seen. The problem is they weren't
>recognized as being nearly invisible, so the problem seems to
>be with the HTML_FONT_LOW_CONTRAST test, not with parsing.
>
Well, there's some problem, hard to tell if it's parsing or the test
itself, but here's what I've found out after adding some debugging calls
to HTML.pm:

1. The html_font_invisible() method gets called
2. There's a problem with arguments passed to it. The foreground
color is seen by Perl code as the string 'color', not the string
in the form of '#feefea'. This explains why the test
HTML_FONTCOLOR_UNKNOWN was triggered ('color' is not a known color
name or a HTML hex code), and why HTML_FONT_LOW_CONTRAST test has
failed.

More specifically, when there's a dangling attribute value like this in
HTML source:

<font color=

#feefea>

, then in the html_font_invisible() method the foreground color ($fg
variable) has the value of 'color' instead of hex code of the HTML color.

If I add a single space after the equality mark in the tag seen above,
html_font_invisible() receives correct data and $fg variable holds the
hex code of font color. So this (notice the space after equality mark):

<font color=

#feefea>

is processed correctly and nearly invisible font is detected .

Moreover, after adding debugging calls to the method html_fgcolor()
(which extracts foreground color information from a HTML element), I can
see that the attribute "color" of this font tag already has the value of
'color', instead of hex code (which should be #feefea in my testcase),
so the problem is deeper than in html_font_invisible() method.

This suggests a parsing problem somewhere, as far as I understood the
code... If I am correct in my suspicions that the Perl expression
"$attr->{color}" is an attribute of a HTML::Parser object, then the
problem is indeed in HTML parser code (correct me if I'm wrong).

--
Best Regards,
Aleksander Adamowski
GG#: 274614
ICQ UIN: 19780575
http://olo.ab.altkom.pl
Re: Semi-invisible font missed by SA [ In reply to ]
On Mon, 23 Feb 2004, Keith C. Ivey wrote:
> > This is mainly because HTML.pm can be fooled by dangling attributes.
> You've lost me there. What do "dangling attributes" have to do
> with this case? HTML_FONTCOLOR_UNKNOWN was triggered.....

I would suspect that rule triggers whenever it fails to see a proper color
spec on the SAME LINE. Which makes sense because otherwise the color
specified is NOT an 'unknown' - it is a valid color.

- C
Re: [spa] Re: [spa] Semi-invisible font missed by SA [ In reply to ]
Aleksander Adamowski <aleksander.adamowski.spamassassin@altkom.pl> wrote:

> More specifically, when there's a dangling attribute value like this in
> HTML source:
>
> <font color=
>
> #feefea>
>
> , then in the html_font_invisible() method the foreground color ($fg
> variable) has the value of 'color' instead of hex code of the HTML color.

Yes, that does sound like a parsing problem. It may be
connected to the fact that the HTML is invalid, since an
attribute value containing "#" must be quoted. Of course, SA
needs to handle that sort of invalid markup, just as mail
readers do.

Have you submitted a bug report at
http://bugzilla.spamassassin.org/ ?


--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC