Mailing List Archive

Spammer's new tricks
Recently, I've seen a lot of spam with either a lot of random words
appended to the bottom, or even whole unrelated paragraphs from news
reports or other random rubbish. I assume this is meant to mess up bayes
filtering...

I doubt I'm the only one seeing this phenomenon. I'm interested as to
what kind of effect this will have, if any, on SA's scoring/bayes
analysis of mail?

-j

--
-jamie <jamie@silverdream.org> | spamtrap: spam@silverdream.org
w: http://silverdream.org | p: sms@silverdream.org
pgp key @ http://silverdream.org/~jps/pub.key
15:30:01 up 7 days, 50 min, 13 users, load average: 0.59, 0.49, 0.35
RE: Spammer's new tricks [ In reply to ]
there a couple rules you can use that do a good job of catching this

rawbody HN_WORDWORD10
/(?:\b(?!=:q(?:from|even|more|this|that|were|with)\b)[a-z]{4,12}[.,:;'!?-]?\
s+){10}/
describe HN_WORDWORD10 LOCAL: string of 10+ random words
score HN_WORDWORD10 .5

rawbody HN_WORDWORD15
/(?:\b(?!=(?:from|even|more|this|that|were|with)\b)[a-z]{4,12}[.,:;'!?-]?\s+
){15}/
describe HN_WORDWORD15 LOCAL: string of 15+ random words
score HN_WORDWORD15 2.5

rawbody HN_WORDWORD30
/(?:\b(?!=(?:from|even|more|this|that|were|with)\b)[a-z]{4,12}[.,:;'!?-]?\s+
){30}/
describe HN_WORDWORD30 LOCAL: string of 30+ random words
score HN_WORDWORD30 5

> -----Original Message-----
> From: Jamie Penman-Smithson [mailto:jamie@silverdream.org]
> Sent: March 10, 2004 10:41 AM
> To: spamassassin-users@incubator.apache.org
> Subject: Spammer's new tricks
>
>
> Recently, I've seen a lot of spam with either a lot of random words
> appended to the bottom, or even whole unrelated paragraphs from news
> reports or other random rubbish. I assume this is meant to
> mess up bayes
> filtering...
>
> I doubt I'm the only one seeing this phenomenon. I'm interested as to
> what kind of effect this will have, if any, on SA's scoring/bayes
> analysis of mail?
>
> -j
>
> --
> -jamie <jamie@silverdream.org> | spamtrap: spam@silverdream.org
> w: http://silverdream.org | p: sms@silverdream.org
> pgp key @ http://silverdream.org/~jps/pub.key
> 15:30:01 up 7 days, 50 min, 13 users, load average: 0.59, 0.49, 0.35
>
Re: Spammer's new tricks [ In reply to ]
Jamie Penman-Smithson wrote:
> Recently, I've seen a lot of spam with either a lot of random words
> appended to the bottom, or even whole unrelated paragraphs from news
> reports or other random rubbish. I assume this is meant to mess up bayes
> filtering...
>
> I doubt I'm the only one seeing this phenomenon. I'm interested as to
> what kind of effect this will have, if any, on SA's scoring/bayes
> analysis of mail?

There have been MANY discussions of this on this list recently. In
short: these probably flag a message as spam. Bayes is (hopefully) tund
to YOUR patterns, not any specific word list. If words are used that are
NOT part of your normal patterns, they tend to indicate spam. Careful
training is required.

- Bob
RE: Spammer's new tricks [ In reply to ]
Paul Barbeau <Paul@hypernet.ca> wrote:

> rawbody HN_WORDWORD10
> /(?:\b(?!=:q(?:from|even|more|this|that|were|with)\b)[a-z]{4,12}
> [.,:;'!?-]?\ s+){10}/ describe HN_WORDWORD10 LOCAL: string
> of 10+ random words score HN_WORDWORD10 .5

What's that ":q" doing in there? Looks like something got
garbled somewhere along the line.

Also (and this affects the other rules you posted), the syntax
for a negative lookahead is "(?!whatever)", not
"(?!=whatever)". The extraneous equals sign means you're
really excluding those short words only if they've got an
equals sign in front of them (or in this case a "=:q"), but
that can't happen anyway because the main match requires a
lowercase letter in that position, so the negative lookahead is
doing nothing.

--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC
Re: Spammer's new tricks [ In reply to ]
He probably meant something along these lines:

# Moderately serious ham hits
# 17.294 21.1387 1.2243 0.945 0.85 1.00 LW_WORDLIST_10
body LW_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){10}/
describe LW_WORDLIST_10 string of 10+ random words
score LW_WORDLIST_10 1.0

Note that that test with the punctuation included hits 1.2% of the ham, so
you have to score it low. You have to go up to 20 words before you only hit
spam with that test, at which point you hit about 17% of the spam and no
ham.

The same rule WITHOUT punctuation is safe at 15 words, but will obviously
miss some of the newer spam that now uses punctuation. I haven't yet
checked to see what percentage of stuff is caught by the 20-word rule with
punctuation and isn't caught by the 15 word rule without punctuation.

Loren
Re: Spammer's new tricks [ In reply to ]
From: "Keith C. Ivey" <kcivey@cpcug.org>

> Paul Barbeau <Paul@hypernet.ca> wrote:
>
> > rawbody HN_WORDWORD10
> > /(?:\b(?!=:q(?:from|even|more|this|that|were|with)\b)[a-z]{4,12}
> > [.,:;'!?-]?\ s+){10}/ describe HN_WORDWORD10 LOCAL: string
> > of 10+ random words score HN_WORDWORD10 .5
>
> What's that ":q" doing in there? Looks like something got
> garbled somewhere along the line.
>
> Also (and this affects the other rules you posted), the syntax
> for a negative lookahead is "(?!whatever)", not
> "(?!=whatever)". The extraneous equals sign means you're
> really excluding those short words only if they've got an
> equals sign in front of them (or in this case a "=:q"), but
> that can't happen anyway because the main match requires a
> lowercase letter in that position, so the negative lookahead is
> doing nothing.

FWIW this is two working versions of that rule expanded into a set of
three rules with escallating scores:

# match Bayes-poison lists of lowercase words without articles or common
prepositions

body PT_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){10}/
describe PT_WORDLIST_10 string of 10+ random words
score PT_WORDLIST_10 1.0

body PT_WORDLIST_13
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){13}/
describe PT_WORDLIST_13 string of 13+ random words
score PT_WORDLIST_13 3.0

body PT_WORDLIST_30
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z]{4,12}\s+){30}/
describe PT_WORDLIST_30 string of 30+ random words
score PT_WORDLIST_30 10.0

# match Bayes-poison lists of lowercase words without articles or common
prepositions ignoring punctuation.

body XX_WORDLIST_10
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){10}/
describe XX_WORDLIST_10 string of 10+ random words
score XX_WORDLIST_10 1.0

body XX_WORDLIST_13
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){13}/
describe XX_WORDLIST_13 string of 13+ random words
score XX_WORDLIST_13 3.0

body XX_WORDLIST_30
/(?:\b(?!(?:from|that|have|this|were|with)\b)[a-z\.\,\-\;]{4,18}\s+){30}/
describe XX_WORDLIST_30 string of 30+ random words
score XX_WORDLIST_30 10.0

Loren put them together.
{^_^}