Mailing List Archive

SA Problem: spam with random words to defeat Baysian filtering ...
I've just joined the list, and requested FAQ and info from the majordomo.
In the absence of either one, I am forced to ask the following of the list
with no knowledge of whether it is an FAQ or not -- sorry.

As indicated in the subject line, I'm getting negative hit rates on spam
which uses random dictionary words. Obviously sa-learn cannot learn how
to deal with such an approach, and my formerly brilliant
sendmail/spamassassin configuration is now next to useless - as I'm
getting 200 - 300 spam's per day.

Can anyone point me to a solution or a counter-counter measure to kill
this damn spam??

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=
Robert S. Sciuk http://www.controlq.com 259 Simcoe St. S.
Control-Q Research tel: 905.576.8028 Oshawa, Ont.
rob@controlq.com fax: 905.576.8386 Canada, L1H 4H3
RE: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Some of us are working on a ruleset to combat this sort of spam, but it's not quite ready for release. A couple of custom rules in your local.cf will go a long way; adjust the scores to meet your needs.

Several versions of these rules have been posted to the SAtalk list; look at:
http://article.gmane.org/gmane.mail.spam.spamassassin.general/38455/match=randomword

Pierre Thomson
BIC


-----Original Message-----
From: Robert S. Sciuk [mailto:rob@ControlQ.com]
Sent: Wednesday, February 11, 2004 11:06 AM
To: SpamAssassin Users List
Subject: SA Problem: spam with random words to defeat Baysian filtering
...



I've just joined the list, and requested FAQ and info from the majordomo.
In the absence of either one, I am forced to ask the following of the list
with no knowledge of whether it is an FAQ or not -- sorry.

As indicated in the subject line, I'm getting negative hit rates on spam
which uses random dictionary words. Obviously sa-learn cannot learn how
to deal with such an approach, and my formerly brilliant
sendmail/spamassassin configuration is now next to useless - as I'm
getting 200 - 300 spam's per day.

Can anyone point me to a solution or a counter-counter measure to kill
this damn spam??

--
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=--=-=-=-=
Robert S. Sciuk http://www.controlq.com 259 Simcoe St. S.
Control-Q Research tel: 905.576.8028 Oshawa, Ont.
rob@controlq.com fax: 905.576.8386 Canada, L1H 4H3
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
At 11:05 AM 2/11/2004, Robert S. Sciuk wrote:
>I've just joined the list, and requested FAQ and info from the majordomo.
>In the absence of either one, I am forced to ask the following of the list
>with no knowledge of whether it is an FAQ or not -- sorry.

The FAQ is actually a wiki web, and it's linked from the spamassassin.org
main page.

http://wiki.spamassassin.org/w/

>As indicated in the subject line, I'm getting negative hit rates on spam
>which uses random dictionary words. Obviously sa-learn cannot learn how
>to deal with such an approach, and my formerly brilliant
>sendmail/spamassassin configuration is now next to useless - as I'm
>getting 200 - 300 spam's per day.
>
>Can anyone point me to a solution or a counter-counter measure to kill
>this damn spam??

This is quite surprising to me.. I've been getting a lot of the "random
word" spams too, but feeding them to sa-learn has been quite effective.

If you've got a lot of input to bayes, the random-word attacks wind up
being more-or-less a wash.

So far this month, I've had 7 false negatives, 0 false positives. Most of
the "dictionary bayes poison" spams are gettting BAYES_99 for me.

For reference, and those wondering about the full details of how I get that
my config consist of:

DCC, razor2 and RBLs used.
habeas_swe score forced down to -1.0
bayes_ignore_header statements for all the habeas SWE headers
bayes_auto_learn_threshold_nonspam -0.3

A few add-on rules:
antidrug.cf (gee, there's a shock, since I wrote it ;)
<http://mywebpages.comcast.net/mkettler/sa/antidrug.cf>http://mywebpages.comcast.net/mkettler/sa/antidrug.cf

A collapsed version of popcorn that's just 2 rules.
Based on http://www.emtinc.net/includes/popcorn.cf
, but edited by me to only be 2 rules

A few rules from
http://www.merchantsoverseas.com/wwwroot/gorilla/body.txt
L_b_MaskedW0rds*
A few rules from
http://www.exit0.us/index.php/FredsRules-SUBJECT
FVGT_s_OBFU_*

One of the blackholes.us blacklists added, with score set
fairly low to avoid FPs.
header
RCVD_IN_CHINA_KR eval:check_rbl('country', 'cn-kr.blackholes.us.')
describe RCVD_IN_CHINA_KR Received
from China or Korea
score RCVD_IN_CHINA_KR 1.0

about 15 negative scoring rules which have "industry
specific" phrases for my companies business in it.

I feed bayes with some spamtraps and nonspamtraps each day, giving it about
100 spams, and 25 nonspams in manual training daily.
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Hello Robert,

Wednesday, February 11, 2004, 8:05:38 AM, you wrote:

RSS> As indicated in the subject line, I'm getting negative hit rates on spam
RSS> which uses random dictionary words. Obviously sa-learn cannot learn how
RSS> to deal with such an approach, and my formerly brilliant
RSS> sendmail/spamassassin configuration is now next to useless - as I'm
RSS> getting 200 - 300 spam's per day.

RSS> Can anyone point me to a solution or a counter-counter measure to kill
RSS> this damn spam??

1) Yes, sa-learn DOES deal with these emails, and does so exceedingly
well here. I call them "bayes fodder", since those random words are
teaching bayes that emails with those random words are spam.

2) I then augment bayes with the following rules:

# longwords -- possible sign of random words placed into spam to confuse anti-spam software
body RM_bpt_longwords68a /\b(?:[a-z]{6,}\s+){8}/
describe RM_bpt_longwords68a Long string of long words
score RM_bpt_longwords68a 1.500 # type=FP - 7429s/2h of 91714 corpus (74113s/17601h) 01/23/04
# ham: userid list,
# "improving compatibility between computer platforms demands certain levels "
body RM_bpt_longwords69a /\b(?:[a-z]{6,}\s+){9}/
describe RM_bpt_longwords69a Long string of long words
score RM_bpt_longwords69a 1.000 # type=max:1 (add to 59a,68a) - 6595s/1h of 91714 corpus (74113s/17601h) 01/23/04
# ham: userid list
body RM_bpt_longwords78a /\b(?:[a-z]{7,}\s+){8}/
describe RM_bpt_longwords78a Long string of long words
score RM_bpt_longwords78a 2.000 # type=max:2 (add to 68a) - 4163s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords59a /\b(?:[a-z]{5,}\s+){9}/
describe RM_bpt_longwords59a Long string of long words
score RM_bpt_longwords59a 1.500 # type=FP - 8753s/8h of 91714 corpus (74113s/17601h) 01/23/04
# ham: userid list
body RM_bpt_longwords79a /\b(?:[a-z]{7,}\s+){9}/
describe RM_bpt_longwords79a Long string of long words
score RM_bpt_longwords79a 1.000 # type=max:1 (add to 78a) - 2950s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords96a /\b(?:[a-z]{9,}\s+){6}/
describe RM_bpt_longwords96a Long string of long words
score RM_bpt_longwords96a 4.000 # 1162s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords88a /\b(?:[a-z]{8,}\s+){8}/
describe RM_bpt_longwords88a Long string of long words
score RM_bpt_longwords88a 4.000 # 1025s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords89a /\b(?:[a-z]{8,}\s+){9}/
describe RM_bpt_longwords89a Long string of long words
score RM_bpt_longwords89a 1.000 # type=max:1 (add to 88a) - 590s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords97 /\b(?:\w{9,}\s+){7}/
describe RM_bpt_longwords97 Long string of long words
score RM_bpt_longwords97 3.000 # 545s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords98 /\b(?:\w{9,}\s+){8}/
describe RM_bpt_longwords98 Long string of long words
score RM_bpt_longwords98 1.000 # type=max:1 (add to 97) - 442s/0h of 91714 corpus (74113s/17601h) 01/23/04
body RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
describe RM_bpt_longwords99 Long string of long words
score RM_bpt_longwords99 1.000 # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04

Bob Menschel
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
On Wed, 11 Feb 2004, Robert Menschel wrote:

[...]
> 1) Yes, sa-learn DOES deal with these emails, and does so exceedingly
> well here. I call them "bayes fodder", since those random words are
> teaching bayes that emails with those random words are spam.
>
> 2) I then augment bayes with the following rules:

[...]
> score RM_bpt_longwords98 1.000 # type=max:1 (add to 97) - 442s/0h of 91714 corpus (74113s/17601h) 01/23/04
> body RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
> describe RM_bpt_longwords99 Long string of long words
> score RM_bpt_longwords99 1.000 # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04
>
> Bob Menschel

can you give some examples for what those rules will hit?
i've been trying some emails and misc texts with it, and got no hit yet :)

regards,
Matthias
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Robert Menschel <Robert@Menschel.net> wrote:
> [...]
> 1) Yes, sa-learn DOES deal with these emails, and does so
> exceedingly well here. I call them "bayes fodder", since those random
> words are teaching bayes that emails with those random words are spam.

Just to avoid confusion, you're saying that AFTER TRAINING, bayes works quite
well for those messages, right? The key is feeding any messages that DO slip
through into sa-learn as spam UNTIL you get those results, no?

The "random words" question seems to come up frequently, and TRAINED bayes
seems to be a good answer.

> 2) I then augment bayes with the following rules:
> [...]

"Add-on" rules do seem to help get bayes there quicker!

- Bob
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
From: "Bob George" <mailings02@ttlexceeded.com>

> Robert Menschel <Robert@Menschel.net> wrote:
> > [...]
> > 1) Yes, sa-learn DOES deal with these emails, and does so
> > exceedingly well here. I call them "bayes fodder", since those random
> > words are teaching bayes that emails with those random words are spam.
>
> Just to avoid confusion, you're saying that AFTER TRAINING, bayes works
quite
> well for those messages, right? The key is feeding any messages that DO
slip
> through into sa-learn as spam UNTIL you get those results, no?
>
> The "random words" question seems to come up frequently, and TRAINED bayes
> seems to be a good answer.
>
> > 2) I then augment bayes with the following rules:
> > [...]
>
> "Add-on" rules do seem to help get bayes there quicker!

After watching the Bayes filter "learn" to auto white list spam when
first installed I disabled the auto white list feature and explicitly
generated lists if ham and spam. When the Bayes filter kicked in after
it had accumulated a couple hundred ham and spam messages the results
were dramatic. Before then it was somewhat discouraging. I do believe
I shall leave automatic learning and white listing turned off because
it seems to false entirely too often for my tastes. (The concept also
seems a little strange. If it already knows it's spam then train it
that the message is spam. I'd rather teach it with the new spam that
is not found than simply rack up higher scores by training it that
material it knows is spam is indeed spam. What am I missing here?)

{^_^} Joanne
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
From: "Raquel Rice" <raquel@thericehouse.net>
> On Wed, 11 Feb 2004 11:35:01 -0500
> Matt Kettler <mkettler@evi-inc.com> wrote:
>
> > I feed bayes with some spamtraps and nonspamtraps each day, giving
> > it about 100 spams, and 25 nonspams in manual training daily.
>
> How do you select, out of all your mail, 125 emails to train bayes
> with?

Might it be because SA seems to need 200 spams before the Bayes
filter kicks in? (It performs remarkably well here with a corpus
of some 450 spams and 700 or so hams.

{^_-}
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
jdow <jdow@earthlink.net> wrote:
> After watching the Bayes filter "learn" to auto white list
> spam when first installed I disabled the auto white list
> feature and explicitly generated lists if ham and spam.

AWL works well for me, but that may have been due a combination of add-on rules
and luck. I've left it enabled, but scoring of spam has swung to such extremes
(a good thing) thanks to bayes and other rules that it really hasn't impacted
things much one way or the other lately.

It does seem most of the auto-whitelist options are now missing from the
manpage (Mail::SpamAssassin::Conf) so perhaps they've been deprecated as of
late? (Must search archives.)

> When the Bayes filter kicked in after it had accumulated a couple
> hundred ham and spam messages the results were dramatic.

I learned my lesson and have begun storing a collection of 'borderline' spam
for training purposes. Thankfully, I had bayes trained before some of the more
clever spams began to hit, so non have gotten through lately, depite all their
attempts.

> Before then it was somewhat discouraging. I do believe I shall
> leave automatic learning and white listing turned off because
> it seems to false entirely too often for my tastes.

Now that I've read the latest manpage, I'm not really sure WHAT AWL is doing in
my case. I do see AWL score adjustments, but they tend to be slight... at least
in comparison to the massive scores most spam gets. Unless I'm mistaken, unless
spammers have forged addresses from real people I get good messages from, AWL
should NOT result in false positives.

> (The concept also seems a little strange. If it already knows it's
> spam then train it that the message is spam. I'd rather teach
> it with the new spam that is not found than simply rack up
> higher scores by training it that material it knows is spam is
> indeed spam. What am I missing here?)

I think there's a difference between auto-whitelist (AWL) -- based on sender --
and bayes_auto, which trains on content. AWL makes good sense... especially for
messages from my good friend that occasionally forwards spammy stuff of
interest. I've left the defaults for bayes_auto (to autolearn high-scoring
spam), but I do augment it with training from my corpus of about 1,000
low-scoring spams that I verified by hand, and the (infrequent) false negative.

I think the reason for bayes auto-learning being useful is that the words in
spam that DIDN'T trip the score get added as well. If those same words appear
commonly in non-spam, they cancel out. But as was pointed out recently, if
spammers use random dictionary words that DON'T appear in non-spam, that itself
is a hint that it might be spammy. It adds to the "smell" of spam, which is why
I think bayes has been so effective at catching the random-word spams that
bypass so many rudimentary filters.

Then again, this may simply be an indicator that I subscribe to low-brow lists.
:)

- Bob
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
From: "Bob George" <mailings02@ttlexceeded.com>

> I think the reason for bayes auto-learning being useful is that the words
in
> spam that DIDN'T trip the score get added as well. If those same words
appear
> commonly in non-spam, they cancel out. But as was pointed out recently, if
> spammers use random dictionary words that DON'T appear in non-spam, that
itself
> is a hint that it might be spammy. It adds to the "smell" of spam, which
is why
> I think bayes has been so effective at catching the random-word spams that
> bypass so many rudimentary filters.
>
> Then again, this may simply be an indicator that I subscribe to low-brow
lists.
> :)

Bob, as far as I can figure it is not the words themselves that trigger
the rules as much as the ratios of word lengths. The random alphabet
word spammers were blocked before I ever installed the .cf files I found
referenced here like chickenpox, 99_FVGT_Tripwire, or any of the others.
I was having zero false positives and maybe 1-2% false spams. (Now that
those extra filters are in the path some spams are going right around
the filter. The poor 133MHz machine I am using to filter two people gets
plain swamped so some seem to go around the spam filtering with no
spamd available for connections. So the extra rules actually made things
worse here. {^_-})

I am thinking of pulling out the "useless" Tripwire and chickenpox
scans. Working too hard to achieve perfection wastes more time than
1-2% spam does.

{^_^}
Re[2]: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Hello Matthias,

Wednesday, February 11, 2004, 5:44:16 PM, you wrote:

>> body RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
>> describe RM_bpt_longwords99 Long string of long words
>> score RM_bpt_longwords99 1.000 # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04

MF> can you give some examples for what those rules will hit?
MF> i've been trying some emails and misc texts with it, and got no hit yet :)

Attached.

Bob Menschel
Re[2]: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Hello Bob,

Wednesday, February 11, 2004, 5:53:03 PM, you wrote:

BG> Robert Menschel <Robert@Menschel.net> wrote:
>> 1) Yes, sa-learn DOES deal with these emails, and does so
>> exceedingly well here. I call them "bayes fodder", since those random
>> words are teaching bayes that emails with those random words are spam.

BG> Just to avoid confusion, you're saying that AFTER TRAINING, bayes works quite
BG> well for those messages, right? The key is feeding any messages that DO slip
BG> through into sa-learn as spam UNTIL you get those results, no?

Correct, with one clarification: The key is feeding ANY/ALL messages to
sa-learn, whether or not they have slipped through. The great majority of
spam is caught regardless; if we sa-learn only those that slip through,
then IMO there isn't enough information for Bayes to make this
determination. If ALL confirmed spam is fed to sa-learn, then Bayes will
have enough information.

BG> The "random words" question seems to come up frequently, and TRAINED bayes
BG> seems to be a good answer.

Agreed.

Bob Menschel
Re[2]: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
On Wed, 11 Feb 2004, Robert Menschel wrote:

> Hello Matthias,
>
> Wednesday, February 11, 2004, 5:44:16 PM, you wrote:
>
> >> body RM_bpt_longwords99 /\b(?:\w{9,}\s+){9}/
> >> describe RM_bpt_longwords99 Long string of long words
> >> score RM_bpt_longwords99 1.000 # type=max:1 (add to 98) - 330s/0h of 91714 corpus (74113s/17601h) 01/23/04
>
> MF> can you give some examples for what those rules will hit?
> MF> i've been trying some emails and misc texts with it, and got no hit yet :)
>
> Attached.

thnx :)

regards,
Matthias
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Raquel Rice <raquel@thericehouse.net> wrote:
> [...]
> All those lists you're so willing to throw away are working
> for me. I run all the rule lists chickenpox and Tripwire.

This is an interesting issue, since I am also on a low-resource computing list
(lots of DOS holdouts!) and they're as bedeviled by spam as the rest of us.

I've noticed that the add-on rules help recognize new patterns, which is very
useful for training bayes. But once bayes has the patterns, it alone is more
than adequate.

I'm wondering how practical it would be to "train up" a more powerful bayes
system with the full boat of rules, then just transfer the bayes data files to
a lower end machine. Or run the additional rules, then disable them for
performance until new patterns emerge.

Would there be a problem with creating a "bayes repository" and share it with
others? Of course, it's a shared bayes configuration, so there'd need to be
some general consensus as to what constitutes spam, etc.

> It takes my poor little 466 only a few seconds to scan for
> viruses and then for SA to do its work. I'd be swamped by
> spam if it weren't for the extra rulesets ... as far as I can
> tell from all the spam that's caught. My partner downloaded
> from our server 137 spam messages yesterday, all tagged, and
> two false negatives ... which I fed to sa-learn.

That's the model we've discussed for the "low-end gateway" for users. Have a
"smarter" machine capable of running tools such as SA do the work, then just
poll for the cleaned up messages using whatever software the users want.

- Bob
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Bob George <mailings02@ttlexceeded.com> wrote:

> I've noticed that the add-on rules help recognize new patterns, which is very
> useful for training bayes. But once bayes has the patterns, it alone is more
> than adequate.

I'm not sure what you mean by "patterns", but it should be
clarified that Bayes doesn't deal with patterns like the ones
recognized by most rules. It deals only with the presence of
tokens, and individual tokens at that, not even combinations.
Rules can recognize much more general and complex patterns in
messages than anything Bayes can (at least as Bayes is
implemented in SA).

--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
From: "Bob George" <mailings02@ttlexceeded.com>

> > It takes my poor little 466 only a few seconds to scan for
> > viruses and then for SA to do its work. I'd be swamped by
> > spam if it weren't for the extra rulesets ... as far as I can
> > tell from all the spam that's caught. My partner downloaded
> > from our server 137 spam messages yesterday, all tagged, and
> > two false negatives ... which I fed to sa-learn.
>
> That's the model we've discussed for the "low-end gateway" for users. Have
a
> "smarter" machine capable of running tools such as SA do the work, then
just
> poll for the cleaned up messages using whatever software the users want.

Bob, my trick here is a simple procmail rule to clone the messages into
a junk mailbox on the linux mailserver machine:
--8<--
:0c:
/$HOME/mail/rawmbox
--8<--

Then I use "mail" as a tool for performing the quick sort into spam and
ham. It took two days to generate my current spam database. Actual time
spent doing it was about an hour or two. Now that the database is trained
I look for any emails that slip through, find them in the raw mailbox,
and toss them into the spam training file. That takes maybe 10 minutes
every few days if I get worked up when more than a couple percent escape
the scanning process. The Baysian analysis has made me lazy about
fomenting new explicit rules here. It builds the rules for me. That's
what a computer should do for me, isn't it?

(I'm worried about when the spammers figure out how to defeat the
simple Baysian analysis. But by then they might have learned that
a trick to survive is to make the advertising interesting. TV was
a LONG time learning this. The current spammers haven't a clue on
this one, yet. But then, it'd take real creative work on their
part. I read that as well beyond them.)

{^_^} Joanne
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Raquel Rice <raquel@thericehouse.net> wrote:
> [...]
> That isn't what I asked. I get over a thousand emails per day,
> personally. Those are from all the lists I'm on, all the
> personal mail, and all the business mail. I assume that
> Matt's email is similar. What I'm asking is, how to select
> 125 per day out of 1000?

I just go through and delete "borderline" cases from my inbox (mbox) (that is,
messages that are OK, but "spammy"), and manually sa-learn that as ham
occasionally. I do the same against folder for mailing lists that have low/no
spam hits. So I simply PRUNE my inbox before training for any large amount of
ham. (more below)

> (I've been going through all my messages each day, manually
> moving "ham" to a ham directory and moving "spam" to a spam
> directory ... a long and tedious job ... then using that to
> train bayes)

The key for me is keeping spam OUT of my inbox altogether for quick downloads,
reading and required daily maintenance. Perhaps set a lower spam threshold
initially, then automatically sort messages above threshold into "obvious" and
"maybe" spam folders? This would help keep your inbox spam-free (mostly), while
not dumping useful but not-as-important stuff.

I manually sort the false-positives out of the "maybe spam" folder and just
drag to "not spam" and "confirmed spam" folders. I have a cron script
automatically run sa-learn on several times a day. Since anything in the "not"
or "confirmed" folder has been verified, I'm comfortable with this. This way, I
don't have to worry about training daily. I just do it as time allows, yet
still enjoy a spam-free inbox.

Daily use is virtually spam-free, and I just sort when convenient. Once bayes
came up to speed, I started dumping anything over the bayes auto_learn
threshold, since I had zero false positives at that level. So even the "maybe
spam" folder isn't overwhelming. If it starts to get cumbersome, I might even
crank this threshold back a couple of points, as I've yet to have a false
positive score much more than 6.

I don't get 1,000 messages personally each day, but over 500 come through
regularly. I find this quite manageable.

- Bob
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Keith C. Ivey <kcivey@cpcug.org> wrote:
> Bob George <mailings02@ttlexceeded.com> wrote:
>
>> I've noticed that the add-on rules help recognize new
>> patterns, which is very useful for training bayes. But once
>> bayes has the patterns, it alone is more than adequate.
>
> I'm not sure what you mean by "patterns", but it should be
> clarified that Bayes doesn't deal with patterns like the ones
> recognized by most rules. It deals only with the presence of
> tokens, and individual tokens at that, not even combinations.
> Rules can recognize much more general and complex patterns in
> messages than anything Bayes can (at least as Bayes is
> implemented in SA).

Ah, I hope I'm not spreading bad information. I'm hardly an SA expert, just a
very happy end-user. It seems that using the add-on rules in conjunction with
bayes has resulted in NONE of the "clever" spams getting through. I have spent
some time thinking through training bayes (including NOT feeding it this list
as ham!) and it seems to have paid off. Perhaps I'm simply benefitting from
better recognition in the basic SA rules.

Just to verify, most spam I receive -- regardless of technique used -- seems to
be tagged with BAYES lately (90+ mostly). So I thought the weird "patterns"
(more correctly, broken-word tokens) were also going into bayes, with the
result that since those odd spellings of v-drug, backhair, spammer domains and
such ONLY show in spam, bayes associates them with statistically indicating
spam. Have I misunderstood?

So if the word "quatrain" only appears in random-word spam (here at least), or
more importantly, never shows in non-spam, it won't help (nor necessarily
hinder) detecting spam. And "eeVagra" and such will ONLY be in spam.
If spammers are using common word lists, I'd think there would be some
repetition, so it *might* help.

Am I off base?

- Bob
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
>
>> From: "Raquel Rice" <raquel@thericehouse.net>
>> > On Wed, 11 Feb 2004 18:54:13 -0800
>> > "jdow" <jdow@earthlink.net> wrote:
>> >
>> > > From: "Raquel Rice" <raquel@thericehouse.net>
>> > > > On Wed, 11 Feb 2004 11:35:01 -0500
>> > > > Matt Kettler <mkettler@evi-inc.com> wrote:
>> > > >
>> > > > > I feed bayes with some spamtraps and nonspamtraps each
>> > > > > day, giving it about 100 spams, and 25 nonspams in manual
>> > > > > training daily.
>> > > >
>> > > > How do you select, out of all your mail, 125 emails to train
>> > > > bayes with?
>> > >
>> > > Might it be because SA seems to need 200 spams before the
>> > > Bayes filter kicks in? (It performs remarkably well here with
>> > > a corpus of some 450 spams and 700 or so hams.
>> > >
>> >
>> > That isn't what I asked. I get over a thousand emails per day,
>> > personally. Those are from all the lists I'm on, all the
>> > personal mail, and all the business mail. I assume that Matt's
>> > email is similar. What I'm asking is, how to select 125 per day
>> > out of 1000?
>> >
>> > (I've been going through all my messages each day, manually
>> > moving"ham" to a ham directory and moving "spam" to a spam
>> > directory ... a long and tedious job ... then using that to
>> > train bayes)
>>

Raquel: Don't know if this is what you want either, but sounds like it.

Right down at the very bottom of my global procmailrc, I place this recipe
to send a copy of the "HAM" to a special HAM collection folder. The other
copy is delivered to the appropriate user mbox. This figures that if the
messages made it through all of the other recipes above -- it's HAM.

Same with SPAM. Any of the recipes that spots a SPAM, a copy goes to a SPAM
collection folder.

Then, at midnight, a cron job feeds both HAM & SPAM using sa-learn.

Hope this helps......

Best regards,
Jack L. Stone,
Administrator

Sage American
http://www.sage-american.com
jacks@sage-american.com
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
From my earlier message:
Right down at the very bottom of my global procmailrc, I place this recipe
to send a copy of the "HAM" to a special HAM collection folder. The other
copy is delivered to the appropriate user mbox. This figures that if the
messages made it through all of the other recipes above -- it's HAM.
-------------------------------------------------
Sorry.......
NOW, for the recipe at the bottom:
## Send copy to Ham folder
## Copy to Ham folder
:0
{
:0c:
$HAM
:0:
$DEFAULT
}

Best regards,
Jack L. Stone,
Administrator

Sage American
http://www.sage-american.com
jacks@sage-american.com
RE: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
This may be slightly off-topic but I think I have a related question.

If spammers start putting a bunch of "good" words at the end of the
spam, which some of them seem to be doing, then when you "learn" them,
won't that screw things up a bit and defeat the whole process?

In this case the rules based checks would be still work, but the Bayes
checks my offset them.

Please tell me if I'm misunderstanding this.

Thanks,
Mark DeMichele
Re: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Mark A. DeMichele <demi@intellipro.com> wrote:
> This may be slightly off-topic but I think I have a related
> question.
>
> If spammers start putting a bunch of "good" words at the end
> of the spam, which some of them seem to be doing, then when
> you "learn" them, won't that screw things up a bit and defeat
> the whole process?

There are also other 'tidbits' in those messages that are useful indicators
though.

> In this case the rules based checks would be still work, but
> the Bayes checks my offset them.


I am not a expert but... if the spammers are using *truly random* words, there
should still be a large number of words that are NOT normally present in ham.
And although a random assortment might contain some "good" words, statistically
they won't be significant so -- if I've got it right -- won't break things at
all. So I don't think a random smattering of non-spam words will have much
impact.

> Please tell me if I'm misunderstanding this.

Any enlightenment welcome here too!

FWIW: I've been feeding random-word spams to bayes, and it is still working
well for me (admittedly in a non-heavy production use setting).

- Bob
Re[2]: SA Problem: spam with random words to defeat Baysian filtering ... [ In reply to ]
Hello Mark,

Thursday, February 12, 2004, 8:37:12 AM, you wrote:

MAD> If spammers start putting a bunch of "good" words at the end of the
MAD> spam, which some of them seem to be doing, then when you "learn"
MAD> them, won't that screw things up a bit and defeat the whole process?

That certainly seems to be what the spammers are hoping for.

MAD> In this case the rules based checks would be still work, but the Bayes
MAD> checks my offset them.

MAD> Please tell me if I'm misunderstanding this.

1) As already pointed out, Bayes collects information from the headers
and the message body of the spam, as well as the random words. Those are
important fodder for Bayes.

2) The random words always contain plenty of words that do NOT appear in
normal emails. They are therefore not in conflict with ham, and become
good spam sign. As Bayes learns more and more of these truly random
words, they become better and better spam sign.

3) Those few words which are randomly included in this misguided attempt
to confuse Bayes and which actually do occur in normal ham are then known
by Bayes to occur in both ham and spam, with the effect that Bayes will
tend to ignore them when determining that messages with all those other
random words and spam tokens are spam.

I've been feeding ALL such emails to Bayes for three or four months now,
and my experience is that Bayes is working beautifully.

Bob Menschel