Mailing List Archive

spam/ham submission via email (qmail)
Hi,

A few quick questions on bayes:

-I assume I should not feed it any of the "big random word" emails,
correct? Those are designed to pollute bayes with bad stuff is my
understanding.

-Under qmail, I'd like to setup an alias that a few trusted staff members
could forward false negatives to and have sa-learn learn those as spam.
Has anyone seen a nice safe way to do this? I have maildrop available as
well, if it would be easier to do this there.

Thanks,

Charles
Re: spam/ham submission via email (qmail) [ In reply to ]
On Fri, 12 Mar 2004, Charles Sprickman wrote:

> -I assume I should not feed it any of the "big random word" emails,
> correct? Those are designed to pollute bayes with bad stuff is my
> understanding.

They're designed to defeat hashing schemes that look for messages that are
similar to one another. The only way to "pollute" SA's Bayes classifier
is to feed it mis-classified messages. The worst that the random word
messages can do is make your ham look like spam; they can't make your spam
look like ham unless you _don't_ train on them.
Re: spam/ham submission via email (qmail) [ In reply to ]
From: "Charles Sprickman" <spork@fasttrackmonkey.com>

> Hi,
>
> A few quick questions on bayes:
>
> -I assume I should not feed it any of the "big random word" emails,
> correct? Those are designed to pollute bayes with bad stuff is my
> understanding.

I do. It seems effective. Random word tests also prove effective.
Putting both together is "Dy-no-mite!"

> -Under qmail, I'd like to setup an alias that a few trusted staff members
> could forward false negatives to and have sa-learn learn those as spam.
> Has anyone seen a nice safe way to do this? I have maildrop available as
> well, if it would be easier to do this there.

I'm not sure I have something that is close to what you want or not. I
have a rude and crude hack that allows individual spam databases, even.
And it works with this email program, Outlook Expunge er Express. I am
still experimenting with it because at the moment it requires two accounts
per user with each in shared groups. (jdow is in jdow_train's group and
jdow_train is in jdow's group given the RedHat group setup paradigm where
each user has their own group.) I don't like that and need to experiment
with folding it all into the same account. That is going to take a little
work, though. And I am time swamped at the moment.

{^_^}
Re: spam/ham submission via email (qmail) [ In reply to ]
Hello Charles,

Friday, March 12, 2004, 4:17:40 PM, you wrote:

CS> -I assume I should not feed it any of the "big random word" emails,
CS> correct? Those are designed to pollute bayes with bad stuff is my
CS> understanding.

You assume wrong. Yes, they may be /designed/ to pollute bayes, but with
thorough training, they have the opposite effect. Those emails with what
I call spam fodder in them are now being caught by Bayes 99% of the time
here.

Bob Menschel
Re: spam/ham submission via email (qmail) [ In reply to ]
From: "Robert Menschel" <Robert@Menschel.net>

> You assume wrong. Yes, they may be /designed/ to pollute bayes, but with
> thorough training, they have the opposite effect. Those emails with what
> I call spam fodder in them are now being caught by Bayes 99% of the time
> here.

Ironically, Bob, it is being caught by Bayes_99 99% of the time.
{^_-}
Re: spam/ham submission via email (qmail) [ In reply to ]
On Fri, 12 Mar 2004 20:49:42 -0800, Robert Menschel wrote:

>CS> -I assume I should not feed it any of the "big random word" emails,
>CS> correct? Those are designed to pollute bayes with bad stuff is my
>CS> understanding.
>
>You assume wrong. Yes, they may be /designed/ to pollute bayes, but with
>thorough training, they have the opposite effect. Those emails with what
>I call spam fodder in them are now being caught by Bayes 99% of the time
>here.

I don't recall ever reading an "executive summary" of why this is the
case, so I'll take a swing at it:

"Random word strings, or 'bayes poison,' often work counter to their
intention of getting the message past your bayes filters. The reason
is that /your/ bayes DB assigns a 'hammy' score only to words that have
previously appeared in more ham than spam in /your/ mailbox. So if a
message suddenly appears that contains 'xylophone uganda unctious
perspiration', those words will not contribute to a 'hammy' bayes score
unless they routinely appear in non-spam mail to you. They will be
assigned a neutral bayes score, because they appear neither 'hammy' nor
'spammy,' and spamassassin will determine the 'spamminess' of the
message based solely on other tests.

"If the total of all tests does result in such a message getting
incorrectly tagged as ham, you should feed it back into your bayes
database via sa-learn, so that these words get flagged as having
appeared in a message that you considered spam. Thus the next time a
message appears which includes any of these 'poison' words, they will
actually score /against/ the message because you have told bayes that
they only ever appear in spam."

Cheers,
Jon Etkins
Austin, TX
Re: spam/ham submission via email (qmail) [ In reply to ]
At 18:27 2004/03/13, Jon Etkins wrote:
>On Fri, 12 Mar 2004 20:49:42 -0800, Robert Menschel wrote:
>
> >CS> -I assume I should not feed it any of the "big random word" emails,
> >CS> correct? Those are designed to pollute bayes with bad stuff is my
> >CS> understanding.
> >
> >You assume wrong. Yes, they may be /designed/ to pollute bayes, but with
> >thorough training, they have the opposite effect. Those emails with what
> >I call spam fodder in them are now being caught by Bayes 99% of the time
> >here.
>
>I don't recall ever reading an "executive summary" of why this is the
>case, so I'll take a swing at it:
>
>"Random word strings, or 'bayes poison,' often work counter to their
>intention of getting the message past your bayes filters. The reason
>is that /your/ bayes DB assigns a 'hammy' score only to words that have
>previously appeared in more ham than spam in /your/ mailbox. So if a
>message suddenly appears that contains 'xylophone uganda unctious
>perspiration', those words will not contribute to a 'hammy' bayes score
>unless they routinely appear in non-spam mail to you. They will be
>assigned a neutral bayes score, because they appear neither 'hammy' nor
>'spammy,' and spamassassin will determine the 'spamminess' of the
>message based solely on other tests.
>
>"If the total of all tests does result in such a message getting
>incorrectly tagged as ham, you should feed it back into your bayes
>database via sa-learn, so that these words get flagged as having
>appeared in a message that you considered spam. Thus the next time a
>message appears which includes any of these 'poison' words, they will
>actually score /against/ the message because you have told bayes that
>they only ever appear in spam."

I don't think the objective of "bayes poison" was ever really to help spam
get past spam filters--as you say, it's trivial to spot, whether with Bayes
or just with a pattern-based rule. The point of "bayes poison", as I
understand it, is to make it a "poison pill" for your Bayes database--if
you tell your database the message is spam, it has to associate all of
these potentially-hammy dictionary words with spam, since those tokens now
show up in the spam column. If words like "perspiration" now show up in
your spam column, there's a greater likelihood that legitimate mail that
contains those words will be misclassified as spam (i.e. more false
positives). The spammers' intention is not so much to get past your
filters, I think, but to eventually corrupt your Bayes database to the
point where it generates an intolerable number of false positives, so that
you'll stop using it altogether due to user complaints.

The selection of these random dictionary words is interesting as well,
since on the surface they don't appear to be words that many people use in
regular conversation. Not many of my conversations use the word
"xylophone" :) That said, there may be method to this madness as well. If
the spammers tried to poison your database with more common words, e.g.
"music", the effect would be largely diluted, since you'd likely have many
times more entries for "music" in your ham column than in your spam column,
and adding one more to the spam side wouldn't significantly change the
probability of the mail being misclassified. On the other hand, if you
only have "xylophone" in your database on the spam side (due to the poison
you ingested earlier), then the first time you *do* receive ham containing
that word your Bayes database is more likely to scream "spam!". The
spammers' poison strings tend to contain dozens if not hundreds of such
words, each one a potential "land mine" for a legitimate mail to step on in
the future.

Another possible intention of "bayes poison" may be an attempt to overrun
the Bayes database itself by forcing out existing tokens when the database
is size-limited. If your Bayes database is at its size limit (say 5 MB),
and there are suddenly 100 new spam tokens to add, where do they get
added? What has to get removed to make room for them? With enough of
these "attacks", a number of more useful spam tokens may eventually be
forced out of your database, making your filter less effective.

Note that "bayes poison" is not aimed at SpamAssassin in particular, but at
any content-filtering tool that uses Bayesian methods. Some
implementations are more vulnerable than others, and certainly those people
who rely *solely* on Bayesian filtering are most vulnerable. This is an
area where SpamAssassin's broad-spectrum approach really shines--the
vulnerabilities of one method are compensated for by strengths in other
methods, all within the same tool.

Pattern-based rules can easily catch "bayes poison", and perhaps in some
future version it might even be possible to prevent such items from being
learned by the Bayes database if a "bayes poison" rule was also
triggered. More generally, it might be useful to be able to have the Bayes
database ignore certain words if they match certain pattern-based rules
(e.g. poison words included in HTML comments, e.g. <!-- xylophone -->,
etc. The pattern-based rules are straightforward to write, but having
SpamAssassin make use of these to do smarter Bayes training might be a
worthwhile idea. That way you could still train on a "poisoned" item, and
as long as one (or more) of SpamAssassin's non-Bayes rules identified a
poison pattern, the contents of that pattern-match could be omitted from
the Bayes tokenisation.


Robert LeBlanc <rjl@renaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.renaissoft.com/maia/>
RE: spam/ham submission via email (qmail) [ In reply to ]
Regards,

Julian Milano
IT Manager


Davenport Group
79-81 Coppin Street (PO Box 12) Richmond Victoria 3121
Ph : (613) 8416 6666

Limits of Liability and Disclaimer - Davenport Industries is not liable for
any loss, damages, claims, cost demand and expense whatsoever and howsoever
arising in connection with this email transmission. The receiver of this
transmission shall ascertain the accuracy and suitability of this data for
their purposes. Although computer virus scanning software is used by
Davenport Industries, the receiver shall be responsible for their own virus
protection and Davenport Industries shall not be held liable for and
subsequent loss, damage, cost or expense.

This email and any attachment is confidential and intended solely for the
use of the individual or entity to whom they are addressed. If you have
received this email in error you are prohibited from disclosing, copying or
using the information contained in it and please inform us by reply email
and delete.



-----Original Message-----
From: Robert LeBlanc [mailto:rjl@renaissoft.com]
Sent: Monday, 15 March 2004 10:02 am
To: spamassassin-users@incubator.apache.org
Subject: Re: spam/ham submission via email (qmail)


At 18:27 2004/03/13, Jon Etkins wrote:
>On Fri, 12 Mar 2004 20:49:42 -0800, Robert Menschel wrote:
>
> >CS> -I assume I should not feed it any of the "big random word" emails,
> >CS> correct? Those are designed to pollute bayes with bad stuff is my
> >CS> understanding.

--8<--sNiP-->8--

The point of "bayes poison", as I
understand it, is to make it a "poison pill" for your Bayes database--if
you tell your database the message is spam, it has to associate all of
these potentially-hammy dictionary words with spam, since those tokens now
show up in the spam column. If words like "perspiration" now show up in
your spam column, there's a greater likelihood that legitimate mail that
contains those words will be misclassified as spam (i.e. more false
positives).

--8<--sNiP-->8--

My opinion....From what I have read about Bayes, it's gonna take, maybe, 100
occurrances of the same word, to be flagged as spam, before it will have a
significant effect on the overall score.


Regards,

Julian Milano
IT Manager


Davenport Group
79-81 Coppin Street (PO Box 12) Richmond Victoria 3121
Ph : (613) 8416 6666

Limits of Liability and Disclaimer - Davenport Industries is not liable for
any loss, damages, claims, cost demand and expense whatsoever and howsoever
arising in connection with this email transmission. The receiver of this
transmission shall ascertain the accuracy and suitability of this data for
their purposes. Although computer virus scanning software is used by
Davenport Industries, the receiver shall be responsible for their own virus
protection and Davenport Industries shall not be held liable for and
subsequent loss, damage, cost or expense.

This email and any attachment is confidential and intended solely for the
use of the individual or entity to whom they are addressed. If you have
received this email in error you are prohibited from disclosing, copying or
using the information contained in it and please inform us by reply email
and delete.