At 18:27 2004/03/13, Jon Etkins wrote:
>On Fri, 12 Mar 2004 20:49:42 -0800, Robert Menschel wrote:
>
> >CS> -I assume I should not feed it any of the "big random word" emails,
> >CS> correct? Those are designed to pollute bayes with bad stuff is my
> >CS> understanding.
> >
> >You assume wrong. Yes, they may be /designed/ to pollute bayes, but with
> >thorough training, they have the opposite effect. Those emails with what
> >I call spam fodder in them are now being caught by Bayes 99% of the time
> >here.
>
>I don't recall ever reading an "executive summary" of why this is the
>case, so I'll take a swing at it:
>
>"Random word strings, or 'bayes poison,' often work counter to their
>intention of getting the message past your bayes filters. The reason
>is that /your/ bayes DB assigns a 'hammy' score only to words that have
>previously appeared in more ham than spam in /your/ mailbox. So if a
>message suddenly appears that contains 'xylophone uganda unctious
>perspiration', those words will not contribute to a 'hammy' bayes score
>unless they routinely appear in non-spam mail to you. They will be
>assigned a neutral bayes score, because they appear neither 'hammy' nor
>'spammy,' and spamassassin will determine the 'spamminess' of the
>message based solely on other tests.
>
>"If the total of all tests does result in such a message getting
>incorrectly tagged as ham, you should feed it back into your bayes
>database via sa-learn, so that these words get flagged as having
>appeared in a message that you considered spam. Thus the next time a
>message appears which includes any of these 'poison' words, they will
>actually score /against/ the message because you have told bayes that
>they only ever appear in spam."
I don't think the objective of "bayes poison" was ever really to help spam
get past spam filters--as you say, it's trivial to spot, whether with Bayes
or just with a pattern-based rule. The point of "bayes poison", as I
understand it, is to make it a "poison pill" for your Bayes database--if
you tell your database the message is spam, it has to associate all of
these potentially-hammy dictionary words with spam, since those tokens now
show up in the spam column. If words like "perspiration" now show up in
your spam column, there's a greater likelihood that legitimate mail that
contains those words will be misclassified as spam (i.e. more false
positives). The spammers' intention is not so much to get past your
filters, I think, but to eventually corrupt your Bayes database to the
point where it generates an intolerable number of false positives, so that
you'll stop using it altogether due to user complaints.
The selection of these random dictionary words is interesting as well,
since on the surface they don't appear to be words that many people use in
regular conversation. Not many of my conversations use the word
"xylophone" :) That said, there may be method to this madness as well. If
the spammers tried to poison your database with more common words, e.g.
"music", the effect would be largely diluted, since you'd likely have many
times more entries for "music" in your ham column than in your spam column,
and adding one more to the spam side wouldn't significantly change the
probability of the mail being misclassified. On the other hand, if you
only have "xylophone" in your database on the spam side (due to the poison
you ingested earlier), then the first time you *do* receive ham containing
that word your Bayes database is more likely to scream "spam!". The
spammers' poison strings tend to contain dozens if not hundreds of such
words, each one a potential "land mine" for a legitimate mail to step on in
the future.
Another possible intention of "bayes poison" may be an attempt to overrun
the Bayes database itself by forcing out existing tokens when the database
is size-limited. If your Bayes database is at its size limit (say 5 MB),
and there are suddenly 100 new spam tokens to add, where do they get
added? What has to get removed to make room for them? With enough of
these "attacks", a number of more useful spam tokens may eventually be
forced out of your database, making your filter less effective.
Note that "bayes poison" is not aimed at SpamAssassin in particular, but at
any content-filtering tool that uses Bayesian methods. Some
implementations are more vulnerable than others, and certainly those people
who rely *solely* on Bayesian filtering are most vulnerable. This is an
area where SpamAssassin's broad-spectrum approach really shines--the
vulnerabilities of one method are compensated for by strengths in other
methods, all within the same tool.
Pattern-based rules can easily catch "bayes poison", and perhaps in some
future version it might even be possible to prevent such items from being
learned by the Bayes database if a "bayes poison" rule was also
triggered. More generally, it might be useful to be able to have the Bayes
database ignore certain words if they match certain pattern-based rules
(e.g. poison words included in HTML comments, e.g. <!-- xylophone -->,
etc. The pattern-based rules are straightforward to write, but having
SpamAssassin make use of these to do smarter Bayes training might be a
worthwhile idea. That way you could still train on a "poisoned" item, and
as long as one (or more) of SpamAssassin's non-Bayes rules identified a
poison pattern, the contents of that pattern-match could be omitted from
the Bayes tokenisation.
Robert LeBlanc <rjl@renaissoft.com>
Renaissoft, Inc.
Maia Mailguard <
http://www.renaissoft.com/maia/>