Mailing List Archive

Re: Some real anti-bayes stuffing followup
after feeding it through sa-learn it did catch the example1.txt as
spam...
but now if I strip out the ad portion - it's still spam.

let's hope I don't have to many folks writing fiction around here.

- pat
UW - madison

>>> Dan Melomedman <dan@devonit.com> 2/13/2004 12:24 PM >>>
Pat Noordsij wrote:
> I have one email that included 2 pages of text from Tom Sawyer.
>
> It didn't get caught.

There are also sentence-writing AI programs conveniently available for
spammers. Finally they found a way to foil Bayesian filters.
Congratulations.

Welp, time to find a new anti-spam mechanism. What is it this time?
Re: Some real anti-bayes stuffing followup [ In reply to ]
Pat Noordsij wrote:
> let's hope I don't have to many folks writing fiction around here.

Do you expect spammers to always use fiction pieces for filter foiling.
It will become a serious nuisance eventually.
RE: Some real anti-bayes stuffing followup [ In reply to ]
> Pat Noordsij wrote:
> > let's hope I don't have to many folks writing fiction around here.
>
> Do you expect spammers to always use fiction pieces for
> filter foiling.
> It will become a serious nuisance eventually.


I have seen this quite a bit. It's not always Tom Sawyer, but
a lot of classic fiction. I go ahead and train it.

I also use bogofilter and I wonder if it will be more accurate long term.
Spamassassin Bayes filter only allows a message(token ?) to be read once,
whereas bogofilter allows it to be read multiple time. I may be showing my
ignorance, but if a token can be read multiple times won't that allow bayes
work through this type of poison as long as it's continuality trained?

Jason
RE: Some real anti-bayes stuffing followup [ In reply to ]
On Fri, 13 Feb 2004, Jason Crowe wrote:

> I also use bogofilter and I wonder if it will be more accurate long term.
> Spamassassin Bayes filter only allows a message(token ?) to be read once,

That's not quite true. SA only allows the same *instance* of a message to
be learned once -- where an instance is (last I checked) determined by the
message-id. So if you get the "same" spam seven times with a different
message-id each time, SA will learn it every time.

(I hope the use of message-id for this goes by the wayside soon, before
spammers get the bright idea to steal old message-id headers from nonspam
usenet or list archives and insert them into newly generated spam.)

> if a token can be read multiple times won't that allow bayes work
> through this type of poison as long as it's continuality trained?

Yes.

Also note that these sorts of attacks are mostly effective against systems
that use a ratio of spammy tokens to total tokens in the given message;
that is, where never-before-seen tokens can reduce the spamminess. As I
understand it, SA does not work that way -- a new token is not given any
weight one way or the other during SA's classification. (Someone will
doubtless correct me on that.) Tokens seen in approximately equal amounts
in both ham and spam tend to be ignored by the classifier as well -- so
unless the poisoner manages to include tokens that previously appeared
mostly in ham, the poison simply vanishes into the noise.
RE: Some real anti-bayes stuffing followup [ In reply to ]
Jason Crowe said:
> I also use bogofilter and I wonder if it will be more accurate long term.
> Spamassassin Bayes filter only allows a message(token ?) to be read once,
> whereas bogofilter allows it to be read multiple time. I may be showing my
> ignorance, but if a token can be read multiple times won't that allow
> bayes
> work through this type of poison as long as it's continuality trained?

My friend uses bogofilter+crm14+SA and he gets better results with the
combination of spam.

According to http://bugzilla.spamassassin.org
(keyword search bogofilter)

crm14 and bogofilter will be included nativelly in 2.70, but they are only
currently available as diff's against 2.55.
I'm trying to convince my buddy to code them up for 2.63...

--
Luke Computer Science System Administrator
Security Administrator,College of Engineering
Montana State University-Bozeman,Montana
Re: Some real anti-bayes stuffing followup [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Lucas Albers writes:
> Jason Crowe said:
> > I also use bogofilter and I wonder if it will be more accurate long term.
> > Spamassassin Bayes filter only allows a message(token ?) to be read once,
> > whereas bogofilter allows it to be read multiple time. I may be showing my
> > ignorance, but if a token can be read multiple times won't that allow
> > bayes
> > work through this type of poison as long as it's continuality trained?
>
> My friend uses bogofilter+crm14+SA and he gets better results with the
> combination of spam.
>
> According to http://bugzilla.spamassassin.org
> (keyword search bogofilter)
>
> crm14 and bogofilter will be included nativelly in 2.70, but they are only
> currently available as diff's against 2.55.
> I'm trying to convince my buddy to code them up for 2.63...

They won't be included natively -- however 2.70 will include support
for plugins. It would be easy enough to write a plugin to support
them that way.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFALU0mQTcbUG5Y7woRAqjoAKCVgJCB7vWZ5aExD+G794CvViX45ACgwy9m
bCYTtk2ztgMGVqTywlzDrjo=
=w4Wm
-----END PGP SIGNATURE-----
Re: Some real anti-bayes stuffing followup [ In reply to ]
Bart Schaefer <schaefer@zanshin.com> wrote, responding to
Robert
Menschel's proposal for catching duplicate message IDs:

> Just two points before I go to bed:
>
> (1) Isn't this effectively what DCC, Razor, Pyzor, etc.
already do?

How is that? We're talking about different messages that have
the same message ID.

> (2) Isn't most of this data already in the Bayes database,
just being
> used differently?

It's true that the message IDs of learned messages are in the
Bayes DB, so it should be possible to use that to catch the
duplicates. I agree with Jon that it's probably not worth the
trouble, though. I have seen spammers occasionally reuse
message IDs, but it doesn't really give them much benefit, so
it's not widespread.

--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC