Mailing List Archive

Re[2]: Some real anti-bayes stuffing followup
Hello Bart, Devs,

Friday, February 13, 2004, 12:33:27 PM, you wrote, concerning Bayes:

BS> (I hope the use of message-id for this goes by the wayside soon,
BS> before spammers get the bright idea to steal old message-id headers
BS> from nonspam usenet or list archives and insert them into newly
BS> generated spam.)

Actually, a new spam-detecting mechanism could be to look for duplicate
message ids. I've received multiple spams all using the same message id.

a) If a ham is sent to my domain with four recipients here, then because
of the way I run SA, I could process that email four times, once for each
mailbox. That's expected. And it's expected that each of those emails
will have identical bodies, and identical subjects.

b) I receive spam where in a given day I can receive similar spam,
identical message ids, but with different subject headers (usually random
words or letters added to a subject), and/or with different bodies
(sometimes minor random differences, sometimes very different messages).

c) I receive spam where on Jan 2 I can receive spam with a given message
ID, and I can receive spam (similar or not) with identical message ids on
Jan 14, Jan 30, Feb 12, etc.

I suggest that if we could store a record with three or four fields,
message-id, checksum(subject), checksum(body), and maybe time(firstseen),
we could use this as a database, and apply a rule (maybe named
DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
time(now) is significantly different from time(firstseen).

Does this seem like a worthwhile approach?

Bob Menschel
Re: Re[2]: Some real anti-bayes stuffing followup [ In reply to ]
On Fri, 2004-02-13 at 20:59, Robert Menschel wrote:
> I suggest that if we could store a record with three or four fields,
> message-id, checksum(subject), checksum(body), and maybe time(firstseen),
> we could use this as a database, and apply a rule (maybe named
> DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
> time(now) is significantly different from time(firstseen).
>
> Does this seem like a worthwhile approach?
>

IANAD (I am not a developer) but I don't think I this a worthwhile
approach for two related reasons:

* it costs us (the mail admins) too much
* it costs spammers too little

We would need to go through the effort of implementing this in code,
then setting off resources (disk and CPU) to checksum and record these
attributes of incoming messages.

In response, spammers would only need to insert a %RND_MSG_ID to render
all our efforts useless.

- Jon

--
jon@tgpsolutions.com

Administrator, tgpsolutions
http://www.tgpsolutions.com
Re[4]: Some real anti-bayes stuffing followup [ In reply to ]
Hello Jon,

Friday, February 13, 2004, 9:11:41 PM, you wrote:

J> On Fri, 2004-02-13 at 20:59, Robert Menschel wrote:
>> I suggest that if we could store a record with three or four fields,
>> message-id, checksum(subject), checksum(body), and maybe time(firstseen),
>> we could use this as a database, and apply a rule (maybe named
>> DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
>> time(now) is significantly different from time(firstseen).
>>
>> Does this seem like a worthwhile approach?

J> IANAD (I am not a developer) but I don't think I this a worthwhile
J> approach for two related reasons:

J> * it costs us (the mail admins) too much
J> * it costs spammers too little

J> We would need to go through the effort of implementing this in code,
J> then setting off resources (disk and CPU) to checksum and record these
J> attributes of incoming messages.

I see this resource requirement as being minimal -- a small fraction of
what we do currently with Bayes.

J> In response, spammers would only need to insert a %RND_MSG_ID to
J> render all our efforts useless.

It'd be easier to simply have their spam-mail programs generate normal,
unique message ids...

Bob Menschel
Re[2]: Some real anti-bayes stuffing followup [ In reply to ]
On Fri, 13 Feb 2004, Robert Menschel wrote:

> Hello Bart, Devs,
>
> Friday, February 13, 2004, 12:33:27 PM, you wrote, concerning Bayes:
>
> BS> (I hope the use of message-id for this goes by the wayside soon,
> BS> before spammers get the bright idea to steal old message-id headers
> BS> from nonspam usenet or list archives and insert them into newly
> BS> generated spam.)
>
> Actually, a new spam-detecting mechanism could be to look for duplicate
> message ids. I've received multiple spams all using the same message id.

Silly question, how does Bayes deal with a message that has -no-
Message-ID? Unlike NNTP, SMTP does not require a Message-ID, just
reccomends one.

I see many messages a day that come into our mail server that totally lack
a Message-ID (I use that as a spam-sign and assign a value of 1.5 to it ;).
My sendmail daemon synthesizes a Message-ID before delivery but it isn't
there during the filtering process.


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Some real anti-bayes stuffing followup [ In reply to ]
On Fri, 13 Feb 2004, Robert Menschel wrote:

> I suggest that if we could store a record with three or four fields,
> message-id, checksum(subject), checksum(body), and maybe
> time(firstseen), we could use this as a database, and apply a rule
> (maybe named DUPLICATE_MESSAGEID) where either (1) checksums don't
> match, or (2) time(now) is significantly different from time(firstseen).

On Fri, 13 Feb 2004, Jon wrote:

> IANAD (I am not a developer) but I don't think I this a worthwhile
> approach for two related reasons:
>
> * it costs us (the mail admins) too much
> * it costs spammers too little


Just two points before I go to bed:

(1) Isn't this effectively what DCC, Razor, Pyzor, etc. already do?

(2) Isn't most of this data already in the Bayes database, just being
used differently?
Re: Re[2]: Some real anti-bayes stuffing followup [ In reply to ]
David B Funk <dbfunk@engineering.uiowa.edu> wrote:

> Silly question, how does Bayes deal with a message that has -no-
> Message-ID? Unlike NNTP, SMTP does not require a Message-ID, just
> reccomends one.

If there is no message ID, SA uses a hash of the message text
followed by '@sa_generated'. Unfortunately that means if the
message is modified at a later stage before delivery it won't
be possible to correct mislearning (of course, relearning a
modified message doesn't work completely right even if there is
a message ID).

--
Keith C. Ivey <kcivey@cpcug.org>
Washington, DC
Re: Re[2]: Some real anti-bayes stuffing followup [ In reply to ]
On Sat, 14 Feb 2004, Keith C. Ivey <kcivey@cpcug.org> wrote:

> David B Funk <dbfunk@engineering.uiowa.edu> wrote:
>
> > Silly question, how does Bayes deal with a message that has -no-
> > Message-ID? Unlike NNTP, SMTP does not require a Message-ID, just
> > reccomends one.
>
> If there is no message ID, SA uses a hash of the message text
> followed by '@sa_generated'.

Note that many MTAs will add the Message-ID on the way through, if it
didn't have one already. SA, in turn, uses that as useful intelligence;
search for MSGID_FROM_MTA_ in the *.cf rules distributed with SA.

--
Brent J. Nordquist <b-nordquist@bethel.edu> N0BJN
Other contact information: http://kepler.acns.bethel.edu/~bjn/contact.html
* Fast pipe * Always on * Get out of the way - Tim Bray http://tinyurl.com/7sti
Re: Re[4]: Some real anti-bayes stuffing followup [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Robert Menschel writes:
> Hello Jon,
>
> Friday, February 13, 2004, 9:11:41 PM, you wrote:
>
> J> On Fri, 2004-02-13 at 20:59, Robert Menschel wrote:
> >> I suggest that if we could store a record with three or four fields,
> >> message-id, checksum(subject), checksum(body), and maybe time(firstseen),
> >> we could use this as a database, and apply a rule (maybe named
> >> DUPLICATE_MESSAGEID) where either (1) checksums don't match, or (2)
> >> time(now) is significantly different from time(firstseen).
> >>
> >> Does this seem like a worthwhile approach?
>
> J> IANAD (I am not a developer) but I don't think I this a worthwhile
> J> approach for two related reasons:
>
> J> * it costs us (the mail admins) too much
> J> * it costs spammers too little
>
> J> We would need to go through the effort of implementing this in code,
> J> then setting off resources (disk and CPU) to checksum and record these
> J> attributes of incoming messages.
>
> I see this resource requirement as being minimal -- a small fraction of
> what we do currently with Bayes.
>
> J> In response, spammers would only need to insert a %RND_MSG_ID to
> J> render all our efforts useless.
>
> It'd be easier to simply have their spam-mail programs generate normal,
> unique message ids...

That's what a real message-ID *is* anyway. The reason they don't do
it is because we can use those patterns as spam signs.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAMq4TQTcbUG5Y7woRAk1bAKC9JhMQ3C6TOHWGdjpnhErar3ne5gCg0EPu
XmwUNygJFZxn9QqasC5lAIM=
=+Bl0
-----END PGP SIGNATURE-----