Mailing List Archive: Bayes & Ham

Bayes & Ham

Feb 2, 2004, 2:35 PM

Post #1 of 3 (1240 views)

This might sound more than a little stupid, but...

I am looking into implementing Bayes filtering and have stockpiled a TON
of Spam to train with. Where are you getting an equal amount of Ham to
train with? I administer an email domain, but only have access to my
own mail (ethically). What are your suggestions on rounding up 1000 or
so Ham messages from my users so that it is not too intrusive or
annoying for the user?

Any suggestions would be great!!!

Mike

Re: Bayes & Ham [ In reply to ]

mkettler at evi-inc

Feb 2, 2004, 3:02 PM

Post #2 of 3 (1204 views)

Permalink

FWIW I use a combination of two sources for HAM training:

1) some selected chunks of my own email (ie: mailing lists not involving
SA, personal email, etc)

2) I set up a "nonspamtrap" account, and I've subscribed this to a few of
the newsletters my user's commonly subscribe to.

Note that an equal amount of spam and ham isn't exactly required, and it's
not exactly optimal either, so don't kill yourself trying to make the
numbers exactly match. Just don't have some huge imbalance (optimal would
be to have the same spam/ham ratio in your training that your server sees
in reality)

At 04:35 PM 2/2/2004, Mike Samba wrote:
>This might sound more than a little stupid, but...
>
>I am looking into implementing Bayes filtering and have stockpiled a TON
>of Spam to train with. Where are you getting an equal amount of Ham to
>train with? I administer an email domain, but only have access to my own
>mail (ethically). What are your suggestions on rounding up 1000 or so Ham
>messages from my users so that it is not too intrusive or annoying for the
>user?
>
>Any suggestions would be great!!!
>
>Mike

Re: Bayes & Ham [ In reply to ]

spamassassin at sasknow

Feb 2, 2004, 3:25 PM

Post #3 of 3 (1194 views)

Permalink

Matt Kettler wrote to Mike Samba and spamassassin-users@incubator.apache.org:

> FWIW I use a combination of two sources for HAM training:
>
> 1) some selected chunks of my own email (ie: mailing lists not
> involving SA, personal email, etc)
>
> 2) I set up a "nonspamtrap" account, and I've subscribed this to a few
> of the newsletters my user's commonly subscribe to.

Good sources. We provide "spam" and "nonspam" accounts for our more
pro-active clients to forward spam and ham, particularly messages that
were incorrectly classified. As long as they're instructed to forward
such messages as attachments, the messages (attachments) come through
unmolested.

I'm fortunate enough to personally own a domain that is now very close
in spelling (same name, different TLD) to a domain used by a large ISP
in our region. After seeing the postmaster logs on our email server, I
set up an account to catch all of the incoming email on my domain. There
are enough mistypes that I get several hundred messages per day for
different recipients, including ham, spam, and virii. It's the closest
thing to broadly varied user email that we can get without violating our
own privacy policy.

I have a staff member (otherwise known as our Resident SpamQueen) go
through that, as well as our shared email boxes (sales, support, etc),
and train the filter. She has no problem finding 1000+ SPAM and HAM
weekly. It's done wonders for our filtering.

If we didn't have such a good source of email, I guess I'd ask a small
percentage of our customers to *voluntarily* allow us to use their
accounts to train the filter... at which point we could just have the
server FCC all of their messages to another shared mailbox on our system
for our bodacious SpamQueen to traverse. That's trivial to implement on
most systems.

Yes, filtering can be configured on a per-user basis, but we chose to
make it as simple for our clients (and as simple for us) as possible,
and go site-wide. So, the filtering may not be quite as precise, but at
least *we* control the QoS, and we err on the side of caution.

It's worked remarkably well. We've been sustaining about 95% correctly
filtered, with no false positives. Server-wide, our HAM:SPAM ratio is
about 1.5:1. With many personal accounts, though, it's more like 1:15
(90-95% SPAM), after viruses are taken out of the equation (but that's
another tangent). We'd be sunk without SpamAssassin.

- Ryan

--
Ryan Thompson <ryan@sasknow.com>

SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4

Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America