Mailing List Archive

optimum configuration without bayes
Hi Guys

I'm about to implement a SpamAssassin setup for nearly one hundred users,
I'll be using Amavis-new, and so can't have Bayes per user.... should I
avoid Bayes all together?

Any suggestions for an optimum configuration for an installation without
using Bayes... how successful with SpamAssassin be without Bayes...?

Any information much appreciated.


Gareth
Re: optimum configuration without bayes [ In reply to ]
At 12:20 PM 2/6/2004, Gareth wrote:
>I'm about to implement a SpamAssassin setup for nearly one hundred users,
>I'll be using Amavis-new, and so can't have Bayes per user.... should I
>avoid Bayes all together?

No... Bayes performs better when it's per user.. that doesn't mean that
multiple-users on a single bayes DB doesn't work.

I don't quite understand why there's this massive misconception that you
shouldn't do multi-user bayes databases... Tons of setups do it, mine
included. MailScanner setups do this _by default_.

The difference between per-user and multi-user bayes depends a lot on how
different a typical user's email is from one another..

If you're a business, most of your email will be related to the market your
company works in. Sure the Purchasing, accounting, engineering and
marketing folks will have different focuses in their email, but they'll
also have a lot in common, as they'll all be getting a lot of
market-specific terminology in their email.

On an "isp" type scenario the impact of multi-user bayes gets a heavier hit
to it's effectiveness, but it's still generally better than not having bayes.


>Any suggestions for an optimum configuration for an installation without
>using Bayes...

Make sure you have network tools.
Net::DNS perl module (for DNSBLs)
DCC and/or razor

And you might want to load some of the custom rulesets like popcorn,
backhair, weeds, etc.

>how successful with SpamAssassin be without Bayes...?

It works, but gets more dependant on DNSBLs and razor-esq systems for help.
RE: optimum configuration without bayes [ In reply to ]
> -----Original Message-----
> From: Gareth [mailto:gareth@bim7.com]
> Sent: Friday, February 06, 2004 12:20 PM
> To: spamassassin-users@incubator.apache.org
> Subject: optimum configuration without bayes
>
>
> Hi Guys
>
> I'm about to implement a SpamAssassin setup for nearly one
> hundred users,
> I'll be using Amavis-new, and so can't have Bayes per
> user.... should I
> avoid Bayes all together?
>
> Any suggestions for an optimum configuration for an
> installation without
> using Bayes... how successful with SpamAssassin be without Bayes...?
>
> Any information much appreciated.
>
>
> Gareth
>

99.999999% without bayes. Sure I may have a custom rule or two ;)

--Chris
Re: optimum configuration without bayes [ In reply to ]
From: "Matt Kettler" <mkettler@evi-inc.com>
> At 12:20 PM 2/6/2004, Gareth wrote:
> >I'm about to implement a SpamAssassin setup for nearly one hundred users,
> >I'll be using Amavis-new, and so can't have Bayes per user.... should I
> >avoid Bayes all together?
>
> No... Bayes performs better when it's per user.. that doesn't mean that
> multiple-users on a single bayes DB doesn't work.

Would I still need to have sa-learn running on ham and spam, and making
employees resend any original messages to a ham and spam mailbox in the
event of a false positive / negative...?

> And you might want to load some of the custom rulesets like popcorn,
> backhair, weeds, etc.

How do these custom rules work, who writes them, why are they not included
with SpamAssassin... any 'How to' documents on setting them up?

Thanks for your help.


Gareth
RE: optimum configuration without bayes [ In reply to ]
> How do these custom rules work, who writes them, why are they
> not included with SpamAssassin... any 'How to' documents on
> setting them up?

See http://wiki.spamassassin.org/w/CustomRulesets. (Thanks, Matt!)

Setting the up is a simple matter of putting the .cf files in the same
folder as your local.cf. They are not included in SA (yet) because they
have not been run through the significant testing and scoring procedure
that standard SA rules go through. Some are limited by your business
environment as they may end up hitting on e-mail you need to receive. My
generally recommendation would be to start with backhair, weeds,
chickenpox and bigevil and then see if any of the other rules make sense
for you.

(How I wish "reply" would reply to the list... I must send messages to
the wrong place at leave every other time I post!)

Bret
Re: optimum configuration without bayes [ In reply to ]
Hi Gareth,

Gareth wrote:
> I'm about to implement a SpamAssassin setup for nearly one hundred users,
> I'll be using Amavis-new, and so can't have Bayes per user.... should I
> avoid Bayes all together?
>
> Any suggestions for an optimum configuration for an installation without
> using Bayes... how successful with SpamAssassin be without Bayes...?

To get some idea of how good SpamAssassin could be with/without certain
features, have a look in 50_scores.cf, wherever that lives on your server.

At the top, in the comments, it lists the percentages of false positives
and negatives the developers found using the default weights for each of
the 'sets'. This is how the default weights for the rules were
calculated. This assumes you're using a treshhold of 5.

From my SpamAssassin-2.61 50_scores.cf:

Set 0: (Pure SpamAssassin rules)
False positives: 0.06% (0.16% of nonspam)
False negatives: 3.87% (5.93% of spam)

Set 1: (SpamAssassin + DNS lookups etc.)
False positives: 0.07% (0.21% of nonspam)
False negatives: 3.79% (5.82% of spam)

Set 2: (SpamAssassin + Bayes)
False positives: 0.05% (0.09% of nonspam)
False negatives: 1.45% (3.13% of spam)

Set 3: (SpamAssassin + Bayes + DNS etc.)
False positives: 0.04% (0.1% of nonspam)
False negatives: 0.49% (0.92% of spam)

At my company we're using Bayes in a sitewide mode, without AWL.
SpamAssassin is working without any DNS or other external lookups though
sendmail does reject all nonexistent domains. We've been running it for
4 weeks now and the percentages I'm seeing for e.g. the last week are:
False negatives: 0.98% (2.6% of spam)
False positives: none (that we know of).
This ignores the 34% of all incoming messages that were flat-out
rejected by sendmail: though they were hopefully all spam and other
unwanted stuff, I can't guess how the filter would have performed on those.

All in all, we seem to do better with our side-wide Bayesian filtering
than should be expected on the basis of SpamAssassin's own tests.

Hope this helps,

Regards, Paul Boven.
Re: optimum configuration without bayes [ In reply to ]
At 02:03 PM 2/6/2004, Gareth wrote:
>From: "Matt Kettler" <mkettler@evi-inc.com>
> > At 12:20 PM 2/6/2004, Gareth wrote:
> > >I'm about to implement a SpamAssassin setup for nearly one hundred users,
> > >I'll be using Amavis-new, and so can't have Bayes per user.... should I
> > >avoid Bayes all together?
> >
> > No... Bayes performs better when it's per user.. that doesn't mean that
> > multiple-users on a single bayes DB doesn't work.
>
>Would I still need to have sa-learn running on ham and spam, and making
>employees resend any original messages to a ham and spam mailbox in the
>event of a false positive / negative...?

Yes and no... You definitely need to have some sa-learning going on..
Getting direct feedback from your users is helpful but not always needed.

Usually FPs are newsletters, so I just set up a dedicated account that gets
it's mail left on the server. I subscribe that account to the newsletters
that are FPing and feed that account's mailbox to sa-learn on a daily basis.




> > And you might want to load some of the custom rulesets like popcorn,
> > backhair, weeds, etc.
>
>How do these custom rules work,


Different custom add-on rules do different things.

>who writes them,

They're all written by different SA users (ie: I wrote antidrug.cf). Some
are very good at it, some are a bit amature.


> why are they not includedwith SpamAssassin..

Some aren't included in SA just because they are "too new" and were
literally written since the last SA release that updated the ruleset.
Others are kind of "non mainstream" and require regular updates (ie:
bigevil, sterns's blacklist, etc).

Because of the GA process, well tuned and balanced scoresets take a while
to build, so SA doesn't update the main ruleset rapidly (it takes about 4
weeks of computer crunching for mass-check and GA).

On the other hand, I can hand score a couple rules quickly and put them out
today... drawback is that my hand-scored rules might have their scores
mis-placed, and haven't been tested against the massive piles of spam in
the corpus. The users of a custom ruleset have to trust the ruleset author
to set scores well, or look at the rules and adjust them themselves.

As a detailed example, the story of antidrug:

Antidrug was recently submitted for inclusion in future versions, but right
now it's an add-on. I was working on some "pill spam" rules to try to
submit to SA when the whole pharmacourt/habeas debacle broke out. Since so
many people were being affected by a barrage of spam from pill spammers
(ie: pharmacourt) I released the not-quite completed ruleset to the public,
figuring many users would benefit from "early access" to these rules.





>. any 'How to' documents on setting them up?

SpamAssassin automatically parses *.cf in your /etc/mail/spamassassin
directory, not just local.cf. Download a ruleset of your choice, copy it
over, and run spamassassin --lint to make sure it's not broken.

A community site for add-on rules exists at:

http://www.exit0.us
Re: optimum configuration without bayes [ In reply to ]
> I wish I could tell you more, but I've already said too much. They grow
> close now. I must flee.
>
> --Hobnobyle of the shire.
>
>
> (OK, I was bored and it's almost time to go home!)
>

Ok Chris has lost it.. who's next up to takeover BigEvil ?
RE: optimum configuration without bayes [ In reply to ]
> >
> >How do these custom rules work,
>

They work very well :)

>
> >who writes them,
>

There is a secret society of rule writers. They dress in all white, and sit
in a dimly lit room high above the regex-muggles. (Except for one member who
works in her spiderhole!) This group snubs their noses at reality. Favoring
the warm glow of a CRT and the arcane beauty of often unused characters like
([/\*^$])\..

They are a restless bunch who feed on each other. Giving up their human
emotions for the love of "Scoring on a big one." They begin every other
sentence with, "I already googled....." There fingers are robust and strong
from hours of pecking away at their craft. They could build a small hovel
using nothing but their fingers as tools.

There unquenching thirst for pattern matching is matched only by their
ability to grep. Some often choose to decipher linguistic statistical
abnormalities, rather then feed. They worship a similar, yet more powerful
group know only as...Da Devs.

We may never know their true identities. We can only listen to what the
bards may tell of them.

I wish I could tell you more, but I've already said too much. They grow
close now. I must flee.

--Hobnobyle of the shire.


(OK, I was bored and it's almost time to go home!)
Re[2]: optimum configuration without bayes [ In reply to ]
Hello Chris,

Friday, February 6, 2004, 10:29:07 AM, you wrote:

>> Any suggestions for an optimum configuration for an installation
>> without using Bayes... how successful with SpamAssassin be without
>> Bayes...?

CS> 99.999999% without bayes. Sure I may have a custom rule or two ;)

A custom rule or 2? I think you may have left a few zeroes off there.

I have fewer email users than being discussed, but I do use Bayes
across three domains, and I feed ALL spam and MOST ham from all three
domains into all three domains.

a) It's easier for me to manage,
b) Nobody in any of the three domains is concerned about bad credit
records, the size of their [mumble], nor how to make millions through
humanitarian efforts on the African continent.

There are some conflicts -- my father actually buys ink from an
organization that /I/ consider a spammer, but since he doesn't, I
can't auto-learn those emails as spam.

Even with Bayes conflicts like that, I consider Bayes to be one of the
most powerful tools I use to fight spam. And with fewer rules than
Chris uses, I hit 99.98% of all spam.

Bob Menschel
Re: optimum configuration without bayes [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Friday 06 February 2004 11:39, Bret Miller wrote:
> (How I wish "reply" would reply to the list... I must send messages to
> the wrong place at leave every other time I post!)

Any good mail reader can be setup to do this with no problem.

Douglas
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.1 (GNU/Linux)

iD8DBQFAJRM0SpWn8R0Z08URAopVAJ4pNFjq1akakfDjZxQjTo8cb0dr+ACgismp
0XnWqTJZvA4QQSJXXM+N6TU=
=5CG7
-----END PGP SIGNATURE-----