Mailing List Archive

SpamAssassin Scalability issues in Enterprise environments......
I've been asked to design a fairly large email gatewaying system, for
which antispam measures are required. I'd like to hear from others
about any issues notes etc regarding scaling SpamAssassin to this sort
of level... hardware requirements, kernel tuning (I'm using FreeBSD),
good and bad issues, perhaps even commercial alternatives.

Thanks.
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
Forrest

there's a few of us using MailScanner (www.mailscanner.info) that glues
SA and various anti-virus tools on FreeBSD with fairly large throughputs
- 200,000 a day have been quoted..

--
Martin Hepworth
Snr Systems Administrator
Solid State Logic
Tel: +44 (0)1865 842300


Forrest Aldrich wrote:
> I've been asked to design a fairly large email gatewaying system, for
> which antispam measures are required. I'd like to hear from others
> about any issues notes etc regarding scaling SpamAssassin to this sort
> of level... hardware requirements, kernel tuning (I'm using FreeBSD),
> good and bad issues, perhaps even commercial alternatives.
>
> Thanks.
>
>

**********************************************************************

This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote confirms that this email message has been swept
for the presence of computer viruses and is believed to be clean.

**********************************************************************
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
I should have specified that the gateway system will be eventually
pushing 100's of thousands to millions of messages....




Martin Hepworth wrote:

> Forrest
>
> there's a few of us using MailScanner (www.mailscanner.info) that
> glues SA and various anti-virus tools on FreeBSD with fairly large
> throughputs - 200,000 a day have been quoted..
>
> --
> Martin Hepworth
> Snr Systems Administrator
> Solid State Logic
> Tel: +44 (0)1865 842300
>
>
> Forrest Aldrich wrote:
>
>> I've been asked to design a fairly large email gatewaying system, for
>> which antispam measures are required. I'd like to hear from others
>> about any issues notes etc regarding scaling SpamAssassin to this
>> sort of level... hardware requirements, kernel tuning (I'm using
>> FreeBSD), good and bad issues, perhaps even commercial alternatives.
>>
>> Thanks.
>>
>>
>
> **********************************************************************
>
> This email and any files transmitted with it are confidential and
> intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please notify
> the system manager.
>
> This footnote confirms that this email message has been swept
> for the presence of computer viruses and is believed to be clean.
>
> **********************************************************************
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
> I've been asked to design a fairly large email gatewaying system, for
> which antispam measures are required.

How many mailboxes are we talking about and what type of growth..
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
The expected growth is to go from 100's of thousands of users to perhaps
millions.... and each user would have about 8 mailboxes each, with
separate username/password combinations. Most likely using Qmail as
the MTA.

It would seem that, to accomplish this with SA, you'd need to deploy yet
more boxes to spread the load (layer switching) of SA queries, at the
very least - but how to share the DB?


Forrest


cami wrote:

>> I've been asked to design a fairly large email gatewaying system, for
>> which antispam measures are required.
>
>
> How many mailboxes are we talking about and what type of growth..
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
Forest

depends on which machines update the bayes DB. From what I understand
the current idea is to nfs mount the bayes DB's readonly so only one
system updates it (the one where it's on a local filesystem).

The problem seems to be with file locking on writes.

--
Martin Hepworth
Snr Systems Administrator
Solid State Logic
Tel: +44 (0)1865 842300


Forrest Aldrich wrote:
> The expected growth is to go from 100's of thousands of users to perhaps
> millions.... and each user would have about 8 mailboxes each, with
> separate username/password combinations. Most likely using Qmail as
> the MTA.
>
> It would seem that, to accomplish this with SA, you'd need to deploy yet
> more boxes to spread the load (layer switching) of SA queries, at the
> very least - but how to share the DB?
>
>
> Forrest
>
>
> cami wrote:
>
>>> I've been asked to design a fairly large email gatewaying system, for
>>> which antispam measures are required.
>>
>>
>>
>> How many mailboxes are we talking about and what type of growth..
>
>
>

**********************************************************************

This email and any files transmitted with it are confidential and
intended solely for the use of the individual or entity to whom they
are addressed. If you have received this email in error please notify
the system manager.

This footnote confirms that this email message has been swept
for the presence of computer viruses and is believed to be clean.

**********************************************************************
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
On Wednesday 03 March 2004 10:29 am, Forrest Aldrich wrote:
> It would seem that, to accomplish this with SA, you'd need to deploy yet
> more boxes to spread the load (layer switching) of SA queries, at the
> very least - but how to share the DB?

Load sharing with SA is trivial, simply CNAME the spamd systems and let DNS
round-robin and spamc failover take care of the rest.

As for sharing the DB, simply put: you don't. For high thru-put you'll have to
disable auto-learning and do manual training daily/weekly/whatever, which
keeps the bayes DB "in sync" as a side effect since auto-learn is turned off.
Caching DNS on the spamd systems is a must as well if you want to do RBL
checks.

If you're going to do virus scanning I highly suggest using whatever *BSD's
equivilent to Linux's /dev/shm (tempfs) auto-resizing ramdisk filesystem as
the work area.
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Martin Hepworth writes:
> Forest
>
> depends on which machines update the bayes DB. From what I understand
> the current idea is to nfs mount the bayes DB's readonly so only one
> system updates it (the one where it's on a local filesystem).
>
> The problem seems to be with file locking on writes.

The problem is more file writes in general. Locking is implemented
in an NFS-safe way, but DB writes over NFS are *extremely* slow
and heavy on the network.

I'd suggest even keeping 1 machine as the "master" bayes DB, do sa-learns
there by hand, and copy out DBs onto the "slave" servers which do
the scanning. Each "slave" has a copy of the bayes DB on their
local disk, which is overwritten periodically by the master.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFARijNQTcbUG5Y7woRAinnAKDfl33zRco2oUPgczsf/nJ0gkc0ewCfdJeK
m+xIyu0qhFWWKuWw8+Y3/zQ=
=Jf/T
-----END PGP SIGNATURE-----
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
On Wed, 3 Mar 2004, Forrest Aldrich wrote:

> I've been asked to design a fairly large email gatewaying system, for
> which antispam measures are required. I'd like to hear from others
> about any issues notes etc regarding scaling SpamAssassin to this sort
> of level... hardware requirements, kernel tuning (I'm using FreeBSD),
> good and bad issues, perhaps even commercial alternatives.

We move about 2 million emails inbound/month, and 1 million
outbound/month. Our MTAs are 3 linux boxes running sendmail, and pass
emails into our internal mail system which stores user mailboxes (about
15000). Each MTA runs MIMEDefang and SpamAssassin. Each box does a
site-wide Bayes. Our email is evenly distributed across the 3 MTAs so
they don't really need to share the Bayes dbs, and we've never seen a
problem with each MTA having its own corpus. Bayes works REALLY well for
us, and has virtually eliminated our false positives and needs to
whitelist. Each MTA also does SpamAss network tests.

We do NO scanning of outbound mail. Virus scanning is handled by the
internal mail system, as we have not yet needed to do virus scanning on
the Internet MTAs.

I must really stress that it is MIMEDefang that allows us to process this
kind of load. I tried one other milter and it crashed constantly. MD is
rock solid, because it codes around the non-solidity of perl. ;-)


Matt

--
Matthew S. Cramer <mscramer@armstrong.com> Office: 717-396-5032
Infrastructure Security Analyst Fax: 717-396-5590
Armstrong World Industries, Inc. Cell: 717-917-7099
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
> I'd suggest even keeping 1 machine as the "master" bayes DB, do sa-learns
> there by hand, and copy out DBs onto the "slave" servers which do
> the scanning. Each "slave" has a copy of the bayes DB on their
> local disk, which is overwritten periodically by the master.

Which brings up an interesting thought.

Does Bayes have a "learn to log" option where it can spool all the messages
an auto-learn has decided to learn?

If it did, each system could "autolearn" to a local log, and a cron script
could suck them up every so often, cat them together, and do a dedicated
update of the Bayes db, then redeploy the updated version.

I'd think with that the only place you will take a hit will be doing the
actual db update, which probably will shut down Bayes for a few moments
during the file switch.

You could do the entire learn process on a dedicated small machine that
wasn't even part of the actual mail filtering system, but was just a db
updater and redeployer.

Loren
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
On Wed, Mar 03, 2004 at 06:46:30PM -0800, Loren Wilton wrote:
> Does Bayes have a "learn to log" option where it can spool all the messages
> an auto-learn has decided to learn?

Yes. there's a learn to journal option.

--
Randomly Generated Tagline:
"The Pre-1985 Video Game Character Test was created by RavenBlack. It is
entirely in fun. Don't think you have special powers just because the test
tells you so. It is not serious, and not to be taken internally. So don't."
- http://blog.ravenblack.net/quiz/videogame.pl
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
Matt Cramer wrote:

Forrest, let me add some "numbers" to Matt's point.

> We move about 2 million emails inbound/month, and 1 million
> outbound/month. Our MTAs are 3 linux boxes running sendmail, and pass
> emails into our internal mail system which stores user mailboxes (about
> 15000). Each MTA runs MIMEDefang and SpamAssassin. Each box does a
[cut]
> We do NO scanning of outbound mail. Virus scanning is handled by the
> internal mail system, as we have not yet needed to do virus scanning on
> the Internet MTAs.
[cut]

We run a "front-line" Postfix+Spamassassin SMTP farm that accepts
incoming mail from Internet. Intra-ISP (yep, we're an ISP) mail doesn't
get antispam filtering. The inbound SMTP chain is completed with an
antivirus farm.

Both AntiSpam and AntiVirus gateways are under hardware load balancing,
and are composed of two identical servers.

As of yesterday (March 3rd, 2004), on one of the two servers:

Postfix Summary

Inbound Msgs 44012
Inbound Size 1586 Mbyte
---------------------

Clean Msgs Summary

Clean Msgs 15416
Size 258 Mbytes
Proc Time 2402 seconds
---------------------

Spam Summary

Spam Msgs 12058
Size 102 Mbytes
Proc Time 1543 seconds


As a matter of reference, our AntiSpam machines are IBM xSeries 345:

1st) 2xXeon @ 2.4GHz, 1GB RAM, 200MHz bus
2nd) 2xXeon @ 2.8 GHz, 1 GB RAM, 400 MHz bus

both running untuned Linux 2.4 kernel, Spamassassin 2.62 with two custom
rules, threshold at 3.5, no analysis of >150k mails.

During worktime hours (8-18 local) the load on the first machine is
around 2, on the second machine is way below 1.

Bayes DB is on each single machine and we do training sessions on both
servers, when needed. Sharing the Bayes DB between multiple SpamAssassin
servers can be achieved using a RDBMS as a backend, but this adds
another point-of-failure. Hardware load balancing does a good job in
presenting all servers the same incoming spam-base.
Number of managed domains doesn't really matter until there's no
per-domain AntiSpam customization.

If anyone would like to have more numbers, just ask and I'll do my best
to produce some statistics.

Paolo Cravero
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
At 23:52 2004/03/03, Paolo Cravero as2594 wrote:

>Sharing the Bayes DB between multiple SpamAssassin servers can be achieved
>using a RDBMS as a backend, but this adds another point-of-failure.

I've been looking for/considering something like this, actually. I assumed
that I'd have to rewrite the BayesStore.pm file to have it use SQL queries
instead. Are you saying that such a solution already exists?


Robert LeBlanc <rjl@renaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.renaissoft.com/maia/>
Re: SpamAssassin Scalability issues in Enterprise environments...... [ In reply to ]
On Thu, Mar 04, 2004 at 12:19:25AM -0800, Robert LeBlanc wrote:
> I've been looking for/considering something like this, actually. I assumed
> that I'd have to rewrite the BayesStore.pm file to have it use SQL queries
> instead. Are you saying that such a solution already exists?

3.0... :)

--
Randomly Generated Tagline:
Man: You must be stupider than you look.

Homer: Stupider like a fix!

Lemon of Troy