Mailing List Archive

Comparing dspam and sa bayes implementations speed
Hello!
I've decided to compare how dspam and spamassassin
bayes implementations perform, because speed is very
important for large installations and author of dspam
says that his pure C implementation is much faster than
Perl. I've created two perl scripts for running dspam
agent and spamc over my ham corpus and measuring total,
min, max and average time of processing message. Dspam
was configured with mysql storage optimized for
speed. For spamassassin benchmark I've used spamd with
only bayes rules. Both were trained on exactly the same
spam and ham corpus. Here are the results:

# DSPAM

1630 messages processed.
Total time: 230.084 wallclock secs (30.42 cusr + 17.97 csys = 48.39 CPU)
Max message processing time: 13.3091468811035
Avg message processing time: 0.140900963947086
Min message processing time: 0.0444350242614746

# SpamAssassin

1630 messages processed.
Total time: 254.895 wallclock secs ( 3.54 cusr + 10.46 csys = 14.00 CPU)
Max message processing time: 3.65092492103577
Avg message processing time: 0.156147952606342
Min message processing time: 0.0727198123931885

It seems that SpamAssassin is not much slower than
dspam, althoug results are biased because:
1) dspam was configured with default settings which
enables two algorithms (bayes and altbayes);
2) dspam was configured to attach signatures with
tokens for re-learning
3) dspam uses chained tokens which increase volume of
data to be processed.

I'm also very surprised that dspam max message processing
time is higher.

This is mostly a toy benchmark but I would like to hear
suggestions on how results can be imroved.
Eugene

--
Email: jmv /at/ online.ru
Re: Comparing dspam and sa bayes implementations speed [ In reply to ]
At 01:28 PM 3/3/04 +0300, Eugene Morozov wrote:
>This is mostly a toy benchmark but I would like to hear
>suggestions on how results can be imroved.
>Eugene

If you really only want to compare the speed of the bayes engines, remove
or zero-out the scores of all the non-bayes rules from SA.

If you've got a test-box for it, you can accomplish this by moving all the
stuff out of /usr/share/spamassassin, and only moving back in the bayes stuff.

And then configure dspam to only use one of it's bayes engines.. You could
even generate a table with all the combinations. SA (full), SA (bayes
only), dspam (bayes), dspam (altbayes), dspam (bayes+altbayes)
Re: Comparing dspam and sa bayes implementations speed [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Matt Kettler writes:
>At 01:28 PM 3/3/04 +0300, Eugene Morozov wrote:
>>This is mostly a toy benchmark but I would like to hear
>>suggestions on how results can be imroved.
>>Eugene
>
>If you really only want to compare the speed of the bayes engines, remove
>or zero-out the scores of all the non-bayes rules from SA.

Matt, I think he said he did that ;)

BTW SpamAssassin also rewrites the messages with spamc -- so DSpam's
addition of an ID string is not really higher-overhead than that. in fact
it's possibly less complex than the SpamAssassin
entire-message-encapsulation report stuff.

If it would be possible to inhibit DSpam's ID string addition, then
testing with that, and with SpamAssassin set to give just the score
instead of the marked-up msg ("spamc -c" or similar), might also be more
accurate -- as that would remove the markup overhead from the timings,
too.

Also, running multiple times and averaged of course ;)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFARix2QTcbUG5Y7woRAu7iAJ0XfPnBYxubp7Lvgqm7Xk3Yf3vQ/QCdGU8N
NRi10+NIyTbQ6pdzlAgpnaQ=
=O0L9
-----END PGP SIGNATURE-----
Re: Comparing dspam and sa bayes implementations speed [ In reply to ]
At 02:05 PM 3/3/2004, Justin Mason wrote:
> >If you really only want to compare the speed of the bayes engines, remove
> >or zero-out the scores of all the non-bayes rules from SA.
>
>Matt, I think he said he did that ;)

Forgive me, that was a post I made during my pre-coffee morning haze..