Mailing List Archive: sa-learn/bayes

What exactly is the negative effect of having a ham corpus that
is much larger than a spam corpus? Say we initialize both corpii (?)
with equal amounts of spam and ham. However, with the majority
of email being non-spam (or at least non-super-high-scoring spam),
the auto-learn over time will cause the ham corpus to grow much
larger than the spam corpus.

The auto-learn feature is very attractive (us being, like most
of you all, over busy and reluctant to take on more regular
maintenance tasks), but is the only real answer to stay on
top of spam, save it, and feed it regularly to sa-learn? It
could be automated, but still... what's the point
of auto-learn if it creates a less useful corpus set?

So I'm hoping I'm wrong about that :-)

Thanks...

-glenn

At 02:42 PM 3/11/2004, little@cs.ucsd.edu wrote:
>What exactly is the negative effect of having a ham corpus that
>is much larger than a spam corpus? Say we initialize both corpii (?)
>with equal amounts of spam and ham. However, with the majority
>of email being non-spam (or at least non-super-high-scoring spam),
>the auto-learn over time will cause the ham corpus to grow much
>larger than the spam corpus.

It's a complete misconception that your spam and ham training should be of
equal size.

Think about it.. bayes is a statistical probability system. It works best
when it's model is as close to reality as possible.

Therefore, I propose that the theoretical best bayes training ratio is not
1:1, but instead whatever your real-world spam:ham ratio is.

But frankly, bayes is pretty resistant to some considerable deviation from
this idea.

>The auto-learn feature is very attractive (us being, like most
>of you all, over busy and reluctant to take on more regular
>maintenance tasks), but is the only real answer to stay on
>top of spam, save it, and feed it regularly to sa-learn? It
>could be automated, but still... what's the point
>of auto-learn if it creates a less useful corpus set?

Autolearn isn't a replacement for manual training.. you can never have a
successful bayes database if you initialize manually and then rely
exclusively on autolearning. Autolearning doesn't do a good job as a sole
source of training because it never learns "mid-scoring" emails. And
because autolearning isn't bayes-based, your previous training doesn't help
the autolearner (this is to prevent "runaway bayes self-feeding" problems)

However, autolearning is a great supplement to your regular manual
training, and reduces the frequency with which hand feeding is needed.

Personally, I handle mine with the following:
autolearning on, but thresholds widened slightly.
email sent to select spamtraps is archived into a separate mbox
email sent to select nonspamtraps is archived into a separate mbox
both the spam and non-spam mboxes are fed to sa-learn with a daily
cron job, and then rotated into a multi-day storage archive.
the subject lines of the spam and nonspam training are emailed to
me for review. Any mistakes are corrected by hand.