Mailing List Archive: Clarification/validatin of bayes-training approach

I've got SA set up with site-wide bayes for my small network for testing, and
am quite impressed. I'm getting very good results with bayes, but I want to
make sure I'm using a good approach. Here's what I've done:

INITIAL TRAINING
1. I trained using a collection of about 2,000 each of ham and spam initially.

ONGOING TRAINING
2. I use auto-learning with defaults.
3. Messages flagged as spam (even if above auto-learning threshold) are fed
back at least once. This is because I've got a script set up that trains
several other bayes-based tools. I average < 100 spam daily. Is this
detrimental? Is there a problem re-training with the same messages repeatedly.
4. Any message SA misses is also fed into the spam store for training after
review. (Note that I haven't seen any that would've been auto-learned as ham).
5. I've recently started adding new ham daily, taken from "accepted" messages.
(It's OK in my world to do so.) I skim them to remove any 'spammy' content,
then feed them through the training script. I average 2-300 message daily. Is
this detrimental?

AVOIDING TAINTING
6. I use bayes_ignore_header on any locally-generated headers.
7. I do not feed any spam-related mailing list messages (SA, procmail,
bogofilter) through SA.

RE-TRAINING
8. I'm maintaining the spam and ham messages used for training.

I'm trying to determine the best long-term strategy for maintaining bayes
effectiveness. I'm concerned that SA has NOT successfully flagged messages
(notably the most recent logo spams) as effectively as bogofilter & spamprobe,
although I've used similar training for all. BAYES_80 seems the norm for
variants on the logo spam.

From the sa-learn manpage:

Learning filters require training to be effective. If you don't train
them, they won't work. In addition, you need to train them with new
messages regularly to keep them up-to-date, or their data will become
stale and impact accuracy.

Based on this, I've adopted the "ongoing training" described above, which fits
under "Unsupervised learning from SpamAssassin rules" as described on the
manpage, along with some supervised training.

My confusion comes later in the manpage where it states:

Another thing to be aware of, is that typically you should aim to train
with at least 1000 messages of spam, and 1000 ham messages, if possi-
ble. More is better, but anything over about 5000 messages does not
improve accuracy significantly in our tests.

I'm well above 1,000 and approaching 5,000 fast. Does auto-expiry take care of
using more than 5,000 messages, or should I switch to "train on exception"
rather than mass-feeding SA?

I've read the bogofilter documentation on "full training", "train on error" and
"train to exhaustion" and am wondering if any of those techniques are relevant
to SA's bayes implementation. From the sa-learn manpage, I understand that
initial full-training with auto-learning supplemented by train on exception is
"recommended." Is that correct? Am I force-feeding SA more than is helpful?

Thanks all,

- Bob