What exactly is the negative effect of having a ham corpus that
is much larger than a spam corpus? Say we initialize both corpii (?)
with equal amounts of spam and ham. However, with the majority
of email being non-spam (or at least non-super-high-scoring spam),
the auto-learn over time will cause the ham corpus to grow much
larger than the spam corpus.
The auto-learn feature is very attractive (us being, like most
of you all, over busy and reluctant to take on more regular
maintenance tasks), but is the only real answer to stay on
top of spam, save it, and feed it regularly to sa-learn? It
could be automated, but still... what's the point
of auto-learn if it creates a less useful corpus set?
So I'm hoping I'm wrong about that :-)
Thanks...
-glenn
is much larger than a spam corpus? Say we initialize both corpii (?)
with equal amounts of spam and ham. However, with the majority
of email being non-spam (or at least non-super-high-scoring spam),
the auto-learn over time will cause the ham corpus to grow much
larger than the spam corpus.
The auto-learn feature is very attractive (us being, like most
of you all, over busy and reluctant to take on more regular
maintenance tasks), but is the only real answer to stay on
top of spam, save it, and feed it regularly to sa-learn? It
could be automated, but still... what's the point
of auto-learn if it creates a less useful corpus set?
So I'm hoping I'm wrong about that :-)
Thanks...
-glenn