Mailing List Archive

Training spamassassin past 5,000 emails
I've read through
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
states that "anything over about 5000 messages does not improve accuracy
significantly in our tests."

So once I hit 5,000, what do? Do I run --forget on say the 500 oldest
emails, delete those from my ham/spam folders and then add in a batch of
500 newer ham/spam emails and then run sa-learn on all the emails in my
spam/ham folders?
Re: Training spamassassin past 5,000 emails [ In reply to ]
Steve Dondley <s@dondley.com> writes:

> I've read through
> https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
> states that "anything over about 5000 messages does not improve
> accuracy significantly in our tests."

I would take that with a grain of salt. Based on my experience running
SA for many years, I'd say that if you have new spam that isn't like
the spam you already have, learning on it will help.

Also, I take it as a comment about "there's no need to try hard to get
more the 5K messages". It doesn't say, "if you train on more than 5000
bad things will happen".

> So once I hit 5,000, what do? Do I run --forget on say the 500 oldest
> emails, delete those from my ham/spam folders and then add in a batch
> of 500 newer ham/spam emails and then run sa-learn on all the emails
> in my spam/ham folders?

I've been running sa-learn daily over my ham folders and my spam folders
for years. I refile spam and ham so that it will be learned. I find
the bayes scoring is quite good except for novel spam. My bayes_* files
are about 83M in total.

So I don't think you necessarily have a problem to solve.
Re: Training spamassassin past 5,000 emails [ In reply to ]
On Tue, 09 Mar 2021 07:49:38 -0500
Steve Dondley wrote:

> I've read through
> https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
> states that "anything over about 5000 messages does not improve
> accuracy significantly in our tests."
>
> So once I hit 5,000, what do? Do I run --forget on say the 500 oldest
> emails, delete those from my ham/spam folders and then add in a batch
> of 500 newer ham/spam emails and then run sa-learn on all the emails
> in my spam/ham folders?


You don't *need* to do anything, that figure is about diminishing
returns.

If you keep a full archive of what's been trained. I think it makes
sense to trim out old mail occasionally and recreate the database -
particularly if it's a single user Bayes.
Re: Training spamassassin past 5,000 emails [ In reply to ]
On 2021-03-09 08:28 AM, Greg Troxel wrote:
> Steve Dondley <s@dondley.com> writes:
>
>> I've read through
>> https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
>> states that "anything over about 5000 messages does not improve
>> accuracy significantly in our tests."
>
> I would take that with a grain of salt. Based on my experience
> running
> SA for many years, I'd say that if you have new spam that isn't like
> the spam you already have, learning on it will help.
>
> Also, I take it as a comment about "there's no need to try hard to get
> more the 5K messages". It doesn't say, "if you train on more than 5000
> bad things will happen".
>
>> So once I hit 5,000, what do? Do I run --forget on say the 500 oldest
>> emails, delete those from my ham/spam folders and then add in a batch
>> of 500 newer ham/spam emails and then run sa-learn on all the emails
>> in my spam/ham folders?
>
> I've been running sa-learn daily over my ham folders and my spam
> folders
> for years. I refile spam and ham so that it will be learned. I find
> the bayes scoring is quite good except for novel spam. My bayes_*
> files
> are about 83M in total.
>
> So I don't think you necessarily have a problem to solve.

OK, thanks for the advice. Appreciated.
Re: Training spamassassin past 5,000 emails [ In reply to ]
On 9 Mar 2021, at 7:49, Steve Dondley wrote:

> I've read through
> https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html which
> states that "anything over about 5000 messages does not improve
> accuracy significantly in our tests."

Did you read the section on expiration?
https://spamassassin.apache.org/full/3.1.x/doc/sa-learn.html#expiration

> So once I hit 5,000, what do?

Be happy that you've reached near-optimal Bayes accuracy.

> Do I run --forget on say the 500 oldest emails, delete those from my
> ham/spam folders and then add in a batch of 500 newer ham/spam emails
> and then run sa-learn on all the emails in my spam/ham folders?

There are edge cases where using --force-expire periodically is
necessary to get expiration to run often enough to avoid bloat, but
unless you have autolearn on and high volume you are unlikely to run
into that problem. If you are only doing manual learning, all should be
well.


--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: Training spamassassin past 5,000 emails [ In reply to ]
On Tue, 09 Mar 2021 08:52:28 -0500
Steve Dondley wrote:

> On 2021-03-09 08:42 AM, RW wrote:

> >
> > If you keep a full archive of what's been trained. I think it makes
> > sense to trim out old mail occasionally and recreate the database -
> > particularly if it's a single user Bayes.
>
> I'm harvesting spam/ham across multiple servers from many different
> users on each server. Is there anything I should be aware of or
> worried about doing something like this? Do I risk the effectiveness
> of SA if it's not tailored to a specific user?

I was really thinking more of an individual running SA for their
own mail. It would be unusual for an admin to keep a full archive of
trained mail for each account.

Per user Bayes can be more accurate, but only if users take the
training seriously.

> I will also be allowing users to flag their own spam using the
> roundcube webmail client.

If you do that you should review the submissions.

> I'm not clear how the individual SA
> database works when there is also a server-wide database.

It's one or the other.
Re: Training spamassassin past 5,000 emails [ In reply to ]
RW wrote:
> On Tue, 09 Mar 2021 08:52:28 -0500
> Steve Dondley wrote:
>> I will also be allowing users to flag their own spam using the
>> roundcube webmail client.
>
> If you do that you should review the submissions.

This. SO much this. ALL THE THIS.

If you're using the "Mark as Junk" or "Mark as Junk 2" plugin you will
get a LOT of mail mistakenly marked as spam when the user intended to
just delete it. The icon in the "classic" theme/skin is VERY easy to
mistake for "Delete". It was so bad here we had to patch in a little
Javascript confirmation popup when we first added Roundcube to our
webmail stable.

Aside from that you will *also* get people who deliberately mark
anything they don't want as spam. This is not terribly healthy for the
Bayes DB, and if you do any other local processing or deconstruction it
will also poison those processes as well.

-kgd