Mailing List Archive: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

Why shouldn't I set the score for SPAM_99 and SPAM_999 higher?

thomas.cameron at camerontech

May 5, 2022, 8:37 AM

Post #1 of 8 (616 views)

I understand that turning knobs without understanding the consequences
can do bad thing, but almost all of the spam that gets through SA on my
server has SPAM_99 or SPAM_999 set in the headers. It is obviously spam,
so I don't really get how it wasn't flagged, but it wasn't. What are the
risks of giving more weight to SPAM_99 and/or SPAM_999? Explain it like
I'm five, sorry, it's probably something simple that I just don't
understand.

Thomas

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

thomas.cameron at camerontech

May 5, 2022, 8:54 AM

Post #2 of 8 (616 views)

On 5/5/22 10:46, Reindl Harald wrote:
>
>
> Am 05.05.22 um 17:37 schrieb Thomas Cameron:
>> I understand that turning knobs without understanding the
>> consequences can do bad thing, but almost all of the spam that gets
>> through SA on my server has SPAM_99 or SPAM_999 set in the headers.
>> It is obviously spam, so I don't really get how it wasn't flagged,
>> but it wasn't. What are the risks of giving more weight to SPAM_99
>> and/or SPAM_999? Explain it like I'm five, sorry, it's probably
>> something simple that I just don't understand
>
> when your bayes is well trained just raise it
>
> the risk is simple: when you bayes isn't trained well or poisend
> (autolearning is the root of all evil) you risk FPs
>
> we milter-reject at 8.0 points and BAYES_99 + BAYES_999 are 7.5 points
> since 2014, the most junk collects the remaining 0.5 points with other
> rules and the few FP typically hit some DNSWL/SPF rules with negative
> score
>
> well, our bayes has 160k messages....

Many thanks! I appreciate the response!

Thomas

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

mnalis-sa-list at voyager

May 5, 2022, 9:47 AM

Post #3 of 8 (616 views)

You should probably check that none of your ham (i.e. non-spam)
messages contains SPAM_99 or SPAM_999. It can happen when spammers
poison your bayes database, and increased score in that case might
lead to legitimate mail being misclassified as a spam.

On Thu, May 05, 2022 at 10:37:40AM -0500, Thomas Cameron wrote:
> I understand that turning knobs without understanding the consequences can
> do bad thing, but almost all of the spam that gets through SA on my server
> has SPAM_99 or SPAM_999 set in the headers. It is obviously spam, so I don't
> really get how it wasn't flagged, but it wasn't. What are the risks of
> giving more weight to SPAM_99 and/or SPAM_999? Explain it like I'm five,
> sorry, it's probably something simple that I just don't understand.
>
> Thomas
>

--
Opinions above are GNU-copylefted.

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

thomas.cameron at camerontech

May 5, 2022, 9:53 AM

Post #4 of 8 (616 views)

On 5/5/22 11:47, Matija Nalis wrote:
> On Thu, May 05, 2022 at 10:37:40AM -0500, Thomas Cameron wrote:
>> I understand that turning knobs without understanding the consequences can
>> do bad thing, but almost all of the spam that gets through SA on my server
>> has SPAM_99 or SPAM_999 set in the headers. It is obviously spam, so I don't
>> really get how it wasn't flagged, but it wasn't. What are the risks of
>> giving more weight to SPAM_99 and/or SPAM_999? Explain it like I'm five,
>> sorry, it's probably something simple that I just don't understand.
>>
>> Thomas
>>
> You should probably check that none of your ham (i.e. non-spam)
> messages contains SPAM_99 or SPAM_999. It can happen when spammers
> poison your bayes database, and increased score in that case might
> lead to legitimate mail being misclassified as a spam.

That's a great call, thanks. I grepped my mail files and didn't find any
SPAM_99 headers in any of them.

Thomas

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

dwreski at guardiandigital

May 5, 2022, 9:59 AM

Post #5 of 8 (616 views)

>> You should probably check that none of your ham (i.e. non-spam)
>> messages contains SPAM_99 or SPAM_999. It can happen when spammers
>> poison your bayes database, and increased score in that case might
>> lead to legitimate mail being misclassified as a spam.
>
> That's a great call, thanks. I grepped my mail files and didn't find any
> SPAM_99 headers in any of them.

You should be looking for BAYES_99 and BAYES_999 in your corpus.

Best,
Dave

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

thomas.cameron at camerontech

May 5, 2022, 10:40 AM

Post #6 of 8 (614 views)

On 5/5/22 11:59, Dave Wreski wrote:
>
>>> You should probably check that none of your ham (i.e. non-spam)
>>> messages contains SPAM_99 or SPAM_999. It can happen when spammers
>>> poison your bayes database, and increased score in that case might
>>> lead to legitimate mail being misclassified as a spam.
>>
>> That's a great call, thanks. I grepped my mail files and didn't find
>> any SPAM_99 headers in any of them.
>
> You should be looking for BAYES_99 and BAYES_999 in your corpus.

Thanks, Dave. I use my various mailboxes (sa-learn --ham --mbox
/home/thomas.cameron/mail/INBOX/[mailbox file] and then sa-learn --spam
--mbox /home/thomas.cameron/mail/INBOX/spam) to train SA, doesn't that
mean that I've already checked my corpora?

Thomas

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

dwreski at guardiandigital

May 5, 2022, 12:28 PM

Post #7 of 8 (614 views)

>>> That's a great call, thanks. I grepped my mail files and didn't find
>>> any SPAM_99 headers in any of them.
>>
>> You should be looking for BAYES_99 and BAYES_999 in your corpus.
>
>
> Thanks, Dave. I use my various mailboxes (sa-learn --ham --mbox
> /home/thomas.cameron/mail/INBOX/[mailbox file] and then sa-learn --spam
> --mbox /home/thomas.cameron/mail/INBOX/spam) to train SA, doesn't that
> mean that I've already checked my corpora?

No, that's how you train your corpora. If you manually look through the
headers of mail that's already been processed by your mail system, the
ham should be as close to BAYES_00 as possible, and spam should be at
BAYES_99. If that's not the case, then it's been trained incorrectly.

/etc/mail/spamassassin/local.cf:
bayes_auto_learn 0
bayes_auto_expire 0

I'd also recommend disabling auto-learn, if you have that enabled.

If you've gone through your corpus manually, and are certain the ham is
all good mail and the spam emails are all bad mail, then it might be
worth it to dump the existing bayes database and just retrain it with
the corresponding mboxes.

I also typically add --progress to sa-learn.

Best,
Dave

>
> Thomas

Re: Why shouldn't I set the score for SPAM_99 and SPAM_999 higher? [ In reply to ]

thomas.cameron at camerontech

May 5, 2022, 12:53 PM

Post #8 of 8 (614 views)

On 5/5/22 14:28, Dave Wreski wrote:
> No, that's how you train your corpora. If you manually look through
> the headers of mail that's already been processed by your mail system,
> the ham should be as close to BAYES_00 as possible, and spam should be
> at BAYES_99. If that's not the case, then it's been trained incorrectly.
>
> /etc/mail/spamassassin/local.cf:
> bayes_auto_learn 0
> bayes_auto_expire 0
>
> I'd also recommend disabling auto-learn, if you have that enabled.
>
> If you've gone through your corpus manually, and are certain the ham
> is all good mail and the spam emails are all bad mail, then it might
> be worth it to dump the existing bayes database and just retrain it
> with the corresponding mboxes.
>
> I also typically add --progress to sa-learn.
>
> Best,
> Dave

Thanks, I appreciate it. I'll tune it a bit.

Thomas