Mailing List Archive

Bayes autolearn: how does it resolve whether rules are body or header related?
Dear fellow Spamassassin users,

I recently noticed that quite a lot of spam emails with high scores
weren't marked for Bayes autolearning. While some senders and receivers
were a common match, explaining why autolearn was nog, there was no
clear explanation for other cases. I therefore put Spamassassin in debug
mode to check in more detail, and noticed that fairly often autolearn is
not used because the minimum score for body tests isn't achieved. After
looking at some specific cases, it seems however that several rules are
either not considered when calculating the header rule score and body
rule score for Bayes autolearning. I've always presumed these scores are
calculated based on whether the underlying rule performs a regex on a
header or on the body, but now I'm not so sure any more. I hope you can
help clear up whether this is intended behaviour (and what that
behaviour is) or whether I should report this as a bug.

One example I noticed is URI_DEOBFU_INSTR=3.595. This is if I understand
it correctly a URI test that's performed on the body. Should a test like
this be counted towards the body score count? Then there's the question
of meta rules such as MONEY_NOHTML. If you resolve the different meta
levels within this rule, it's a combination of header and body, however
it's only counted towards the header score. Finally, it seems as if
custom rules I've added within local.cf aren't considered. Is that
indeed the case (and if so, is that by design)? I'm also not completely
sure if UNWANTED_BODY_LANGUAGE and tests like razor, pyzor and DCC are
considered for body scores.

Within the same realm, I'm also wondering whether these expected numbers
for body and header can be tweaked and if so, how. For example the case
below isn't autolearned even though it has a huge score and a vast
amount of tests going off, but seemingly not enough body-related scores.
Is that really the intended behaviour?

May  8 10:40:32 mail amavis[4076058]: (4076058-16)
header_edits_for_quar: <fineart@dasanart.com> ->
<gdpr@notgoingtoshare.tld>, Yes, score=24.619 tag=-9999 tag2=5 kill=7.5
tests=[.ADVANCE_FEE_3_NEW_MONEY=0.001,
AXB_XMAILER_MIMEOLE_OL_024C2=0.001, BAYES_50=0.8, BERT_KULSPAM=1,
FORGED_MUA_OUTLOOK=1.927, FREEMAIL_FORGED_REPLYTO=2.095,
FREEMAIL_REPLYTO=1, FREEMAIL_REPLYTO_END_DIGIT=0.25,
FROM_MISSPACED=0.001, FROM_MISSP_EH_MATCH=0.001,
FROM_MISSP_FREEMAIL=0.001, FROM_MISSP_MSFT=0.001,
FROM_MISSP_REPLYTO=2.497, FSL_BULK_SIG=0.001, FSL_CTYPE_WIN1251=0.001,
FSL_NEW_HELO_USER=0.001, KHOP_HELO_FCRDNS=0.398, LOTS_OF_MONEY=0.001,
MISSING_HEADERS=1.021, MISSING_MID=0.497, MONEY_FREEMAIL_REPTO=1.202,
MONEY_FROM_MISSP=0.001, MONEY_NOHTML=2.497, NSL_RCVD_HELO_USER=0.001,
PYZOR_CHECK=1.392, REPLYTO_WITHOUT_TO_CC=1.552, REPTO_419_FRAUD=2.996,
SPF_HELO_NONE=0.001, TO_NO_BRKTS_FROM_MSSP=1.593,
TO_NO_BRKTS_MSFT=1.888, XFER_LOTSA_MONEY=0.001] autolearn=no
autolearn_force=no

Thank you in advance for your help. If you need any more examples or
would us to run some tests, then feel free to let me know.

Kind regards,
Bert Van de Poel
ULYSSIS
Re: Bayes autolearn: how does it resolve whether rules are body or header related? [ In reply to ]
On Sun, 9 May 2021 04:17:26 +0200
Bert Van de Poel wrote:


> Within the same realm, I'm also wondering whether these expected
> numbers for body and header can be tweaked and if so, how.

You can create a meta-rule for definite spam and set:

tflags <rule name> autolearn_force

a hit on any rule with this flag set causes the 3+3 check to be
ignored. It does nothing else.



One thing that does look wrong is that maybe_body_only() looks
for:

(($type == $TYPE_BODY_TESTS) || ($type == $TYPE_BODY_EVALS)
|| ($type == $TYPE_URI_TESTS) || ($type == $TYPE_URI_EVALS))

so it's missing any rawbody and full rules.


Specifically Pyzor, Razor2 and DCC are full eval rules.
Re: Bayes autolearn: how does it resolve whether rules are body or header related? [ In reply to ]
On 09.05.21 04:17, Bert Van de Poel wrote:
>Dear fellow Spamassassin users,
>
>I recently noticed that quite a lot of spam emails with high scores
>weren't marked for Bayes autolearning. While some senders and
>receivers were a common match, explaining why autolearn was nog, there
>was no clear explanation for other cases. I therefore put Spamassassin
>in debug mode to check in more detail, and noticed that fairly often
>autolearn is not used because the minimum score for body tests isn't
>achieved. After looking at some specific cases, it seems however that
>several rules are either not considered when calculating the header
>rule score and body rule score for Bayes autolearning. I've always
>presumed these scores are calculated based on whether the underlying
>rule performs a regex on a header or on the body, but now I'm not so
>sure any more. I hope you can help clear up whether this is intended
>behaviour (and what that behaviour is) or whether I should report this
>as a bug.
>
>One example I noticed is URI_DEOBFU_INSTR=3.595. This is if I
>understand it correctly a URI test that's performed on the body.
>Should a test like this be counted towards the body score count? Then
>there's the question of meta rules such as MONEY_NOHTML. If you
>resolve the different meta levels within this rule, it's a combination
>of header and body, however it's only counted towards the header
>score. Finally, it seems as if custom rules I've added within local.cf
>aren't considered. Is that indeed the case (and if so, is that by
>design)? I'm also not completely sure if UNWANTED_BODY_LANGUAGE and
>tests like razor, pyzor and DCC are considered for body scores.
>
>Within the same realm, I'm also wondering whether these expected
>numbers for body and header can be tweaked and if so, how. For example
>the case below isn't autolearned even though it has a huge score and a
>vast amount of tests going off, but seemingly not enough body-related
>scores. Is that really the intended behaviour?
>
>May? 8 10:40:32 mail amavis[4076058]: (4076058-16)
>header_edits_for_quar: <fineart@dasanart.com> ->
><gdpr@notgoingtoshare.tld>, Yes, score=24.619 tag=-9999 tag2=5
>kill=7.5 tests=[.ADVANCE_FEE_3_NEW_MONEY=0.001,
>AXB_XMAILER_MIMEOLE_OL_024C2=0.001, BAYES_50=0.8, BERT_KULSPAM=1,
>FORGED_MUA_OUTLOOK=1.927, FREEMAIL_FORGED_REPLYTO=2.095,
>FREEMAIL_REPLYTO=1, FREEMAIL_REPLYTO_END_DIGIT=0.25,
>FROM_MISSPACED=0.001, FROM_MISSP_EH_MATCH=0.001,
>FROM_MISSP_FREEMAIL=0.001, FROM_MISSP_MSFT=0.001,
>FROM_MISSP_REPLYTO=2.497, FSL_BULK_SIG=0.001, FSL_CTYPE_WIN1251=0.001,
>FSL_NEW_HELO_USER=0.001, KHOP_HELO_FCRDNS=0.398, LOTS_OF_MONEY=0.001,
>MISSING_HEADERS=1.021, MISSING_MID=0.497, MONEY_FREEMAIL_REPTO=1.202,
>MONEY_FROM_MISSP=0.001, MONEY_NOHTML=2.497, NSL_RCVD_HELO_USER=0.001,
>PYZOR_CHECK=1.392, REPLYTO_WITHOUT_TO_CC=1.552, REPTO_419_FRAUD=2.996,
>SPF_HELO_NONE=0.001, TO_NO_BRKTS_FROM_MSSP=1.593,
>TO_NO_BRKTS_MSFT=1.888, XFER_LOTSA_MONEY=0.001] autolearn=no
>autolearn_force=no
>
>Thank you in advance for your help. If you need any more examples or
>would us to run some tests, then feel free to let me know.

looks like most of those are meta rules:

header FREEMAIL_REPLYTO_END_DIGIT
header MISSING_HEADERS
body BAYES_50
header SPF_HELO_NONE
header FSL_CTYPE_WIN1251
header NSL_RCVD_HELO_USER
header REPTO_419_FRAUD

score FREEMAIL_REPLYTO_END_DIGIT 0.25
score MISSING_HEADERS 0.915 1.207 1.204 1.021
score SPF_HELO_NONE 0.001

so you don't have points from body rules.

your mentioned URI_DEOBFU_INSTR is a meta rule:

meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST

so maybe it's not considered.


--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Linux IS user friendly, it's just selective who its friends are...
Re: Bayes autolearn: how does it resolve whether rules are body or header related? [ In reply to ]
On Sun, 9 May 2021 20:03:27 +0200
Matus UHLAR - fantomas wrote:


> so you don't have points from body rules.
>
> your mentioned URI_DEOBFU_INSTR is a meta rule:
>
> meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST
>
> so maybe it's not considered.

They are treated as header, or ignored if marked as net.
Re: Bayes autolearn: how does it resolve whether rules are body or header related? [ In reply to ]
>> so you don't have points from body rules.
>>
>> your mentioned URI_DEOBFU_INSTR is a meta rule:
>>
>> meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST
>>
>> so maybe it's not considered.
>
> They are treated as header, or ignored if marked as net.

I think a bug report should be submitted for this.

Either they should be treated split 50/50 as header and body score, or when
the metas are built they shoudl have a "body rule" flag, and that used to
determine where the score goes.

I tried, but for some reason apache decided that I'm evil and blocked the
submission attempt, so someone else can do it.

Loren
Re: Bayes autolearn: how does it resolve whether rules are body or header related? [ In reply to ]
Dear Loren,

Thank you very much for your email. Based on your message I could deduce
there were earlier messages (which I then read through a web archive).
For some unexplained reason I never received the previous 3 responses to
my email. I hope the university network isn't randomly over-filtering
spam again (we've had those kinds of problems for a while now, it's
quite a problem, we are much more careful about how we mark spam).

Based on what I've read, I agree that this is indeed a bug (or actually
several). I've filed the following bug reports:
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7904 (missing body
types, as mentioned by RW)
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7905 (meta tflags=net
tests are ignored)
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7906 (meta
tflags!=net tests are always header tests)
https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7907 (better support
for meta tests in autolearning in general, with 2 possible solutions)

Thank you very much to RW and Matus Uhlar for helping me figure out what
code to look at and for al three of you to confirm that this is clearly
a set of bugs.

Feel free to file more bugs if you consider there are more based on my
issue, as well as to give support, write suggestions or submit patches
on the bugs I have already filed.

Kind regards,
Bert Van de Poel

On 10/05/2021 06:41, Loren Wilton wrote:
>>> so you don't have points from body rules.
>>>
>>> your mentioned URI_DEOBFU_INSTR is a meta rule:
>>>
>>> meta URI_DEOBFU_INSTR __URI_DEOBFU_INSTR && !__MSGID_OK_HOST
>>>
>>> so maybe it's not considered.
>>
>> They are treated as header, or ignored if marked as net.
>
> I think a bug report should be submitted for this.
>
> Either they should be treated split 50/50 as header and body score, or
> when the metas are built they shoudl have a "body rule" flag, and that
> used to determine where the score goes.
>
> I tried, but for some reason apache decided that I'm evil and blocked
> the submission attempt, so someone else can do it.
>
>        Loren
>
Re: Bayes autolearn: how does it resolve whether rules are body or header related? [ In reply to ]
On Mon, 10 May 2021 20:39:31 +0200
Bert Van de Poel wrote:


> Based on what I've read, I agree that this is indeed a bug (or
> actually several). I've filed the following bug reports:
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7904 (missing body
> types, as mentioned by RW)
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7905 (meta
> tflags=net tests are ignored)
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7906 (meta
> tflags!=net tests are always header tests)
> https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7907 (better
> support for meta tests in autolearning in general, with 2 possible
> solutions)
>
> Thank you very much to RW and Matus Uhlar for helping me figure out
> what code to look at and for al three of you to confirm that this is
> clearly a set of bugs.


I don't agree that they are bugs. I think it would be useful to add
missing body types, but I don't think the rest is hugely wrong, and
it's not sensible for anyone to spend a lot of time on it. Particularly
when it so easy to to turn-off the 3+3 test selectively with
autolearn_force.

Net meta rules usually contain scored net eval rules so it's sensible
to ignore them. Treating meta rules as header points seems to be erring
on the right side. There's a case for ignoring metarules altogether

Autolearning is something that's best avoided if at all possible.
Erring on on the side of avoiding mistraining is a good thing.