Mailing List Archive

SHORT_WORD_LINES & KAM_LINEPADDING
Hi,

I'm curious about the SHORT_WORD_LINES, KAM_LINEPADDING and HK_RANDOM
rules. I received a legitimate email from a gmail sender that was pushed
beyond 5.0 because of these rules. It hit both SCC_5_SHORT_WORD_LINES and
SCC_10_SHORT_WORD_LINES, and because a score isn't explicitly set, the two
rules added 2.0 points to the score.

describe SCC_5_SHORT_WORD_LINES 5 lines with many short words
meta SCC_5_SHORT_WORD_LINES __SCC_SHORT_WORDS >= 5
describe SCC_10_SHORT_WORD_LINES 10 lines with many short words
meta SCC_10_SHORT_WORD_LINES __SCC_SHORT_WORDS >= 10
describe SCC_20_SHORT_WORD_LINES 20 lines with many short words
meta SCC_20_SHORT_WORD_LINES __SCC_SHORT_WORDS >= 20
describe SCC_35_SHORT_WORD_LINES 35 lines with many short words
meta SCC_35_SHORT_WORD_LINES __SCC_SHORT_WORDS >= 35

KAM_LINEPADDING was hit because it was a longer email chain that involved
many ">" line characters.

rawbody __KAM_LINEPADDING /(\n[^\n]){8}/
meta KAM_LINEPADDING (__KAM_LINEPADDING >= 1)
score KAM_LINEPADDING 1.2
describe KAM_LINEPADDING Spam that tries to get past blank line filters

1.0 HK_RANDOM_FROM From username looks random
1.0 HK_RANDOM_ENVFROM Envelope sender username looks random

The envelope-from and From address were both the same (
killercopywritingltd@gmail.com), so because they "look random" another 2.0
points were added.

Add to that the IP Gmail used to send it had a relatively poor sender score:
0.7 RCVD_IN_SENDERSCORE_70_79 RBL: Senderscore.org score of 70 to 79
[209.85.208.54 listed in score.senderscore.com]

It also hit BAYES_50, which pushed it beyond 5.0.

Of course I could welcomelist the sender, train bayes or manually reduce
the scores of these rules, but they stood out to me as something that's
worth consideration. Should they be reevaluated?
Re: SHORT_WORD_LINES & KAM_LINEPADDING [ In reply to ]
On 2023-03-16 at 12:35:12 UTC-0400 (Thu, 16 Mar 2023 12:35:12 -0400)
Alex <mysqlstudent@gmail.com>
is rumored to have said:

> Hi,
>
> I'm curious about the SHORT_WORD_LINES, KAM_LINEPADDING and HK_RANDOM
> rules.

The SCC_*_SHORT_WORD_LINES cluster is my fault, developed into its
current state for KAM. KAM_LINEPADDING is (obviously) also from the KAM
channel, but I'm unfamiliar with its motivation. Unless KAM has started
it recently (and quietly,) there is no structured QA or rescoring for
that channel.

The HK_RANDOM_* rules are part of the default channel, and so they go
through QA and rescoring based on the masscheck results we get.

> I received a legitimate email from a gmail sender that was pushed
> beyond 5.0 because of these rules. It hit both SCC_5_SHORT_WORD_LINES
> and
> SCC_10_SHORT_WORD_LINES,

That's impressive. Legitimate mail with more than ten lines with 11 or
more consecutive 1-4 letter words is not common.

> and because a score isn't explicitly set, the two
> rules added 2.0 points to the score.

That's intentional. The default 1.0 score and those thresholds worked
well for the mail streams it was targeting when I came up with those
rules. They still work well where I use them, although I occasionally
see non-spam hit the 20 rule. The original spam phyla that were hitting
the 35 rule have abated in the past year, but I've never had a false
positive problem with them.

In general, many sites will find it necessary to broadly reduce rule
scores or raise the spam threshold if adding the KAM channel into an
installation using a common basic configuration, especially if you have
Bayes and AWL/TxRep active. You should also be prepared to welcomelist
senders liberally and/or make use of the
more_spam_to/all_spam_to/whitelist_to settings to selectively relax the
effective threshold based on the "To:" header.

[...]

> Of course I could welcomelist the sender, train bayes or manually
> reduce
> the scores of these rules, but they stood out to me as something
> that's
> worth consideration. Should they be reevaluated?

Rules in the ruleset distributed by the ASF SpamAssassin project (e.g.
HK_RANDOM_*) are normally re-evaluated nightly by the RuleQA process,
which enables, disables, and re-scores rules based on the masscheck
submissions we receive. This has been broken for a few weeks due to one
site having breakage in their masscheck submission process, but it is my
understanding that this has now been fixed.

The QA of the KAM rule channel is much more opaque, as it is
fundamentally a custom local ruleset used by PCCC's Raptor service. This
means that it is designed to work best with predominantly "small
business" email. It also is designed to work in conjunction with mature
self-service quarantine and welcomelist/blocklist tools. You can report
problems at https://raptor.pccc.com/raptor.cgim?template=report_problem
(Select "KAM.cf Issues" from the subject selection pull-down.)


--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire