Mailing List Archive

Really hard-to-filter spam
Hey, all. I've recently started getting spam that's really hard to deal
with, and I'm open to suggestions as to how to approach it.
Superficially, they all look much like this:

Sender: "ivy" <epltbv@rehc.com>
From: "ivy" <bkwtzk@rehc.com>
To: ken@jots.org
Date: 27 Jul 2023 06:46:13 +0800
Subject: cxUP
---
mnGRZIrmMwvufsQdRRJ?Nlh?132-1532-1334

Now, the _only_ thing that stays the same is the /132.1532.1334/ (even
the separators change). "Well, great, Ken. Use a regex and zap 'em."
I did, and the regex did nothing, which completely confused me. So I
actually _looked_ at the damn e-mail:
---------------------------- cut here --------------------------
Subject: cxUP
Content-Type: multipart/alternative;
boundary=--boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6


----boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64

bW5HUlpJcm1Nd3Z1ZnNRZFJSSuWIkU5saOmjhDEzMi0xNTMyLTEzMzQ=
----boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6
Content-Type: text/html; charset=utf-8
Content-Transfer-Encoding: base64
---------------------------- cut here --------------------------

The damn body's been encoded! And there's so little in there that it's
not triggering on many rules (e.g., Bayesian doesn't go over 20%). If
anyone has a bright idea -- maybe a way to decode the attachments and
run a regex against _that_? -- I'm all ears.

Thanks much,

-Ken
RE: Really hard-to-filter spam [ In reply to ]
>
> Hey, all. I've recently started getting spam that's really hard to deal
> with, and I'm open to suggestions as to how to approach it.
> Superficially, they all look much like this:
>

Post the complete message source including headers.
Re: Really hard-to-filter spam [ In reply to ]
On 7/27/2023 12:08 PM, Ken D'Ambrosio wrote:
> Hey, all. I've recently started getting spam that's really hard to
> deal with, and I'm open to suggestions as to how to approach it.
> Superficially, they all look much like this:
>
> Sender: "ivy" <epltbv@rehc.com>
> From: "ivy" <bkwtzk@rehc.com>
> To: ken@jots.org
> Date: 27 Jul 2023 06:46:13 +0800
> Subject: cxUP
> ---
> mnGRZIrmMwvufsQdRRJ?Nlh?132-1532-1334
>
> Now, the _only_ thing that stays the same is the /132.1532.1334/ (even
> the separators change).  "Well, great, Ken.  Use a regex and zap
> 'em."  I did, and the regex did nothing, which completely confused
> me.  So I actually _looked_ at the damn e-mail:
>
>
> The damn body's been encoded!  And there's so little in there that
> it's not triggering on many rules (e.g., Bayesian doesn't go over
> 20%).  If anyone has a bright idea -- maybe a way to decode the
> attachments and run a regex against _that_? -- I'm all ears.
>

1.  There are milters/content-filters that decode Base64 message parts
(amavisd-new, mimedefang, etc) for processing by SA.
2.  There are still sufficiently unique items: First-Name-Only,
Mixed-Case word in the Subject (NLP modeling), and a Base-64 encoded
HTML attachment (w/ UTF-8 encoding no less).  Combined in a Meta rule,
these innocuous items will likely hit with good accuracy even without
Base64 decoding.

$0.02,

-- Jared Hall
Re: Really hard-to-filter spam [ In reply to ]
On Fri, 28 Jul 2023, Jared Hall wrote:

> On 7/27/2023 12:08 PM, Ken D'Ambrosio wrote:
>> Hey, all. I've recently started getting spam that's really hard to deal
>> with, and I'm open to suggestions as to how to approach it. Superficially,
[snip..]
>> The damn body's been encoded!  And there's so little in there that it's not
>> triggering on many rules (e.g., Bayesian doesn't go over 20%).  If anyone
>> has a bright idea -- maybe a way to decode the attachments and run a regex
>> against _that_? -- I'm all ears.
>>
>
> 1.  There are milters/content-filters that decode Base64 message parts
> (amavisd-new, mimedefang, etc) for processing by SA.
> 2.  There are still sufficiently unique items: First-Name-Only, Mixed-Case
> word in the Subject (NLP modeling), and a Base-64 encoded HTML attachment (w/
> UTF-8 encoding no less).  Combined in a Meta rule, these innocuous items will
> likely hit with good accuracy even without Base64 decoding.

Umm, unless I'm really missing something here the usual SA processing decodes
such body stuff (QP, Base64, etc) and feeds the "cleaned" text to the rule
processing engine.

You have to work hard to get matches done on the raw stuff if you want to do
special rule matching on the un-decoded body.


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center, 103 S Capitol St.
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Really hard-to-filter spam [ In reply to ]
On 2023-07-28 at 00:26:51 UTC-0400 (Thu, 27 Jul 2023 23:26:51 -0500
(CDT))
David B Funk <users@spamassassin.apache.org>
is rumored to have said:

> On Fri, 28 Jul 2023, Jared Hall wrote:
>
>> On 7/27/2023 12:08 PM, Ken D'Ambrosio wrote:
>>> Hey, all. I've recently started getting spam that's really hard to
>>> deal with, and I'm open to suggestions as to how to approach it.
>>> Superficially,
> [snip..]
>>> The damn body's been encoded!  And there's so little in there that
>>> it's not triggering on many rules (e.g., Bayesian doesn't go over
>>> 20%).  If anyone has a bright idea -- maybe a way to decode the
>>> attachments and run a regex against _that_? -- I'm all ears.
>>>
>>
>> 1.  There are milters/content-filters that decode Base64 message
>> parts (amavisd-new, mimedefang, etc) for processing by SA.
>> 2.  There are still sufficiently unique items: First-Name-Only,
>> Mixed-Case word in the Subject (NLP modeling), and a Base-64 encoded
>> HTML attachment (w/ UTF-8 encoding no less).  Combined in a Meta
>> rule, these innocuous items will likely hit with good accuracy even
>> without Base64 decoding.
>
> Umm, unless I'm really missing something here the usual SA processing
> decodes such body stuff (QP, Base64, etc) and feeds the "cleaned" text
> to the rule processing engine.

Correct. It has nothing to do with the calling glue.

> You have to work hard to get matches done on the raw stuff if you want
> to do special rule matching on the un-decoded body.

Correct. That should only be needed in rare cases where you're looking
for a pattern in a non-text part.

I'm not sure why the OP's rule didn't match the target message, but it
is NOT because of the Base64 encoding of parts with the 'text' primary
MIME type. If I had to guess, I'd look for invisible characters hidden
in the text (e.g. Unicode "zero width non-joiner" marks and the like)
that break the pattern and for lookalike non-ASCII characters (often
Cyrillic or Greek) in the target string.

--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: Really hard-to-filter spam [ In reply to ]
On 2023-07-27 at 12:08:02 UTC-0400 (Thu, 27 Jul 2023 12:08:02 -0400)
Ken D'Ambrosio <ken@jots.org>
is rumored to have said:

> Hey, all. I've recently started getting spam that's really hard to
> deal with, and I'm open to suggestions as to how to approach it.
> Superficially, they all look much like this:
>
> Sender: "ivy" <epltbv@rehc.com>
> From: "ivy" <bkwtzk@rehc.com>
> To: ken@jots.org
> Date: 27 Jul 2023 06:46:13 +0800
> Subject: cxUP
> ---
> mnGRZIrmMwvufsQdRRJ?Nlh?132-1532-1334
>
> Now, the _only_ thing that stays the same is the /132.1532.1334/ (even
> the separators change). "Well, great, Ken. Use a regex and zap 'em."
> I did, and the regex did nothing, which completely confused me. So I
> actually _looked_ at the damn e-mail:
> ---------------------------- cut here --------------------------
> Subject: cxUP
> Content-Type: multipart/alternative;
> boundary=--boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6
>
>
> ----boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: base64
>
> bW5HUlpJcm1Nd3Z1ZnNRZFJSSuWIkU5saOmjhDEzMi0xNTMyLTEzMzQ=
> ----boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6
> Content-Type: text/html; charset=utf-8
> Content-Transfer-Encoding: base64
> ---------------------------- cut here --------------------------
>
> The damn body's been encoded!

A very common situation that SpamAssassin handles transparently, as
documented.
In SA 4.0, rules are checked against the decoded text parts of a message
which has been QP or B64 decoded and (by default) normalized to UTF-8.

> And there's so little in there that it's not triggering on many rules
> (e.g., Bayesian doesn't go over 20%). If anyone has a bright idea --
> maybe a way to decode the attachments and run a regex against _that_?
> -- I'm all ears.

Debug your regex rule. This SHOULD work to catch that specific number
grouping, with or without delimiting non-digits:

body BAD_NUMBER /132\D*1532\D*1334/



--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: Really hard-to-filter spam [ In reply to ]
>>> On 7/27/2023 12:08 PM, Ken D'Ambrosio wrote:
>>>> Hey, all. I've recently started getting spam that's really hard to
>>>> deal with, and I'm open to suggestions as to how to approach it.
>>>> Superficially,
> I'm not sure why the OP's rule didn't match the target message, but it
> is NOT because of the Base64 encoding of parts with the 'text' primary
> MIME type. If I had to guess, I'd look for invisible characters hidden
> in the text (e.g. Unicode "zero width non-joiner" marks and the like)
> that break the pattern and for lookalike non-ASCII characters (often
> Cyrillic or Greek) in the target string.

Sweet! The assistance of those who actually felt like assisting,
instead of simply critiquing, is much appreciated. I see some
assumptions I made were wrong (e.g., decoding apparently isn't a
problem), and I'm guessing it is probably something stupid like Unicode.
I'll also make sure I match those other rules; my rules file, I now
realize, is ancient, and likely badly needs to be made more current.

Much appreciated!

-Ken
Re: Really hard-to-filter spam [ In reply to ]
On 7/28/2023 1:49 AM, Ken D'Ambrosio wrote:
>>>> On 7/27/2023 12:08 PM, Ken D'Ambrosio wrote:
>>>>> Hey, all. I've recently started getting spam that's really hard to
>>>>> deal with, and I'm open to suggestions as to how to approach it.
>>>>> Superficially,
>
> Sweet!  The assistance of those who actually felt like assisting,
> instead of simply critiquing, is much appreciated.  I see some
> assumptions I made were wrong (e.g., decoding apparently isn't a
> problem), and I'm guessing it is probably something stupid like
> Unicode.  I'll also make sure I match those other rules; my rules
> file, I now realize, is ancient, and likely badly needs to be made
> more current.

Bill and Dave are correct.  It does NOT make a difference on the glue. 
Using your sample data as is,

Content-Type: multipart/alternative;
boundary=--boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6


----boundary_1294650_c95a1e92-a32e-44c3-b4a8-21415b9755c6
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: base64

bW5HUlpJcm1Nd3Z1ZnNRZFJSSuWIkU5saOmjhDEzMi0xNTMyLTEzMzQ=


it works fine and hits your regex.  I tested this with SA v3.4.6, PERL
5.26.1.  Maybe check for the presence of the MIME::Base64 module and
Unicode::UTF8 modules (something like "instmodsh" option "l").

Just another great mystery; like Bigfoot, Pyramids, UFOs, Crop Circles,
Plains of Nazca, and Microsoft Fax Server.


-- Jared Hall
Re: Really hard-to-filter spam [ In reply to ]
On 7/28/23 00:23, Bill Cole wrote:
> 1. There are milters/content-filters that decode Base64 message parts
> (amavisd-new, mimedefang, etc) for processing by SA.
>>> 2.  There are still sufficiently unique items: First-Name-Only,
>>> Mixed-Case word in the Subject (NLP modeling), and a Base-64 encoded
>>> HTML attachment (w/ UTF-8 encoding no less).  Combined in a Meta
>>> rule, these innocuous items will likely hit with good accuracy even
>>> without Base64 decoding.
>>
>> Umm, unless I'm really missing something here the usual SA processing
>> decodes such body stuff (QP, Base64, etc) and feeds the "cleaned"
>> text to the rule processing engine.
>
> Correct. It has nothing to do with the calling glue.
>
>> You have to work hard to get matches done on the raw stuff if you
>> want to do special rule matching on the un-decoded body.
>
> Correct. That should only be needed in rare cases where you're looking
> for a pattern in a non-text part.
>
> I'm not sure why the OP's rule didn't match the target message, but it
> is NOT because of the Base64 encoding of parts with the 'text' primary
> MIME type. If I had to guess, I'd look for invisible characters hidden
> in the text (e.g. Unicode "zero width non-joiner" marks and the like)
> that break the pattern and for lookalike non-ASCII characters (often
> Cyrillic or Greek) in the target string.

I am seeing the same issue. I get those same emails, with that
132.1532.1334 string or similar. SA is definitely not catching them,
even though I dump them into my spam folder and run sa-learn --spam
against them day after day. How can I check to see if it's actually
decoding the base64? Or is that just a fact? It seems incredibly weird
that I get these things every day, I mark them as spam every day, and
they never hit more than a couple of points on the spam scale.

Thomas
Re: Really hard-to-filter spam [ In reply to ]
On 8/2/23 13:28, Reindl Harald wrote:
> then i bet you have the same "RCVD_IN_ZEN_BLOCKED_OPENDNS" as the OP
> which means you are not capable to operate a mailserver
>
> https://www.spamhaus.org/returnc/pub/
>
> throwen against our spamfilter it would be blocked without any
> question - above 8.0 points the spamass-milter rejects
>
> Content analysis details:   (32.3 points, 5.5 required)
>
>  pts rule name              description
> ---- ----------------------
> --------------------------------------------------
>  1.0 CUST_DNSBL_26_UCE2     RBL: dnsbl-uce-2.thelounge.net
>                             (dnsbl-2.uceprotect.net)
>                            [60.176.201.72 listed in
> dnsbl-uce-2.thelounge.net]
>  6.5 CUST_DNSBL_4_ZEN_PBL   RBL: zen.spamhaus.org (pbl.spamhaus.org)
>                             [60.176.201.72 listed in zen.spamhaus.org]
>  5.5 CUST_DNSBL_6_ZEN_XBL   RBL: zen.spamhaus.org (xbl.spamhaus.org)
>  1.0 CUST_DNSBL_25_NSZONES  RBL: bl.nszones.com
>                             [60.176.201.72 listed in bl.nszones.com]
>  5.5 BAYES_80               BODY: Bayes spam probability is 80 to 95%
>                             [score: 0.9084]
>  0.1 HK_RANDOM_ENVFROM      Envelope sender username looks random
>  0.1 HK_RANDOM_FROM         From username looks random
>  6.5 CUST_DNSBL_2_SORBS_DUL RBL: dnsbl.sorbs.net
>                             (dul.dnsbl.sorbs.net)
>                             [60.176.201.72 listed in dnsbl.sorbs.net]
>  0.0 SPF_HELO_NONE          SPF: HELO does not publish an SPF Record
>  0.1 SPF_NONE               SPF: sender does not publish an SPF Record
>  0.0 HTML_MESSAGE           BODY: HTML included in message
>  0.1 TVD_SPACE_RATIO        No description available.
>  2.5 RDNS_NONE              Delivered to internal network by a host
> with no rDNS
> -0.0 T_SCC_BODY_TEXT_LINE   No description available.
>  0.5 INVALID_MSGID          Message-Id is not valid, according to RFC
> 2822
>  2.5 TVD_SPACE_RATIO_MINFP  Space ratio (vertical text obfuscation?)
>  0.5 BOGOFILTER_PROB_SPAM   BOGOFILTER: No description available.

Wow! What a charming response! You must be a LOT of fun at parties, and
have lots of friends! <eyeroll>

No, I did not get that response. I don't have any of those specific spam
to sample, as I have not gotten one today. But the last spam I got that
slipped through SA had this score:

X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00,DEAR_SOMETHING,
DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,
HTML_MESSAGE,RCVD_IN_DNSWL_HI,RCVD_IN_MSPIKE_H2,RCVD_IN_PBL,
SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no

So nothing about any tests not working, or queries being rejected.
Nothing that looks like misconfiguration on my end. I am not saying
there are no misconfigurations on my end, but if there are, it's not
super obvious to me.

Cheers!
--
Thomas
Re: Really hard-to-filter spam [ In reply to ]
On Wed, 2 Aug 2023, Thomas Cameron via users wrote:

> Wow! What a charming response! You must be a LOT of fun at parties, and have lots of friends! <eyeroll>

Please don't feed the troll. There's a reason that Reindl is blocked from this list.

>
> No, I did not get that response. I don't have any of those specific spam to sample, as I have not gotten one today. But the last spam I got that
> slipped through SA had this score:
>
> X-Spam-Status: No, score=-5.1 required=5.0 tests=BAYES_00,DEAR_SOMETHING,
> DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,
> HTML_MESSAGE,RCVD_IN_DNSWL_HI,RCVD_IN_MSPIKE_H2,RCVD_IN_PBL,
> SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no
> So nothing about any tests not working, or queries being rejected. Nothing that looks like misconfiguration on my end. I am not saying there are
> no misconfigurations on my end, but if there are, it's not super obvious to me.

The fact that you're getting BAYES_00 on that message indicates that Bayes
-really- thinks it's ham.
Given that you've trained multiple instances of this kind of message to Bayes as
spam but it still gets BAYES_00 score means one of two things:
1) Either you've got thousands of instances of similar messages that were
learned as 'ham'
2) or the database that Bayes in your running SA instance is using is not the
same one that you were doing your training to.

This could be configuration issues or pilot error (using the wrong identity when
doing the training, training on the wrong machine, etc).

On your SA machine what does the output of "sa-learn --dump magic" show you?
(IE how many nspam & nham tokens, what is the newest "atime", etc).

If careful config & log inspection doesn't give clues, try this brute-force
test.
Shut down your SA, move the directory containing your Bayes database out of the
way and create a new empty one.
("sa-learn --dump magic" should now show 0 tokens).

Then train a few ham & spam messages (only a dozen or so), recheck the --dump
magic to see that there are now some tokens in the database but not too many.

Restart your SA and watch the log results. If there are fewer than 200 messages
(both ham & spam) in your Bayes database then SA won't use it, so make sure
that's the case, your new database should be too empty for SA to be willing to
use it.
So if you -are- getting Bayes scores then that indicates that SA is using some
database other than what you think it has.

Now start manually training more messages (spam & ham). When you hit the 200
count threashold Bayes scores should start showing up in your logs.

Good luck.

--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center, 103 S Capitol St.
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Really hard-to-filter spam [ In reply to ]
On 8/2/23 14:32, Dave Funk wrote:
> On Wed, 2 Aug 2023, Thomas Cameron via users wrote:
>
>> Wow! What a charming response! You must be a LOT of fun at parties,
>> and have lots of friends! <eyeroll>
>
> Please don't feed the troll. There's a reason that Reindl is blocked
> from this list.

I was not aware, and I apologize.

>>
>> No, I did not get that response. I don't have any of those specific
>> spam to sample, as I have not gotten one today. But the last spam I
>> got that
>> slipped through SA had this score:
>>
>> X-Spam-Status: No, score=-5.1 required=5.0
>> tests=BAYES_00,DEAR_SOMETHING,
>>     DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FREEMAIL_FROM,
>>     HTML_MESSAGE,RCVD_IN_DNSWL_HI,RCVD_IN_MSPIKE_H2,RCVD_IN_PBL,
>>     SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no
>> So nothing about any tests not working, or queries being rejected.
>> Nothing that looks like misconfiguration on my end. I am not saying
>> there are
>> no misconfigurations on my end, but if there are, it's not super
>> obvious to me.
>
> The fact that you're getting BAYES_00 on that message indicates that
> Bayes -really- thinks it's ham.
> Given that you've trained multiple instances of this kind of message
> to Bayes as spam but it still gets BAYES_00 score means one of two
> things:
> 1) Either you've got thousands of instances of similar messages that
> were learned as 'ham'
> 2) or the database that Bayes in your running SA instance is using is
> not the same one that you were doing your training to.
>
> This could be configuration issues or pilot error (using the wrong
> identity when doing the training, training on the wrong machine, etc).
>
> On your SA machine what does the output of "sa-learn --dump magic"
> show you?
> (IE how many nspam & nham tokens, what is the newest "atime", etc).
>
> If careful config & log inspection doesn't give clues, try this
> brute-force test.
> Shut down your SA, move the directory containing your Bayes database
> out of the way and create a new empty one.
> ("sa-learn --dump magic" should now show 0 tokens).
>
> Then train a few ham & spam messages (only a dozen or so), recheck the
> --dump magic to see that there are now some tokens in the database but
> not too many.
>
> Restart your SA and watch the log results. If there are fewer than 200
> messages (both ham & spam) in your Bayes database then SA won't use
> it, so make sure that's the case, your new database should be too
> empty for SA to be willing to use it.
> So if you -are- getting Bayes scores then that indicates that SA is
> using some database other than what you think it has.
>
> Now start manually training more messages (spam & ham). When you hit
> the 200 count threashold Bayes scores should start showing up in your
> logs.
>
> Good luck.

Thank you very much. The message that slipped through today was NOT one
of the ones being discussed in this thread, it was a different format
and totally different message. I only included it to demonstrate that my
server was not being rejected for queries as the blocked user intimated.
I will dig deeper into the --magic and make sure I'm feeding Bayes with
spam and ham.

Thanks for your response, and again, I apologize for leaking that user's
garbage to the list. I was not aware that he was blocked.

--
Thomas
Re: Really hard-to-filter spam [ In reply to ]
On Wed, 2 Aug 2023, Thomas Cameron via users wrote:

> Thank you very much. The message that slipped through today was NOT one of
> the ones being discussed in this thread, it was a different format and
> totally different message. I only included it to demonstrate that my server
> was not being rejected for queries as the blocked user intimated. I will dig
> deeper into the --magic and make sure I'm feeding Bayes with spam and ham.

Regardless, if a message has never been seen before and has little correlation
to earlier messages its Bayes should hit someplace in the 40% to 60% range.

The fact that it hit 00% indicates a strong correlation to lots of ham (or
something is screwy with your Bayes).


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center, 103 S Capitol St.
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Really hard-to-filter spam [ In reply to ]
On 8/2/23 15:52, David B Funk wrote:
>
> Regardless, if a message has never been seen before and has little
> correlation to earlier messages its Bayes should hit someplace in the
> 40% to 60% range.
>
> The fact that it hit 00% indicates a strong correlation to lots of ham
> (or something is screwy with your Bayes).

OK, here's what I got just now:

[thomas.cameron@mail-east ~]$ sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 41449 0 non-token data: nspam
0.000 0 49720 0 non-token data: nham
0.000 0 162741 0 non-token data: ntokens
0.000 0 1689089541 0 non-token data: oldest atime
0.000 0 1691009577 0 non-token data: newest atime
0.000 0 1691007146 0 non-token data: last journal
sync atime
0.000 0 1690991018 0 non-token data: last expiry atime
0.000 0 1382400 0 non-token data: last expire
atime delta
0.000 0 13879 0 non-token data: last expire
reduction count

I can absolutely re-train Bayes. I am kind of an email pack-rat, so I
have over a gig of saved known good emails in various folders. I have SA
set up so that emails are scanned individually on a per user basis via
procmail rule:

[thomas.cameron@mail-east ~]$ head .procmailrc
MAILDIR=$HOME/mail
LOGFILE=$MAILDIR/procmail.log

:0fw: spamassassin.lock
* < 512000
| spamassassin

I have the users move spam to an imap folder, and then run (via the
user's cron job):

sa-learn --mbox --spam /home/[username]/mail/spam

If something is flagged as spam and it's not supposed to be, I have them
copy it to the ham folder and I run (also via cron job):

sa-learn --mbox --ham /home/[username]/mail/spam

For my email account, I've used my inbox and various other folders to
train Bayes in the past (although it's definitely been a while since I
did Bayes maintenance), but I have zero issue nuking my personal Bayes
data and starting over.

Thoughts?

--
Thomas
Re: Really hard-to-filter spam [ In reply to ]
On Wed, Aug 02, 2023 at 04:17:22PM -0500, Thomas Cameron via users wrote:
> On 8/2/23 15:52, David B Funk wrote:
>
> <snip>
>
> I have the users move spam to an imap folder, and then run (via the user's
> cron job):
>
> sa-learn --mbox --spam /home/[username]/mail/spam
>
> If something is flagged as spam and it's not supposed to be, I have them
> copy it to the ham folder and I run (also via cron job):
>
> sa-learn --mbox --ham /home/[username]/mail/spam

^^^^
Hopefully this is just a typo in your email, but the above line trains
your spam folder as if it's ham. That could easily cause your screwed-up
bayes scores.

--Sean
Re: Really hard-to-filter spam [ In reply to ]
On 8/4/23 02:15, Sean Greenslade wrote:
> On Wed, Aug 02, 2023 at 04:17:22PM -0500, Thomas Cameron via users wrote:
>> On 8/2/23 15:52, David B Funk wrote:
>>
>> <snip>
>>
>> I have the users move spam to an imap folder, and then run (via the user's
>> cron job):
>>
>> sa-learn --mbox --spam /home/[username]/mail/spam
>>
>> If something is flagged as spam and it's not supposed to be, I have them
>> copy it to the ham folder and I run (also via cron job):
>>
>> sa-learn --mbox --ham /home/[username]/mail/spam
>
> ^^^^
> Hopefully this is just a typo in your email, but the above line trains
> your spam folder as if it's ham. That could easily cause your screwed-up
> bayes scores.
>
> --Sean

It was a typo, sorry. I have a cron job that uses --spam against the
spam folder, and --ham against the ham folder. I just copied and pasted
poorly. This is the actual script for my account:

[thomas.cameron@mail-east ~]$ cat bin/spamcheck
#!/bin/bash
sa-learn --progress --spam --mbox /home/thomas.cameron/mail/INBOX/spam
sa-learn --progress --ham --mbox /home/thomas.cameron/mail/INBOX/ham

Bayes tests for other messages, like the one you sent me, looks like this:

------------------------------------------------------------------
Return-Path: <sean@redacted.foo>
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
mail-east.camerontech.com
X-Spam-Level:
X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,
DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI,SPF_HELO_NONE,
SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham
autolearn_force=no version=3.4.6
------------------------------------------------------------------

But messages flagged as spam look like this:

------------------------------------------------------------------
Return-Path:
<usawildseafood_ad-thomas.cameron=camerontech.com@redacted.click>
X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
mail-east.camerontech.com
X-Spam-Flag: YES
X-Spam-Level: ************************************
X-Spam-Status: Yes, score=36.8 required=5.0 tests=BAYES_99,BAYES_999,
DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FROM_FMBLA_NEWDOM,
FROM_SUSPICIOUS_NTLD,FROM_SUSPICIOUS_NTLD_FP,HTML_IMAGE_ONLY_32,
HTML_MESSAGE,PDS_OTHER_BAD_TLD,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,
RCVD_IN_DNSWL_HI,RDNS_NONE,SH_HELO_DBL,SH_HELO_ZRD_FRESH,
SH_ZRD_HEADERS_FRESH,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,
URIBL_ABUSE_SURBL,URIBL_BLACK,URIBL_ZRD shortcircuit=no autolearn=spam
autolearn_force=no version=3.4.6
------------------------------------------------------------------

The previous email I copied headers from as an example was just a bad
example. Usually Bayes is /pretty/ accurate on my system. I only used
that one because it was a message which made it through SpamAssassin. I
was trying to demonstrate that the checks were not failing, as suggested
in an earlier comment.

Thanks for catching that, though. I have made silly mistakes like that
so I appreciate you checking me.

--
Thomas
Re: Really hard-to-filter spam [ In reply to ]
On Fri, Aug 04, 2023 at 08:38:24AM -0500, Thomas Cameron wrote:
> It was a typo, sorry. I have a cron job that uses --spam against the spam
> folder, and --ham against the ham folder. I just copied and pasted poorly.
> This is the actual script for my account:
>
> [thomas.cameron@mail-east ~]$ cat bin/spamcheck
> #!/bin/bash
> sa-learn --progress --spam --mbox /home/thomas.cameron/mail/INBOX/spam
> sa-learn --progress --ham --mbox /home/thomas.cameron/mail/INBOX/ham
>
> Bayes tests for other messages, like the one you sent me, looks like this:
>
> ------------------------------------------------------------------
> Return-Path: <sean@redacted.foo>
> X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
> mail-east.camerontech.com
> X-Spam-Level:
> X-Spam-Status: No, score=-7.1 required=5.0 tests=BAYES_00,DKIM_SIGNED,
> DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_HI,SPF_HELO_NONE,
> SPF_PASS,T_SCC_BODY_TEXT_LINE shortcircuit=no autolearn=ham
> autolearn_force=no version=3.4.6
> ------------------------------------------------------------------
>
> But messages flagged as spam look like this:
>
> ------------------------------------------------------------------
> Return-Path:
> <usawildseafood_ad-thomas.cameron=camerontech.com@redacted.click>
> X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on
> mail-east.camerontech.com
> X-Spam-Flag: YES
> X-Spam-Level: ************************************
> X-Spam-Status: Yes, score=36.8 required=5.0 tests=BAYES_99,BAYES_999,
> DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,FROM_FMBLA_NEWDOM,
> FROM_SUSPICIOUS_NTLD,FROM_SUSPICIOUS_NTLD_FP,HTML_IMAGE_ONLY_32,
> HTML_MESSAGE,PDS_OTHER_BAD_TLD,RAZOR2_CF_RANGE_51_100,RAZOR2_CHECK,
> RCVD_IN_DNSWL_HI,RDNS_NONE,SH_HELO_DBL,SH_HELO_ZRD_FRESH,
> SH_ZRD_HEADERS_FRESH,SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE,
> URIBL_ABUSE_SURBL,URIBL_BLACK,URIBL_ZRD shortcircuit=no autolearn=spam
> autolearn_force=no version=3.4.6
> ------------------------------------------------------------------
>
> The previous email I copied headers from as an example was just a bad
> example. Usually Bayes is /pretty/ accurate on my system. I only used that
> one because it was a message which made it through SpamAssassin. I was
> trying to demonstrate that the checks were not failing, as suggested in an
> earlier comment.
>
> Thanks for catching that, though. I have made silly mistakes like that so I
> appreciate you checking me.

In that case, I think I can only offer some general suggestions that I
personally follow.

I have the autolearn function completely disabled. In my experience, if
you have a decent training corpus of known ham and known spam, autolearn
doesn't really add anything.

Like yours, my bayes results are usually quite accurate. At this point,
I only train messages that are actually false positives or false
negatives. I can't say for sure how effective this is, but my intuition
is that by only training on "hard" messages (meaning ones that the
non-bayes SA rules couldn't take care of on their own), I'm keeping the
bayes engine focused on the most important messages to classify
correctly. Your above spample has such a high score, my mail server
would have rejected that message at SMTP time even if it had triggered
BAYES_00. I wouldn't bother training such a message; the rest of the
rules have it covered.

Another thing to note is that spam tends to change over time. Having
really old spams in your bayes DB could be diluting its effectiveness by
having it look for signs that the current crop of spams don't show. It
might be worth starting fresh with an empty bayes db and training just a
few hundred of your most recent hams and spams.

And finally, if there's something consistent about the messages, don't
be afraid to write a manual rule. I have a few special rules in my
configs that alter the bayes scoring based on other aspects of the
messages.

--Sean