Mailing List Archive: Strange findings debugging bayes results

Strange findings debugging bayes results

mercurialuser at gmail

Feb 16, 2023, 1:18 AM

Post #1 of 13 (1077 views)

I was investigating a bunch of bitcoin spam: different titles,
different senders (all from gmail), different text, different pdf
attachment.

Unfortunately in those days my bayes db was polluted and they all got
a BAYES_50, 0.8.

I tested the messages now with a recreated bayes db and got some
BAYES_999. So I dug to understand if I already saw the spam...

But the debug result was unpleasant:
dbg: bayes: tokenized header: 92 tokens
dbg: bayes: token 'HX-Received:Jan' => 0.998028449502134
dbg: bayes: token 'HX-Google-DKIM-Signature:20210112' => 0.997244532803181
dbg: bayes: token 'H*r:sk:<START_OF_RECIPIENT_EMAIL_ADDRESS>' =>
0.997244532803181
dbg: bayes: token 'H*r:a05' => 0.995425742574258
dbg: bayes: token 'HAuthentication-Results:sk:<MY_SA_HOSTNAME>.' =>
0.986543689320388
dbg: bayes: token 'HX-Google-DKIM-Signature:reply-to' => 0.916110175863517
dbg: bayes: token 'H*r:2002' => 0.877842810325844
dbg: bayes: token 'HAuthentication-Results:2048-bit' => 0.858520043212023
dbg: bayes: token 'HAuthentication-Results:pass' => 0.855193895034317
dbg: bayes: score = 0.999997915091326

Every score is based on headers, very generic headers. and some
related to my setup.

Not a single token from the message body....

Re: Strange findings debugging bayes results [ In reply to ]

Feb 16, 2023, 3:02 AM

Post #2 of 13 (1077 views)

On Thu, Feb 16, 2023 at 10:18:50AM +0100, hg user wrote:
> I was investigating a bunch of bitcoin spam: different titles,
> different senders (all from gmail), different text, different pdf
> attachment.
>
> Unfortunately in those days my bayes db was polluted and they all got
> a BAYES_50, 0.8.
>
> I tested the messages now with a recreated bayes db and got some
> BAYES_999. So I dug to understand if I already saw the spam...
>
> But the debug result was unpleasant:
> dbg: bayes: tokenized header: 92 tokens
> dbg: bayes: token 'HX-Received:Jan' => 0.998028449502134
> dbg: bayes: token 'HX-Google-DKIM-Signature:20210112' => 0.997244532803181
> dbg: bayes: token 'H*r:sk:<START_OF_RECIPIENT_EMAIL_ADDRESS>' =>
> 0.997244532803181
> dbg: bayes: token 'H*r:a05' => 0.995425742574258
> dbg: bayes: token 'HAuthentication-Results:sk:<MY_SA_HOSTNAME>.' =>
> 0.986543689320388
> dbg: bayes: token 'HX-Google-DKIM-Signature:reply-to' => 0.916110175863517
> dbg: bayes: token 'H*r:2002' => 0.877842810325844
> dbg: bayes: token 'HAuthentication-Results:2048-bit' => 0.858520043212023
> dbg: bayes: token 'HAuthentication-Results:pass' => 0.855193895034317
> dbg: bayes: score = 0.999997915091326
>
>
> Every score is based on headers, very generic headers. and some
> related to my setup.
>
> Not a single token from the message body....

The Bayes implementation has been practically unmaintained for a long time,
so YMMV.

You can try something like this, most headers are parsed badly and generate
biasing random garbage (unscientific observation):

bayes_ignore_header ARC-Authentication-Results
bayes_ignore_header ARC-Message-Signature
bayes_ignore_header ARC-Seal
bayes_ignore_header Authentication-Results
bayes_ignore_header Autocrypt
bayes_ignore_header IronPort-SDR
bayes_ignore_header suggested_attachment_session_id
bayes_ignore_header X-Brightmail-Tracker
bayes_ignore_header X-Exchange-Antispam-Report-CFA-Test
bayes_ignore_header X-Forefront-Antispam-Report
bayes_ignore_header X-Forefront-Antispam-Report-Untrusted
bayes_ignore_header X-Gm-Message-State
bayes_ignore_header X-Google-DKIM-Signature
bayes_ignore_header x-microsoft-antispam
bayes_ignore_header X-Microsoft-Antispam-Message-Info
bayes_ignore_header X-Microsoft-Antispam-Message-Info-Original
bayes_ignore_header X-Microsoft-Antispam-Untrusted
bayes_ignore_header X-Microsoft-Exchange-Diagnostics
bayes_ignore_header x-ms-exchange-antispam-messagedata
bayes_ignore_header x-ms-exchange-antispam-messagedata-0
bayes_ignore_header x-ms-exchange-crosstenant-id
bayes_ignore_header x-ms-exchange-crosstenant-network-message-id
bayes_ignore_header x-ms-exchange-crosstenant-rms-persistedconsumerorg
bayes_ignore_header X-MS-Exchange-CrossTenant-userprincipalname
bayes_ignore_header x-ms-exchange-slblob-mailprops
bayes_ignore_header x-ms-office365-filtering-correlation-id
bayes_ignore_header X-MSFBL
bayes_ignore_header X-Provags-ID
bayes_ignore_header X-SG-EID
bayes_ignore_header X-SG-ID
bayes_ignore_header X-UI-Out-Filterresults
bayes_ignore_header X-ClientProxiedBy
bayes_ignore_header X-MS-Exchange-CrossTenant-FromEntityHeader
bayes_ignore_header X-OriginatorOrg
bayes_ignore_header X-MS-Exchange-CrossTenant-OriginalArrivalTime
bayes_ignore_header X-MS-TrafficTypeDiagnostic
bayes_ignore_header X-MS-Exchange-CrossTenant-AuthAs
bayes_ignore_header X-MS-Exchange-Transport-CrossTenantHeadersStamped
bayes_ignore_header X-MS-Exchange-CrossTenant-AuthSource

Re: Strange findings debugging bayes results [ In reply to ]

dwreski at guardiandigital

Feb 16, 2023, 5:17 AM

Post #3 of 13 (1077 views)

Hi,

Here's also another 50+ headers we've collected over the years that I
believe started as a list from AXB 10+ years ago.

https://pastebin.com/raw/f6Fwh8HJ

dave

On 2/16/23 6:02 AM, Henrik K wrote:
> On Thu, Feb 16, 2023 at 10:18:50AM +0100, hg user wrote:
>> I was investigating a bunch of bitcoin spam: different titles,
>> different senders (all from gmail), different text, different pdf
>> attachment.
>>
>> Unfortunately in those days my bayes db was polluted and they all got
>> a BAYES_50, 0.8.
>>
>> I tested the messages now with a recreated bayes db and got some
>> BAYES_999. So I dug to understand if I already saw the spam...
>>
>> But the debug result was unpleasant:
>> dbg: bayes: tokenized header: 92 tokens
>> dbg: bayes: token 'HX-Received:Jan' => 0.998028449502134
>> dbg: bayes: token 'HX-Google-DKIM-Signature:20210112' => 0.997244532803181
>> dbg: bayes: token 'H*r:sk:<START_OF_RECIPIENT_EMAIL_ADDRESS>' =>
>> 0.997244532803181
>> dbg: bayes: token 'H*r:a05' => 0.995425742574258
>> dbg: bayes: token 'HAuthentication-Results:sk:<MY_SA_HOSTNAME>.' =>
>> 0.986543689320388
>> dbg: bayes: token 'HX-Google-DKIM-Signature:reply-to' => 0.916110175863517
>> dbg: bayes: token 'H*r:2002' => 0.877842810325844
>> dbg: bayes: token 'HAuthentication-Results:2048-bit' => 0.858520043212023
>> dbg: bayes: token 'HAuthentication-Results:pass' => 0.855193895034317
>> dbg: bayes: score = 0.999997915091326
>>
>>
>> Every score is based on headers, very generic headers. and some
>> related to my setup.
>>
>> Not a single token from the message body....
> The Bayes implementation has been practically unmaintained for a long time,
> so YMMV.
>
> You can try something like this, most headers are parsed badly and generate
> biasing random garbage (unscientific observation):
>
> bayes_ignore_header ARC-Authentication-Results
> bayes_ignore_header ARC-Message-Signature
> bayes_ignore_header ARC-Seal
> bayes_ignore_header Authentication-Results
> bayes_ignore_header Autocrypt
> bayes_ignore_header IronPort-SDR
> bayes_ignore_header suggested_attachment_session_id
> bayes_ignore_header X-Brightmail-Tracker
> bayes_ignore_header X-Exchange-Antispam-Report-CFA-Test
> bayes_ignore_header X-Forefront-Antispam-Report
> bayes_ignore_header X-Forefront-Antispam-Report-Untrusted
> bayes_ignore_header X-Gm-Message-State
> bayes_ignore_header X-Google-DKIM-Signature
> bayes_ignore_header x-microsoft-antispam
> bayes_ignore_header X-Microsoft-Antispam-Message-Info
> bayes_ignore_header X-Microsoft-Antispam-Message-Info-Original
> bayes_ignore_header X-Microsoft-Antispam-Untrusted
> bayes_ignore_header X-Microsoft-Exchange-Diagnostics
> bayes_ignore_header x-ms-exchange-antispam-messagedata
> bayes_ignore_header x-ms-exchange-antispam-messagedata-0
> bayes_ignore_header x-ms-exchange-crosstenant-id
> bayes_ignore_header x-ms-exchange-crosstenant-network-message-id
> bayes_ignore_header x-ms-exchange-crosstenant-rms-persistedconsumerorg
> bayes_ignore_header X-MS-Exchange-CrossTenant-userprincipalname
> bayes_ignore_header x-ms-exchange-slblob-mailprops
> bayes_ignore_header x-ms-office365-filtering-correlation-id
> bayes_ignore_header X-MSFBL
> bayes_ignore_header X-Provags-ID
> bayes_ignore_header X-SG-EID
> bayes_ignore_header X-SG-ID
> bayes_ignore_header X-UI-Out-Filterresults
> bayes_ignore_header X-ClientProxiedBy
> bayes_ignore_header X-MS-Exchange-CrossTenant-FromEntityHeader
> bayes_ignore_header X-OriginatorOrg
> bayes_ignore_header X-MS-Exchange-CrossTenant-OriginalArrivalTime
> bayes_ignore_header X-MS-TrafficTypeDiagnostic
> bayes_ignore_header X-MS-Exchange-CrossTenant-AuthAs
> bayes_ignore_header X-MS-Exchange-Transport-CrossTenantHeadersStamped
> bayes_ignore_header X-MS-Exchange-CrossTenant-AuthSource
--

DaveWreski

President & CEO

Guardian Digital, Inc.

We Make Email Safe

640-800-9446 <tel:640-800-9446>

dwreski@guardiandigital.com <mailto:dwreski@guardiandigital.com>

https://guardiandigital.com <https://guardiandigital.com>

103 Godwin Ave, Suite 314, Midland Park, NJ 07432

facebook <https://www.facebook.com/gdlinux>

twitter <https://twitter.com/gdlinux>

linkedin <https://www.linkedin.com/company/guardiandigital>

Re: Strange findings debugging bayes results [ In reply to ]

axb.lists at gmail

Feb 16, 2023, 6:01 AM

Post #4 of 13 (1077 views)

I've updated 23_bayes_ignore_header.cf
(last update was from 2016 :)

https://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

Axb

On 2/16/23 14:17, Dave Wreski wrote:
> Here's also another 50+ headers we've collected over the years that I
> believe started as a list from AXB 10+ years ago.
>
> https://pastebin.com/raw/f6Fwh8HJ

Re: Strange findings debugging bayes results [ In reply to ]

mnalis-sa-list at voyager

Feb 16, 2023, 9:49 AM

Post #5 of 13 (1077 views)

On Thu, Feb 16, 2023 at 01:02:25PM +0200, Henrik K wrote:
> On Thu, Feb 16, 2023 at 10:18:50AM +0100, hg user wrote:
> > Every score is based on headers, very generic headers. and some
> > related to my setup.
> >
> > Not a single token from the message body....
>
> The Bayes implementation has been practically unmaintained for a long time,
> so YMMV.
>
> You can try something like this, most headers are parsed badly and generate
> biasing random garbage (unscientific observation):
>
> bayes_ignore_header ARC-Authentication-Results
> bayes_ignore_header ARC-Message-Signature

Yeah, bayes of headers (and CSS/HTML stuff) has been doing me much
more misclassifications than good, so I've eventually given up on
updating ever-growing bayes_ignore_header list and disabled bayes on
the headers altogether:

bayes_token_sources none visible uri mimepart

My stance being: If enduser would not be classifying on those sources
(except Subject header), neither should automatic bayes classification...

perhaps OP has bayes_token_sources setting that takes only headers
into the account?

https://man.archlinux.org/man/Mail::SpamAssassin::Conf.3pm.en#bayes_token_sources

--
Opinions above are GNU-copylefted.

Re: Strange findings debugging bayes results [ In reply to ]

mercurialuser at gmail

Feb 19, 2023, 9:47 AM

Post #6 of 13 (1067 views)

>
>
> bayes_token_sources none visible uri mimepart
>

I added this line to my config with no changes in the tokens used to sum
the bayes score, headers still used. It may be a command only recognized
during learning but I should check the sources.

> perhaps OP has bayes_token_sources setting that takes only headers
> into the account?
>

No. that mail had really few words in the text and probably the bayes
system considered them not relevant.

The real question is: has bayes still its use case in 2023 ? Is it still
used with important scores or just to flag messages for a review?

Re: Strange findings debugging bayes results [ In reply to ]

lwilton at earthlink

Feb 19, 2023, 2:53 PM

Post #7 of 13 (1067 views)

> The real question is: has bayes still its use case in 2023 ? Is it still used with important scores or just to flag messages for a review?

It works fine for me here.

Re: Strange findings debugging bayes results [ In reply to ]

mercurialuser at gmail

Feb 19, 2023, 10:10 PM

Post #8 of 13 (1067 views)

Can you please give me some details on your bayes setup? Headers
exclusion, bayes_token_sources, how do you "sa-learn" messages...

thank you

On Sun, Feb 19, 2023 at 11:53 PM Loren Wilton <lwilton@earthlink.net> wrote:

> > The real question is: has bayes still its use case in 2023 ? Is it still
> used with important scores or just to flag messages for a review?
>
> It works fine for me here.
>
>

Re: Strange findings debugging bayes results [ In reply to ]

lwilton at earthlink

Feb 20, 2023, 3:28 AM

Post #9 of 13 (1058 views)

> Can you please give me some details on your bayes setup?
> Headers exclusion, bayes_token_sources, how do you "sa-learn" messages...

Standard options on Bayes. No autolearn. A cron job that will harvest Spam and Ham mboxes and feed them to sa-learn once a day, then archive the learned messages. Per-user bayes and learning. Mail is hand-moved into the spam and ham learning folders, and for my personal account, I do this rarely, generally only when a message is mis-categorized. Although messages being mis-categorized as spam is often the result of a lot of quite aggressive local rules I have rather than a Bayes mis-classification.

Re: Strange findings debugging bayes results [ In reply to ]

users at spamassassin

Feb 20, 2023, 12:47 PM

Post #10 of 13 (1056 views)

On 20 February 2023 12:28:00 CET, Loren Wilton <lwilton@earthlink.net> wrote:
>
> A cron job that will harvest Spam and Ham mboxes and feed them to sa-learn once a day, then archive the learned messages. Per-user bayes and learning. Mail is hand-moved into the spam and ham learning folders, and for my personal account, I do this rarely, generally only when a message is mis-categorized. Although messages being mis-categorized as spam is often the result of a lot of quite aggressive local rules I have rather than a Bayes mis-classification.

When you "harvest" ham from mboxes, what do you consider ham?

You also, additionally, have a Ham folder for your users then? Interesting. Did you manage to train your users to use it easily? Does it grow unbounded or are old messages removed from it? If so, how to know they can be deleted like from the Spam folder.

It's an interesting idea, just wondering about the details. Getting my users to train spamassassim has always been impossible for me.

Re: Strange findings debugging bayes results [ In reply to ]

lwilton at earthlink

Feb 20, 2023, 1:30 PM

Post #11 of 13 (1056 views)

This is a home system with only a few users. All users have "Spam" and "Ham" folders showing up in their email program of choice, and they just drag messages they do or don't like into the appropriate folders. There are "Oldham" and "Oldspam" mboxes, and the new spam and ham (respectively) get merged into these folders after learning, and removed from the current Spam and Ham folders.
----- Original Message -----
From: Michael Grant
To: users@spamassassin.apache.org ; Loren Wilton ; hg user
Sent: Monday, February 20, 2023 12:47 PM
Subject: Re: Strange findings debugging bayes results

On 20 February 2023 12:28:00 CET, Loren Wilton <lwilton@earthlink.net> wrote:
>
> A cron job that will harvest Spam and Ham mboxes and feed them to sa-learn once a day, then archive the learned messages. Per-user bayes and learning. Mail is hand-moved into the spam and ham learning folders, and for my personal account, I do this rarely, generally only when a message is mis-categorized. Although messages being mis-categorized as spam is often the result of a lot of quite aggressive local rules I have rather than a Bayes mis-classification.

When you "harvest" ham from mboxes, what do you consider ham?

You also, additionally, have a Ham folder for your users then? Interesting. Did you manage to train your users to use it easily? Does it grow unbounded or are old messages removed from it? If so, how to know they can be deleted like from the Spam folder.

It's an interesting idea, just wondering about the details. Getting my users to train spamassassim has always been impossible for me.

Re: Strange findings debugging bayes results [ In reply to ]

lwilton at earthlink

Feb 20, 2023, 6:10 PM

Post #12 of 13 (1056 views)

> From: "Reindl Harald" <h.reindl@thelounge.net>
> in other words a system for morons - morons which will drag mails to spam
> instead click on "unsubscribe"
>
> per-user bayes don't work well, never

Well Harald, you are certainly welcome to your opinion. It would be nicer if
you had kept it yourself though.
The system works just fine with the userbase it has. It probably wouldn't
work for AOL or *.online.

Re: Strange findings debugging bayes results [ In reply to ]

users at spamassassin

Feb 21, 2023, 12:54 AM

Post #13 of 13 (1055 views)

On Mon, Feb 20, 2023 at 01:30:15PM -0800, Loren Wilton wrote:
> This is a home system with only a few users. All users have "Spam" and "Ham"
> folders showing up in their email program of choice, and they just drag
> messages they do or don't like into the appropriate folders. There are "Oldham"
> and "Oldspam" mboxes, and the new spam and ham (respectively) get merged into
> these folders after learning, and removed from the current Spam and Ham
> folders.

I had a similar idea but never implemmented it because I felt it was
too difficult for users to deal with. I was considering 2 folders:
'Spam Training Set' and 'Ham Training Set' which would always
represent the set of messages that Spamassassin was currently trained
with. If you changed the contents of these mboxes, a cron job would
delete the old bayes tokens and retrain with the current set.

The difference between these folders and the Spam folder (or Junk or
whatever you call it locally) is that messages older than 30 days get
auto-deleted. After 30 days, those messages would no longer represent
the training set.

Having 2 spam folders is confusing and not easy to manage.

Neither of these 2 extra folders are folders that users would look for
messages so they really do have to copy messages into them which isn't
just dragging them. That for me was the main issue I faced.

So I abandoned this line of thinkinking.

You mentioned harvesting ham and spam from mboxes as in from the inbox
directly. This got me wondering more about this.

Clearly using messages that the user dragged to Spam that
spamassassin did not mark as Spam to train as spam. Easy.

And use messages that the user left in their mailbox or deleted or
archived as ham. Could be ok but less sure.

And lastly, messages that were in Spam (since Spamassassin marked them
as spam), that a user moved out of Spam. Just look through all their
folders (except Spam) for messages that Spamassassin marked as spam
and retrain on those as ham. Again, maybe a bad assumption, could
work though.

I was really just curious to know if other people had workable ideas
how to get bayes trained with the least amount of friction.