Mailing List Archive

bayes_ignore_header policy?
There's lots of common headers that are basically huge base64 strings,
creating stupid amounts of random Bayes tokens.

Apparently rulesrc/sandbox/axb/23_bayes_ignore_header.cf was created to
handle some of these already?

I've found atleast these missing:

IronPort-SDR
X-Exchange-Antispam-Report-CFA-Test
X-Forefront-Antispam-Report-Untrusted
X-Gm-Message-State
X-MS-Exchange-AntiSpam-MessageData
X-MS-Exchange-AntiSpam-MessageData-0
X-MS-Exchange-CrossTenant-UserPrincipalName
X-MS-Exchange-SLBlob-MailProps
X-MSFBL
X-Microsoft-Antispam-Message-Info
X-Microsoft-Antispam-Message-Info-Original
X-Microsoft-Antispam-Untrusted
X-Microsoft-Exchange-Diagnostics
X-Provags-ID
X-SG-EID
X-SG-ID

Wouldn't these be better put directly into bayes/23_bayes.cf instead of some
sandbox, that's intended more for testing rules than changing SA config?

Any objections 1) adding these new ones 2) moving everything to 23_bayes.cf?
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 10:42, Henrik K wrote:

> Wouldn't these be better put directly into bayes/23_bayes.cf instead of
> some
> sandbox, that's intended more for testing rules than changing SA
> config?

i think we would change it to be a list of headers always indexed, not
just random ignored ?

could be done with dkim header signed is always trusted headers in bayes
?

not signed header ignore

> Any objections 1) adding these new ones 2) moving everything to
> 23_bayes.cf?

3) a better patch :=)
Re: bayes_ignore_header policy? [ In reply to ]
On Sat, May 07, 2022 at 03:08:08PM +0200, Benny Pedersen wrote:
> On 2022-05-07 10:42, Henrik K wrote:
>
> > Wouldn't these be better put directly into bayes/23_bayes.cf instead of
> > some
> > sandbox, that's intended more for testing rules than changing SA config?
>
> i think we would change it to be a list of headers always indexed, not just
> random ignored ?
>
> could be done with dkim header signed is always trusted headers in bayes ?
>
> not signed header ignore
>
> > Any objections 1) adding these new ones 2) moving everything to
> > 23_bayes.cf?
>
> 3) a better patch :=)

Sorry, I tried reading three times, but I don't understand your suggestion..
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 15:19, Henrik K wrote:

>> 3) a better patch :=)
> Sorry, I tried reading three times, but I don't understand your
> suggestion..

DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=hege.li; s=hege2;
t=1651929565; bh=mDJa8KTm9fQckBvDRBvqjvPDd2nTerfIP3931ZK07fQ=;
h=Date:From:To:Subject:References:In-Reply-To:From;
b=awKAzs8isiW1Kxq04b2g1pnn5lLt+4VLiyJH9ai0V/YyOy0TLi3ajY7TtFgz5abf5
ioZMIvY9aIAk5swUr6yGEnEZqG/5yW2hiId5Vb78Vwj4PoGwuO1WxRNE+8VdlrRkFf
StgfHdXlle+6wqJ5H47fzBIobEvoEM0C3g/QKJbhys0sGzrjbEqVzAKIETzACaGVG+
8EWwYM7utrUnENIaf48/M3kL5qFReJJ32Fzol7x1G1O7qoBQT4zrB1cnQ19Tt6tpRR
InkL7nZx4cPXx4laYiTxiymQkEHxwqD43AXGzdng/BESAYeEImU4zHZSHBwNXJ2Mah
Ej8ihZ2Zi+m4g==

From is twise in above

i think dkim h= could be trusted in bayes, instaed of defining untrusted
bayes_header_ignore randomheader

in local.cf

bayes_trust_header From
bayes_trust_header Date
bayes_trust_header To

etc

all other header as axb did would still be untrusted then

hope i have explaned it well now
Re: bayes_ignore_header policy? [ In reply to ]
On Sat, May 07, 2022 at 03:46:13PM +0200, Benny Pedersen wrote:
> On 2022-05-07 15:19, Henrik K wrote:
>
> > > 3) a better patch :=)
> > Sorry, I tried reading three times, but I don't understand your
> > suggestion..
>
> DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=hege.li; s=hege2;
> t=1651929565; bh=mDJa8KTm9fQckBvDRBvqjvPDd2nTerfIP3931ZK07fQ=;
> h=Date:From:To:Subject:References:In-Reply-To:From;
> b=awKAzs8isiW1Kxq04b2g1pnn5lLt+4VLiyJH9ai0V/YyOy0TLi3ajY7TtFgz5abf5
> ioZMIvY9aIAk5swUr6yGEnEZqG/5yW2hiId5Vb78Vwj4PoGwuO1WxRNE+8VdlrRkFf
> StgfHdXlle+6wqJ5H47fzBIobEvoEM0C3g/QKJbhys0sGzrjbEqVzAKIETzACaGVG+
> 8EWwYM7utrUnENIaf48/M3kL5qFReJJ32Fzol7x1G1O7qoBQT4zrB1cnQ19Tt6tpRR
> InkL7nZx4cPXx4laYiTxiymQkEHxwqD43AXGzdng/BESAYeEImU4zHZSHBwNXJ2Mah
> Ej8ihZ2Zi+m4g==
>
> From is twise in above
>
> i think dkim h= could be trusted in bayes, instaed of defining untrusted
> bayes_header_ignore randomheader

I don't understand what you mean with "trust".

bayes_ignore_header is used for ignoring headers that produces random
garbage tokens, or too common ones that have no use for classifying other
messages.

It has nothing to do with trust or no trust.
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 15:58, Henrik K wrote:

> I don't understand what you mean with "trust".

sorry my wording on it then :/

> bayes_ignore_header is used for ignoring headers that produces random
> garbage tokens, or too common ones that have no use for classifying
> other
> messages.

lets just continue added more bayes_ignore_header lines, much more
simple then to read what i write :/

with bayes_indexed_header it would make less lines needed to be added,
and it would not need much updates either, and also less memory usage is
needed

> It has nothing to do with trust or no trust.

you dont need to trust me, lol, i still have a point

clamav have around 9 million signatures here, but still no virus hits,
is this not a wake up call ?, seems its not for oracle, i remember you
liked to not load main.cld in clamav to save memory, how did it go ?
Re: bayes_ignore_header policy? [ In reply to ]
On Sat, May 07, 2022 at 04:09:02PM +0200, Benny Pedersen wrote:
>
> with bayes_indexed_header it would make less lines needed to be added, and
> it would not need much updates either, and also less memory usage is needed

Please explain how "bayes_indexed_header" (whatever it is) would make less
lines, and me not needing to add "bayes_ignore_header X-Microsoft-Antispam"
manually so it won't fill my database with garbage.
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 16:14, Henrik K wrote:
> On Sat, May 07, 2022 at 04:09:02PM +0200, Benny Pedersen wrote:
>>
>> with bayes_indexed_header it would make less lines needed to be added,
>> and
>> it would not need much updates either, and also less memory usage is
>> needed
>
> Please explain how "bayes_indexed_header" (whatever it is) would make
> less
> lines, and me not needing to add "bayes_ignore_header
> X-Microsoft-Antispam"
> manually so it won't fill my database with garbage.

i provided a way to define wanted, axb provided a way to define unwanted

eod
Re: bayes_ignore_header policy? [ In reply to ]
On Sat, May 07, 2022 at 04:29:33PM +0200, Benny Pedersen wrote:
> On 2022-05-07 16:14, Henrik K wrote:
> > On Sat, May 07, 2022 at 04:09:02PM +0200, Benny Pedersen wrote:
> > >
> > > with bayes_indexed_header it would make less lines needed to be
> > > added, and
> > > it would not need much updates either, and also less memory usage is
> > > needed
> >
> > Please explain how "bayes_indexed_header" (whatever it is) would make
> > less
> > lines, and me not needing to add "bayes_ignore_header
> > X-Microsoft-Antispam"
> > manually so it won't fill my database with garbage.
>
> i provided a way to define wanted, axb provided a way to define unwanted

Ok I got it now..

So no header would be tokenized by default, unless there is
"bayes_allow_header From To Received" etc.

Dunno, it might help or might not. Would the allow list be actually any
shorter, and would it be maintainable in the long run?

There's many headers that should be parsed more intelligently anyway.
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 16:37, Henrik K wrote:

> Ok I got it now..

and its not yet monday ? :)

> So no header would be tokenized by default, unless there is
> "bayes_allow_header From To Received" etc.

yes i belive this is "install and forget", needed good headers dont
change

> Dunno, it might help or might not. Would the allow list be actually
> any
> shorter, and would it be maintainable in the long run?

atleast axb list is longer and not dkim signed :)

i dont know if dkim signed headers is what to limit to, but it imho
makes sense to only bayes based on this headers, and hope emails that is
not dkim signed dont use other usefull headers for bayes

> There's many headers that should be parsed more intelligently anyway.

next step
Re: bayes_ignore_header policy? [ In reply to ]
On Sat, May 07, 2022 at 04:47:37PM +0200, Benny Pedersen wrote:
> > So no header would be tokenized by default, unless there is
> > "bayes_allow_header From To Received" etc.
>
> yes i belive this is "install and forget", needed good headers dont change

Who knows what are "good headers" and if random spammer added headers make
any difference.. all this would require running days of "10-fold cross
validation", JM is not around anymore so no one knows what to do with the
Bayes engine..

> > Dunno, it might help or might not. Would the allow list be actually any
> > shorter, and would it be maintainable in the long run?
>
> atleast axb list is longer and not dkim signed :)
>
> i dont know if dkim signed headers is what to limit to, but it imho makes
> sense to only bayes based on this headers, and hope emails that is not dkim
> signed dont use other usefull headers for bayes

I don't see why DKIM should be mixed up with Bayes. Anyone can add DKIM
headers, and making Bayes wait for network lookups is complete waste. "Good
headers" for Bayes are irrelevant to that..
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 16:55, Henrik K wrote:
> On Sat, May 07, 2022 at 04:47:37PM +0200, Benny Pedersen wrote:
>> > So no header would be tokenized by default, unless there is
>> > "bayes_allow_header From To Received" etc.
>>
>> yes i belive this is "install and forget", needed good headers dont
>> change
>
> Who knows what are "good headers"

trusteddomains aka opendkim have a list of required headers to sign, for
bayes we could trust them aswell instad of make a list of onwanted
headers

> and if random spammer added headers make
> any difference.. all this would require running days of "10-fold cross
> validation",

not really did this header pass dkim, that part is not needed for bayes

> JM is not around anymore so no one knows what to do with the
> Bayes engine..

outsource bayes to dspam ? :=)

>> > Dunno, it might help or might not. Would the allow list be actually any
>> > shorter, and would it be maintainable in the long run?
>>
>> atleast axb list is longer and not dkim signed :)
>>
>> i dont know if dkim signed headers is what to limit to, but it imho
>> makes
>> sense to only bayes based on this headers, and hope emails that is not
>> dkim
>> signed dont use other usefull headers for bayes
>
> I don't see why DKIM should be mixed up with Bayes. Anyone can add
> DKIM
> headers, and making Bayes wait for network lookups is complete waste.
> "Good
> headers" for Bayes are irrelevant to that..

it was not what i meant with tired it to dkim, only point was use same
headers as dkim to get stable bayes header list
Re: bayes_ignore_header policy? [ In reply to ]
On Sun, May 08, 2022 at 11:27:03AM +0200, Benny Pedersen wrote:
> On 2022-05-07 16:55, Henrik K wrote:
> > On Sat, May 07, 2022 at 04:47:37PM +0200, Benny Pedersen wrote:
> > > > So no header would be tokenized by default, unless there is
> > > > "bayes_allow_header From To Received" etc.
> > >
> > > yes i belive this is "install and forget", needed good headers dont
> > > change
> >
> > Who knows what are "good headers"
>
> trusteddomains aka opendkim have a list of required headers to sign, for
> bayes we could trust them aswell instad of make a list of onwanted headers

It makes no sense to get a list of headers from "somewhere" like opendkim,
it has no relevance on what Bayes should use.

> > JM is not around anymore so no one knows what to do with the
> > Bayes engine..
>
> outsource bayes to dspam ? :=)

You realise that dspam has also been dead for years? :-)

> it was not what i meant with tired it to dkim, only point was use same
> headers as dkim to get stable bayes header list

I'm sure we can figure out what common email headers are by ourselves
without referring to DKIM.
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-07 at 04:42:25 UTC-0400 (Sat, 7 May 2022 11:42:25 +0300)
Henrik K <hege@hege.li>
is rumored to have said:

> There's lots of common headers that are basically huge base64 strings,
> creating stupid amounts of random Bayes tokens.
>
> Apparently rulesrc/sandbox/axb/23_bayes_ignore_header.cf was created to
> handle some of these already?
>
> I've found atleast these missing:
>
> IronPort-SDR
> X-Exchange-Antispam-Report-CFA-Test
> X-Forefront-Antispam-Report-Untrusted
> X-Gm-Message-State
> X-MS-Exchange-AntiSpam-MessageData
> X-MS-Exchange-AntiSpam-MessageData-0
> X-MS-Exchange-CrossTenant-UserPrincipalName
> X-MS-Exchange-SLBlob-MailProps
> X-MSFBL
> X-Microsoft-Antispam-Message-Info
> X-Microsoft-Antispam-Message-Info-Original
> X-Microsoft-Antispam-Untrusted
> X-Microsoft-Exchange-Diagnostics
> X-Provags-ID
> X-SG-EID
> X-SG-ID
>
> Wouldn't these be better put directly into bayes/23_bayes.cf instead of some
> sandbox, that's intended more for testing rules than changing SA config?

Yes.

However, I'm not convinced that all of those are unhelpful for Bayes. Some will never repeat and so are pure noise, but those which identify specific senders may be useful. The MS anti-spam headers may be tokenized into useful pieces (e.g. "NSPM" or "SPM") even if the headers as a whole are opaque.

> Any objections 1) adding these new ones

I have not researched all of those, but I believe that some of those should in theory be useful in Bayes.

> 2) moving everything to 23_bayes.cf?

+1


--
Bill Cole
bill@scconsult.com or billcole@apache.org
(AKA @grumpybozo and many *@billmail.scconsult.com addresses)
Not Currently Available For Hire
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-08 17:29, Bill Cole wrote:
> On 2022-05-07 at 04:42:25 UTC-0400 (Sat, 7 May 2022 11:42:25 +0300)
> Henrik K <hege@hege.li>
> is rumored to have said:
>
>> There's lots of common headers that are basically huge base64 strings,
>> creating stupid amounts of random Bayes tokens.
>>
>> Apparently rulesrc/sandbox/axb/23_bayes_ignore_header.cf was created
>> to
>> handle some of these already?
>>
>> I've found atleast these missing:
>>
>> IronPort-SDR
>> X-Exchange-Antispam-Report-CFA-Test
>> X-Forefront-Antispam-Report-Untrusted
>> X-Gm-Message-State
>> X-MS-Exchange-AntiSpam-MessageData
>> X-MS-Exchange-AntiSpam-MessageData-0
>> X-MS-Exchange-CrossTenant-UserPrincipalName
>> X-MS-Exchange-SLBlob-MailProps
>> X-MSFBL
>> X-Microsoft-Antispam-Message-Info
>> X-Microsoft-Antispam-Message-Info-Original
>> X-Microsoft-Antispam-Untrusted
>> X-Microsoft-Exchange-Diagnostics
>> X-Provags-ID
>> X-SG-EID
>> X-SG-ID
>>
>> Wouldn't these be better put directly into bayes/23_bayes.cf instead
>> of some
>> sandbox, that's intended more for testing rules than changing SA
>> config?
>
> Yes.
>
> However, I'm not convinced that all of those are unhelpful for Bayes.
> Some will never repeat and so are pure noise, but those which identify
> specific senders may be useful. The MS anti-spam headers may be
> tokenized into useful pieces (e.g. "NSPM" or "SPM") even if the
> headers as a whole are opaque.
>
>> Any objections 1) adding these new ones
>
> I have not researched all of those, but I believe that some of those
> should in theory be useful in Bayes.

so it would be a need to have opt out for bayes_ignore_header ?

>> 2) moving everything to 23_bayes.cf?
> +1

+1
Re: bayes_ignore_header policy? [ In reply to ]
On Sun, May 08, 2022 at 11:29:29AM -0400, Bill Cole wrote:
>
> I have not researched all of those, but I believe that some of those
> should in theory be useful in Bayes.

So is someone going to research them then? And the 268 older headers that
axb already implemented?

Honestly I couldn't care less, since I've already ignored them locally.
Just figured they would save few database bytes globally. This is not
something I want to waste time running 10-fold cross validations anyway.
Re: bayes_ignore_header policy? [ In reply to ]
On 2022-05-10 06:44, Henrik K wrote:
> On Sun, May 08, 2022 at 11:29:29AM -0400, Bill Cole wrote:
>>
>> I have not researched all of those, but I believe that some of those
>> should in theory be useful in Bayes.
>
> So is someone going to research them then? And the 268 older headers
> that
> axb already implemented?
>
> Honestly I couldn't care less, since I've already ignored them locally.
> Just figured they would save few database bytes globally. This is not
> something I want to waste time running 10-fold cross validations
> anyway.

why have that bayes_ignore_header support in spamassassin in the first
place then ?

we could add one single option, ignore ALL headers ?

saves much more in bayes :)
Re: bayes_ignore_header policy? [ In reply to ]
> On 2022-05-10 06:44, Henrik K wrote:
>> On Sun, May 08, 2022 at 11:29:29AM -0400, Bill Cole wrote:
>>>
>>> I have not researched all of those, but I believe that some of those
>>> should in theory be useful in Bayes.
>>
>> So is someone going to research them then? And the 268 older headers
>> that
>> axb already implemented?
>>
>> Honestly I couldn't care less, since I've already ignored them locally.
>> Just figured they would save few database bytes globally. This is not
>> something I want to waste time running 10-fold cross validations anyway.
>
> why have that bayes_ignore_header support in spamassassin in the first
> place then ?
>
> we could add one single option, ignore ALL headers ?
>
> saves much more in bayes :)

Just at a guess, ignoring any header that is a long string of gibberish >
say 16 bytes is probably worthwhile. All of the stuff I see in headers of
that sort seems to be related to spam tracking on various platforms.
Re: bayes_ignore_header policy? [ In reply to ]
On 7 May 2022, Henrik K. spake thusly:

> There's lots of common headers that are basically huge base64 strings,
> creating stupid amounts of random Bayes tokens.

Honestly I'm wondering if a simpler way to deal with these might simply
be to detect lengthy bayes64ed regions in headers (not actually that
difficult), try to unbase64 them and use *that* for Bayes, probably as a
new pseudo-header with a name derived from the old one, and with the
content dropped from the tokenization of the old one. Combine that with
an extra check: "words" containing control characters (in the range
0x0--0x1f) are not tokenized. (Maybe another check imposing a maximum
length on things Bayes considers words might be a good idea, but I'm not
sure if we do that already.)

It is clearly wrong to tokenize long base64 strings in any case, and
we already decode these in body text: maybe we should start doing
something similar for regions of headers, since this is such a common
thing for non-spam to do these days.