Mailing List Archive: Auto-learning ‘considered harmful’: not so much when rejecting spam?

Auto-learning ‘considered harmful’: not so much when rejecting spam?

Jan 17, 2023, 4:33 AM

Post #1 of 3 (138 views)

I have heard it said many times on this list that auto-learning is
discouraged, so I decided to finally look into disabling it.

But then I realised that I do have a use for auto-learning: In my setup,
I use a milter to reject certain spam (score > 10.0). Now, if I turn off
auto-learning I lose something. Because, as far as I understand the
default spam auto-learning threshold of 12.0 causes incoming
high-probability spam to be learned as spam, even though the message is
then rejected and not available locally later.

Is my understanding correct? Auto-learning of spam can be useful if spam
is rejected during the SMTP conversation but after it has been seen
– and learned – by SpamAssassin?

Re: Auto-learning ‘considered harmful’: not so much when rejecting spam? [ In reply to ]

kmcgrail at apache

Jan 17, 2023, 6:37 AM

Post #2 of 3 (138 views)

Permalink

On 1/17/2023 7:33 AM, David Bürgin wrote:
> I have heard it said many times on this list that auto-learning is
> discouraged, so I decided to finally look into disabling it.
>
> But then I realised that I do have a use for auto-learning: In my setup,
> I use a milter to reject certain spam (score > 10.0). Now, if I turn off
> auto-learning I lose something. Because, as far as I understand the
> default spam auto-learning threshold of 12.0 causes incoming
> high-probability spam to be learned as spam, even though the message is
> then rejected and not available locally later.
>
> Is my understanding correct? Auto-learning of spam can be useful if spam
> is rejected during the SMTP conversation but after it has been seen
> – and learned – by SpamAssassin?

The problem with auto learning I've seen is that it slowly spirals
miscategorization errors. The technical term is that it reinforces a
bias. A Bayes database should be carefully maintained. It's not very
much of a fire and forget technology.

And, for example, letting user's control it becomes a question of "what
is spam?" For example, users might get a very legit mail BUT they are
tired of seeing it in their inbox. So they want to train it as spam.
If you have per-user implementations, that can be good BUT you need a
few hundred samples of good email and bad email to activate Bayes.

In short, I don't have a good solution for training Bayes that isn't a
lot of work but auto-learning is usually a bad solution.

One case where it might be good is if you had a system setup that you
fed emails to that were classified. It would then use that good feed to
use the auto-learning and add a way of learning without using the
command line.

Regards,
KAM

--
Kevin A. McGrail
KMcGrail@Apache.org

Member, Apache Software Foundation
Chair Emeritus Apache SpamAssassin Project
https://www.linkedin.com/in/kmcgrail - 703.798.0171

Re: Auto-learning ‘considered harmful’: not so much when rejecting spam? [ In reply to ]

uhlar at fantomas

Jan 17, 2023, 7:32 AM

Post #3 of 3 (138 views)

Permalink

>On 1/17/2023 7:33 AM, David Bürgin wrote:
>>I have heard it said many times on this list that auto-learning is
>>discouraged, so I decided to finally look into disabling it.
>>
>>But then I realised that I do have a use for auto-learning: In my setup,
>>I use a milter to reject certain spam (score > 10.0). Now, if I turn off
>>auto-learning I lose something. Because, as far as I understand the
>>default spam auto-learning threshold of 12.0 causes incoming
>>high-probability spam to be learned as spam, even though the message is
>>then rejected and not available locally later.
>>
>>Is my understanding correct? Auto-learning of spam can be useful if spam
>>is rejected during the SMTP conversation but after it has been seen
>>– and learned – by SpamAssassin?

On 17.01.23 09:37, Kevin A. McGrail wrote:
>The problem with auto learning I've seen is that it slowly spirals
>miscategorization errors.

mostly because there are no really useful indicators of hamminess, and if
they are, spammers use them to spread their junk.

after long manual training beingocasionally spoiled by autolearn,
I have manually selected all rules that have negative scores to noautolearn:

tflags RCVD_IN_RP_CERTIFIED noautolearn net nice
tflags RCVD_IN_VALIDITY_CERTIFIED noautolearn net nice
tflags RCVD_IN_RP_SAFE noautolearn net nice
tflags RCVD_IN_VALIDITY_SAFE noautolearn net nice
tflags RCVD_IN_DNSWL_LOW noautolearn net nice
tflags RCVD_IN_DNSWL_MED noautolearn net nice
tflags RCVD_IN_DNSWL_HI noautolearn net nice
tflags RCVD_IN_MSPIKE_H2 noautolearn net nice
tflags RCVD_IN_MSPIKE_H3 noautolearn net nice
tflags RCVD_IN_MSPIKE_H4 noautolearn net nice
tflags RCVD_IN_MSPIKE_H5 noautolearn net nice
tflags RCVD_IN_MSPIKE_WL noautolearn net nice
tflags RCVD_IN_IADB_DK noautolearn net nice
tflags RCVD_IN_IADB_DOPTIN noautolearn net nice
tflags RCVD_IN_IADB_LISTED noautolearn net nice
tflags RCVD_IN_IADB_MI_CPR_MAT noautolearn net nice
tflags RCVD_IN_IADB_ML_DOPTIN noautolearn net nice
tflags RCVD_IN_IADB_OPTIN noautolearn net nice
tflags RCVD_IN_IADB_OPTIN_GT50 noautolearn net nice
tflags RCVD_IN_IADB_RDNS noautolearn net nice
tflags RCVD_IN_IADB_SENDERID noautolearn net nice
tflags RCVD_IN_IADB_SPF noautolearn net nice
tflags RCVD_IN_IADB_UT_CPR_MAT noautolearn net nice
tflags RCVD_IN_IADB_VOUCHED noautolearn net nice
tflags DKIMWL_WL_HIGH noautolearn net nice
tflags DKIMWL_WL_MEDHI noautolearn net nice
tflags DKIMWL_WL_MED noautolearn net nice
tflags DKIM_VALID noautolearn net nice
tflags DKIM_VALID_EF noautolearn net nice

still needs some training.

and, in some places, you may need to dump the database and re-train from
scratch.
That's why manual training is great and why you need to keep some spam, but
mostly ham.

> The technical term is that it reinforces a
>bias. A Bayes database should be carefully maintained. It's not very
>much of a fire and forget technology.
>
>And, for example, letting user's control it becomes a question of
>"what is spam?" For example, users might get a very legit mail BUT
>they are tired of seeing it in their inbox. So they want to train it
>as spam. If you have per-user implementations, that can be good BUT
>you need a few hundred samples of good email and bad email to activate
>Bayes.
>
>In short, I don't have a good solution for training Bayes that isn't a
>lot of work but auto-learning is usually a bad solution.
>
>One case where it might be good is if you had a system setup that you
>fed emails to that were classified. It would then use that good feed
>to use the auto-learning and add a way of learning without using the
>command line.

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
It's now safe to throw off your computer.