Mailing List Archive

sa-learn, TXREP, network queries, documentation
This is a blend of a not-entirely-sure documentation bug report and
questions. I am using 3.4.5.

I used to use BAYES. To train it, I sorted ham that landed in spam
folders back to where it should have gone, and sorted spam that landed
in ham folders to "spam.manual". I had a cron job ran sa-learn over
each folder once a day, with --spam and --ham arguments. This worked
reasonably well, even though there are a vast number of messsages; most
are not new and the relearning process tended to just pick up the new or
re-filed ones.

Recently I enabled TXREP, and I'm generally very happy with it.
I did run sa-learn on a few messages that were misclassified; both
ham that score above 1 and low-scoring spam.

I received advice that bayes was difficult to use correctly in terms of
training and keeping the database in good shape, and had some
misclassifications, so I decided to clear out my bayes db and retrain,
by which I mean running sa-learn over my current set of ham/spam.

I was surprised by two aspects of this (note that I am only 98% sure I
interpretead things right):

With TXREP enabled, sa-learn seems to cause a full reevaluation of
each message. On one hand this makes a lot of sense once considered,
because the foundation of TXREP is moving scores towards the learned
average.

Because of TXREP's re-evaluation, without "-L", sa-learn causes rbl
queries to be made for each messages scanned, and the rate of queries
is very high. After doing this, I found that I was blocked by URIBL.

So therefore:

1) sa-learn -L documents that

-L, --local
Do not perform any network accesses while learning details about
the mail messages. This will speed up the learning process, but
may result in a slightly lower accuracy.

Note that this is currently ignored, as current versions of
SpamAssassin will not perform network access while learning; but
future versions may.

and while I haven't quite proved it, the second paragraph seems
wrong.

2) The web page at

https://cwiki.apache.org/confluence/display/SPAMASSASSIN/TxRep

says to use sa-learn and doesn't caution about -L. (Of course, if one
manually trains a few errant messages it doesn't matter.)

3) sa-learn does not document that it is no longer for BAYES, but a
general interface to mechanisms that learn. (There's also no
"sa-learn --methods" to show the current list.) Many of the sa-learn
options seem to really be about bayes only, and some seem to be higher
level.

4) There is a bonus of txrep_learn_penalty for learning spam, default
20. If the user says it is spam by calling learn, then I don't
understand why it isn't just treated as score 20. Likewise
txrep_learn_bonus and being treated as -20. It would seem to avoid
much processing and also potentially huge amounts of RBL traffic.
(I've added -L to my script that calls sa-learn.)

5) It's very nice to have URIBL_BLOCKED, which is how I noticed.
Thanks to whoever added that, and to URIBL for providing the BL. I'm
sorry my machine generated excessive queries (and I'm glad the block
expired after a few weeks of not making any).


I'm curious how others see this, and if anyone else has had trouble with
dnsbl blocks from running sa-learn with txrep.

Greg
Re: sa-learn, TXREP, network queries, documentation [ In reply to ]
On Mon, 12 Apr 2021 09:40:47 -0400
Greg Troxel wrote:



> 3) sa-learn does not document that it is no longer for BAYES, but a
> general interface to mechanisms that learn.

It always was in theory.

> 4) There is a bonus of txrep_learn_penalty for learning spam,
> default 20. If the user says it is spam by calling learn, then I
> don't understand why it isn't just treated as score 20.

Probably it's to keep retraining straightforward if you make a mistake.

> (I've added -L to my script that calls sa-learn.)

That sounds like a bad idea because you will be feeding it bogus data.
It's probably better to just turn-of TxRep when training from an
historic corpus and only run your periodic training on new mail.