Mailing List Archive

sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved)
Hello,

I setup spamassassin the other week on my inbox mail-server, and so far
its been running good, now I wanted to try to train my bayes database
with some mails I have stored (200+ of each spam and ham, which should
be enough according to documentation).

Here is the version I am using:

[foo@mailcollect ~]$ sa-learn --version
SpamAssassin version 3.4.3

I have spamassassin running as a systemd service:
/system.slice/spamassassin.service
+-645 /usr/bin/perl -T -w /usr/bin/spamd -c -m5 -H
--razor-log-file=sys-syslog
+-652 spamd child
+-653 spamd child

And the platform I am running on:

[foo@mailcollect ~]$ grep -e PRETTY_NAME /etc/os-release
PRETTY_NAME="Fedora 31 (Thirty One)"
[foo@mailcollect ~]$ rpm -qi spamassassin
Name : spamassassin
Version : 3.4.3
Release : 2.fc31
Architecture: x86_64

The the problem is this: when I initially used sa-learn on my mailboxes,
it was fairly good.

For example:

+ /usr/bin/sa-learn --no-sync --progress --ham
/var/spool/fetchmail/Maildir/.Congstar
92% [================================= ] 5.21 msgs/sec 00m09s DONE
Learned tokens from 45 message(s) (49 message(s) examined)

This is just a small Maildir, I have other much bigger ones (including
my spam Maildir, which contains 2000+ messages).

Now, if I run sa-learn again on the same folder (the manual says
"SpamAssassin remembers which mail messages it has learnt already, and
will not re-learn those messages again, unless you use the --forget
option.", so I think this is OK to do), it gets absurdly slow, taking
over 2 minutes for the same directory with 45 mails.

+ /usr/bin/sa-learn --no-sync --progress --ham
/var/spool/fetchmail/Maildir/.Congstar
92% [============================= ] 0.30 msgs/sec 02m40s DONE
Learned tokens from 0 message(s) (49 message(s) examined)

Now imagine this for a folder with over 2k messages (of which I have
several).

I am not sure why this is. I ran sa-learn with debug enabled to see
whether I can see something and it looks like it spends ~3s on each
messages for updating the TxRep database (which I enabled in
spamassassin "loadplugin Mail::SpamAssassin::Plugin::TxRep"):

Jan 8 23:49:50.745 [308] dbg: TxRep: forgetting a message
Jan 8 23:49:50.746 [308] dbg: auto-whitelist: db-based
ec300f7aa9c95003b94439831b843605e9a94660@sa_generated|ip=none scores 2/-40
Jan 8 23:49:50.746 [308] dbg: check: tagrun - tag TXREPMSG_ID is now
ready, value: -20.0
Jan 8 23:49:50.746 [308] dbg: TxRep: reputation: -20.000, count: 2,
weight: 1.0, delta: -20.000, MSG_ID:
ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
Jan 8 23:49:52.202 [308] dbg: TxRep: forgetting stored score -20.000 of
message ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
Jan 8 23:49:52.203 [308] dbg: TxRep: active,
ec300f7aa9c95003b94439831b843605e9a94660@sa_generated pre-score: ?,
autolearn score: -20, IP: 93.191.162.21, address:
noreply@congstarnews.de (unsigned)

Jan 8 23:49:52.209 [308] dbg: TxRep: reputation: none, count: 0,
learning: -20, MSG_ID: ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
Jan 8 23:49:52.209 [308] dbg: auto-whitelist: add_score: new count: 1,
new totscore: 20
Jan 8 23:49:53.710 [308] dbg: auto-whitelist: DB addr list: untie-ing
and unlocking
Jan 8 23:49:53.715 [308] dbg: auto-whitelist: DB addr list: file
locked, breaking lock
Jan 8 23:49:53.716 [308] dbg: locker: safe_unlock: unlink
/var/spool/fetchmail/.spamassassin/tx-reputation.lock

You see the timestamps. This happens for each of the 49 messages.
Whenever it wants to forget a score. Which also explains why it was so
much faster initially.. when it didn't know the score yet, and didn't
have anything to forget. Adding the new score is also slow'ish it seems.

Whats going on here? This is the file-sizes of my databases:

[mageta@mailcollect ~]$ ls -lh .spamassassin/
total 45M
-rw------- 1 mageta mail 61K Jan 9 00:23 bayes_journal
-rw------- 1 mageta mail 4.7M Jan 8 23:39 bayes_seen
-rw------- 1 mageta mail 41M Jan 8 23:39 bayes_toks
-rw------- 1 mageta mail 11M Jan 9 00:23 tx-reputation
-rw------- 1 mageta mail 4 Jan 9 00:23 tx-reputation.mutex
-rw-r--r-- 1 mageta mail 2.7K Jan 9 00:09 user_prefs
[mageta@mailcollect ~]$ file .spamassassin/tx-reputation
.spamassassin/tx-reputation: Berkeley DB (Hash, version 9, native
byte-order)
[mageta@mailcollect ~]$ file .spamassassin/bayes_toks
.spamassassin/bayes_toks: Berkeley DB (Hash, version 9, native byte-order)

Any ideas? Can I fix this somehow? Should I make a bug-report? This
makes sa-learn pretty unusable for me atm. I have let it run once for
everything I have.. so I should be good for now - which is great!! -,
but letting it rerun will tie up one CPU on my server for _hours_ now.


best regards,
- Benjamin
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On 8 Jan 2020, Benjamin Block told this:

> Now, if I run sa-learn again on the same folder (the manual says "SpamAssassin remembers which mail messages it has learnt already,
> and will not re-learn those messages again, unless you use the --forget option.", so I think this is OK to do), it gets absurdly
> slow, taking over 2 minutes for the same directory with 45 mails.
>
> + /usr/bin/sa-learn --no-sync --progress --ham /var/spool/fetchmail/Maildir/.Congstar
> 92% [============================= ] 0.30 msgs/sec 02m40s DONE
> Learned tokens from 0 message(s) (49 message(s) examined)
>
> Now imagine this for a folder with over 2k messages (of which I have several).

Possibly related to <https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7587>?

> Jan 8 23:49:52.209 [308] dbg: TxRep: reputation: none, count: 0, learning: -20, MSG_ID:
> ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
> Jan 8 23:49:52.209 [308] dbg: auto-whitelist: add_score: new count: 1, new totscore: 20
> Jan 8 23:49:53.710 [308] dbg: auto-whitelist: DB addr list: untie-ing and unlocking
> Jan 8 23:49:53.715 [308] dbg: auto-whitelist: DB addr list: file locked, breaking lock
> Jan 8 23:49:53.716 [308] dbg: locker: safe_unlock: unlink /var/spool/fetchmail/.spamassassin/tx-reputation.lock

... looks like it to me. It's at least spotting the lock and breaking
it, but it's still taking a second and a half to do it, and it happens
for each message. That's better than the 90s it used to take, but still
bad.

I've come to the conclusion that TxRep is essentially unmaintained and
basically doesn't work unless you use SQL storage, and have migrated
back to the AWL, which still works fine. I hope I'm wrong.
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On Tue, Jan 14, 2020 at 12:05:57PM +0000, Nix wrote:
>
> I've come to the conclusion that TxRep is essentially unmaintained and
> basically doesn't work unless you use SQL storage, and have migrated
> back to the AWL, which still works fine. I hope I'm wrong.

There's only so much a few inactive developers can do, I don't even use
TxRep (or AWL for that matter), so my priorities are on more critical
things. Feel free to contribute. :-)

In any case one should use SQL if possible (and Redis for Bayes), file based
databases have always been painful to use.
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On 2020-01-14 7:05 am, Nix wrote:
> On 8 Jan 2020, Benjamin Block told this:
>
> ... looks like it to me. It's at least spotting the lock and breaking
> it, but it's still taking a second and a half to do it, and it happens
> for each message. That's better than the 90s it used to take, but still
> bad.
>
> I've come to the conclusion that TxRep is essentially unmaintained and
> basically doesn't work unless you use SQL storage, and have migrated
> back to the AWL, which still works fine. I hope I'm wrong.

Are you saying that it DOES work cleanly when using SQL storage ? I've
been using AWL with SQL for years and it's been "fine". Want to change
up to TxRep with SQL, but now not so sure ...
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On Tue, Jan 14, 2020 at 12:05:57PM +0000, Nix wrote:
> On 8 Jan 2020, Benjamin Block told this:
>
> > Now, if I run sa-learn again on the same folder (the manual says "SpamAssassin remembers which mail messages it has learnt already,
> > and will not re-learn those messages again, unless you use the --forget option.", so I think this is OK to do), it gets absurdly
> > slow, taking over 2 minutes for the same directory with 45 mails.
> >
> > + /usr/bin/sa-learn --no-sync --progress --ham /var/spool/fetchmail/Maildir/.Congstar
> > 92% [============================= ] 0.30 msgs/sec 02m40s DONE
> > Learned tokens from 0 message(s) (49 message(s) examined)
> >
> > Now imagine this for a folder with over 2k messages (of which I have several).
>
> Possibly related to <https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7587>?

Ah yes, I saw that as well, and thought it might be related. But I saw
they made changes in response to the bug, so I wasn't sure that still
applies.

>
> > Jan 8 23:49:52.209 [308] dbg: TxRep: reputation: none, count: 0, learning: -20, MSG_ID:
> > ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
> > Jan 8 23:49:52.209 [308] dbg: auto-whitelist: add_score: new count: 1, new totscore: 20
> > Jan 8 23:49:53.710 [308] dbg: auto-whitelist: DB addr list: untie-ing and unlocking
> > Jan 8 23:49:53.715 [308] dbg: auto-whitelist: DB addr list: file locked, breaking lock
> > Jan 8 23:49:53.716 [308] dbg: locker: safe_unlock: unlink /var/spool/fetchmail/.spamassassin/tx-reputation.lock
>
> ... looks like it to me. It's at least spotting the lock and breaking
> it, but it's still taking a second and a half to do it, and it happens
> for each message. That's better than the 90s it used to take, but still
> bad.
>
> I've come to the conclusion that TxRep is essentially unmaintained and
> basically doesn't work unless you use SQL storage, and have migrated
> back to the AWL, which still works fine. I hope I'm wrong.

Hmm, interesting. Maybe I should try SQL then to see whether its faster
with that. Makes my setup more complex though, not a huge fan of that,
but OK.

Thanks,
- Benjamin
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On Tue, 14 Jan 2020 12:05:57 +0000
Nix wrote:


> I've come to the conclusion that TxRep is essentially unmaintained and
> basically doesn't work unless you use SQL storage, and have migrated
> back to the AWL, which still works fine. I hope I'm wrong.

I think people should think about whether they actually need TxRep. To
me it's an additional risk rather than a safety net.

TxRep looks to be hacked-out from AWL, it's complex and lacks
transparency. Most of its reported bugs are clearly visible, they
involve long delays, runtime errors and debug messages. The chance are
that these bugs are the tip of the iceberg. If it's also getting its
computed score wrong, it will have to be pretty bad, pretty often,
before anyone notices.

Most of what it does doesn't seem well designed. I think in part this
is because it reuses AWL's database code and so sees everything as a
score-averaging problem.

The chief flaw in AWL was that it used the first-public IP address
from a forgeable received header. This potentially allows spammers to
exploit a good reputation if they can match email addresses to IP
address blocks.

TxRep uses a trusted IP address which is mostly a step forward (except
for forwarded email where it's very much worse). However, in practice
this is rarely used and it uses DKIM or SPF reputations instead.

Unfortunately TxRep appears to mishandle SPF and treats the header
"From" as being authenticated by a pass regardless of alignment with
the envelope sender. This can allow spam to abuse good reputations
without the spammer even trying.
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On 14 Jan 2020, Henrik K. said:

> On Tue, Jan 14, 2020 at 12:05:57PM +0000, Nix wrote:
>>
>> I've come to the conclusion that TxRep is essentially unmaintained and
>> basically doesn't work unless you use SQL storage, and have migrated
>> back to the AWL, which still works fine. I hope I'm wrong.
>
> There's only so much a few inactive developers can do, I don't even use
> TxRep (or AWL for that matter), so my priorities are on more critical
> things. Feel free to contribute. :-)

I tried, but couldn't make it work well enough to not lose functionality
at the same time as I gained the ability to not break locks.

> In any case one should use SQL if possible (and Redis for Bayes), file based
> databases have always been painful to use.

Honestly, for smaller installations a whole database server is just one
more thing to break. I'd only want one of those tied up with my email if
I was storing *the email itself* in the database as well.

--
NULL && (void)
Re: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved) [ In reply to ]
On 14 Jan 2020, Dean Carpenter spake thusly:

> On 2020-01-14 7:05 am, Nix wrote:
>> On 8 Jan 2020, Benjamin Block told this:
>>
>> ... looks like it to me. It's at least spotting the lock and breaking
>> it, but it's still taking a second and a half to do it, and it happens
>> for each message. That's better than the 90s it used to take, but still
>> bad.
>>
>> I've come to the conclusion that TxRep is essentially unmaintained and
>> basically doesn't work unless you use SQL storage, and have migrated
>> back to the AWL, which still works fine. I hope I'm wrong.
>
> Are you saying that it DOES work cleanly when using SQL storage ? I've
> been using AWL with SQL for years and it's been "fine".

Other people seem to be saying that. I never tried SQL anything with
SpamAssassin, so I can't be sure.

--
NULL && (void)