Mailing List Archive: sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved)

sa-learn absurdly slow on re-iterating over mailboxes (TxRep involved)

Jan 8, 2020, 3:48 PM

Post #1 of 8 (888 views)

Hello,

I setup spamassassin the other week on my inbox mail-server, and so far
its been running good, now I wanted to try to train my bayes database
with some mails I have stored (200+ of each spam and ham, which should
be enough according to documentation).

Here is the version I am using:

[foo@mailcollect ~]$ sa-learn --version
SpamAssassin version 3.4.3

I have spamassassin running as a systemd service:
/system.slice/spamassassin.service
+-645 /usr/bin/perl -T -w /usr/bin/spamd -c -m5 -H
--razor-log-file=sys-syslog
+-652 spamd child
+-653 spamd child

And the platform I am running on:

[foo@mailcollect ~]$ grep -e PRETTY_NAME /etc/os-release
PRETTY_NAME="Fedora 31 (Thirty One)"
[foo@mailcollect ~]$ rpm -qi spamassassin
Name : spamassassin
Version : 3.4.3
Release : 2.fc31
Architecture: x86_64

The the problem is this: when I initially used sa-learn on my mailboxes,
it was fairly good.

For example:

+ /usr/bin/sa-learn --no-sync --progress --ham
/var/spool/fetchmail/Maildir/.Congstar
92% [================================= ] 5.21 msgs/sec 00m09s DONE
Learned tokens from 45 message(s) (49 message(s) examined)

This is just a small Maildir, I have other much bigger ones (including
my spam Maildir, which contains 2000+ messages).

Now, if I run sa-learn again on the same folder (the manual says
"SpamAssassin remembers which mail messages it has learnt already, and
will not re-learn those messages again, unless you use the --forget
option.", so I think this is OK to do), it gets absurdly slow, taking
over 2 minutes for the same directory with 45 mails.

+ /usr/bin/sa-learn --no-sync --progress --ham
/var/spool/fetchmail/Maildir/.Congstar
92% [============================= ] 0.30 msgs/sec 02m40s DONE
Learned tokens from 0 message(s) (49 message(s) examined)

Now imagine this for a folder with over 2k messages (of which I have
several).

I am not sure why this is. I ran sa-learn with debug enabled to see
whether I can see something and it looks like it spends ~3s on each
messages for updating the TxRep database (which I enabled in
spamassassin "loadplugin Mail::SpamAssassin::Plugin::TxRep"):

Jan 8 23:49:50.745 [308] dbg: TxRep: forgetting a message
Jan 8 23:49:50.746 [308] dbg: auto-whitelist: db-based
ec300f7aa9c95003b94439831b843605e9a94660@sa_generated|ip=none scores 2/-40
Jan 8 23:49:50.746 [308] dbg: check: tagrun - tag TXREPMSG_ID is now
ready, value: -20.0
Jan 8 23:49:50.746 [308] dbg: TxRep: reputation: -20.000, count: 2,
weight: 1.0, delta: -20.000, MSG_ID:
ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
Jan 8 23:49:52.202 [308] dbg: TxRep: forgetting stored score -20.000 of
message ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
Jan 8 23:49:52.203 [308] dbg: TxRep: active,
ec300f7aa9c95003b94439831b843605e9a94660@sa_generated pre-score: ?,
autolearn score: -20, IP: 93.191.162.21, address:
noreply@congstarnews.de (unsigned)

Jan 8 23:49:52.209 [308] dbg: TxRep: reputation: none, count: 0,
learning: -20, MSG_ID: ec300f7aa9c95003b94439831b843605e9a94660@sa_generated
Jan 8 23:49:52.209 [308] dbg: auto-whitelist: add_score: new count: 1,
new totscore: 20
Jan 8 23:49:53.710 [308] dbg: auto-whitelist: DB addr list: untie-ing
and unlocking
Jan 8 23:49:53.715 [308] dbg: auto-whitelist: DB addr list: file
locked, breaking lock
Jan 8 23:49:53.716 [308] dbg: locker: safe_unlock: unlink
/var/spool/fetchmail/.spamassassin/tx-reputation.lock

You see the timestamps. This happens for each of the 49 messages.
Whenever it wants to forget a score. Which also explains why it was so
much faster initially.. when it didn't know the score yet, and didn't
have anything to forget. Adding the new score is also slow'ish it seems.

Whats going on here? This is the file-sizes of my databases:

[mageta@mailcollect ~]$ ls -lh .spamassassin/
total 45M
-rw------- 1 mageta mail 61K Jan 9 00:23 bayes_journal
-rw------- 1 mageta mail 4.7M Jan 8 23:39 bayes_seen
-rw------- 1 mageta mail 41M Jan 8 23:39 bayes_toks
-rw------- 1 mageta mail 11M Jan 9 00:23 tx-reputation
-rw------- 1 mageta mail 4 Jan 9 00:23 tx-reputation.mutex
-rw-r--r-- 1 mageta mail 2.7K Jan 9 00:09 user_prefs
[mageta@mailcollect ~]$ file .spamassassin/tx-reputation
.spamassassin/tx-reputation: Berkeley DB (Hash, version 9, native
byte-order)
[mageta@mailcollect ~]$ file .spamassassin/bayes_toks
.spamassassin/bayes_toks: Berkeley DB (Hash, version 9, native byte-order)

Any ideas? Can I fix this somehow? Should I make a bug-report? This
makes sa-learn pretty unusable for me atm. I have let it run once for
everything I have.. so I should be good for now - which is great!! -,
but letting it rerun will tie up one CPU on my server for _hours_ now.

best regards,
- Benjamin