Mailing List Archive: [Bug 7943] New: TxRep gives nonsensical scores?

https://bz.apache.org/SpamAssassin/show_bug.cgi?id=7943

Bug ID: 7943
Summary: TxRep gives nonsensical scores?
Product: Spamassassin
Version: 3.4.6
Hardware: PC
OS: Linux
Status: NEW
Severity: normal
Priority: P2
Component: Learner
Assignee: dev@spamassassin.apache.org
Reporter: mnalis-sabug@voyager.hr
Target Milestone: Undefined

TxRep seems to return nonsensical scores. I'm using MySQL table if it matters
(as DB files have long ago become unusable to me due to heavy locking &
timeouts).

I've finally taken some time to try to debug it, and first issue was that 3.4.6
was generating many same MSGID tokens
("da39a3ee5e6b4b0d3255bfef95601890afd80709@sa_generated" had count>10 in a few
minutes), which would then get reused by ham and spam because "that mail was
already seen".

(I've partially tracked that problem down to the with how sha1 hash for
"xxxxxx@sa_generated" is created in 3.4.6 - TxRep was using
"Mail::SpamAssassin::Plugin::Bayes->get_msgid()" which seems to be
case-sensitive and only works for one case of "Message-Id", otherwise it tries
to fall back to using hash of date/body but...)

Anyway I've seen SVN trunk has changed that part of the code, so I've simply
disabled MSGID tokens with "txrep_track_messages 0" and truncated the txrep
table, hoping that would solve the issue. It did not - it still returned
strange results (spammy score for hams etc.)

I've then tried getting SVN trunk TxRep.pm version, with no luck (it still
worked wrong, and I've had to copy new generate_msgid() to make it work)

I've then nuked the txrep table; added some debug, and start feeding one
clearly ham e-mail several times through "spamassassin -L -t". This is how
mysql table looked for first 5 runs (I'm only focusing on EMAILIP tag here, but
the same problem is with others):

+----------+---------------+------+----------+----------+----------+---------------------+
| username | email | ip | msgcount | totscore | signedby |
last_hit |

+----------+---------------+------+----------+----------+----------+---------------------+
1st | amavis | hepi@hep.hr | none | 1 | -10.21 | spf |
2021-11-12 03:07:03 |
2nd | amavis | hepi@hep.hr | none | 2 | -10.21 | spf |
2021-11-12 03:09:27 |
3rd | amavis | hepi@hep.hr | none | 3 | -10.21 | spf |
2021-11-12 03:10:24 |
4th | amavis | hepi@hep.hr | none | 4 | -10.21 | spf |
2021-11-12 03:11:17 |
5th | amavis | hepi@hep.hr | none | 5 | -10.21 | spf |
2021-11-12 03:12:54 |

I've added following debug just after:
$delta = ($self->total() + $msgscore) / (1 + $self->count()) - $msgscore;

dbg("TxRep: mn %s _formula delta = (total()=%0.3f + msgscore=%0.3f) / (1 +
count()=%0.3f) - msgscore=%0.3f = %0.3f", $tag_id, $self->total(), $msgscore,
$self->count(), $msgscore, $delta);

And this is what it printed for that first 5 runs:
dbg: TxRep: mn EMAILIP _formula delta = (total()=0.000 + msgscore=-10.210) / (1
+ count()=0.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=1.000) - msgscore=-10.210 = 0.000
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=2.000) - msgscore=-10.210 = 3.403
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=3.000) - msgscore=-10.210 = 5.105
dbg: TxRep: mn EMAILIP _formula delta = (total()=-10.210 + msgscore=-10.210) /
(1 + count()=4.000) - msgscore=-10.210 = 6.126

This looks wrong. I've started with TXREP=0 SA score, and after receiving 5 HAM
messages from that sender, TXREP now returns high positive SPAM score:
3.1 TXREP TXREP: Score normalizing based on sender's
reputation

The more HAM I feed it, the higher the SPAM score gets.

I'm thinking $delta is supposed to get slightly more negative with each HAM
that passes through, or at least remain the same, and definitely not start
classifying the email as SPAM. Is my assumption correct? Any idea how $delta
calculation should actually work here?

--
You are receiving this mail because:
You are the assignee for the bug.