http://bugzilla.spamassassin.org/show_bug.cgi?id=3173 ------- Additional Comments From jm@jmason.org 2004-03-17 18:52 -------
ok -- urgh. it looks like the prior checkin was partly more accurate due to
ignoring invisible tokens, and partly due to a bug in how the new
token-decomposition code worked!
(the bug was that the decomposition ran before very long tokens were shortened
to "skip" tokens, so it wound up generating long decomposed tokens. see end of
mail for discussion on *those* results.)
So accurate figures for invisible-text treatment are:
invisnone: SUMMARY: 0.30/0.70 fp 2 fn 270 uh 273 us 3872 c 704.50
invis1: SUMMARY: 0.30/0.70 fp 2 fn 270 uh 274 us 3861 c 703.50
invis2: SUMMARY: 0.30/0.70 fp 2 fn 261 uh 284 us 3840 c 693.40
namely,
invisnone: do not add invisible tokens at all
invis1: add with an "I*" prefix to keep in a separate namespace
invis2: add, with no prefix
how's that for unexpected. ;)
Inspecting the Bayes dbs looks like everything's working as it should; the "I*"
tokens in invis1 really are the ones found in bayes-poison blocks. it really
does seem that (at least on this corpus) allowing the invisible, bayes-pollution
tokens to pollute the db, actually INCREASES accuracy. Very odd.
I would still prefer to keep the inviz tokens separate from the real ones -- and
given David's points that it's entirely possible to make invisible-*looking*
tokens enough to fool a filter but be visible in a MUA, I'd prefer to not just
throw them out invisnone style, since that'll be exploited. So I'd prefer
invis1. But invis2 is surprising.
Possible reason: as David noted in bug 2282, a lot of recent spam does *not* use
invis blocks, it just leaves the text fully visible. could be why...
db sizes:
2889847 testset7/invisnone/results/bucket1/bayes_db.dump
2917427 testset7/invis2/results/bucket1/bayes_db.dump
2929940 testset7/invis1/results/bucket1/bayes_db.dump
from 260 spams, 451 hams.
Anyway, I'll check in the code into SVN in "invisnone" mode now, since it fixes
a couple of other bugs anyway.
... On a separate issue. here's some ROUGH figures for skip and non-skip:
skip: SUMMARY: 0.30/0.70 fp 2 fn 261 uh 284 us 3840 c 693.40
noskip: SUMMARY: 0.30/0.70 fp 2 fn 260 uh 267 us 3825 c 689.20
rough because it's slightly buggy how that was implemented, being the
side-effect of a bug anyway. ;) and DB sizes:
skip: 2956604 testset5/checkin/results/bucket1/bayes_db.dump
noskip: 2917427 testset7/invis2/results/bucket1/bayes_db.dump
I think it's best to stick with "skip" tokens to save that 40k or so on-disk,
since it only clears up 1 FN otherwise. bear in mind that's 40k of skipped data
(long hashbusters, msgids, etc.) from only 260 spams and 451 hams. thoughts?
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.