Mailing List Archive

1 2  View All
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173





------- Additional Comments From jm@jmason.org 2004-03-17 09:14 -------
yeah, I went back to first principles and have been verifying that the tokens
are being treated correctly this time around. Not sure what the situation was
before.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173





------- Additional Comments From jm@jmason.org 2004-03-17 18:52 -------
ok -- urgh. it looks like the prior checkin was partly more accurate due to
ignoring invisible tokens, and partly due to a bug in how the new
token-decomposition code worked!

(the bug was that the decomposition ran before very long tokens were shortened
to "skip" tokens, so it wound up generating long decomposed tokens. see end of
mail for discussion on *those* results.)

So accurate figures for invisible-text treatment are:

invisnone: SUMMARY: 0.30/0.70 fp 2 fn 270 uh 273 us 3872 c 704.50
invis1: SUMMARY: 0.30/0.70 fp 2 fn 270 uh 274 us 3861 c 703.50
invis2: SUMMARY: 0.30/0.70 fp 2 fn 261 uh 284 us 3840 c 693.40

namely,

invisnone: do not add invisible tokens at all
invis1: add with an "I*" prefix to keep in a separate namespace
invis2: add, with no prefix

how's that for unexpected. ;)

Inspecting the Bayes dbs looks like everything's working as it should; the "I*"
tokens in invis1 really are the ones found in bayes-poison blocks. it really
does seem that (at least on this corpus) allowing the invisible, bayes-pollution
tokens to pollute the db, actually INCREASES accuracy. Very odd.

I would still prefer to keep the inviz tokens separate from the real ones -- and
given David's points that it's entirely possible to make invisible-*looking*
tokens enough to fool a filter but be visible in a MUA, I'd prefer to not just
throw them out invisnone style, since that'll be exploited. So I'd prefer
invis1. But invis2 is surprising.

Possible reason: as David noted in bug 2282, a lot of recent spam does *not* use
invis blocks, it just leaves the text fully visible. could be why...

db sizes:

2889847 testset7/invisnone/results/bucket1/bayes_db.dump
2917427 testset7/invis2/results/bucket1/bayes_db.dump
2929940 testset7/invis1/results/bucket1/bayes_db.dump

from 260 spams, 451 hams.

Anyway, I'll check in the code into SVN in "invisnone" mode now, since it fixes
a couple of other bugs anyway.


... On a separate issue. here's some ROUGH figures for skip and non-skip:

skip: SUMMARY: 0.30/0.70 fp 2 fn 261 uh 284 us 3840 c 693.40
noskip: SUMMARY: 0.30/0.70 fp 2 fn 260 uh 267 us 3825 c 689.20

rough because it's slightly buggy how that was implemented, being the
side-effect of a bug anyway. ;) and DB sizes:

skip: 2956604 testset5/checkin/results/bucket1/bayes_db.dump
noskip: 2917427 testset7/invis2/results/bucket1/bayes_db.dump

I think it's best to stick with "skip" tokens to save that 40k or so on-disk,
since it only clears up 1 FN otherwise. bear in mind that's 40k of skipped data
(long hashbusters, msgids, etc.) from only 260 spams and 451 hams. thoughts?






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

jm@jmason.org changed:

What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2282
nThis| |





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

jm@jmason.org changed:

What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2892
nThis| |





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

jm@jmason.org changed:

What |Removed |Added
----------------------------------------------------------------------------
OtherBugsDependingO| |2423
nThis| |





------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173





------- Additional Comments From quinlan@pathname.com 2004-03-17 19:17 -------
A agree about invis1 appearing to be the best option.

One note that it seems somewhat wasteful that were not doing a Huffman encoding
of our prefixes.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173





------- Additional Comments From koppel@ece.lsu.edu 2004-03-18 06:28 -------
It looks like my fears about an increase in fp when invisible tokens
were marked (comment 18) were unfounded, at least based on these
results.

I'd still prefer not marking them (treating them the same as visible
regions), but marking them is not nearly as dangerous (IMO) as
ignoring invisible regions outright.

I've written some code that detects possible poisoning by looking at
the number of new tokens. Originally it was only to look in invisible
regions but given how rare they are I modified it to look at the whole
message. (It counts the number of new body tokens that are not part of
URLs and compares that to the number of previously seen tokens of any
type.) There is still some tuning to do, when it's ready I'll file it
under a new bug.




------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]
http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

jm@jmason.org changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |RESOLVED
Resolution| |FIXED



------- Additional Comments From jm@jmason.org 2004-03-23 21:48 -------
ok, this is now closed.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

1 2  View All