Mailing List Archive: [Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text.

bugzilla-daemon at bugzilla

Mar 14, 2004, 2:05 PM

Post #1 of 33 (461 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From sidney@sidney.com 2004-03-14 13:05 -------
Ok, now there's an appropriate place to talk about this :-)

If it is possible to craft such a message, then our code is identifying text as
invisible when it is not invisible. That would be a bug in our code, which can
be fixed. The correct approach is to attach such a message to a bug report.

I would not consider any solution acceptible if it allows a spammer to create a
message with 20,000 unique random 4-letter combinations that would be processed
by Bayes and not visible in a mail reader, unless someone comes up with a way
for that not to be a DoS attack on SpamAssassin with Bayes. That doesn't mean do
nothing to fix a problem, but it is a security issue that cannot be ignored.

I don't see introducing a vulnerability in order to fix a problem that has not
been demonstrated. Where is this message that is labeled invisible but isn't and
for which there is no fix in the invisibility detector code? If there is no such
example after some time, I'll be closing this bug as a WONTFIX. Of course if I
do that and an example shows up in the future, I would be happy to see this
reopened.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 2:53 PM

Post #2 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-14 13:53 -------
> If it is possible to craft such a message, then our code is
> identifying text as invisible when it is not invisible. That would
> be a bug in our code, which can be fixed. The correct approach is to
> attach such a message to a bug report.

A lot of time can go by between when such messages appear and users
download the fixed releases of SA. Why introduce this weakness into
BC without considering the alternatives.

To protect against random word strings slowing down SA one might limit
the number of tokens processed by BC, or at least processed by BC
within invisible regions. The former would protect against DoS
attacks that r9460 would not.

Specially marking tokens in invisible regions might also improve
classification accuracy.

The reduction in FNs cited in Bug 2129 is impressive. I'd like to
take a closer look at what's going on. For example, it appears that
BC does look at tokens in invisible regions for scoring a message, it
just doesn't learn them. (If that's true I'd call it a bug.) Another
thing I'd like to know is how the BC was trained before starting the
testing used to get the results. I'd appreciate it if anyone could
enlighten me before I take a look.

> I don't see introducing a vulnerability in order to fix a problem
> that has not been demonstrated. Where is this message that is
> labeled invisible but isn't and for which there is no fix in the
> invisibility detector code? If there is no such example after some
> time, I'll be closing this bug as a WONTFIX.

We'll have to wait until r9460 is released, only then would spammers
try to take advantage of it. (Or did you want me to come up with one
on my own?)

To summarize, r9460 opens a vulnerability in BC in order to prevent a
DoS attack that could be mounted anyway (by making the random strings
visible).

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 3:10 PM

Post #3 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From jm@jmason.org 2004-03-14 14:10 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

> The reduction in FNs cited in Bug 2129 is impressive. I'd like to
> take a closer look at what's going on. For example, it appears that
> BC does look at tokens in invisible regions for scoring a message, it
> just doesn't learn them. (If that's true I'd call it a bug.)

??? shouldn't be the case.

> Another
> thing I'd like to know is how the BC was trained before starting the
> testing used to get the results. I'd appreciate it if anyone could
> enlighten me before I take a look.

See "masses/bayes-testing/README". it's a 10-fold cross-validation
run, the std for testing trained classifiers.

David, I'd be happy to test some code using this method on the same
corpus, if you'd care to come up with a patch against current SVN.

(I'm thinking there *may* be a case where this will be useful in
future.)

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVNhhQTcbUG5Y7woRAoNHAJ4wPQ7EMka31+Ewxfbb06h2mw1oygCfVLNK
+bLfYFa3LQBG55014wpj6T0=
=ouq6
-----END PGP SIGNATURE-----

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 3:43 PM

Post #4 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From sidney@sidney.com 2004-03-14 14:43 -------
We are dealing with two possible attacks. One we know how to do now, loading
spam with invisible text that SpamAssassin will process in a way that overloads
SpamAssassin, effectively crashing it. The second we don't know how to do,
loading spam with visible text that SpamAssassin thinks is invisible to create
false negatives. All I am saying is that we must not create a vulnerability to a
known crash attack just to protect against the unproven possibility of false
negatives.

Setting a limit on the number of tokens that we process would make us vulnerable
to the Bayes poisoning that we have been seeing. So far they have not worked
because we score based on the 15 most significant tokens. If we have to throw
away tokens we can't find the most significant 15.

You are correct that a spammer could put 20,000 tokens in visible text. If they
go so far as to have a three line ad for v*ag*a followed by two thousand lines
of gibberish, we would have to do something about that. Perhaps exceeding some
limit of unique tokens in one message could suppress Bayes and trigger another
rule. Yes, that would also work to prevent the DoS if the random words were in
invisible text, but doing so would give spammers a way to turn off Bayes
processing with no visible effect on the spam. That can't be a good thing.

So, yes, given the tradeoffs, I would like you to come up with examples on your
own, so we can pre-empt the efforts of the spammers. If this bug report remains
purely theoretical for too long, I will close it, subject to being reopened if
and when someone can come up with an example. I would not approve of the simple
solution of I*tokens without an included patch for DoS protection.

Ok, I've made my points. I'll shut up now until I see either some code or an
example of the problem.

> Another thing I'd like to know is how the BC was trained
> before starting the testing used to get the results.

Justin talked about how he tested in comments 6 and 7 in bug #2129. I interpret
that as saying that he used 10fold cross-validation. That repeates training on
samples of the corpus and testing on the remainder.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:38 PM

Post #5 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-14 15:38 -------
Justin, thanks for the pointer to the BC testing methodology, I'll
have a look.

When I get a chance I'll work up a patch to mark invisible tokens.

I've come up with a simple message that SA thinks is invisible but
Firefox (and presumably other HTML renderers) does not. I'll attach
it below. My concern is that any attempt to detect truly invisible
text would be easy to get around, unless we include something close
to a complete HTML/CSS rendering system.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:40 PM

Post #6 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-14 15:40 -------
Created an attachment (id=1840)
--> (http://bugzilla.spamassassin.org/attachment.cgi?id=1840&action=view)
Message faking invisibility.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:40 PM

Post #7 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From quinlan@pathname.com 2004-03-14 15:40 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

If we use a single html_text array and a single body array and set up a
parallel array of properties, then it would be trivial to add an option
to ignore or not ignore invisible text.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:43 PM

Post #8 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From quinlan@pathname.com 2004-03-14 15:43 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

> Message faking invisibility.

Part of the solution is for us to handle CSS properly. Sadly, it is
necessary.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:44 PM

Post #9 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-14 15:44 -------
An option is a great idea! What's the default setting? :-)

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:48 PM

Post #10 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-14 15:48 -------
> Part of the solution is for us to handle CSS properly. Sadly, it is
> necessary.

Add to that a better handling of HTML. There might be ways of faking
invisibility by assigning fg and bg colors to different blocks and confusing SA
about which one applies.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 4:58 PM

Post #11 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From quinlan@pathname.com 2004-03-14 15:58 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

> Add to that a better handling of HTML. There might be ways of faking
> invisibility by assigning fg and bg colors to different blocks and
> confusing SA about which one applies.

We *know* already...

Well, you are able to submit code improvements.

For starters, you can look at jgc's spam tricks page. I know a
half-dozen more, but CSS parsing and some basic CSS handling are the
main things lacking right now.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 5:17 PM

Post #12 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-14 16:17 -------
> We *know* already...

My argument is that for now we should not omit seemingly invisible text, at
least for now, because it would be too easy to fake.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 5:53 PM

Post #13 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From jm@jmason.org 2004-03-14 16:53 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

>Setting a limit on the number of tokens that we process would make us vulnerable
>to the Bayes poisoning that we have been seeing. So far they have not worked
>because we score based on the 15 most significant tokens. If we have to throw
>away tokens we can't find the most significant 15.

BTW -- it's 150, not 15.

>You are correct that a spammer could put 20,000 tokens in visible text. If they
>go so far as to have a three line ad for v*ag*a followed by two thousand lines
>of gibberish, we would have to do something about that. Perhaps exceeding some
>limit of unique tokens in one message could suppress Bayes and trigger another
>rule. Yes, that would also work to prevent the DoS if the random words were in
>invisible text, but doing so would give spammers a way to turn off Bayes
>processing with no visible effect on the spam. That can't be a good thing.

I suggest reading the "Dobly" paper before thinking up new details along
these lines -- it sounds quite practical to detect this.

>> Another thing I'd like to know is how the BC was trained
>> before starting the testing used to get the results.
>
>Justin talked about how he tested in comments 6 and 7 in bug #2129. I interpret
>that as saying that he used 10fold cross-validation. That repeates training on
>samples of the corpus and testing on the remainder.

yep. Specifically, training on 1/10th, testing against the other 9/10ths
of the corpus, and repeat over all "folds". I've added a page to
describe it here:
http://wiki.apache.org/spamassassin/TenFoldCrossValidation

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVP6eQTcbUG5Y7woRAuBRAKDTYR68APi3eo/Ttqnn08jaHFkEzwCggpLW
h8DNjyXDjNrimCULva1dZzI=
=ZSuF
-----END PGP SIGNATURE-----

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 7:35 PM

Post #14 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From sidney@sidney.com 2004-03-14 18:35 -------
> BTW -- it's 150, not 15

Oh, Paul Graham and DSPAM used 15 (later up to 20 to 25). That's an interesting
difference.

> I suggest reading the "Dobly" paper before thinking up new details
> along these lines -- it sounds quite practical to detect this.

My understanding after reading the paper is that Dobly uses the spam/ham counts
of the tokens to determine the "sparseness" of text, so it would have to access
the database entry for every token in the message. That would not help with a
DoS attack at all, since it takes as much I/O to determine that the words are
noise. It would only improve accuracy of the final result, if it works at all.
Wouldn't Dobly only help when the random words contained embedded high
probability ham indicators and Dobly led us to ignore them? How would a spammer
produce high probability ham indicator words targeted to each recipient?

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Re: [Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

Mar 14, 2004, 8:59 PM

Post #15 of 33 (460 views)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bugzilla-daemon@bugzilla.spamassassin.org writes:
> > BTW -- it's 150, not 15
>
> Oh, Paul Graham and DSPAM used 15 (later up to 20 to 25). That's an
> interesting difference.

Yeah -- experimentally verified in "bayes tweaks round 1" if I recall
correctly ;) Spambayes found the same.

Note that PG doesn't do 10-fold cross-validation as far as I know ;)

> > I suggest reading the "Dobly" paper before thinking up new details
> > along these lines -- it sounds quite practical to detect this.
>
> My understanding after reading the paper is that Dobly uses the spam/ham
> counts of the tokens to determine the "sparseness" of text, so it would
> have to access the database entry for every token in the message. That
> would not help with a DoS attack at all, since it takes as much I/O to
> determine that the words are noise. It would only improve accuracy of
> the final result, if it works at all.

that's very true. OK, I see your point...

> Wouldn't Dobly only help when the
> random words contained embedded high probability ham indicators and
> Dobly led us to ignore them? How would a spammer produce high
> probability ham indicator words targeted to each recipient?

Yeah -- true. The issue is that occasionally they can be lucky
and hit one or two 0.001's. Chi2 combining is much better at
dealing with that. But I bet there's the occasional spam that
gets through...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVSohQTcbUG5Y7woRAmJ3AKCgexbzNj8SCmwdp/ZyLHDv+EY0UgCfZmte
nV4FqAfwMPne//3ANzVk/50=
=n+8y
-----END PGP SIGNATURE-----

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 14, 2004, 8:59 PM

Post #16 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From jm@jmason.org 2004-03-14 19:59 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

bugzilla-daemon@bugzilla.spamassassin.org writes:
> > BTW -- it's 150, not 15
>
> Oh, Paul Graham and DSPAM used 15 (later up to 20 to 25). That's an
> interesting difference.

Yeah -- experimentally verified in "bayes tweaks round 1" if I recall
correctly ;) Spambayes found the same.

Note that PG doesn't do 10-fold cross-validation as far as I know ;)

> > I suggest reading the "Dobly" paper before thinking up new details
> > along these lines -- it sounds quite practical to detect this.
>
> My understanding after reading the paper is that Dobly uses the spam/ham
> counts of the tokens to determine the "sparseness" of text, so it would
> have to access the database entry for every token in the message. That
> would not help with a DoS attack at all, since it takes as much I/O to
> determine that the words are noise. It would only improve accuracy of
> the final result, if it works at all.

that's very true. OK, I see your point...

> Wouldn't Dobly only help when the
> random words contained embedded high probability ham indicators and
> Dobly led us to ignore them? How would a spammer produce high
> probability ham indicator words targeted to each recipient?

Yeah -- true. The issue is that occasionally they can be lucky
and hit one or two 0.001's. Chi2 combining is much better at
dealing with that. But I bet there's the occasional spam that
gets through...

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.4 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAVSohQTcbUG5Y7woRAmJ3AKCgexbzNj8SCmwdp/ZyLHDv+EY0UgCfZmte
nV4FqAfwMPne//3ANzVk/50=
=n+8y
-----END PGP SIGNATURE-----

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 15, 2004, 12:25 AM

Post #17 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From lwilton@earthlink.net 2004-03-14 23:25 -------
As I see it from examining spam, there are two kinds (generally) of invisible
text in the messages. Type 1 is a whole lot of "words", doubtless intended as
Bayes poisioning, and generally showing up at the end of the body of an HTML
message, or sometimes after the body (illegal HTML). It is also common to see
them in the text fork where they will generally be ignored by mail readers.

So far as I can tell, this Bayes "poisoning" in the vast majority of cases has
just the opposite effect than intended -- it makes the message as 99%
probability spam. This can often be very useful in pushing a message solidly
into the spam bucket when the available rules were indecisive. I would hate to
lose this form of Beyes "poisoning", since it is so effective in detecting spam.

The second form of invisible text is individual letters or small sequences of
letters or numbers, typically in a 0 or 1 pt font, inserted in the middle of a
word, typically something like via<small letters here>gra. The intent here is
clearly random obfuscation of key words to make them hard to match by rules,
and hopefully hard to match by Beyes.

In these cases simply making the invisible letters go away will have to
advantage of making the evil words immediately obvious to both rules and Beyes,
since they will not contain the invisible obfuscation text.

Which means I both do and don't want invisible text to vanish from the text
rendering of the html. A simple and probably effective rule would be to throw
away any invisible text that isn't bounded by a wordbreak on at least one side,
or that doesn't contain whitespace. Or maybe more simply any run of invisible
text of less than say 6 characters. And multi-word run of invisible text
should REMAIN in the rendering, since it is very effective in Bayes for
detecting spam.

(Which implies also that having a set of rules of percentage of invisible text
to non-invisible text could be a good spam detector all by itself.)

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 15, 2004, 8:08 AM

Post #18 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-15 07:08 -------
I commented above (Comment 2) that invisible text is not being ignored
for classification but is being ignored (as intended) for learning.
Actually, I don't think it's being ignored in either case, see Bug
3176.

If I'm not wrong then we need to look at what the improved performance
Justin observed is due to. (See Bug 2129 Cmment 7.) Perhaps he
tested using a working copy of the invisible text feature but a
non-working version was checked into the repository.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 16, 2004, 7:33 AM

Post #19 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-16 06:33 -------
I've come to the conclusion that using special marking for invisible
Bayes tokens, such as "I:poison", is a bad idea. The imbalance
between "invisible" regions in spam and ham messages can lead to false
positives for legitimate messages with regions deemed invisible. (In
a recent 1-day HTML run HTML_FONT_INVISIBLE hits on 10.7% of spam and
2.4% of ham.)

I still think that invisible regions should not be ignored.

Right now I'm thinking along the lines of having the following options:

1: Always ignore invisible regions.

2: Never ignore invisible regions.

3: Always learn invisible regions.
If percent of recognized tokens > some threshold,
use ordinary scoring;
if percent of recognized tokens <= some threshold,
consider only spammy tokens (in case region contains random strings).

Option 3 should reduce the impact of Bayes poisoning, just as ignoring
invisible regions does, without enabling spammers to get around BC by
faking invisible text.

If anyone is familiar with work or discussion on considering only
spammy tokens when the number of recognized tokens is small, I'd
appreciate a pointer.

I'll work on a patch to implement this idea and post it probably later
today.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 16, 2004, 12:32 PM

Post #20 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From quinlan@pathname.com 2004-03-16 11:32 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

> I've come to the conclusion that using special marking for invisible
> Bayes tokens, such as "I:poison", is a bad idea. The imbalance
> between "invisible" regions in spam and ham messages can lead to false
> positives for legitimate messages with regions deemed invisible. (In
> a recent 1-day HTML run HTML_FONT_INVISIBLE hits on 10.7% of spam and
> 2.4% of ham.)

There's zero basis for your conclusion. Accidentally invisible text in
ham is very likely going to use different words than intentionally
invisible text in spam. It's fine to speculate, but until you've done a
test or have a reference you can point to, laying down firm conclusions
is a waste of everyone's time.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 16, 2004, 1:05 PM

Post #21 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-16 12:05 -------
> There's zero basis for your conclusion.

If I've been pompous or otherwise offensive I apologize. Please suggest
a less offensive way of phrasing the conclusion of my speculation.

> Accidentally invisible text in ham is very likely going to use
> different words than intentionally invisible text in spam.

I'm not concerned about clearly hammy or spammy words, I'm concerned
about the large number of words that are neutral. Those words are
considered neutral by BC (as now used) because it sees roughly equal
amounts of ham and spam. My fear is that for those who get little or
no invisible ham, in the invisible hams that do arrive words that
should be scored neutral will be scored spammy. Keeping track of the
number of ham and spam messages having invisible regions won't help if
what should be a neutral word has not yet appeared in an invisible ham
region.

> It's fine to speculate, but until you've done a test or have a
> reference you can point to, laying down firm conclusions is a waste of
> everyone's time.

I'm just trying to decide what to do next. I have some data, I'd be happy
to post it if you (or others) want to help me interpret it.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 16, 2004, 6:06 PM

Post #22 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-16 17:06 -------
I'm getting a version ready that will specially mark invisible
tokens and which works along the ways described in Comment 18
(different options, not all at once).

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 16, 2004, 10:39 PM

Post #23 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From jm@jmason.org 2004-03-16 21:39 -------
testing the "I*" variant now. I'd be happy to test another variant given a patch ;)

BTW, another question: should "body" see the visible parts only? or both vis
and invis?

if the latter, how will that interact with bug 3139 (in that "tiny font"
sections should be considered "invisible text")?

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 17, 2004, 12:32 AM

Post #24 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From quinlan@pathname.com 2004-03-16 23:32 -------
Subject: Re: Bayes can be circumvented by faking invisible or near-invisible text.

> BTW, another question: should "body" see the visible parts only? or
> both vis and invis?

Maybe both to make sure we match as much as possible?

Only maybe 10% of spam has invisible text, another 10% has low contrast,
and maybe another 10% of other random ways to hide text.

> if the latter, how will that interact with bug 3139 (in that "tiny
> font" sections should be considered "invisible text")?

I think ignoring tiny fonts would have to be tested. It seems to be a
smaller percentage of spam, so I think it can wait.

Daniel

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 3173] Bayes can be circumvented by faking invisible or near-invisible text. [ In reply to ]

bugzilla-daemon at bugzilla

Mar 17, 2004, 5:06 AM

Post #25 of 33 (460 views)

http://bugzilla.spamassassin.org/show_bug.cgi?id=3173

------- Additional Comments From koppel@ece.lsu.edu 2004-03-17 04:06 -------
Did anyone look at bug 3176, mentioned in comment 17 here? I don't think
invisible text is being ignored, at least in the trunk code as of last night.

------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.