Mailing List Archive

FW: Bayes Tokens
> ----------
> From: Jim Grusendorf
> Sent: Friday, February 06, 2004 3:25:23 PM
> To: 'sa-list@hudsonca.ca'
> Subject: Bayes Tokens
> Auto forwarded by a Rule
>
>
>
I'm sure this is nothing new to you SpamAssassin experts, but it seemed like
a profound realization when it him me:

In addition to the actual words and phrases in a message and its headers,
*any* aspect or property of the message -- whether intrinsic or derived --
can be considered a Bayes token. POPFile takes this into account to a
limited extent with their "pseudowords"
(http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord). But a
rule-based system can be enfolded into a Bayes solution, because the result
of evaluating any rule can be considered a token. Imagine if SpamAssassin
had an optional mode where, instead of using manual scoring, all the active
rules of its impressively comprehensive ruleset were tokenized? I think
this would result in most impressive accuracy.

I'm imagining examples such as testing for blacklist membership, which some
people (mostly elsewhere, apparently) think is a bad idea. I could let
Bayes decide just exactly how relevant it is to *my* corpus. How about the
"Message is x% to y% HTML" rules, or "text to image ratio" rules. This
would develop a very precise profile of what I consider to be ham and spam.

I've emailed Paul Graham and Gary Robinson for their opinions, and both
agree that this is a good idea. Paul pointed out that he mentions the
possibility in the appendix to A Plan for Spam, and Gary mentioned a
commercial product (PureMessage) that apparently does some of this. It sure
would be nice to see it in SpamAssassin someday.

Jim Grusendorf
Computer Systems Manager
HHS Management Limited Partnership
jgrusendorf@hudsonca.ca
PGP Key ID 0x5534507C
Re: FW: Bayes Tokens [ In reply to ]
On Fri, 2004-02-06 at 16:25, SpamAssassin List wrote:
> I'm sure this is nothing new to you SpamAssassin experts, but it seemed like
> a profound realization when it him me:
>
> In addition to the actual words and phrases in a message and its headers,
> *any* aspect or property of the message -- whether intrinsic or derived --
> can be considered a Bayes token. POPFile takes this into account to a

Quite right...

> limited extent with their "pseudowords"
> (http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord). But a
> rule-based system can be enfolded into a Bayes solution, because the result
> of evaluating any rule can be considered a token. Imagine if SpamAssassin
> had an optional mode where, instead of using manual scoring, all the active
> rules of its impressively comprehensive ruleset were tokenized? I think
> this would result in most impressive accuracy.
>
There was actually some discussion about a similar topic on the sa-dev
list about a week or two back. Their discussion was specifically
related to *generating* the (static) scores using naive bayes. IIRC,
the general consensus was that the existing GA or the new perceptron
(and orders of magnitude faster) can come up with better scores.
Something about including contextual information in the computation
instead of simply "good" or "bad" (eg: rule1 alone is neutral between
ham/spam, but rule1 and rule2 together are spammy.)

I fully expect somebody to reply telling me I'm smoking crack. Here,
have a grain of salt.

> I'm imagining examples such as testing for blacklist membership, which some
> people (mostly elsewhere, apparently) think is a bad idea. I could let
> Bayes decide just exactly how relevant it is to *my* corpus. How about the
> "Message is x% to y% HTML" rules, or "text to image ratio" rules. This
> would develop a very precise profile of what I consider to be ham and spam.
>
This is definitely an interesting idea... letting bayes tweak scores...

> I've emailed Paul Graham and Gary Robinson for their opinions, and both
> agree that this is a good idea. Paul pointed out that he mentions the

Even a more interesting idea.

> possibility in the appendix to A Plan for Spam, and Gary mentioned a
> commercial product (PureMessage) that apparently does some of this. It sure
> would be nice to see it in SpamAssassin someday.


--
Chris Thielen

Easily generate SpamAssassin rules to catch obfuscated spam phrases:
http://www.sandgnat.com/cmos/
Re: FW: Bayes Tokens [ In reply to ]
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


> In addition to the actual words and phrases in a message and its headers,
> *any* aspect or property of the message -- whether intrinsic or derived --
> can be considered a Bayes token. POPFile takes this into account to a
> limited extent with their "pseudowords"
> (http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord).

We do this extensively ;) see the source to Bayes.pm for details.

> But a
> rule-based system can be enfolded into a Bayes solution, because the result
> of evaluating any rule can be considered a token. Imagine if SpamAssassin
> had an optional mode where, instead of using manual scoring, all the active
> rules of its impressively comprehensive ruleset were tokenized? I think
> this would result in most impressive accuracy.

This is under investigation. some previous attempts failed pretty
badly, but we think there might be a way to do it correctly.

> I'm imagining examples such as testing for blacklist membership, which some
> people (mostly elsewhere, apparently) think is a bad idea. I could let
> Bayes decide just exactly how relevant it is to *my* corpus. How about the
> "Message is x% to y% HTML" rules, or "text to image ratio" rules. This
> would develop a very precise profile of what I consider to be ham and spam.
>
> I've emailed Paul Graham and Gary Robinson for their opinions, and both
> agree that this is a good idea. Paul pointed out that he mentions the
> possibility in the appendix to A Plan for Spam, and Gary mentioned a
> commercial product (PureMessage) that apparently does some of this. It sure
> would be nice to see it in SpamAssassin someday.

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.3 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFAJEfRQTcbUG5Y7woRApcIAJ0R45Y8fXC+p+XCCt3MmnLAFEWawACfVA9k
EgVQDJPoVcZGbprzJsPOZ5A=
=20Ub
-----END PGP SIGNATURE-----