> ----------
> From: Jim Grusendorf
> Sent: Friday, February 06, 2004 3:25:23 PM
> To: 'sa-list@hudsonca.ca'
> Subject: Bayes Tokens
> Auto forwarded by a Rule
>
>
>
I'm sure this is nothing new to you SpamAssassin experts, but it seemed like
a profound realization when it him me:
In addition to the actual words and phrases in a message and its headers,
*any* aspect or property of the message -- whether intrinsic or derived --
can be considered a Bayes token. POPFile takes this into account to a
limited extent with their "pseudowords"
(http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord). But a
rule-based system can be enfolded into a Bayes solution, because the result
of evaluating any rule can be considered a token. Imagine if SpamAssassin
had an optional mode where, instead of using manual scoring, all the active
rules of its impressively comprehensive ruleset were tokenized? I think
this would result in most impressive accuracy.
I'm imagining examples such as testing for blacklist membership, which some
people (mostly elsewhere, apparently) think is a bad idea. I could let
Bayes decide just exactly how relevant it is to *my* corpus. How about the
"Message is x% to y% HTML" rules, or "text to image ratio" rules. This
would develop a very precise profile of what I consider to be ham and spam.
I've emailed Paul Graham and Gary Robinson for their opinions, and both
agree that this is a good idea. Paul pointed out that he mentions the
possibility in the appendix to A Plan for Spam, and Gary mentioned a
commercial product (PureMessage) that apparently does some of this. It sure
would be nice to see it in SpamAssassin someday.
Jim Grusendorf
Computer Systems Manager
HHS Management Limited Partnership
jgrusendorf@hudsonca.ca
PGP Key ID 0x5534507C
> From: Jim Grusendorf
> Sent: Friday, February 06, 2004 3:25:23 PM
> To: 'sa-list@hudsonca.ca'
> Subject: Bayes Tokens
> Auto forwarded by a Rule
>
>
>
I'm sure this is nothing new to you SpamAssassin experts, but it seemed like
a profound realization when it him me:
In addition to the actual words and phrases in a message and its headers,
*any* aspect or property of the message -- whether intrinsic or derived --
can be considered a Bayes token. POPFile takes this into account to a
limited extent with their "pseudowords"
(http://popfile.sourceforge.net/cgi-bin/wiki.pl?Glossary/PseudoWord). But a
rule-based system can be enfolded into a Bayes solution, because the result
of evaluating any rule can be considered a token. Imagine if SpamAssassin
had an optional mode where, instead of using manual scoring, all the active
rules of its impressively comprehensive ruleset were tokenized? I think
this would result in most impressive accuracy.
I'm imagining examples such as testing for blacklist membership, which some
people (mostly elsewhere, apparently) think is a bad idea. I could let
Bayes decide just exactly how relevant it is to *my* corpus. How about the
"Message is x% to y% HTML" rules, or "text to image ratio" rules. This
would develop a very precise profile of what I consider to be ham and spam.
I've emailed Paul Graham and Gary Robinson for their opinions, and both
agree that this is a good idea. Paul pointed out that he mentions the
possibility in the appendix to A Plan for Spam, and Gary mentioned a
commercial product (PureMessage) that apparently does some of this. It sure
would be nice to see it in SpamAssassin someday.
Jim Grusendorf
Computer Systems Manager
HHS Management Limited Partnership
jgrusendorf@hudsonca.ca
PGP Key ID 0x5534507C