Mailing List Archive

[SpamAssassin Wiki] Updated: BayesInSpamAssassin
Date: 2004-06-05T22:58:36
Editor: 64.252.169.209 <>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

no comment

Change Log:

------------------------------------------------------------------------------
@@ -46,3 +46,8 @@
Bayesian+Dictionary analysis

It may be worthwhile to develop a Bayesian checker which checks for the proportion of dictionary words vs. non-dictionary words. This may quickly assist in identifying messages that utilize Bayesian avoidance techniques with punctuation/spacing interspersed through commonly identified words. Possible adjunct to the current methods.
+
+(Added by Guest)
+Bayesian+Grammar analysis
+
+One standard Bayesian avoidance technique is to throw in a large number of randomly-selected unusual words (e.g., chelate, diorite, swathe, crowberry) after the main message. A significant boost could be achieved by checking things like "What ratio of the message is punctuated like a sentence?" and "What ratio of the punctuation-defined sentences are grammatical sentences (i.e., have a subject and a verb)?"
[SpamAssassin Wiki] Updated: BayesInSpamAssassin [ In reply to ]
Date: 2004-06-06T00:39:00
Editor: JustinMason <jm@jmason.org>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

no comment

Change Log:

------------------------------------------------------------------------------
@@ -41,13 +41,15 @@
If you have "maildir" mailboxes, running ''spamassassin -r'' multiple times can be tedious for large numbers of spam. So you can use this ["report_spam.pl"] script to run it for you. The script is written in perl. You can save the script to your spamassassin computer and then run it using ''report_spam.pl your_spam_directory''. Each message in your_spam_directory will then be learned in bayes '''and''' reported to the checksum services.

= Possible Future Directions =
-(InSanity)

-Bayesian+Dictionary analysis
+(InSanity): Bayesian+Dictionary analysis

It may be worthwhile to develop a Bayesian checker which checks for the proportion of dictionary words vs. non-dictionary words. This may quickly assist in identifying messages that utilize Bayesian avoidance techniques with punctuation/spacing interspersed through commonly identified words. Possible adjunct to the current methods.

-(Added by Guest)
-Bayesian+Grammar analysis
+(Added by Guest): Bayesian+Grammar analysis

One standard Bayesian avoidance technique is to throw in a large number of randomly-selected unusual words (e.g., chelate, diorite, swathe, crowberry) after the main message. A significant boost could be achieved by checking things like "What ratio of the message is punctuated like a sentence?" and "What ratio of the punctuation-defined sentences are grammatical sentences (i.e., have a subject and a verb)?"
+
+(JustinMason): answers to those two
+
+The danger here is that it's pretty trivial for spammers to use chunks from freely-available text on the internet, which do form correct, grammatical sentences.
[SpamAssassin Wiki] Updated: BayesInSpamAssassin [ In reply to ]
Date: 2004-06-10T00:27:49
Editor: 145.18.136.124 <>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

no comment

Change Log:

------------------------------------------------------------------------------
@@ -53,3 +53,12 @@
(JustinMason): answers to those two

The danger here is that it's pretty trivial for spammers to use chunks from freely-available text on the internet, which do form correct, grammatical sentences.
+
+
+(Added by Guest): Bayesian+Spell checking
+
+More and more spams include misspelled spam-words like V.|agarra and the like. Perhaps it would benefit Bayesian analysis to run the mail to be categorized through some sort of filter that can detect/translate leetspeak and simple misspellings, and/or include the leetspeak/misspelling-count as a categorizer. After all, spammers may include shakespear to avoid Bayesian filters, but then they must also include misspellings of the wares they are selling.
+
+(Added by Guest): Bayesian performance of spamassassin cf. mozilla mail
+
+It struck me that the performance of spamassassin, especially when it has only been fed small amounts of hams and spams, is under-par compared to the performance of mozilla mail, even though I'm retraining SA about every day. Any ideas what might be the culprit?
[SpamAssassin Wiki] Updated: BayesInSpamAssassin [ In reply to ]
Date: 2004-06-10T00:28:37
Editor: 145.18.136.124 <>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

M

Change Log:

------------------------------------------------------------------------------
@@ -61,4 +61,4 @@

(Added by Guest): Bayesian performance of spamassassin cf. mozilla mail

-It struck me that the performance of spamassassin, especially when it has only been fed small amounts of hams and spams, is under-par compared to the performance of mozilla mail, even though I'm retraining SA about every day. Any ideas what might be the culprit?
+It struck me that the performance of spamassassin, especially when it has only been fed small amounts (though more than 200) of hams and spams, is under-par compared to the performance of mozilla mail, even though I'm retraining SA about every day. Any ideas what might be the culprit?
[SpamAssassin Wiki] Updated: BayesInSpamAssassin [ In reply to ]
Date: 2004-06-10T09:51:53
Editor: JustinMason <jm@jmason.org>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

no comment

Change Log:

------------------------------------------------------------------------------
@@ -40,25 +40,6 @@

If you have "maildir" mailboxes, running ''spamassassin -r'' multiple times can be tedious for large numbers of spam. So you can use this ["report_spam.pl"] script to run it for you. The script is written in perl. You can save the script to your spamassassin computer and then run it using ''report_spam.pl your_spam_directory''. Each message in your_spam_directory will then be learned in bayes '''and''' reported to the checksum services.

-= Possible Future Directions =
+= Questions, Comments, Future Directions =

-(InSanity): Bayesian+Dictionary analysis
-
-It may be worthwhile to develop a Bayesian checker which checks for the proportion of dictionary words vs. non-dictionary words. This may quickly assist in identifying messages that utilize Bayesian avoidance techniques with punctuation/spacing interspersed through commonly identified words. Possible adjunct to the current methods.
-
-(Added by Guest): Bayesian+Grammar analysis
-
-One standard Bayesian avoidance technique is to throw in a large number of randomly-selected unusual words (e.g., chelate, diorite, swathe, crowberry) after the main message. A significant boost could be achieved by checking things like "What ratio of the message is punctuated like a sentence?" and "What ratio of the punctuation-defined sentences are grammatical sentences (i.e., have a subject and a verb)?"
-
-(JustinMason): answers to those two
-
-The danger here is that it's pretty trivial for spammers to use chunks from freely-available text on the internet, which do form correct, grammatical sentences.
-
-
-(Added by Guest): Bayesian+Spell checking
-
-More and more spams include misspelled spam-words like V.|agarra and the like. Perhaps it would benefit Bayesian analysis to run the mail to be categorized through some sort of filter that can detect/translate leetspeak and simple misspellings, and/or include the leetspeak/misspelling-count as a categorizer. After all, spammers may include shakespear to avoid Bayesian filters, but then they must also include misspellings of the wares they are selling.
-
-(Added by Guest): Bayesian performance of spamassassin cf. mozilla mail
-
-It struck me that the performance of spamassassin, especially when it has only been fed small amounts (though more than 200) of hams and spams, is under-par compared to the performance of mozilla mail, even though I'm retraining SA about every day. Any ideas what might be the culprit?
+It is not appropriate to discuss those here. Please use the Spamassassin-users mailing list as a forum for those discussions.
[SpamAssassin Wiki] Updated: BayesInSpamAssassin [ In reply to ]
Date: 2004-06-20T16:57:44
Editor: DanKohn <dan@dankohn.com>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

no comment

Change Log:

------------------------------------------------------------------------------
@@ -31,6 +31,11 @@

Again, it's important to do both.

+= How to train Bayes without logging on =
+(DanKohn)
+
+If you don't read your mail on the account where SpamAssassin is running, it can be challenging to do mistake-based training, where you learn false negatives (i.e., spam that was not caught) as spam. One approach is redirect your false negatives and use procmail to train on them, as described in ProcmailToForwardMail.
+
= Training *and* reporting =
(KurtYoder)
[SpamAssassin Wiki] Updated: BayesInSpamAssassin [ In reply to ]
Date: 2004-06-21T11:00:37
Editor: MalteStretz <mss@apache.org>
Wiki: SpamAssassin Wiki
Page: BayesInSpamAssassin
URL: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

Don't forget to put an empty line after headers, else werid things happen

Change Log:

------------------------------------------------------------------------------
@@ -1,18 +1,23 @@
-== Bayes Introduction ==
-The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called ''tokens''; words or short character sequences that are commonly found in spam or ham. If I've handed 100 messages to sa-learn that have the phrase ''penis enlargement'' and told it that those are all spam, when the 101st message comes in with the words ''penis'' and ''enlargment'', the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message.
+= Bayes Introduction =
+
+The Bayesian classifier in Spamassassin tries to identify spam by looking at what are called ''tokens''; words or short character sequences that are commonly found in spam or ham.
+If I've handed 100 messages to sa-learn that have the phrase ''penis enlargement'' and told it that those are all spam, when the 101st message comes in with the words ''penis'' and ''enlargment'', the Bayesian classifier will be pretty sure that the new message is spam and will increase the spam score of that message.

If you're having trouble with Bayes, see BayesFaq for help.

+
= Things to remember =
+
* To ''train'' Spamassassin, you get a mailbox full of messages that you know are spam and use the sa-learn program to pull out the tokens and remember them for later:

'''sa-learn --showdots --mbox --spam ''' ''spam-file''

Then you get a mailbox full of messages you're sure are ham and teach Bayes about those:

- '''sa-learn --showdots --mbox --ham ''' ''ham-file''
+ '''sa-learn --showdots --mbox --ham ''' ''ham-file''

It is important to do both.
+
* The bayesian classifier can only score new messages if it already has 200 known spams and 200 known hams.
* If Spamassassin fails to identify a spam, teach it so it can do better next time. Run it through the sa-learn program and it will be more likely to correctly identify it as spam next time. Likewise, if SA puts a ham in your spam folder, run that message through '''sa-learn --ham ''' ''ham-folder''.
* It's OK to feed emails with Spamassassin markup into the sa-learn command -- sa-learn will ignore any standard Spamassassin headers, and if the original email has been encapsulated into an attachment it will decapsulate the email. In other words sa-learn will undo any changes which Spamassassin has done before learning the spam/ham character of the email.
@@ -32,19 +37,35 @@
Again, it's important to do both.

= How to train Bayes without logging on =
-(DanKohn)

-If you don't read your mail on the account where SpamAssassin is running, it can be challenging to do mistake-based training, where you learn false negatives (i.e., spam that was not caught) as spam. One approach is redirect your false negatives and use procmail to train on them, as described in ProcmailToForwardMail.
+If you don't read your mail on the account where SpamAssassin is running, it can be challenging to do mistake-based training, where you learn false negatives (i.e., spam that was not caught) as spam.
+One approach is redirect your false negatives and use procmail to train on them, as described in ProcmailToForwardMail.

-= Training *and* reporting =
-(KurtYoder)
+(DanKohn)

-If you only train your own bayes database using ''sa-learn'', you will not be reporting the spam message you received to spam checksum services such as dcc, pyzor, or razor. To report the spam to the checksum services, you will need to use ''spamassassin -r < the_spam_message_file''. You may also need to register as a spam reporter for services such as razor. If you are not sure your reports are being accepted, run ''spamassassin -rD < the_spam_message_file'' and look for any debugging output telling you that you need to register.
+= Training plus reporting =

-You can only invoke spamassassin using ''spamassassin -r'' on single files. This is fine for "mbox" spam mailboxes which are all contained in one file. However, for "maildir" directories, you will need to run ''spamassassin -r'' on each message individually. If you are not sure which format you have, look at your mail directory. If you see one or more files and each file contains one or more messages, you have "mbox" format. If you see directories containing files, each file name is a long string of numbers, letters, and punctuation, and each file contains one email message, you have "maildir" format.
+If you only train your own bayes database using ''sa-learn'', you will not be reporting the spam message you received to spam checksum services such as dcc, pyzor, or razor.
+To report the spam to the checksum services, you will need to use ''spamassassin -r < the_spam_message_file''.
+You may also need to register as a spam reporter for services such as razor.
+If you are not sure your reports are being accepted, run ''spamassassin -rD < the_spam_message_file'' and look for any debugging output telling you that you need to register.
+
+You can only invoke spamassassin using ''spamassassin -r'' on single files.
+This is fine for "mbox" spam mailboxes which are all contained in one file.
+However, for "maildir" directories, you will need to run ''spamassassin -r'' on each message individually.
+If you are not sure which format you have, look at your mail directory.
+If you see one or more files and each file contains one or more messages, you have "mbox" format.
+If you see directories containing files, each file name is a long string of numbers, letters, and punctuation, and each file contains one email message, you have "maildir" format.
+
+If you have "maildir" mailboxes, running ''spamassassin -r'' multiple times can be tedious for large numbers of spam.
+So you can use this ["report_spam.pl"] script to run it for you.
+The script is written in perl.
+You can save the script to your spamassassin computer and then run it using ''report_spam.pl your_spam_directory''.
+Each message in your_spam_directory will then be learned in bayes '''and''' reported to the checksum services.

-If you have "maildir" mailboxes, running ''spamassassin -r'' multiple times can be tedious for large numbers of spam. So you can use this ["report_spam.pl"] script to run it for you. The script is written in perl. You can save the script to your spamassassin computer and then run it using ''report_spam.pl your_spam_directory''. Each message in your_spam_directory will then be learned in bayes '''and''' reported to the checksum services.
+(KurtYoder)

= Questions, Comments, Future Directions =

-It is not appropriate to discuss those here. Please use the Spamassassin-users mailing list as a forum for those discussions.
+It is not appropriate to discuss those here.
+Please use the Spamassassin-users mailing list as a forum for those discussions.