Mailing List Archive

[Spamassassin Wiki] Trivial Update of "CorpusCleaning" by JustinMason
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/CorpusCleaning

------------------------------------------------------------------------------

Here's a few methods used to deal with common forms of corpus pollution -- messages in a mail corpus that aren't suitable for use in a MassCheck.

- == False Positives and False Negatives ==
+ == Cleaning Out False Positives ==

To clean a spam corpus of FalsePositives -- first, do a mass-check. You will wind up with a 'spam.log' and 'ham.log' file. Run these commands to get a list of the 200 lowest-scoring spams, create a mbox file with just those messages, then open that mbox up in the "mutt" mail client:

@@ -28, +28 @@

}}}

You can also remove the offending files, or messages from the source mailboxes, directly. However, this depends on what format you use to store messages; Maildirs, mboxes, etc. etc. (Maildirs are easiest, since you can just delete the files named in the 'id.fps' file.)
+
+ == Cleaning Out False Negatives ==

Doing the same operation to clean the ham corpus of FalseNegatives is similar, but reverses a few things... here's the commands to do that: