Mailing List Archive

[Spamassassin Wiki] Update of "HandClassifiedCorpora" by JustinMason
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/HandClassifiedCorpora

The comment on the change is:
add note about bayes issue noticed during 3.1.0 rescoring

------------------------------------------------------------------------------
Like a Bayesian learning system, SpamAssassin's GeneticAlgorithm requires a corpus of hand-classified mail. Our guidelines are (quoting and expanding on "masses/CORPUS_POLICY"):

- * hand-verified as "spam" and "ham" (non-spam) piles -- *not* just classified using existing spam-classification algorithms (such as SpamAssassin itself). Note that it's fine to use SpamAssassin to pre-filter them into the right piles, just make sure you scan the results "by hand" afterwards to verify that SpamAssassin made the correct diagnosis in each case.
+ * hand-verified as "spam" and "ham" (non-spam) piles -- ''not'' just classified using existing spam-classification algorithms, such as SpamAssassin itself. Note that it's fine to use SpamAssassin to pre-filter them into the right piles, just make sure you scan the results "by hand" afterwards to verify that SpamAssassin made the correct diagnosis in each case.

* containing a representative mix of ham mail -- that includes commercial-sounding-but-not-spam messages, legitimate business discussions (which may include talk of "sales", "marketing", "offers", bankruptcies, mortgages, etc), or verified opt-in mail newsletters. This is a ''very'' important point! Your ham corpus should contain as much ham as is possible, as close to ALL valid emails received by everybody as is possible, with only the exceptions noted here. ("as is possible" recognizes that for privacy and confidentiality reasons some ham cannot be stored anywhere but its destination email folder.)

@@ -12, +12 @@


* cleaned of viruses, bounce mails from broken virus and spam filters, and forwarded spam messages. These will skew the results.

- * and finally, cleaned of discussion of spam or virus messages or signatures (such as SpamAssassin-talk or bugtraq mailing list messages). Even though they are ham, these often contain snippets of code that incorrectly trigger tests, and again will skew the results. (Rewriting the tests to avoid triggering on SpamAssassin-talk messages is not realistic!)
+ * and finally, cleaned of discussion of spam or virus messages or signatures (such as SpamAssassin-talk or bugtraq mailing list messages). Even though they are ham, these often contain snippets of code that incorrectly trigger tests, and again will skew the results. (Rewriting the tests to avoid triggering on SpamAssassin-talk messages is not realistic, unfortunately.)

+ * ''(if you're mass-checking for a RescoreMassCheck:)'' the corpora must contain ''both'' ham and spam. If it contains only one, the accuracy figures reported for the Bayes rules will be invalid.
+
- Once you run "mass-check" on a corpus (MassCheck), see the instructions in "CORPUS_SUBMIT" for details of how to verify that the top scorers are not accidental spam that got through.
+ Once you run MassCheck, see the instructions in CorpusCleaning for details of how to verify that the top scorers are not accidental spam that got through.

(Aside: yes, it's "corpora". See PluralOfCorpus)

== Minor things that are nice to have ==

- * eliminate duplicates -- there should be one and only one copy of any single email, whether spam or ham. (JustinMason: in my opinion, this isn't a hard and fast rule, as it can be very time-consuming. I'd suggest just removing dups where they all arrive at the same time, in sequence, if possible, but don't really worry about it too much.)
+ * eliminate duplicates -- there should be one and only one copy of any single email, whether spam or ham. (This isn't a hard and fast rule, as it can be very time-consuming. Just remove dups where they all arrive at the same time, in sequence, if possible, but don't really worry about it too much.)