Mailing List Archive: [Spamassassin Wiki] Update of "RescoreMassCheck" by DanielQuinlan

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by DanielQuinlan:
http://wiki.apache.org/spamassassin/RescoreMassCheck

The comment on the change is:
revise for 3.1.0 mass-check

------------------------------------------------------------------------------

We generate new scores by analyzing a massive collection of mail (a "corpus"), and running software to create a score-set that gets the best possible set of scores, so that the maximum possible number of mails in that corpus are correctly classified (ie. so that SA thinks the ham messages are nearly all ham, and the spam messages are nearly all spam).

- As MattKettler explains:
+ = Summary =

- The corpus is a LOT (aprox 1million pieces) of real-world, hand sorted mail.
+ The corpus consists of many (approximately 1 million pieces) of real-world, hand sorted mail.

- Basically a smallish number (less than 100) people, including the
+ Basically a smallish number of people (about 15), including the
- developers themselves, work as volunteer "corpus submitters". They hand
+ developers themselves, work as volunteer "corpus submitters". They hand
- sort their mail and run mass-check over it. They submit the output logs
+ classify their mail and then run mass-check over it. They submit the output logs
- mass-check generates. Occasionally people review the submitted logs for
+ mass-check generates. Occasionally people review the submitted logs for
- obvious mistakes, but it is largely a trust system.
+ obvious mistakes, but it is largely a trust system.

- If you want to see the statistics from the last corpus run, check the
+ If you want to see the statistics from the last corpus run, check the
- STATISTICS.txt files that come in the SA tarball. It will tell you how many
+ STATISTICS.txt files that come in the SA tarball. It will tell you how many
- emails were used, and what the hit rates of all the rules were.
+ emails were used, and what the hit rates of all the rules were.
-

= Procedure =

- Here's the process for generating the score-set.
+ Here's the process for generating the scores as of SpamAssassin 3.1.0:
-
- TODO: this is no longer accurate -- iirc we can do all mass-checks in one sitting. Daniel, can you update this?

== 1. heads-up ==

Inform everyone in advance on the -users and -dev lists that we will be starting mass-checks shortly, and they should get their corpora nice and clean.

- == 2. announce mass-check run 1 (score sets 0 and 1) ==
+ == 2. announce mass-check ==

See MassCheck. The mass-check for both scoresets can be done in one command, e.g.

{{{
cd masses
+ mkdir -p spamassassin
+
- echo "use_bayes 0" > spamassassin/user_prefs
+ cat > spamassassin/user_prefs
- mass-check --net [targets etc]
+ bayes_auto_learn 0
+ lock_method flock
+ bayes_store_module Mail::SpamAssassin::BayesStore::SDBM
+ use_auto_whitelist 0
+ hit [Control-D]
+
+ mass-check -j 4 --bayes --net --restart=400 --learn=35 --reuse [all targets]
+
+ (note if a --after flag is part of the announcement, please add it as well)
}}}

- Here's the full announcement text for this phase: RescoreSet01Details
+ Here's the full announcement text for this phase: RescoreDetails

- We then take the log files rsync'd up to the server, and use those logs for both set 0 and set 1; set 0 can be generated from set 1 by stripping out the network tests.
+ We then take the log files rsync'd up to the server, and use those logs for all 4 score sets. The initial logs are for score set 3 (the fourth), sets 0, 1, and 2 can be generated from set 4 by stripping out the network tests and/or the Bayes tests.

- == 3. allow several days to complete ==
+ == 3. allow several days to complete (it takes a really long time!) ==

- Provide enough time, including a weekend if possible, giving people enough time to get around to running it given that they may be busy with day-job stuff ;)
+ Provide enough time, including a weekend if possible, giving people enough time to get around to running it given that they may be busy with day-job stuff. ;)

- == 4. generate scores for score sets 0 and 1 ==
+ == 4. generate scores for score sets ==

See RunningPerceptron.

Once this is complete, update rules/50_scores.cf with the generated scores.

- == 5. announce mass-check run 2 (set 2) and run 3 (set 3) ==
-
- See MassCheck. Because set 2 and set 3 both require scores from set 0 and set 1, and both depend on auto-learning, they have to be run separately. TODO: this is no longer the case!
-
- Scoreset 2:
-
- {{{
- cd masses
- echo "use_bayes 1" > spamassassin/user_prefs
- mass-check [targets etc]
- }}}
-
- Scoreset 3:
-
- {{{
- cd masses
- echo "use_bayes 1" > spamassassin/user_prefs
- mass-check --net [targets etc]
- }}}
-
- Here's the full announcement text for this phase: RescoreSet23Details
-
- == 6. wait for everyone to complete them, as per step #3. ==
-
- Waiting...
-
- == 7. generate scores for score sets 2 and 3 ==
-
- See RunningPerceptron.
-