Mailing List Archive: [Spamassassin Wiki] Update of "RunningPerceptron" by HenryStern

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by HenryStern:
http://wiki.apache.org/spamassassin/RunningPerceptron

------------------------------------------------------------------------------
= Running the Perceptron to generate scores =

- If all goes well, the ["Perceptron"] will take over from the GeneticAlgorithm (GA) as the main way we generate scores. (This text was copied from RescoreTenFcv and needs editing.)
+ Generating scores is a two step process: model validation and score generation. To prepare your environment for running perceptron, execute the following after setting CORPUS:

- Change these lines:
{{{
- make clean >> make.output
- make >> make.output 2>&1
- ./evolve
- pwd; date
+ mkdir masses/ORIG -p
+ for CLASS in "ham spam"; do
+ cat $CORPUS/submit/$CLASS*.log > masses/ORIG/$CLASS.log
+ for I in $(seq 0 3); do
+ ln -s masses/ORIG/$CLASS.log masses/ORIG/$CLASS-set$I.log
+ done
+ done
}}}

- to
+ == Model validation ==

+ Before generating the final set of scores, you need to pick a configuration for the training program. In order to do this, you run a series of "ten-fold cross validations" and use "Student's t-test" to compare their results. Should you be so inclined, you can also use ANOVA to compare result sets. This is left as an exercise to the reader.
- {{{
- make clean >> make.output
- make -C perceptron_c clean >> make.output
- make tmp/tests.h >> make.output 2>&1
- rm -rf perceptron_c/tmp; cp -r tmp perceptron_c/tmp
- make -C perceptron_c >> make.output
- ( cd perceptron_c ; ./perceptron -p 0.75 -e 100 )
- pwd; date
- }}}

- Change
+ In the masses directory, you will find a file called "config." As its name suggests, this is the configuration for the validate-model (and runGA) script. It consists of 5 fields: SCORESET, HAM_PREFERENCE, THRESHOLD, EPOCHS and NOTE. SCORESET is an integer between 0 and 3. Set 0 is for the ruleset with bayes and network tests disabled. Set 1 is for the ruleset with network tests enabled. Set 2 is for the ruleset with bayes enabled. Set 3 is for the ruleset with network tests and bayes enabled. HAM_PREFERENCE, THRESHOLD and EPOCHS correspond to options passed to perceptron. See its documentation. NOTE is appended to the name of the directory containing the result sets.

+ To refine your parameters, do an iterative process of editing the config file and then running validate-model. To compare the results of two runs using Student's t-test, use the "compare-models" script. Each result set will be stored in a directory of the form "vm-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE" and contains a file called "validate" which contains the aggregated results from the cross-validation. Run compare-models like so: ./compare-models vm-set0-2.0-5.0-100-before/validate vm-set0-2.0-5.0-100-after/validate
- {{{
- cp craig-evolve.scores [output]
- }}}

- to
+ To speed things up, validate-models caches most of the compiled files. If you change your logs or any of the scripts that are used as part of compilation, you will need to rm -rf vm-cache.

+ == Score generation ==
- {{{
- perl -pe 's/^(score\s+\S+\s+)0\s+/$1/gs;' \
- < perceptron_c/perceptron.scores \
- > [output]
- }}}

- (required to work around an extra digit output by the perceptron app).
+ When you are happy with your configuration, set it in your config file and execute the runGA script. You will find your results in a directory of the form "gen-$NAME-$HAM_PREFERENCE-$THRESHOLD-$EPOCHS-$NOTE" runGA uses a randomly selected corpus with 90% being used for training and 10% being used for testing.