Mailing List Archive: svn commit: rev 6321 - incubator/spamassassin/trunk/masses

Author: felicity
Date: Tue Jan 27 11:19:10 2004
New Revision: 6321

Added:
incubator/spamassassin/trunk/masses/README.perceptron
incubator/spamassassin/trunk/masses/perceptron.c
Log:
bug 2910: looks like these weren't "svn add"ed ... oops!

Added: incubator/spamassassin/trunk/masses/README.perceptron
==============================================================================
--- (empty file)
+++ incubator/spamassassin/trunk/masses/README.perceptron Tue Jan 27 11:19:10 2004
@@ -0,0 +1,154 @@
+Copyright (c)2003 Henry Stern
+
+Fast SpamAssassin Score Learning Tool
+
+Henry Stern
+Faculty of Computer Science
+Dalhousie University
+6050 University Avenue
+Halifax, NS Canada
+B3H 1W5
+henry@stern.ca
+
+January 8, 2004
+
+1. WHAT IS IT?
+
+This program is used to compute scores for SpamAssassin rules. It makes
+use of data files generated by the suite of scripts in
+spamassassin/masses. The program outputs the generated scores in a file
+titled 'perceptron.scores'.
+
+The advantage of this program over that of the genetic algorithm (GA)
+implementation in spamassassin/masses/craig_evolve.c is that while the GA
+requires several hours to run on high-end machines, the perceptron
+requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.
+
+This makes incremental updates and score personalization practical for the
+end-user and gives developers a better idea just how useful a new rule is.
+
+2. OPTIONS
+
+There are four options that can be passed to the perceptron program.
+
+ -p ham_preference
+
+ This increases the number of non-spam messages in the training
+ set. It does this by adding 1 + (number of tests hit) *
+ ham_preference instances of non-spam messages to the training set.
+ This is intended to reduce false positives by encouraging the
+ training program to look at the harder-to-clasify ham messages
+ more often. By default, this parameter is 2.0.
+
+ -e num_epochs
+
+ This parameter sets how many passes the perceptron will make
+ through the training set before terminating. On each pass, the
+ training set is shuffled and then iterated through. By default,
+ it will make 15 passes.
+
+ -l learning_rate
+
+ This parameter modifies the learning rate of the perceptron. The
+ error gradient is computed for each instance. This program uses a
+ logsig activation function y = (1/(1+exp(-x))), so the error
+ gradient is computed as E(x) = y*(1-y)*(is_spam - y). For
+ each instance and score hit in the training set, the scores are
+ modified by adding E(x) / (number of tests hit + 1) *
+ learning_rate. The default value for this is 2.0, but it can be
+ whatever you want.
+
+ -w weight_decay
+
+ To prevent the scores from getting too high (or even forcing them
+ to if you want), before each epoch, the scores and network bias
+ are multiplied by the weight decay. This is off by default (1.0),
+ but it can be useful.
+
+3. HOW DOES IT WORK?
+
+This program implements the "Stochastic Gradient Descent" method of
+training a neural network. It uses a single perceptron with a logsig
+activation function and maps the weights to SpamAssassin score space.
+
+The perceptron is the simplest form of neural network. It consists of a
+transfer function and an activation function. Together, they simulate the
+average firing rate of a biological neuron over time.
+
+The transfer function is the sum of the product of weights and inputs.
+It simulates the membrane potential of a neuron. When the accumulated
+charge on the membrane exceeds a certain threshold, the neuron fires,
+sending an electrical impulse down the axon. This implementation uses a
+linear transfer function:
+
+[1] f(x) = bias + sum_{i=1}^{n} w_i * x_i
+
+This is quite similar to how the SpamAssasin score system works. If you
+set the bias to be 0, w_i to be the score for rule i and x_i to be whether
+or not rule i is activated by a given message, the transfer function will
+return the score of the message.
+
+The activation function simulates the electical spike travelling down the
+axon. It takes the output of the transfer function as input and applies
+some sort of transformation to it. This implementation uses a logsig
+activation function:
+
+[2] y(x) = 1 / (1 + exp(-f(x)))
+
+This non-linear function constrains the output of the transfer function
+between 0 and 1. When plotted, it looks somewhat S-shaped with vertical
+asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.
+The slope of the function is greatest at x=0 and tapers off as it
+approaches the asymptotes.
+
+Lastly, the performance of the perceptron needs to be measured using an
+error function. Two error functions are commonly used: mean squared
+error and entropic error. By default, this implementation uses mean
+squared error but entropic error may be substituted by adding a compiler
+directive.
+
+The most common method of training neural networks is called gradient
+descent. It involves iteratively tuning the parameters of the network so
+that the mean error rate always decreases. This is done by finding the
+direction of steepest descent down the "error gradient," reducing the
+value of the error function for the next iteration of the algorithm.
+
+If the transfer function and activation function are both differentiable,
+the error gradient of a neural network can be calculated with respect to
+the weights and bias. Without getting into calculus, the error gradient
+for a perceptron with a linear transfer function, logsig activation
+function and mean squared error function is:
+
+[3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x))
+
+The weights are updated using the function:
+
+[4] w_i = w_i + E(x) * x_i * learning_rate
+
+Since the SpamAssassin rule hits are sparse, the basic gradient descent
+algorithm is impractical. This implementation uses a variation called
+"Stochastic gradient descent." Instead of doing one batch update per
+epoch, the training set is randomly walked through, doing incremental
+updates. In addition, in this implementation, the learning rate is
+modified by the number of rule hits for a given training instance.
+Together, these allow good weights to be computed for
+infrequently-occurring rules.
+
+Once the perceptron has finished running, the weights are converted to
+scores and exported using the familiar file format. Weights are converted
+to scores using this function:
+
+[5] score(weight) = -threshold * weight / bias
+
+4. ACKNOWLEDGEMENTS
+
+I would like to thank my PhD supervisor, Michael Shepherd, for not getting
+mad at me while I worked on this. I'd also like to thank Thomas
+Trappenberg for his invaluable assistance while I was tweaking the
+performance of the learning algorithm. I would like to thank Daniel
+Quinlan, Justin Mason and Theo Van Dinter for their valuable input and
+constructive criticism.
+
+--
+hs
+8/1/2004

Added: incubator/spamassassin/trunk/masses/perceptron.c
==============================================================================
--- (empty file)
+++ incubator/spamassassin/trunk/masses/perceptron.c Tue Jan 27 11:19:10 2004
@@ -0,0 +1,425 @@
+/* Copyright (c)2003 Henry Stern
+ *
+ * This program uses stochastic gradient descent to learn a scoreset for
+ * SpamAssassin. You'll need to run logs-to-c from spamassassin/masses to
+ * generate the stuff in tmp.
+ */
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <assert.h>
+#include <sys/time.h>
+#include <math.h>
+#include <unistd.h>
+
+#include "tmp/scores.h"
+#include "tmp/tests.h"
+
+/* Ensure that multiple error functions have not been chosen. */
+#ifdef ENTROPIC_ERROR
+#ifdef LEAST_SQUARES_ERROR
+#warn Both entropic error and least squares error chosen. Using least squares.
+#endif
+#endif
+
+/* Choose the least squares error function by default. */
+#ifndef ENTROPIC_ERROR
+#ifndef LEAST_SQUARES_ERROR
+#define LEAST_SQUARES_ERROR
+#endif
+#endif
+
+#define OUTPUT_FILE "perceptron.scores"
+
+void init_wheel ();
+void destroy_wheel ();
+int get_random_test ();
+void init_weights();
+void destroy_weights ();
+void write_weights (FILE * fp);
+double evaluate_test (int test);
+double evaluate_test_nogain (int test);
+void train (int num_epochs, double learning_rate);
+void usage ();
+
+/* Converts a weight to a SpamAssassin score. */
+#define weight_to_score(x) (-5*(x)/bias)
+#define score_to_weight(x) (-(x)*bias/5)
+
+int wheel_size; /* The number of entries in the roulette wheel (NOT THE
+ SIZE OF THE ROULETTE_WHEEL ARRAY!). */
+int * roulette_wheel; /* Used for roulette wheel selection. */
+double ham_preference = 2.0;
+
+double * weights; /* The weights of the single-layer perceptron. */
+double bias; /* The network bias for the single-layer perceptron. */
+
+int num_epochs = 15;
+double learning_rate = 2.0;
+double weight_decay = 1.0;
+
+/* Initialize the roulette wheel and populate it, replicating harder-to-classify hams. */
+void init_wheel () {
+ int i;
+ int spam = 0, ham = 0;
+
+ /* cache locations on the roulette wheel for fast lookups */
+ roulette_wheel = (int*)calloc(num_nondup, sizeof(int));
+ wheel_size = 0;
+
+ roulette_wheel[0] = 0;
+
+ for (i = 0; i < num_nondup - 1; i++) {
+ int slot_size = 1;
+
+ /* Hams with more tests are rare and harder to classify but are the
+ * most important to classify correctly. They are thus replicated in the
+ * training set proportionally to their difficulty. */
+ if ( ! is_spam[i] ) {
+ slot_size += (int)(num_tests_hit[i] * ham_preference);
+ }
+
+ /* The database is compressed with all instances mapped in the same place. */
+ slot_size *= tests_count[i];
+ wheel_size += slot_size;
+
+ if ( ! is_spam[i] ) {
+ ham += slot_size;
+ } else {
+ spam++;
+ }
+
+ roulette_wheel[i+1] = roulette_wheel[i] + slot_size;
+ }
+
+ printf ("Modified training set statistics: %d spam, %d ham.\n", spam, ham);
+}
+
+/* Free the resources for the roulette wheel selector. */
+void destroy_wheel () {
+ if ( roulette_wheel ) {
+ free (roulette_wheel);
+ roulette_wheel = 0;
+ wheel_size = 0;
+ }
+}
+
+/* Get a random test using roulette wheel selection. This is not used anymore. */
+int get_random_test () {
+ int r;
+ int bottom, middle, top;
+
+ r = lrand48() % wheel_size;
+
+ bottom = 0;
+ top = num_nondup - 1;
+ middle = top / 2;
+
+ while (1) {
+ if ( r < roulette_wheel[bottom+1] ) {
+ return bottom;
+ } else if ( r >= roulette_wheel[top] ) {
+ return top;
+ } else if ( r >= roulette_wheel[middle] && r < roulette_wheel[middle+1] ) {
+ return middle;
+ } else if ( r < roulette_wheel[middle] ) {
+ top = middle-1;
+ bottom++;
+ middle = bottom + (top-bottom)/2;
+ } else {
+ bottom = middle+1;
+ top--;
+ middle = bottom + (top-bottom)/2;
+ }
+ }
+}
+
+/* Allocate and initialize the weights over the range [-0.5..0.5] */
+void init_weights () {
+ int i;
+
+ weights = (double*)calloc(num_scores, sizeof(double));
+
+ bias = drand48() - 0.5;
+ for (i = 0; i < num_scores; i++) {
+ weights[i] = drand48() - 0.5;
+ }
+}
+
+/* Free the resources allocated for the weights. */
+void destroy_weights () {
+ if ( weights ) {
+ free(weights);
+ weights = 0;
+ }
+}
+
+/* Writes out the weights in SpamAssassin score space. */
+void write_weights (FILE * fp) {
+ int i;
+ int ga_nn, ga_yy, ga_ny, ga_yn;
+ double nnscore, yyscore, nyscore, ynscore;
+ double threshold = 5;
+
+ ga_nn = ga_yy = ga_ny = ga_yn = 0;
+ nnscore = yyscore = nyscore = ynscore = 0;
+
+ /* Run through all of the instances in the training set and tally up
+ * the scores. */
+ for (i = 0; i < num_nondup; i++) {
+ double score = weight_to_score(evaluate_test_nogain(i)) + 5;
+
+ if ( score >= threshold && is_spam[i] ) {
+ ga_yy += tests_count[i];
+ yyscore += tests_count[i] * score;
+ } else if ( score < threshold && !is_spam[i] ) {
+ ga_nn += tests_count[i];
+ nnscore += tests_count[i] * score;
+ } else if ( score >= threshold && !is_spam[i] ) {
+ ga_ny += tests_count[i];
+ nyscore += tests_count[i] * score;
+ } else {
+ ga_yn += tests_count[i];
+ ynscore += tests_count[i] * score;
+ }
+ }
+
+ /* This is copied from the dump() function in craig-evolve.c. It
+ * outputs some nice statistics about the learned classifier. */
+ fprintf (fp,"\n# SUMMARY for threshold %3.1f:\n", threshold);
+ fprintf (fp,
+ "# Correctly non-spam: %6d %4.2f%% (%4.2f%% of non-spam corpus)\n",
+ ga_nn,
+ (ga_nn / (float) num_tests) * 100.0,
+ (ga_nn / (float) num_nonspam) * 100.0);
+ fprintf (fp,
+ "# Correctly spam: %6d %4.2f%% (%4.2f%% of spam corpus)\n",
+ ga_yy,
+ (ga_yy / (float) num_tests) * 100.0,
+ (ga_yy / (float) num_spam) * 100.0);
+ fprintf (fp,
+ "# False positives: %6d %4.2f%% (%4.2f%% of nonspam)\n",
+ ga_ny,
+ (ga_ny / (float) num_tests) * 100.0,
+ (ga_ny / (float) num_nonspam) * 100.0);
+ fprintf (fp,
+ "# False negatives: %6d %4.2f%% (%4.2f%% of spam)\n",
+ ga_yn,
+ (ga_yn / (float) num_tests) * 100.0,
+ (ga_yn / (float) num_spam) * 100.0);
+
+ fprintf (fp,"# Average score for spam: %3.3f nonspam: %3.1f\n",(ynscore+yyscore)/((double)(ga_yn+ga_yy)),(nyscore+nnscore)/((double)(ga_nn+ga_ny)));
+ fprintf (fp,"# Average for false-pos: %3.3f false-neg: %3.1f\n",(nyscore/(double)ga_ny),(ynscore/(double)ga_yn));
+
+ fprintf (fp,"# TOTAL: %6d %3.2f%%\n\n", num_tests, 100.0);
+
+
+ for (i = 0; i < num_scores; i++) {
+ if ( is_mutatable[i] ) {
+ fprintf(fp, "score %-30s 0 %2.3f\n", score_names[i], weight_to_score(weights[i]));
+ } else {
+ fprintf(fp, "score %-30s 0 %2.3f # not mutable\n", score_names[i], range_lo[i]);
+ }
+ }
+}
+
+/* Computes the value of the activation function of the perceptron for
+ * a given input. */
+double evaluate_test (int test) {
+#ifdef LEAST_SQUARES_ERROR
+ return 1/(1+exp(-evaluate_test_nogain(test)));
+#else
+#ifdef ENTROPIC_ERROR
+ return tanh(evaluate_test_nogain(test));
+#endif
+#endif
+}
+
+/* Computes the value of the transfer function (in this case, linear) for
+ * an input defined in tests_hit[test]* */
+double evaluate_test_nogain (int test) {
+ double sum;
+ int i;
+
+ sum = bias;
+
+ for (i = 0; i < num_tests_hit[test]; i++) {
+ sum += weights[tests_hit[test][i]];
+ }
+
+ /* Translate the 'unmutatable' scores to weight space. */
+ sum += score_to_weight(scores[test]);
+
+ return sum;
+}
+
+/* Trains the perceptron using stochastic gradient descent. */
+void train (int num_epochs, double learning_rate) {
+ int epoch, random_test;
+ int i, j;
+ int * tests;
+ double y_out, error, delta;
+
+ /* Initialize and populate an array containing indices of training
+ * instances. This is shuffled on every epoch and then iterated
+ * through. I had originally planned to use roulette wheel selection,
+ * but shuffled selection seems to work better. */
+ tests = (int*)calloc(wheel_size, sizeof(int));
+
+ for (i = 0, j = 0; i < num_nondup-1; i++) {
+ for (; j < roulette_wheel[i+1]; j++) {
+ tests[j] = i;
+ }
+ }
+ for (; j < wheel_size; j++) {
+ tests[j] = num_nondup-1;
+ }
+
+ for (epoch = 0; epoch < num_epochs; epoch++) {
+ /* decay the weights on every epoch to smooth out statistical
+ * anomalies */
+ if ( weight_decay != 1.0 ) {
+ bias *= weight_decay;
+ for (i = 0; i < num_mutable; i++) {
+ weights[i] *= weight_decay;
+ }
+ }
+
+ /* shuffle the training instances */
+ for (i = 0; i < wheel_size-1; i++) {
+ int tmp;
+ int r = lrand48 () % (wheel_size - i);
+
+ tmp = tests[i];
+ tests[i] = tests[r+i];
+ tests[r+i] = tmp;
+ }
+
+ for (j = 0; j < wheel_size; j++) {
+
+ /* select a random test (they have been randomized above) */
+ random_test = tests[j];
+
+ /* compute the output of the network */
+ y_out = evaluate_test(random_test);
+
+/* compute the error gradient for the logsig node with least squares error */
+#ifdef LEAST_SQUARES_ERROR
+ error = is_spam[random_test] - y_out;
+ delta = y_out * (1-y_out) * error / (num_tests_hit[random_test]+1) * learning_rate;
+#else
+/* compute the error gradient for the tanh node with entropic error */
+#ifdef ENTROPIC_ERROR
+ error = (2.0*is_spam[random_test]-1) - y_out;
+ delta = error / (num_tests_hit[random_test]+1) * learning_rate;
+#endif
+#endif
+
+ /* adjust the weights to descend the steepest part of the error gradient */
+ bias += delta;
+ for (i = 0; i < num_tests_hit[random_test]; i++) {
+ int idx = tests_hit[random_test][i];
+ weights[idx] += delta;
+
+ /* Constrain the weights so that nice rules are always <= 0 etc. */
+ if ( range_lo[idx] >= 0 && weights[idx] < 0 ) {
+ weights[idx] = 0;
+ } else if ( range_hi[idx] <= 0 && weights[idx] > 0 ) {
+ weights[idx] = 0;
+ }
+
+ /*
+ if ( weights[idx] < score_to_weight(range_lo[idx]) ) {
+ weights[idx] = score_to_weight(range_lo[idx]);
+ } else if ( weights[idx] > score_to_weight(range_hi[idx]) ) {
+ weights[idx] = score_to_weight(range_hi[idx]);
+ }
+ */
+ }
+ }
+ }
+
+ free(tests);
+}
+
+void usage () {
+ printf ("usage: perceptron [args]\n"
+ "\n"
+ " -p ham_preference = adds extra ham to training set multiplied by number of\n"
+ " tests hit (2.0 default)\n"
+ " -e num_epochs = number of epochs to train (15 default)\n"
+ " -l learning_rate = learning rate for gradient descent (2.0 default)\n"
+ " -w weight_decay = per-epoch decay of learned weight and bias (1.0 default)\n"
+ "\n");
+ exit(30);
+}
+
+int main (int argc, char ** argv) {
+ struct timeval tv, tv_start, tv_end;
+ long long int t_usec;
+ FILE * fp;
+ int arg;
+
+ /* Read the command line options */
+ while ((arg = getopt (argc, argv, "p:e:l:w:?")) != -1) {
+ switch (arg) {
+ case 'p':
+ ham_preference = atof(optarg);
+ break;
+
+ case 'e':
+ num_epochs = atoi(optarg);
+ break;
+
+ case 'l':
+ learning_rate = atof(optarg);
+ break;
+
+ case 'w':
+ weight_decay = atof(optarg);
+ break;
+
+ case '?':
+ usage();
+ break;
+ }
+ }
+
+ /* Seed the PRNG */
+ gettimeofday (&tv, 0);
+ t_usec = tv.tv_sec * 1000000 + tv.tv_usec;
+ srand48 ((int)t_usec);
+
+ /* Load the instances and score constraints generated by logs-to-c. */
+ loadtests();
+ loadscores();
+
+ /* Replicate instances from the training set to bias against false positives. */
+ init_wheel ();
+
+ /* Allocate and initialize the weight vector. */
+ init_weights ();
+
+ /* Train the network using stochastic gradient descent. */
+ gettimeofday(&tv_start, 0);
+ train(num_epochs, learning_rate);
+ gettimeofday(&tv_end, 0);
+
+ t_usec = tv_end.tv_sec * 1000000 + tv_end.tv_usec -
+ (tv_start.tv_sec *1000000 + tv_start.tv_usec);
+ printf ("Training time = %fs.\n", t_usec / 1000000.0f);
+
+ /* Translate the learned weights to SA score space and output to a file. */
+ fp = fopen (OUTPUT_FILE, "w");
+ if ( fp ) {
+ write_weights(fp);
+ fclose(fp);
+ } else {
+ perror (OUTPUT_FILE);
+ }
+
+ /* Free resources */
+ destroy_weights ();
+ destroy_wheel ();
+ return 0;
+}