Date: Tue Jan 27 11:19:10 2004

New Revision: 6321

Added:

incubator/spamassassin/trunk/masses/README.perceptron

incubator/spamassassin/trunk/masses/perceptron.c

Log:

bug 2910: looks like these weren't "svn add"ed ... oops!

Added: incubator/spamassassin/trunk/masses/README.perceptron

==============================================================================

--- (empty file)

+++ incubator/spamassassin/trunk/masses/README.perceptron Tue Jan 27 11:19:10 2004

@@ -0,0 +1,154 @@

+Copyright (c)2003 Henry Stern

+

+Fast SpamAssassin Score Learning Tool

+

+Henry Stern

+Faculty of Computer Science

+Dalhousie University

+6050 University Avenue

+Halifax, NS Canada

+B3H 1W5

+henry@stern.ca

+

+January 8, 2004

+

+1. WHAT IS IT?

+

+This program is used to compute scores for SpamAssassin rules. It makes

+use of data files generated by the suite of scripts in

+spamassassin/masses. The program outputs the generated scores in a file

+titled 'perceptron.scores'.

+

+The advantage of this program over that of the genetic algorithm (GA)

+implementation in spamassassin/masses/craig_evolve.c is that while the GA

+requires several hours to run on high-end machines, the perceptron

+requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.

+

+This makes incremental updates and score personalization practical for the

+end-user and gives developers a better idea just how useful a new rule is.

+

+2. OPTIONS

+

+There are four options that can be passed to the perceptron program.

+

+ -p ham_preference

+

+ This increases the number of non-spam messages in the training

+ set. It does this by adding 1 + (number of tests hit) *

+ ham_preference instances of non-spam messages to the training set.

+ This is intended to reduce false positives by encouraging the

+ training program to look at the harder-to-clasify ham messages

+ more often. By default, this parameter is 2.0.

+

+ -e num_epochs

+

+ This parameter sets how many passes the perceptron will make

+ through the training set before terminating. On each pass, the

+ training set is shuffled and then iterated through. By default,

+ it will make 15 passes.

+

+ -l learning_rate

+

+ This parameter modifies the learning rate of the perceptron. The

+ error gradient is computed for each instance. This program uses a

+ logsig activation function y = (1/(1+exp(-x))), so the error

+ gradient is computed as E(x) = y*(1-y)*(is_spam - y). For

+ each instance and score hit in the training set, the scores are

+ modified by adding E(x) / (number of tests hit + 1) *

+ learning_rate. The default value for this is 2.0, but it can be

+ whatever you want.

+

+ -w weight_decay

+

+ To prevent the scores from getting too high (or even forcing them

+ to if you want), before each epoch, the scores and network bias

+ are multiplied by the weight decay. This is off by default (1.0),

+ but it can be useful.

+

+3. HOW DOES IT WORK?

+

+This program implements the "Stochastic Gradient Descent" method of

+training a neural network. It uses a single perceptron with a logsig

+activation function and maps the weights to SpamAssassin score space.

+

+The perceptron is the simplest form of neural network. It consists of a

+transfer function and an activation function. Together, they simulate the

+average firing rate of a biological neuron over time.

+

+The transfer function is the sum of the product of weights and inputs.

+It simulates the membrane potential of a neuron. When the accumulated

+charge on the membrane exceeds a certain threshold, the neuron fires,

+sending an electrical impulse down the axon. This implementation uses a

+linear transfer function:

+

+[1] f(x) = bias + sum_{i=1}^{n} w_i * x_i

+

+This is quite similar to how the SpamAssasin score system works. If you

+set the bias to be 0, w_i to be the score for rule i and x_i to be whether

+or not rule i is activated by a given message, the transfer function will

+return the score of the message.

+

+The activation function simulates the electical spike travelling down the

+axon. It takes the output of the transfer function as input and applies

+some sort of transformation to it. This implementation uses a logsig

+activation function:

+

+[2] y(x) = 1 / (1 + exp(-f(x)))

+

+This non-linear function constrains the output of the transfer function

+between 0 and 1. When plotted, it looks somewhat S-shaped with vertical

+asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.

+The slope of the function is greatest at x=0 and tapers off as it

+approaches the asymptotes.

+

+Lastly, the performance of the perceptron needs to be measured using an

+error function. Two error functions are commonly used: mean squared

+error and entropic error. By default, this implementation uses mean

+squared error but entropic error may be substituted by adding a compiler

+directive.

+

+The most common method of training neural networks is called gradient

+descent. It involves iteratively tuning the parameters of the network so

+that the mean error rate always decreases. This is done by finding the

+direction of steepest descent down the "error gradient," reducing the

+value of the error function for the next iteration of the algorithm.

+

+If the transfer function and activation function are both differentiable,

+the error gradient of a neural network can be calculated with respect to

+the weights and bias. Without getting into calculus, the error gradient

+for a perceptron with a linear transfer function, logsig activation

+function and mean squared error function is:

+

+[3] E(x) = y(x) * (1-y(x)) * (y_expected - y(x))

+

+The weights are updated using the function:

+

+[4] w_i = w_i + E(x) * x_i * learning_rate

+

+Since the SpamAssassin rule hits are sparse, the basic gradient descent

+algorithm is impractical. This implementation uses a variation called

+"Stochastic gradient descent." Instead of doing one batch update per

+epoch, the training set is randomly walked through, doing incremental

+updates. In addition, in this implementation, the learning rate is

+modified by the number of rule hits for a given training instance.

+Together, these allow good weights to be computed for

+infrequently-occurring rules.

+

+Once the perceptron has finished running, the weights are converted to

+scores and exported using the familiar file format. Weights are converted

+to scores using this function:

+

+[5] score(weight) = -threshold * weight / bias

+

+4. ACKNOWLEDGEMENTS

+

+I would like to thank my PhD supervisor, Michael Shepherd, for not getting

+mad at me while I worked on this. I'd also like to thank Thomas

+Trappenberg for his invaluable assistance while I was tweaking the

+performance of the learning algorithm. I would like to thank Daniel

+Quinlan, Justin Mason and Theo Van Dinter for their valuable input and

+constructive criticism.

+

+--

+hs

+8/1/2004

Added: incubator/spamassassin/trunk/masses/perceptron.c

==============================================================================

--- (empty file)

+++ incubator/spamassassin/trunk/masses/perceptron.c Tue Jan 27 11:19:10 2004

@@ -0,0 +1,425 @@

+/* Copyright (c)2003 Henry Stern

+ *

+ * This program uses stochastic gradient descent to learn a scoreset for

+ * SpamAssassin. You'll need to run logs-to-c from spamassassin/masses to

+ * generate the stuff in tmp.

+ */

+

+#include <stdio.h>

+#include <stdlib.h>

+#include <assert.h>

+#include <sys/time.h>

+#include <math.h>

+#include <unistd.h>

+

+#include "tmp/scores.h"

+#include "tmp/tests.h"

+

+/* Ensure that multiple error functions have not been chosen. */

+#ifdef ENTROPIC_ERROR

+#ifdef LEAST_SQUARES_ERROR

+#warn Both entropic error and least squares error chosen. Using least squares.

+#endif

+#endif

+

+/* Choose the least squares error function by default. */

+#ifndef ENTROPIC_ERROR

+#ifndef LEAST_SQUARES_ERROR

+#define LEAST_SQUARES_ERROR

+#endif

+#endif

+

+#define OUTPUT_FILE "perceptron.scores"

+

+void init_wheel ();

+void destroy_wheel ();

+int get_random_test ();

+void init_weights();

+void destroy_weights ();

+void write_weights (FILE * fp);

+double evaluate_test (int test);

+double evaluate_test_nogain (int test);

+void train (int num_epochs, double learning_rate);

+void usage ();

+

+/* Converts a weight to a SpamAssassin score. */

+#define weight_to_score(x) (-5*(x)/bias)

+#define score_to_weight(x) (-(x)*bias/5)

+

+int wheel_size; /* The number of entries in the roulette wheel (NOT THE

+ SIZE OF THE ROULETTE_WHEEL ARRAY!). */

+int * roulette_wheel; /* Used for roulette wheel selection. */

+double ham_preference = 2.0;

+

+double * weights; /* The weights of the single-layer perceptron. */

+double bias; /* The network bias for the single-layer perceptron. */

+

+int num_epochs = 15;

+double learning_rate = 2.0;

+double weight_decay = 1.0;

+

+/* Initialize the roulette wheel and populate it, replicating harder-to-classify hams. */

+void init_wheel () {

+ int i;

+ int spam = 0, ham = 0;

+

+ /* cache locations on the roulette wheel for fast lookups */

+ roulette_wheel = (int*)calloc(num_nondup, sizeof(int));

+ wheel_size = 0;

+

+ roulette_wheel[0] = 0;

+

+ for (i = 0; i < num_nondup - 1; i++) {

+ int slot_size = 1;

+

+ /* Hams with more tests are rare and harder to classify but are the

+ * most important to classify correctly. They are thus replicated in the

+ * training set proportionally to their difficulty. */

+ if ( ! is_spam[i] ) {

+ slot_size += (int)(num_tests_hit[i] * ham_preference);

+ }

+

+ /* The database is compressed with all instances mapped in the same place. */

+ slot_size *= tests_count[i];

+ wheel_size += slot_size;

+

+ if ( ! is_spam[i] ) {

+ ham += slot_size;

+ } else {

+ spam++;

+ }

+

+ roulette_wheel[i+1] = roulette_wheel[i] + slot_size;

+ }

+

+ printf ("Modified training set statistics: %d spam, %d ham.\n", spam, ham);

+}

+

+/* Free the resources for the roulette wheel selector. */

+void destroy_wheel () {

+ if ( roulette_wheel ) {

+ free (roulette_wheel);

+ roulette_wheel = 0;

+ wheel_size = 0;

+ }

+}

+

+/* Get a random test using roulette wheel selection. This is not used anymore. */

+int get_random_test () {

+ int r;

+ int bottom, middle, top;

+

+ r = lrand48() % wheel_size;

+

+ bottom = 0;

+ top = num_nondup - 1;

+ middle = top / 2;

+

+ while (1) {

+ if ( r < roulette_wheel[bottom+1] ) {

+ return bottom;

+ } else if ( r >= roulette_wheel[top] ) {

+ return top;

+ } else if ( r >= roulette_wheel[middle] && r < roulette_wheel[middle+1] ) {

+ return middle;

+ } else if ( r < roulette_wheel[middle] ) {

+ top = middle-1;

+ bottom++;

+ middle = bottom + (top-bottom)/2;

+ } else {

+ bottom = middle+1;

+ top--;

+ middle = bottom + (top-bottom)/2;

+ }

+ }

+}

+

+/* Allocate and initialize the weights over the range [-0.5..0.5] */

+void init_weights () {

+ int i;

+

+ weights = (double*)calloc(num_scores, sizeof(double));

+

+ bias = drand48() - 0.5;

+ for (i = 0; i < num_scores; i++) {

+ weights[i] = drand48() - 0.5;

+ }

+}

+

+/* Free the resources allocated for the weights. */

+void destroy_weights () {

+ if ( weights ) {

+ free(weights);

+ weights = 0;

+ }

+}

+

+/* Writes out the weights in SpamAssassin score space. */

+void write_weights (FILE * fp) {

+ int i;

+ int ga_nn, ga_yy, ga_ny, ga_yn;

+ double nnscore, yyscore, nyscore, ynscore;

+ double threshold = 5;

+

+ ga_nn = ga_yy = ga_ny = ga_yn = 0;

+ nnscore = yyscore = nyscore = ynscore = 0;

+

+ /* Run through all of the instances in the training set and tally up

+ * the scores. */

+ for (i = 0; i < num_nondup; i++) {

+ double score = weight_to_score(evaluate_test_nogain(i)) + 5;

+

+ if ( score >= threshold && is_spam[i] ) {

+ ga_yy += tests_count[i];

+ yyscore += tests_count[i] * score;

+ } else if ( score < threshold && !is_spam[i] ) {

+ ga_nn += tests_count[i];

+ nnscore += tests_count[i] * score;

+ } else if ( score >= threshold && !is_spam[i] ) {

+ ga_ny += tests_count[i];

+ nyscore += tests_count[i] * score;

+ } else {

+ ga_yn += tests_count[i];

+ ynscore += tests_count[i] * score;

+ }

+ }

+

+ /* This is copied from the dump() function in craig-evolve.c. It

+ * outputs some nice statistics about the learned classifier. */

+ fprintf (fp,"\n# SUMMARY for threshold %3.1f:\n", threshold);

+ fprintf (fp,

+ "# Correctly non-spam: %6d %4.2f%% (%4.2f%% of non-spam corpus)\n",

+ ga_nn,

+ (ga_nn / (float) num_tests) * 100.0,

+ (ga_nn / (float) num_nonspam) * 100.0);

+ fprintf (fp,

+ "# Correctly spam: %6d %4.2f%% (%4.2f%% of spam corpus)\n",

+ ga_yy,

+ (ga_yy / (float) num_tests) * 100.0,

+ (ga_yy / (float) num_spam) * 100.0);

+ fprintf (fp,

+ "# False positives: %6d %4.2f%% (%4.2f%% of nonspam)\n",

+ ga_ny,

+ (ga_ny / (float) num_tests) * 100.0,

+ (ga_ny / (float) num_nonspam) * 100.0);

+ fprintf (fp,

+ "# False negatives: %6d %4.2f%% (%4.2f%% of spam)\n",

+ ga_yn,

+ (ga_yn / (float) num_tests) * 100.0,

+ (ga_yn / (float) num_spam) * 100.0);

+

+ fprintf (fp,"# Average score for spam: %3.3f nonspam: %3.1f\n",(ynscore+yyscore)/((double)(ga_yn+ga_yy)),(nyscore+nnscore)/((double)(ga_nn+ga_ny)));

+ fprintf (fp,"# Average for false-pos: %3.3f false-neg: %3.1f\n",(nyscore/(double)ga_ny),(ynscore/(double)ga_yn));

+

+ fprintf (fp,"# TOTAL: %6d %3.2f%%\n\n", num_tests, 100.0);

+

+

+ for (i = 0; i < num_scores; i++) {

+ if ( is_mutatable[i] ) {

+ fprintf(fp, "score %-30s 0 %2.3f\n", score_names[i], weight_to_score(weights[i]));

+ } else {

+ fprintf(fp, "score %-30s 0 %2.3f # not mutable\n", score_names[i], range_lo[i]);

+ }

+ }

+}

+

+/* Computes the value of the activation function of the perceptron for

+ * a given input. */

+double evaluate_test (int test) {

+#ifdef LEAST_SQUARES_ERROR

+ return 1/(1+exp(-evaluate_test_nogain(test)));

+#else

+#ifdef ENTROPIC_ERROR

+ return tanh(evaluate_test_nogain(test));

+#endif

+#endif

+}

+

+/* Computes the value of the transfer function (in this case, linear) for

+ * an input defined in tests_hit[test]* */

+double evaluate_test_nogain (int test) {

+ double sum;

+ int i;

+

+ sum = bias;

+

+ for (i = 0; i < num_tests_hit[test]; i++) {

+ sum += weights[tests_hit[test][i]];

+ }

+

+ /* Translate the 'unmutatable' scores to weight space. */

+ sum += score_to_weight(scores[test]);

+

+ return sum;

+}

+

+/* Trains the perceptron using stochastic gradient descent. */

+void train (int num_epochs, double learning_rate) {

+ int epoch, random_test;

+ int i, j;

+ int * tests;

+ double y_out, error, delta;

+

+ /* Initialize and populate an array containing indices of training

+ * instances. This is shuffled on every epoch and then iterated

+ * through. I had originally planned to use roulette wheel selection,

+ * but shuffled selection seems to work better. */

+ tests = (int*)calloc(wheel_size, sizeof(int));

+

+ for (i = 0, j = 0; i < num_nondup-1; i++) {

+ for (; j < roulette_wheel[i+1]; j++) {

+ tests[j] = i;

+ }

+ }

+ for (; j < wheel_size; j++) {

+ tests[j] = num_nondup-1;

+ }

+

+ for (epoch = 0; epoch < num_epochs; epoch++) {

+ /* decay the weights on every epoch to smooth out statistical

+ * anomalies */

+ if ( weight_decay != 1.0 ) {

+ bias *= weight_decay;

+ for (i = 0; i < num_mutable; i++) {

+ weights[i] *= weight_decay;

+ }

+ }

+

+ /* shuffle the training instances */

+ for (i = 0; i < wheel_size-1; i++) {

+ int tmp;

+ int r = lrand48 () % (wheel_size - i);

+

+ tmp = tests[i];

+ tests[i] = tests[r+i];

+ tests[r+i] = tmp;

+ }

+

+ for (j = 0; j < wheel_size; j++) {

+

+ /* select a random test (they have been randomized above) */

+ random_test = tests[j];

+

+ /* compute the output of the network */

+ y_out = evaluate_test(random_test);

+

+/* compute the error gradient for the logsig node with least squares error */

+#ifdef LEAST_SQUARES_ERROR

+ error = is_spam[random_test] - y_out;

+ delta = y_out * (1-y_out) * error / (num_tests_hit[random_test]+1) * learning_rate;

+#else

+/* compute the error gradient for the tanh node with entropic error */

+#ifdef ENTROPIC_ERROR

+ error = (2.0*is_spam[random_test]-1) - y_out;

+ delta = error / (num_tests_hit[random_test]+1) * learning_rate;

+#endif

+#endif

+

+ /* adjust the weights to descend the steepest part of the error gradient */

+ bias += delta;

+ for (i = 0; i < num_tests_hit[random_test]; i++) {

+ int idx = tests_hit[random_test][i];

+ weights[idx] += delta;

+

+ /* Constrain the weights so that nice rules are always <= 0 etc. */

+ if ( range_lo[idx] >= 0 && weights[idx] < 0 ) {

+ weights[idx] = 0;

+ } else if ( range_hi[idx] <= 0 && weights[idx] > 0 ) {

+ weights[idx] = 0;

+ }

+

+ /*

+ if ( weights[idx] < score_to_weight(range_lo[idx]) ) {

+ weights[idx] = score_to_weight(range_lo[idx]);

+ } else if ( weights[idx] > score_to_weight(range_hi[idx]) ) {

+ weights[idx] = score_to_weight(range_hi[idx]);

+ }

+ */

+ }

+ }

+ }

+

+ free(tests);

+}

+

+void usage () {

+ printf ("usage: perceptron [args]\n"

+ "\n"

+ " -p ham_preference = adds extra ham to training set multiplied by number of\n"

+ " tests hit (2.0 default)\n"

+ " -e num_epochs = number of epochs to train (15 default)\n"

+ " -l learning_rate = learning rate for gradient descent (2.0 default)\n"

+ " -w weight_decay = per-epoch decay of learned weight and bias (1.0 default)\n"

+ "\n");

+ exit(30);

+}

+

+int main (int argc, char ** argv) {

+ struct timeval tv, tv_start, tv_end;

+ long long int t_usec;

+ FILE * fp;

+ int arg;

+

+ /* Read the command line options */

+ while ((arg = getopt (argc, argv, "p:e:l:w:?")) != -1) {

+ switch (arg) {

+ case 'p':

+ ham_preference = atof(optarg);

+ break;

+

+ case 'e':

+ num_epochs = atoi(optarg);

+ break;

+

+ case 'l':

+ learning_rate = atof(optarg);

+ break;

+

+ case 'w':

+ weight_decay = atof(optarg);

+ break;

+

+ case '?':

+ usage();

+ break;

+ }

+ }

+

+ /* Seed the PRNG */

+ gettimeofday (&tv, 0);

+ t_usec = tv.tv_sec * 1000000 + tv.tv_usec;

+ srand48 ((int)t_usec);

+

+ /* Load the instances and score constraints generated by logs-to-c. */

+ loadtests();

+ loadscores();

+

+ /* Replicate instances from the training set to bias against false positives. */

+ init_wheel ();

+

+ /* Allocate and initialize the weight vector. */

+ init_weights ();

+

+ /* Train the network using stochastic gradient descent. */

+ gettimeofday(&tv_start, 0);

+ train(num_epochs, learning_rate);

+ gettimeofday(&tv_end, 0);

+

+ t_usec = tv_end.tv_sec * 1000000 + tv_end.tv_usec -

+ (tv_start.tv_sec *1000000 + tv_start.tv_usec);

+ printf ("Training time = %fs.\n", t_usec / 1000000.0f);

+

+ /* Translate the learned weights to SA score space and output to a file. */

+ fp = fopen (OUTPUT_FILE, "w");

+ if ( fp ) {

+ write_weights(fp);

+ fclose(fp);

+ } else {

+ perror (OUTPUT_FILE);

+ }

+

+ /* Free resources */

+ destroy_weights ();

+ destroy_wheel ();

+ return 0;

+}