Mailing List Archive

svn commit: rev 6319 - in incubator/spamassassin/trunk: . masses rules
Author: jm
Date: Tue Jan 27 00:24:29 2004
New Revision: 6319

Removed:
incubator/spamassassin/trunk/masses/craig-evolve.c
incubator/spamassassin/trunk/masses/true-false-pos-neg-filter.pl
Modified:
incubator/spamassassin/trunk/MANIFEST
incubator/spamassassin/trunk/MANIFEST.SKIP
incubator/spamassassin/trunk/masses/Makefile
incubator/spamassassin/trunk/masses/README
incubator/spamassassin/trunk/masses/mk-baseline-results
incubator/spamassassin/trunk/masses/rewrite-cf-with-new-scores
incubator/spamassassin/trunk/masses/runGA
incubator/spamassassin/trunk/rules/50_scores.cf
incubator/spamassassin/trunk/spamassassin.spec
Log:
bug 2910: replaced evolver with fast perceptron score generation tool by Henry Stern; also cleaned up a few old scripts that aren't used anymore

Modified: incubator/spamassassin/trunk/MANIFEST
==============================================================================
--- incubator/spamassassin/trunk/MANIFEST (original)
+++ incubator/spamassassin/trunk/MANIFEST Tue Jan 27 00:24:29 2004
@@ -63,7 +63,7 @@
masses/corpora/remove-tests-from-logs
masses/corpora/uniq-mailbox
masses/corpora/uniq-maildirs
-masses/craig-evolve.c
+masses/perceptron.c
masses/extract-message-from-mbox
masses/find-extremes
masses/fp-fn-statistics
@@ -93,7 +93,6 @@
masses/tenpass/README
masses/tenpass/compute-current-tcr
masses/tenpass/split-log-into-buckets
-masses/true-false-pos-neg-filter.pl
masses/uniq-scores
ninjabutton.png
old/Changes.before-2002-06-18.gz

Modified: incubator/spamassassin/trunk/MANIFEST.SKIP
==============================================================================
--- incubator/spamassassin/trunk/MANIFEST.SKIP (original)
+++ incubator/spamassassin/trunk/MANIFEST.SKIP Tue Jan 27 00:24:29 2004
@@ -38,8 +38,8 @@
masses/commands.sh
masses/cont_evolve.pid
masses/copy-logs-to-deimos
-masses/craig-evolve
-masses/craig-evolve.o
+masses/perceptron
+masses/perceptron.o
masses/download-trapped-spam
masses/evolve
masses/evolve.o

Modified: incubator/spamassassin/trunk/masses/Makefile
==============================================================================
--- incubator/spamassassin/trunk/masses/Makefile (original)
+++ incubator/spamassassin/trunk/masses/Makefile Tue Jan 27 00:24:29 2004
@@ -1,60 +1,42 @@
-# Pick whether or not to use MPI here
-# Allowed values are yes, no
-MPI = no
-
-# Pick a platform to build for - currently support linux, macosx
-PLATFORM = linux
-
-# Pick where you have PGApack installed
-PGAPATH = pgapack
+# portions Copyright (c)2003 Henry Stern
+CC= gcc
+CFLAGS= -g -O2 -Wall
+LDFLAGS= -lm

# What rule scoreset are we using?
-SCORESET = 0
+SCORESET = 0

#### Should be no need to modify below this line

-ifeq ($(MPI),yes)
- CC = mpicc
- USE_MPI = -DUSE_MPI
-else
- CC = gcc
- USE_MPI =
-endif
-
-PGAINC = $(PGAPATH)/include
-
-ifeq ($(PLATFORM),macosx)
- OPT = -O3 -mcpu=750 -faltivec
- PGALIB = $(PGAPATH)/lib/freebsd
- EXTRALIBS =
-else
- OPT = -O9 #-mcpu=athlon
- PGALIB = $(PGAPATH)/lib/linux
- EXTRALIBS = -lm
-endif
-
-all: badrules evolve
-
-evolve: tmp/tests.h craig-evolve.c
- $(CC) -Wall $(OPT) $(USE_MPI) -DWL=2 -DOPTIMIZE -L$(PGALIB) -I$(PGAINC) craig-evolve.c -o evolve -lpgaO $(EXTRALIBS)
- strip evolve
+all: badrules perceptron
+
+perceptron: perceptron.o
+ $(CC) -o perceptron perceptron.o $(LDFLAGS)
+
+perceptron.o: tmp/rules.pl tmp/tests.h tmp/scores.h
+ $(CC) $(CFLAGS) -c -o perceptron.o perceptron.c
+
+tmp/rules.pl: tmp/.created parse-rules-for-masses
+ perl parse-rules-for-masses -d ../rules -s $(SCORESET)

tmp/tests.h: tmp/.created tmp/ranges.data logs-to-c
- ./logs-to-c --scoreset=$(SCORESET)
+ perl logs-to-c --scoreset=$(SCORESET)
+
+tmp/scores.h: tmp/tests.h

tmp/ranges.data: tmp/.created freqs score-ranges-from-freqs
- ./score-ranges-from-freqs ../rules $(SCORESET) < freqs
+ perl score-ranges-from-freqs ../rules $(SCORESET) < freqs

freqs: spam.log ham.log
- ./hit-frequencies -x -p -s $(SCORESET) > freqs
+ perl hit-frequencies -x -p -s $(SCORESET) > freqs

badrules: freqs
- ./lint-rules-from-freqs < freqs > badrules
+ perl lint-rules-from-freqs < freqs > badrules

tmp/.created:
-mkdir tmp
touch tmp/.created

clean:
- rm -rf *.o evolve tmp freqs
+ rm -rf *.o perceptron tmp freqs


Modified: incubator/spamassassin/trunk/masses/README
==============================================================================
--- incubator/spamassassin/trunk/masses/README (original)
+++ incubator/spamassassin/trunk/masses/README Tue Jan 27 00:24:29 2004
@@ -61,9 +61,9 @@
listing the path to the message or its message ID, its score, and the tests
that triggered on that mail.

- Using this info, and the genetic algorithm in evolve.cxx, I can figure out
- which tests get good hits with few false positives, etc., and re-score the
- tests to optimise the ratio.
+ Using this info, and a score optimization tool, I can figure out which tests
+ get good hits with few false positives, etc., and re-score the tests to
+ optimise the ratio.

This script relies on the spamassassin distribution directory living in "..".

@@ -71,19 +71,11 @@
logs-to-c :

Takes the "spam.log" and "nonspam.log" files and converts them into C
- source files and simplified data files for use by the "evolve" genetic
- algorithm. (Called by "make" when you build the evolver, so generally
+ source files and simplified data files for use by the C score optimization
+ algorithm. (Called by "make" when you build the perceptron, so generally
you won't need to run it yourself.)


-evolve.cxx :
-
- Source for "evolve". To build this, use "make". Note that it requires
- GAlib ( ftp://lancet.mit.edu/pub/ga/ ) unpacked in a dir called "galib245"
- to build. Alternatively just mail the data files to me and I'll run
- evolve for all of us ;)
-
-
hit-frequencies :

Analyses the log files and computes how often each test hits, overall,
@@ -100,11 +92,9 @@
suitable for a release build of SpamAssassin.


-continual_evolve.sh :
+perceptron.c :

- Continually runs the evolver, saving each run's best genome (and its results)
- into separate files named "result.n" where n starts at 1 and counts up.
- Handy for running overnight.
+ Perceptron learner by Henry Stern. See "README.perceptron" for details.


--- EOF -- lastmod: Aug 23 2002 jm
+-- EOF

Modified: incubator/spamassassin/trunk/masses/mk-baseline-results
==============================================================================
--- incubator/spamassassin/trunk/masses/mk-baseline-results (original)
+++ incubator/spamassassin/trunk/masses/mk-baseline-results Tue Jan 27 00:24:29 2004
@@ -5,14 +5,11 @@
SCORESET=$1
fi

-#make evolve SCORESET=$SCORESET || exit 1
-
echo "STATISTICS REPORT FOR SPAMASSASSIN RULESET"
echo
echo "Classification success on test corpora, at default threshold:"
echo

-#./evolve -C -t 5 | egrep '^#'
./logs-to-c --spam=spam-validate.log --nonspam=nonspam-validate.log --threshold 5 --count --scoreset=$SCORESET | sed -e 's/^Reading.*//' -e '/^$/d'

echo
@@ -21,7 +18,6 @@

# list a wide range of thresholds, so that we can make graphs later ;)
for thresh in -4 -3 -2 -1 0 1 2 3 4 4.5 5.5 6 6.5 7 8 9 10 12 15 17 20 ; do
- #./evolve -C -t $thresh | egrep '^#' | egrep -v '^# (Average|TOTAL)'
./logs-to-c --spam=spam-validate.log --nonspam=nonspam-validate.log --threshold $thresh --count --scoreset=$SCORESET | sed -e 's/^Reading.*//' -e '/^$/d'
echo
done

Modified: incubator/spamassassin/trunk/masses/rewrite-cf-with-new-scores
==============================================================================
--- incubator/spamassassin/trunk/masses/rewrite-cf-with-new-scores (original)
+++ incubator/spamassassin/trunk/masses/rewrite-cf-with-new-scores Tue Jan 27 00:24:29 2004
@@ -34,7 +34,7 @@
die "parse-rules-for-masses had no error but no tmp/rules.pl!?!";
}

-# now read the evolved scores
+# now read the generated scores
my @gascoreorder = ();
my %gascorelines = ();
open (STDIN, "<$newscores") or die "cannot open $newscores";
@@ -53,7 +53,7 @@
my $out = '';
my $pre = '';

-# read until '# Start of GA-evolved scores', removing scores from our
+# read until '# Start of generated scores', removing scores from our
# new list if we come across them.
while (<IN>) {
if (/^\s*score\s+(\S+)\s/) {
@@ -61,17 +61,17 @@
next unless (exists ($rules{$name}) && $rules{$name}->{issubrule} == 0);
}
$pre .= $_;
- /^# Start of GA-evolved scores/ and last;
+ /^# Start of generated scores/ and last;
}

-# now skip until '# End of GA-evolved scores'
+# now skip until '# End of generated scores'
while (<IN>) {
if (/^\s*score\s+\S+/) {
my($score,$name,@scores) = split;
@{$oldscores{$name}} = @scores;
}

- /^# End of GA-evolved scores/ and last;
+ /^# End of generated scores/ and last;
}
if (defined $_) {
$out .= $_;

Modified: incubator/spamassassin/trunk/masses/runGA
==============================================================================
--- incubator/spamassassin/trunk/masses/runGA (original)
+++ incubator/spamassassin/trunk/masses/runGA Tue Jan 27 00:24:29 2004
@@ -9,12 +9,12 @@
fi

if [ "x$1" = "x" ]; then
-echo "[Doing a scoreset $SCORESET GA run]"
+echo "[Doing a scoreset $SCORESET score-generation run]"

# Clean out old runs
echo "[Cleaning up]"
-rm -rf spam-validate.log nonspam-validate.log ham-validate.log spam.log nonspam.log ham.log NSBASE SPBASE tmp make.output freqs craig-evolve.scores \
- GA-$NAME.out GA-$NAME.scores GA-$NAME.validate
+rm -rf spam-validate.log nonspam-validate.log ham-validate.log spam.log nonspam.log ham.log NSBASE SPBASE tmp make.output freqs perceptron.scores \
+ gen-$NAME.out gen-$NAME.scores gen-$NAME.validate
make clean >/dev/null

# Generate 90/10 split logs
@@ -34,7 +34,7 @@
mv split-10.log spam-validate.log
cd ..

-echo "[Setting up for GA run]"
+echo "[Setting up for gen run]"
# Ok, setup for a run
ln -s SPBASE/spam.log .
ln -s NSBASE/nonspam.log .
@@ -43,25 +43,27 @@
ln -s NSBASE/nonspam-validate.log .
ln -s NSBASE/nonspam-validate.log ham-validate.log

-echo "[Generating evolve]"
-# Generate evolve with full logs
-make -j `cpucount` SCORESET=$SCORESET > make.output 2>&1
-
-( echo "[GA run start]" ; pwd ; date ; \
-./evolve -b 20; \
-mv craig-evolve.scores GA-$NAME.scores ; \
-echo "[GA run end]" ; pwd ; date ) | tee GA-$NAME.out
+numcpus=`cpucount || echo 1` # cpucount isn't available everywhere
+
+echo "[Generating perceptron]"
+# Generate perceptron with full logs
+make -j $numcpus SCORESET=$SCORESET > make.output 2>&1
+
+( echo "[gen run start]" ; pwd ; date ; \
+./perceptron -p 2.0 -e 100; \
+mv perceptron.scores gen-$NAME.scores ; \
+echo "[gen run end]" ; pwd ; date ) | tee gen-$NAME.out

else

# This needs to have 50_scores.cf in place first ...
-echo "[GA validation results]"
+echo "[gen validation results]"
./logs-to-c --spam=SPBASE/spam-validate.log \
--nonspam=NSBASE/nonspam-validate.log \
- --count --cffile=../rules --scoreset=$SCORESET | tee GA-$NAME.validate
+ --count --cffile=../rules --scoreset=$SCORESET | tee gen-$NAME.validate

echo "[STATISTICS file generation]"
-./mk-baseline-results $SCORESET | tee GA-$NAME.statistics
+./mk-baseline-results $SCORESET | tee gen-$NAME.statistics
fi

exit 0

Modified: incubator/spamassassin/trunk/rules/50_scores.cf
==============================================================================
--- incubator/spamassassin/trunk/rules/50_scores.cf (original)
+++ incubator/spamassassin/trunk/rules/50_scores.cf Tue Jan 27 00:24:29 2004
@@ -97,7 +97,7 @@
# False negatives: 164 0.49% (0.92% of spam, 437 weighted)
# TCR: 74.527197 SpamRecall: 99.079% SpamPrec: 99.915% FP: 0.04% FN: 0.49%

-# Start of GA-evolved scores
+# Start of generated scores

score ACCEPT_CREDIT_CARDS 2.507 2.596 0 1.037
score ACT_NOW_CAPS 0.545 1.343 0 0.699
@@ -881,9 +881,9 @@
score DOMAIN_SUBJECT 0
score RATWARE_EVAMAIL 2.900 2.800 0 0

-# End of GA-evolved scores.
+# End of generated scores.

-# Scores for tests that are scored manually or with isolated GA runs. Most
+# Scores for tests that are scored manually or with isolated rescore runs. Most
# are net tests, userconf tests, tests occuring with very low frequency, or
# tests with many false positives.

@@ -994,7 +994,7 @@
# accessdb lookups
score ACCESSDB 0

-# GA never changes the whitelist/blacklist scores
+# rescore never changes the whitelist/blacklist scores

score USER_IN_BLACKLIST 100.000
score USER_IN_WHITELIST -100.000

Modified: incubator/spamassassin/trunk/spamassassin.spec
==============================================================================
--- incubator/spamassassin/trunk/spamassassin.spec (original)
+++ incubator/spamassassin/trunk/spamassassin.spec Tue Jan 27 00:24:29 2004
@@ -56,7 +56,7 @@
SpamAssassin provides you with a way to reduce, if not completely eliminate,
Unsolicited Bulk Email (or "spam") from your incoming email. It can be
invoked by a MDA such as sendmail or postfix, or can be called from a procmail
-script, .forward file, etc. It uses a genetic-algorithm-evolved scoring system
+script, .forward file, etc. It uses a perceptron-optimized scoring system
to identify messages which look spammy, then adds headers to the message so
they can be filtered by the user's mail reading software. This distribution
includes the spamc/spamc components which considerably speeds processing of