Hi,
On Fri, 5 Mar 2004 14:24:24 -0000 (GMT) "Andy Blanchard" <andyb@zocalo.uk.com> wrote:
> ...I've basically
> rolled my own (my needs were not that sophisticated) by using an excellent
> Perl script called "hgrep" that allows you to grep mail headers by
> intelligently dealing with the line breaks for you. You can grab a copy
> from here:
>
> http://www.cpan.org/authors/id/E/EL/ELIJAH/
... which led me to analyze my spam corpus (the 3219 flagged by SA 2.6x)
to get my Top 25 Rule list:
86.30% 2778 HTML_MESSAGE
82.82% 2666 BAYES_99
71.51% 2302 MIME_HTML_ONLY
54.61% 1758 MIME_HTML_NO_CHARSET
50.85% 1637 RCVD_IN_SORBS
37.65% 1212 BIZ_TLD
34.61% 1114 DCC_CHECK
33.55% 1080 HTML_FONT_BIG
26.53% 854 MISSING_MIMEOLE
25.82% 831 RAZOR2_CHECK
25.41% 818 FORGED_OUTLOOK_TAGS
24.79% 798 MIME_HTML_ONLY_MULTI
24.42% 786 RAZOR2_CF_RANGE_51_100
23.33% 751 RCVD_IN_NJABL
22.83% 735 HTML_FONTCOLOR_RED
22.24% 716 CLICK_BELOW
21.78% 701 HTML_IMAGE_ONLY_02
21.00% 676 RCVD_IN_DSBL
19.01% 612 HTML_FONT_INVISIBLE
18.27% 588 HTML_70_80
18.11% 583 USERPASS
17.05% 549 FORGED_OUTLOOK_HTML
16.56% 533 HTML_FONTCOLOR_UNKNOWN
15.91% 512 HTML_60_70
15.56% 501 RCVD_IN_RFCI
And with all the network and Bayes tests removed:
86.30% 2778 HTML_MESSAGE
71.51% 2302 MIME_HTML_ONLY
54.61% 1758 MIME_HTML_NO_CHARSET
37.65% 1212 BIZ_TLD
33.55% 1080 HTML_FONT_BIG
26.53% 854 MISSING_MIMEOLE
25.41% 818 FORGED_OUTLOOK_TAGS
24.79% 798 MIME_HTML_ONLY_MULTI
22.83% 735 HTML_FONTCOLOR_RED
22.24% 716 CLICK_BELOW
21.78% 701 HTML_IMAGE_ONLY_02
19.01% 612 HTML_FONT_INVISIBLE
18.27% 588 HTML_70_80
18.11% 583 USERPASS
17.05% 549 FORGED_OUTLOOK_HTML
16.56% 533 HTML_FONTCOLOR_UNKNOWN
15.91% 512 HTML_60_70
13.20% 425 HTML_FONTCOLOR_UNSAFE
12.30% 396 MISSING_OUTLOOK_NAME
12.24% 394 HTML_FONTCOLOR_BLUE
11.96% 385 PENIS_ENLARGE2
11.18% 360 HTML_50_60
10.84% 349 HTML_LINK_CLICK_HERE
9.94% 320 HTTP_EXCESSIVE_ESCAPES
9.85% 317 DATE_IN_FUTURE_12_24
Note that during this period the Tripwire rules changed name from
FVGT_TRIPWIRE_xx to TW_xx and rules like Chickenpox, Backhair, Weeds,
and Tripwire should be condensed into a group.
One should analyze ham as well to see which tests they trigger; you
might as well run a mass-check if you want good, detailed statistics.
fyi,
-- Bob