Mailing List Archive

[Spamassassin Wiki] Update of "DnsblAccuracy082005" by JustinMason
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/DnsblAccuracy082005

The comment on the change is:
everybody likes DNSBL stats ;)

New page:
= DNS Blocklist Accuracy Figures (as of July 2005) =

Many people, whether they use SpamAssassin or not, find accuracy figures for DNSBLs to be useful. Here are accuracy figures for the DNS blocklists included in SpamAssassin 3.1.0, as measured during our July rescoring run. We use the following techniques to assure high accuracy on these figures:

* hits are recorded from 'live' data at the time the messages were received, not post-facto testing (using 'mass-check --reuse')
* there were 9 people contributing their hit data, from a variety of geographical locations and organisational types
* both ["Ham"] and ["Spam"] hitrates are measured, and the corpora were hand-verified in advance
* the corpora use (relatively) fresh mail, received between January 2004 and July 2005

123778 spam messages and 53091 ham messages were used:

{{{
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
176869 123778 53091 0.700 0.00 0.00 (all messages)
100.000 69.9829 30.0171 0.700 0.00 0.00 (all messages as %)
}}}

These were randomly chosen from all contributors' logs (see below).
First off, the DNS blocklists.

Note -- the sorting is by mass-check's RANK metric, which puts 'better' results
near the top, and the results are in HitFrequencies format. The 'S/O',
'SPAM%', and 'HAM%' columns are the most important metrics; S/O values
approaching 1.0 are best.

{{{
17.449 24.9285 0.0113 1.000 0.97 3.90 RCVD_IN_XBL
3.841 5.4824 0.0132 0.998 0.88 2.16 RCVD_IN_SORBS_SOCKS
5.865 8.3690 0.0283 0.997 0.88 3.16 RCVD_IN_SBL
9.438 13.4652 0.0490 0.996 0.84 2.23 RCVD_IN_WHOIS_INVALID
2.237 3.1839 0.0301 0.991 0.79 0.02 RCVD_IN_SORBS_MISC
27.913 39.8423 0.0998 0.998 0.76 2.60 RCVD_IN_DSBL
4.914 6.9883 0.0772 0.989 0.74 0.02 RCVD_IN_SORBS_HTTP
0.914 1.3015 0.0113 0.991 0.72 2.77 RCVD_IN_NJABL_SPAM
7.692 10.9486 0.0979 0.991 0.72 2.43 RCVD_IN_WHOIS_BOGONS
22.130 31.5662 0.1300 0.996 0.71 2.05 RCVD_IN_SORBS_DUL
10.642 15.1449 0.1450 0.991 0.67 0.72 RCVD_IN_NJABL_PROXY
18.739 26.6946 0.1921 0.993 0.61 1.95 RCVD_IN_NJABL_DUL
0.345 0.4888 0.0094 0.981 0.57 0.20 RCVD_IN_SORBS_SMTP
5.309 7.5062 0.1865 0.976 0.56 1.46 RCVD_IN_SORBS_WEB
16.300 23.1463 0.3372 0.986 0.53 1.56 RCVD_IN_BL_SPAMCOP_NET
0.166 0.0016 0.5481 0.003 0.47 -2.20 RCVD_IN_IADB_VOUCHED
0.161 0.0016 0.5330 0.003 0.47 -4.30 RCVD_IN_BSP_TRUSTED
0.096 0.1365 0.0000 1.000 0.41 1.00 RCVD_IN_WHOIS_HIJACKED
0.118 0.1656 0.0075 0.956 0.41 0.10 RCVD_IN_NJABL_RELAY
0.040 0.0533 0.0094 0.850 0.32 0.26 RCVD_IN_SORBS_ZOMBIE
0.000 0.0000 0.0000 0.500 0.28 0.00 RCVD_IN_SORBS_BLOCK
0.000 0.0000 0.0000 0.500 0.28 0.00 RCVD_IN_NJABL_MULTI
0.000 0.0000 0.0000 0.500 0.28 0.00 RCVD_IN_NJABL_CGI
}}}

URI blocklist lookups, against SURBL and SBL:

{{{
17.882 25.5522 0.0000 1.000 1.00 4.50 URIBL_SC_SURBL
9.684 13.8369 0.0019 1.000 0.98 3.81 URIBL_AB_SURBL
34.260 48.9497 0.0132 1.000 0.98 4.09 URIBL_JP_SURBL
36.356 51.9317 0.0414 0.999 0.90 3.01 URIBL_OB_SURBL
30.956 44.1605 0.1695 0.996 0.66 2.14 URIBL_WS_SURBL
0.266 0.3805 0.0000 1.000 0.56 2.80 URIBL_PH_SURBL
22.415 31.8425 0.4370 0.986 0.49 1.64 URIBL_SBL
}}}

SPF lookups:

{{{
3.437 4.8942 0.0396 0.992 0.80 1.38 SPF_SOFTFAIL
1.006 1.4292 0.0207 0.986 0.71 2.43 SPF_HELO_SOFTFAIL
2.550 3.5717 0.1676 0.955 0.53 1.14 SPF_FAIL
2.297 3.2090 0.1695 0.950 0.52 1.07 SPF_NEUTRAL
1.796 2.5029 0.1488 0.944 0.51 0.00 SPF_HELO_FAIL
0.935 1.2724 0.1488 0.895 0.43 0.00 SPF_HELO_NEUTRAL
5.334 2.5925 11.7252 0.181 0.21 -0.00 SPF_HELO_PASS
3.267 2.6241 4.7654 0.355 0.10 -0.00 SPF_PASS
}}}

More details of the source mass-check log files and test procedure can be read
in SpamAssassin bug 4505. the full list of freqs can be found in the STATISTICS-set3.txt file in the 3.1.0 release.
Here's a list of the data files used. Note that only a randomly-chosen one
tenth of each file was used.

{{{
bash-3.00$ ls -l /home/corpus-rsync/corpus/submit/
total 2839184
-r--r--r-- 1 rsync rsync 7967268 Jul 16 18:18 ham-bayes-net-bzoetekouw.log
-r--r--r-- 1 rsync rsync 1987090 Jul 16 14:49 ham-bayes-net-cthielen.log
-r--r--r-- 1 rsync rsync 23284450 Jul 24 08:04 ham-bayes-net-daf.log
-r--r--r-- 1 rsync rsync 51469171 Jul 19 02:26 ham-bayes-net-jm.log
-r--r--r-- 1 rsync rsync 45026386 Jul 19 02:27 ham-bayes-net-jm2.log
-r--r--r-- 1 rsync rsync 294744 Jul 25 18:57 ham-bayes-net-misak.log
-r--r--r-- 1 rsync rsync 22130676 Jul 27 04:17 ham-bayes-net-parkerm.log
-r--r--r-- 1 rsync rsync 14056970 Jul 27 19:37 ham-bayes-net-quinlan.log
-r--r--r-- 1 rsync rsync 8603737 Jul 27 17:01 ham-bayes-net-rod.log
-r--r--r-- 1 rsync rsync 28410747 Jul 27 02:34 ham-bayes-net-theo.log
-r--r--r-- 1 rsync rsync 62685697 Jul 16 18:22 spam-bayes-net-bzoetekouw.log
-r--r--r-- 1 rsync rsync 11891366 Jul 16 14:50 spam-bayes-net-cthielen.log
-r--r--r-- 1 rsync rsync 96553037 Jul 24 08:09 spam-bayes-net-daf.log
-r--r--r-- 1 rsync rsync 28662170 Jul 19 02:28 spam-bayes-net-jm.log
-r--r--r-- 1 rsync rsync 209202453 Jul 19 02:34 spam-bayes-net-jm2.log
-r--r--r-- 1 rsync rsync 243487 Jul 25 18:57 spam-bayes-net-misak.log
-r--r--r-- 1 rsync rsync 39357821 Jul 27 04:19 spam-bayes-net-parkerm.log
-r--r--r-- 1 rsync rsync 41987897 Jul 27 19:39 spam-bayes-net-quinlan.log
-r--r--r-- 1 rsync rsync 97404262 Jul 27 17:03 spam-bayes-net-rod.log
-r--r--r-- 1 rsync rsync 358576609 Jul 27 02:34 spam-bayes-net-theo.log
}}}
[Spamassassin Wiki] Update of "DnsblAccuracy082005" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/DnsblAccuracy082005

The comment on the change is:
--reuse status

------------------------------------------------------------------------------

Many people, whether they use SpamAssassin or not, find accuracy figures for DNSBLs to be useful. Here are accuracy figures for the DNS blocklists included in SpamAssassin 3.1.0, as measured during our July rescoring run. We use the following techniques to assure high accuracy on these figures:

- * hits are recorded from 'live' data at the time the messages were received, not post-facto testing (using 'mass-check --reuse')
+ * some hits are recorded from 'live' data at the time the messages were received, not post-facto testing (using 'mass-check --reuse')
* there were 9 people contributing their hit data, from a variety of geographical locations and organisational types
* both ["Ham"] and ["Spam"] hitrates are measured, and the corpora were hand-verified in advance
* the corpora use (relatively) fresh mail, received between January 2004 and July 2005
@@ -81, +81 @@

Here's a list of the data files used. Note that only a randomly-chosen one
tenth of each file was used.

+ Use of --reuse for real-time network results: confirmed on: 2 users (bmenschel, jm); confirmed off: 1 user (duncf); unknown: 6 users.
+
{{{
bash-3.00$ ls -l /home/corpus-rsync/corpus/submit/
total 2839184
[Spamassassin Wiki] Update of "DnsblAccuracy082005" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/DnsblAccuracy082005

The comment on the change is:
we really need to figure out a way for mass-check logs to note "--reuse" status

------------------------------------------------------------------------------
Here's a list of the data files used. Note that only a randomly-chosen one
tenth of each file was used.

- Use of --reuse for real-time network results: confirmed on: 4 users (bmenschel, jm, parker, cthielen); confirmed off: 1 user (duncf); unknown: 5 users.
+ Use of --reuse for real-time network results: confirmed on: 4 users (bmenschel, jm, parker, cthielen); confirmed off: 1 user (duncf); unknown: 4 users (bzoetekouw, misak, quinlan, theo).

{{{
bash-3.00$ ls -l /home/corpus-rsync/corpus/submit/
[Spamassassin Wiki] Update of "DnsblAccuracy082005" by JustinMason [ In reply to ]
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Spamassassin Wiki" for change notification.

The following page has been changed by JustinMason:
http://wiki.apache.org/spamassassin/DnsblAccuracy082005

The comment on the change is:
add some other network test results

------------------------------------------------------------------------------
3.267 2.6241 4.7654 0.355 0.10 -0.00 SPF_PASS
}}}

+ RFC-ignorant, testing against the envelope sender's domain:
+
+ {{{
+ 3.038 4.3352 0.0132 0.997 0.86 2.60 DNS_FROM_RFC_DSN
+ 1.174 1.6715 0.0151 0.991 0.75 1.94 DNS_FROM_RFC_BOGUSMX
+ 3.590 5.0607 0.1620 0.969 0.57 1.45 DNS_FROM_RFC_WHOIS
+ 13.930 19.7071 0.4615 0.977 0.47 1.71 DNS_FROM_RFC_POST
+ 12.120 16.7154 1.4051 0.922 0.34 0.20 DNS_FROM_RFC_ABUSE
+ }}}
+
+ other network rules:
+
+ {{{
+ 1.898 2.7081 0.0094 0.997 0.82 3.20 NO_DNS_FOR_FROM
+ 1.449 2.0593 0.0245 0.988 0.74 1.51 DNS_FROM_SECURITYSAGE
+ 7.200 10.0898 0.4615 0.956 0.44 0.23 DNS_FROM_AHBL_RHSBL
+ }}}
+
More details of the source mass-check log files and test procedure can be read
in SpamAssassin bug 4505. the full list of freqs can be found in the STATISTICS-set3.txt file in the 3.1.0 release.
Here's a list of the data files used. Note that only a randomly-chosen one