Mailing List Archive

Any Statistics available from SpamAssassin?
I have been using SpamAssassin for a year or so, and couldnt
get along without it.

Still, I keep seeing writeups of Spamassassin where they say
that it is 99+% efficient at recognizing spam, at least now
that it has the Baysian filtering in it...

Well, thats not the case here, it does recognize on the order
of 300 messages a day (thank you, thank you), but probably
misses on the order of another 75-100.

So thats more like 75-80% not 99+%.

Now the Baysian Filtering is up (I think) and primed.
the dcc stuff is up...

But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?

Is there any way to get any statistics out of SpamAssassin?

I do see comments about the dcc stuff in the maillog on occasion,
mostly complaints about not being able to get to a specific site,

but it would be nice

to know if the Baysian stuff is working at all,
or if it (SpamAssassin)
was having long term problems getting to dcc sites.

I dont see any 'flags' for any statistics, am I missing something?
Even a script to grep the 'tossed' messages (I save them for a few
days) would be acceptable, but at the moment SpamAssassin is a great
be black box,- it seems to work, but it could be having real problems
and I wouldnt have a clue.

Well, thats longer than I wanted the message to be, but...


--
Reg.Clemens
reg@dwf.com
Re: Any Statistics available from SpamAssassin? [ In reply to ]
| Now the Baysian Filtering is up (I think) and primed.
| the dcc stuff is up...
|
| But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?


add_header spam DCC _DCCB_: _DCCR_
will show the dcc stuff and the bayes stuff will show
when it hits after sufficient training

Greg


----- Original Message -----
From: <reg@dwf.com>
To: <spamassassin-users@incubator.apache.org>
Sent: Thursday, February 19, 2004 3:28 PM
Subject: Any Statistics available from SpamAssassin?


| I have been using SpamAssassin for a year or so, and couldnt
| get along without it.
|
| Still, I keep seeing writeups of Spamassassin where they say
| that it is 99+% efficient at recognizing spam, at least now
| that it has the Baysian filtering in it...
|
| Well, thats not the case here, it does recognize on the order
| of 300 messages a day (thank you, thank you), but probably
| misses on the order of another 75-100.
|
| So thats more like 75-80% not 99+%.
|
| Now the Baysian Filtering is up (I think) and primed.
| the dcc stuff is up...
|
| But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?
|
| Is there any way to get any statistics out of SpamAssassin?
|
| I do see comments about the dcc stuff in the maillog on occasion,
| mostly complaints about not being able to get to a specific site,
|
| but it would be nice
|
| to know if the Baysian stuff is working at all,
| or if it (SpamAssassin)
| was having long term problems getting to dcc sites.
|
| I dont see any 'flags' for any statistics, am I missing something?
| Even a script to grep the 'tossed' messages (I save them for a few
| days) would be acceptable, but at the moment SpamAssassin is a great
| be black box,- it seems to work, but it could be having real problems
| and I wouldnt have a clue.
|
| Well, thats longer than I wanted the message to be, but...
|
|
| --
| Reg.Clemens
| reg@dwf.com
|
|
Re: Any Statistics available from SpamAssassin? [ In reply to ]
On Thu, 19 Feb 2004 reg@dwf.com wrote:

> Well, thats not the case here, it does recognize on the order
> of 300 messages a day (thank you, thank you), but probably
> misses on the order of another 75-100.

You might run a spam through spamassassin with the debug flag on and
check to make sure all the DNS black lists are getting called.

Mojo
--
Morris Jones <*>
Monrovia, CA
mojo@whiteoaks.com
http://www.whiteoaks.com
Re: Any Statistics available from SpamAssassin? [ In reply to ]
Thanks for the reply.
Actually, I was not clear enough on what I was after.

Knowing what percentage of the spam articles that Spamassassin
actually caught might be of interest to an author, but not
necessairly to a user (me).

I was looking for something simpler, just the number of messages
that were marked as spam by each of SpamAssassins subsystems.

And if 'A' marks it before 'B' looks at it, I dont care if
'B' might have marked it, Im just looking for something that
says that method 'C' didnt catch anything at all,-
to tell me that something is wrong with 'C' (in my install).

--
Reg.Clemens
reg@dwf.com
Re: Any Statistics available from SpamAssassin? [ In reply to ]
I think that your stats of 80% is probably pretty accurate for a default
install. I watched closely my spam and was able to see a few patterns
myself and therefore adjusted a couple of things.

I noticed that a very large portion of my spams, missed and not, were
coming from SORBS and other blacklisted addresses. So I raised each
one of them up to 2.5. 7-8 of them total. With just those and a
default trained Bayes, I was consistantly hitting
about 93%.

I also installed several add on rules: weeds, chickenpox, antidrug. I
left off backhair out of fear of FPing attachments that I get a lot of.

I've installed Razor2 and DCC. I haven't gotten to Pyzor yet,
evaluating each one as I go.

Now with a default autolearned Bayes of about 18000 messages, which has
quite a few mis-learned emails, I'm correctly catching 99.1% of my spam
messages with only 1 FP. at a setting of 5.0 spam level.

After getting Pyzor running, I'm going to dump my Bayes database and
actively train it.

So sitting at a set and forget a rate of 80% is good. If you babysit it
for a few thousand emails (a couple of days here), you can hit those
numbers.

Bryan Britt
Beltane Web Services


--
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ICQ: 53037451
Bryan L. Britt 501-327-8558
Beltane Web Services, Conway, AR http://www.beltane.com
~~~~~~~~~~Support Private Communications on the Internet~~~~~~~~~~



----------------------- Original Message -----------------------
On Thu, 19 Feb 2004 13:28:06 -0700, reg@dwf.com wrote:

> I have been using SpamAssassin for a year or so, and couldnt
> get along without it.
>
> Still, I keep seeing writeups of Spamassassin where they say
> that it is 99+% efficient at recognizing spam, at least now
> that it has the Baysian filtering in it...
>
> Well, thats not the case here, it does recognize on the order
> of 300 messages a day (thank you, thank you), but probably
> misses on the order of another 75-100.
>
> So thats more like 75-80% not 99+%.
>
> Now the Baysian Filtering is up (I think) and primed.
> the dcc stuff is up...
>
> But HOW DO I TELL IF THESE THINGS ARE REALLY WORKING?
>
> Is there any way to get any statistics out of SpamAssassin?
>
> I do see comments about the dcc stuff in the maillog on occasion,
> mostly complaints about not being able to get to a specific site,
>
> but it would be nice
>
> to know if the Baysian stuff is working at all,
> or if it (SpamAssassin)
> was having long term problems getting to dcc sites.
>
> I dont see any 'flags' for any statistics, am I missing something?
> Even a script to grep the 'tossed' messages (I save them for a few
> days) would be acceptable, but at the moment SpamAssassin is a great
> be black box,- it seems to work, but it could be having real problems
> and I wouldnt have a clue.
>
> Well, thats longer than I wanted the message to be, but...
>
>
> --
> Reg.Clemens
> reg@dwf.com
Re: Any Statistics available from SpamAssassin? [ In reply to ]
Bryan Britt said:
>
> Now with a default autolearned Bayes of about 18000 messages, which has
> quite a few mis-learned emails, I'm correctly catching 99.1% of my spam
> messages with only 1 FP. at a setting of 5.0 spam level.
>
> After getting Pyzor running, I'm going to dump my Bayes database and
> actively train it.
>
> So sitting at a set and forget a rate of 80% is good. If you babysit it
> for a few thousand emails (a couple of days here), you can hit those
> numbers.
>
Look at the tar ball and it has a statistics file in masscheck, that lists
the fn/fp for each score level.
Mail-SpamAssassin-2.63/rules/Statistics.txt.

it's alright if your bayes database has a few miss-learned email in it,
the bayes will adjust for it.
Set set your learn hamd/learn spam levels good enough so you have very few
miss learnings.
I throw in pyzor/dcc/razor and on average that knocks the average score
upt to 17.5.
This is the average score for all email that scores 6 or higher, so is not
the actual average score.
In my case I'm only catching around 65%-70% of the incoming spam, but I
have too high a mail volume to relearn, nor can I have individual user
preferences. I catch more of the mail at the second SA install, which has
more customized installs.
I was too afraid of fp to use the sare rules except for:
backhair.cf, and randomword.cf, they both work well.

randomword.cf for bayesian poisoning.
body RANDOMWORD_10
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){10}/
describe RANDOMWORD_10 String of 10+ random words
score RANDOMWORD_10 1
body RANDOMWORD_15
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){15}/
describe RANDOMWORD_15 String of 15+ random words
score RANDOMWORD_15 3

body RANDOMWORD_20
/(?:\b(?!(?:from|even|more|were|with)\b)[a-z]{4,12}\s+){20}/
describe RANDOMWORD_20 String of 20+ random words
score RANDOMWORD_20 5

#then I upped those scores by .4 to .6
#normally .1
score HTML_FONTCOLOR_UNSAFE .5
score HTML_FONTCOLOR_UNKNOWN .5
#normally .4
score HTML_FONTCOLOR_INVISIBLE 1


Computer Science System Administrator
Security Administrator,College of Engineering
Montana State University-Bozeman,Montana