Mailing List Archive

bayes scoring q
I'm trying to come up to speed with spamassassin, so apologies if
I ask some ignorant questions. I've checked the FAQ and haven't
found what I'm looking for.

The situation is that we have spamassassin installed here site-wide.
The problem is that a *lot* of spam gets through, scored fairly
lowly. Is this a common problem? Probably 50% of what appears to
the eye as obvious spam gets scored in the 0.4 to 4.9 range. We
had some that scored around -4.x because it had Habeas headers in
it that apparently were not detected as fake.

Is this (missing a lot of spam) a common occurence? Our site config is:

rewrite_subject 0
report_safe 0
skip_rbl_checks 1
use_razor2 0
use_bayes 1
bayes_path /opt/spamassassin/db/bayesdb
bayes_file_mode 666
bayes_auto_learn 1
use_pyzor 0
dns_available test: skuld.ucsd.edu noc.ucsd.edu ucsd.ucsd.edu
use_dcc 0
allow_user_rules 0

So we're using default scoring for almost everything, and would
rather stay with the defaults as much as possible rather than start
hacking at scoring to artificially inflate the scores.

For the most part, bayes is not a factor (I'm hand-running messages
with the -D flag to try to see what's going on).

Any pointers on what else to check?

Thanks much!

-glenn
Re: bayes scoring q [ In reply to ]
At 05:56 PM 3/10/2004, little@cs.ucsd.edu wrote:
>Probably 50% of what appears to
>the eye as obvious spam gets scored in the 0.4 to 4.9 range. We
>had some that scored around -4.x because it had Habeas headers in
>it that apparently were not detected as fake.

>So we're using default scoring for almost everything, and would
>rather stay with the defaults as much as possible rather than start
>hacking at scoring to artificially inflate the scores.
>
>For the most part, bayes is not a factor (I'm hand-running messages
>with the -D flag to try to see what's going on).

Um... that's a definite mis-statement. Bayes almost certainly IS a factor.
Anything "obviously spam" should be getting BAYES_90 or BAYES_99.. if you
are getting lots of FN mail that's not scoring high in bayes, you might
want to examine your training.
Re: bayes scoring q [ In reply to ]
By "not a factor" I meant that it was scoring midrange, so did not
end up contributing to the final score. But you're right, the
database is screwy. Largely because of my problem that spam
scores too low. We have auto-learn, and many spams don't make
a high enough score to be auto-learned as spam. In addition,
some spams actually score low enough (see the habeas problem
I mentioned earlier) to be auto-learned as ham :-(

Any pointers on how to get the non-bayes spam score up, so that
auto-learn will be more useful?

Thanks...

> From mkettler@evi-inc.com Wed Mar 10 15:00:31 2004
>
> At 05:56 PM 3/10/2004, little@cs.ucsd.edu wrote:
> >Probably 50% of what appears to
> >the eye as obvious spam gets scored in the 0.4 to 4.9 range. We
> >had some that scored around -4.x because it had Habeas headers in
> >it that apparently were not detected as fake.
>
> >So we're using default scoring for almost everything, and would
> >rather stay with the defaults as much as possible rather than start
> >hacking at scoring to artificially inflate the scores.
> >
> >For the most part, bayes is not a factor (I'm hand-running messages
> >with the -D flag to try to see what's going on).
>
> Um... that's a definite mis-statement. Bayes almost certainly IS a factor.
> Anything "obviously spam" should be getting BAYES_90 or BAYES_99.. if you
> are getting lots of FN mail that's not scoring high in bayes, you might
> want to examine your training.
>
>
>
Re: bayes scoring q [ In reply to ]
On Wed, 2004-03-10 at 22:56, little@cs.ucsd.edu wrote:
<snip>
> Is this (missing a lot of spam) a common occurence? Our site config is:
>
<snip>
> skip_rbl_checks 1

Personally, I've found that enabling RBL checks brings up alot of FN
tagged spam scores high enough to count as spam, and in some cases, high
enough to be auto learned. I'm not an SA guru by any means, and YMMV, on
the flip side RBL checks means some overheads checking DNS records.

> use_razor2 0
> use_pyzor 0

Any reason why you've disabled razor/pyzor? My personal view is 'the
more the better' in terms of anti-spam technologies, one method of
catching spam on its own isn't likely to do much, but when combined with
others it makes a big difference.

Just my £0.02p

[...]

-j

--
-jamie <jamie@silverdream.org> | spamtrap: spam@silverdream.org
w: http://silverdream.org | p: sms@silverdream.org
pgp key @ http://silverdream.org/~jps/pub.key
22:30:01 up 7 days, 7:50, 13 users, load average: 0.10, 0.28, 0.23
Re: bayes scoring q [ In reply to ]
At 06:05 PM 3/10/2004, little@cs.ucsd.edu wrote:
>We have auto-learn, and many spams don't make
>a high enough score to be auto-learned as spam. In addition,
>some spams actually score low enough (see the habeas problem
>I mentioned earlier) to be auto-learned as ham :-(

Autolearn is a good thing, but how much manual training are you doing?

Autolearning alone as your sole source of bayes training is a very bad
idea, and prone to disaster.

I might also suggest the following to help mitigate some of the habeas damage:

bayes_ignore_header X-Habeas-SWE-1
bayes_ignore_header X-Habeas-SWE-2
bayes_ignore_header X-Habeas-SWE-3
bayes_ignore_header X-Habeas-SWE-4
bayes_ignore_header X-Habeas-SWE-5
bayes_ignore_header X-Habeas-SWE-6
bayes_ignore_header X-Habeas-SWE-7
bayes_ignore_header X-Habeas-SWE-8
bayes_ignore_header X-Habeas-SWE-9

This will make the bayes database never give ham nor spam points because an
email has these headers.. since there's already a rule for them, there's no
reason to give "double credit" and give them bayes consideration as well.
Re: bayes scoring q [ In reply to ]
Regarding the RBL checks, we just didn't want the overhead (we
process a ton of email). Also, we're a university and some of
our "customer base" is pretty against the RBL concept.

Same thing with razor and pyzor.

Maybe those are what would get us more reasonable scores,
I don't know. But for better or worse, using any of them
would be a difficult sell at this point.

-glenn
Re: bayes scoring q [ In reply to ]
At 15:40 2004/03/10, Glenn Little wrote:
>Regarding the RBL checks, we just didn't want the overhead (we
>process a ton of email). Also, we're a university and some of
>our "customer base" is pretty against the RBL concept.
>
>Same thing with razor and pyzor.
>
>Maybe those are what would get us more reasonable scores,
>I don't know. But for better or worse, using any of them
>would be a difficult sell at this point.

Ah, but the difference in this case is that SpamAssassin doesn't use these
tests the same way your "customer base" is likely expecting. SpamAssassin
queries DNSBLs and other external sources like Razor, Pyzor, and DCC, but
does not consider a single positive result to be damning--it merely adds to
the mail's score, just like every other test. In effect, these external
tests are just treated as another form of evidence as SpamAssassin
assembles its case against suspicious mail. This allows the mail to be
analyzed more thoroughly, resulting in fewer false positives and false
negatives, at the expense of a little more processing time and bandwidth.

The folks that oppose the use of DNSBLs and such are usually against the
idea of blocking mail outright, based on someone else's
recommendation. With SpamAssassin you can assign different scores to each
DNSBL, so that if you trust one of them more than another, you can adjust
their scores individually. If a particular DNSBL seems too triggerhappy
for you, or has questionable listing policies, you can disable it entirely
(by assigning it a score of 0), or just downgrade its weight with a lower
score.


Robert LeBlanc <rjl@renaissoft.com>
Renaissoft, Inc.
Maia Mailguard <http://www.renaissoft.com/maia/>
Re: bayes scoring q [ In reply to ]
Glenn Little wrote:

> Regarding the RBL checks, we just didn't want the overhead (we
> process a ton of email). Also, we're a university and some of
> our "customer base" is pretty against the RBL concept.
>
> Same thing with razor and pyzor.
>
> Maybe those are what would get us more reasonable scores,
> I don't know. But for better or worse, using any of them
> would be a difficult sell at this point.
>
> -glenn
>
>
Why? If you use these things within SpamAssassin, the rule still holds
true that no one thing will make a message spam or ham. So, if a
message hits 3 seperate black lists, and that makes it spam, can your
"customer base" seriously say that 3 different black lists falsely
listed a server and it was falsely tagged as spam? I think THAT would
be the difficult sell. Using these lists alone as the sole reason to
block email definitely can be argued against, but I've found that some
lists, specifically the ones that list dynamic IPs, have little to no
false positives. Considering the amount of email I block with a single
RBL that lists dynamic IPs, I would NEVER turn it off. 6 months with
150000-200000 emails a day, and not a single complaint of a false
positive is pretty dam good. So, if you choose your RBLs wisely, using
them to block email directly with your MTA can work.

Razor, Pyzor, and DCC list emails based on human submissions (the best
spam detector in my opinion is a human). These lists are probably more
accurate because it is the actual content of the message that is
blacklisted, not the server it came from. So, Bob Legit's email even
though it comes from the same server as Joe Spammer's email, will not
get marked by one of these lists. Again, the rule holds true though, no
single one of these lists alone will cause a message to be marked spam.


Chris
Re: bayes scoring q [ In reply to ]
I think the performance hit and reliance on an external
connection for every email processed was at least an
equal concern.

In practice is that just not a problem? How about
a network timeout situation? How does that end up
working out? Sounds like you (Christopher) at least
have a fairly high mail volume.

-glenn

Christopher M. Iarocci wrote:

> Glenn Little wrote:
>
>> Regarding the RBL checks, we just didn't want the overhead (we
>> process a ton of email). Also, we're a university and some of
>> our "customer base" is pretty against the RBL concept.
>>
>> Same thing with razor and pyzor.
>>
>> Maybe those are what would get us more reasonable scores,
>> I don't know. But for better or worse, using any of them
>> would be a difficult sell at this point.
>>
>> -glenn
>>
>>
> Why? If you use these things within SpamAssassin, the rule still holds
> true that no one thing will make a message spam or ham. So, if a
> message hits 3 seperate black lists, and that makes it spam, can your
> "customer base" seriously say that 3 different black lists falsely
> listed a server and it was falsely tagged as spam? I think THAT would
> be the difficult sell. Using these lists alone as the sole reason to
> block email definitely can be argued against, but I've found that some
> lists, specifically the ones that list dynamic IPs, have little to no
> false positives. Considering the amount of email I block with a single
> RBL that lists dynamic IPs, I would NEVER turn it off. 6 months with
> 150000-200000 emails a day, and not a single complaint of a false
> positive is pretty dam good. So, if you choose your RBLs wisely, using
> them to block email directly with your MTA can work.
>
> Razor, Pyzor, and DCC list emails based on human submissions (the best
> spam detector in my opinion is a human). These lists are probably more
> accurate because it is the actual content of the message that is
> blacklisted, not the server it came from. So, Bob Legit's email even
> though it comes from the same server as Joe Spammer's email, will not
> get marked by one of these lists. Again, the rule holds true though, no
> single one of these lists alone will cause a message to be marked spam.
>
>
> Chris
Re: bayes scoring q [ In reply to ]
On Wed, 10 Mar 2004, Glenn Little wrote:

> I think the performance hit and reliance on an external
> connection for every email processed was at least an
> equal concern.
>
> In practice is that just not a problem? How about
> a network timeout situation? How does that end up
> working out? Sounds like you (Christopher) at least
> have a fairly high mail volume.
>
> -glenn

Glenn,
Are you responsible for the mail @cs.ucsd.edu ?
Then you -DO- use a DNSBL for every message, and I've got the
transcript to prove it:

server15$ telnet fast.ucsd.edu smtp
Trying...
Connected to fast.ucsd.edu.
Escape character is '^]'.
220 fast.ucsd.edu ESMTP
helo server15.icaen.uiowa.edu
250 fast.ucsd.edu Hello server15.icaen.uiowa.edu [128.255.17.10], pleased to meet you
mail from: <bill@nowhere.icaen.uiowa.edu>
553 5.1.8 <bill@nowhere.icaen.uiowa.edu>... Domain of sender address bill@nowhere.icaen.uiowa.edu does not exist

You do a DNS resolvability check on each message that somebody tries to
hand you, which is dependent upon an arbitratry remote DNS server.
Based upon the results of that DNS lookup, you do reject mail.
Thus I could argue that you do use DNSBL lists.

Now I do that -and- I do a DNS lookup on rbl-plus.mail-abuse.org.
As we have a paid subscription to MAPS, with zone-xfer rights,
that second lookup is often faster and easier than the first is.

If performance is truely a concern, just secondary the zones that
you want to use.

So it boils down to an issue of trust. Who do you trust to tell you
whether you should accept any particular message?
Some arbitrary remote DNS server who claims authority for the
senders domain? (DNS spoofing issues considered)?
A 3rd party who has done some work to collect information about
some particular desirability criteria of a particular sending host?
(EG open-relay, dial-up/dynamic IP, open proxy, ...).

Do the due diligence, it can pay off in the end.
(hmm is this some kind of stock add ;)

Dave

--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: bayes scoring q [ In reply to ]
Dave Funk wrote:
> 553 5.1.8 <bill@nowhere.icaen.uiowa.edu>... Domain of sender address bill@nowhere.icaen.uiowa.edu does not exist
>
>You do a DNS resolvability check on each message that somebody tries to
>hand you, which is dependent upon an arbitratry remote DNS server.
>Based upon the results of that DNS lookup, you do reject mail.
>Thus I could argue that you do use DNSBL lists.

DNS blacklist maintainers are self-appointed and have varying
reputations that may be difficult to ascertain reliably. Few people
would recommend the risky idea of using a poorly-chosen DNS-based
blacklist.

Confirming that a sending domain really exists is a form of whitelist
query to the properly delegated authority for the domain. This
function is very different from DNSBL queries, and it would be better
if you didn't try to mislead people into thinking that they are
morally equivalent just because they both use the DNS.

:: Jeff Makey
jeff@sdsc.edu
Re: bayes scoring q [ In reply to ]
From: "Matt Kettler" <mkettler@evi-inc.com>

> Autolearn is a good thing, but how much manual training are you doing?
>
> Autolearning alone as your sole source of bayes training is a very bad
> idea, and prone to disaster.
>
> I might also suggest the following to help mitigate some of the habeas
damage:
>
> bayes_ignore_header X-Habeas-SWE-1
> bayes_ignore_header X-Habeas-SWE-2
> bayes_ignore_header X-Habeas-SWE-3
> bayes_ignore_header X-Habeas-SWE-4
> bayes_ignore_header X-Habeas-SWE-5
> bayes_ignore_header X-Habeas-SWE-6
> bayes_ignore_header X-Habeas-SWE-7
> bayes_ignore_header X-Habeas-SWE-8
> bayes_ignore_header X-Habeas-SWE-9
>
> This will make the bayes database never give ham nor spam points because
an
> email has these headers.. since there's already a rule for them, there's
no
> reason to give "double credit" and give them bayes consideration as well.

Er, I take it as a given that the X-Habeas headers are prime indicators
of spam. So what's wrong with scoring them twice? How does Granny Pratfall
install X-Habeas headers on her Outlook Express email? It's hard for her.
It's easy for spammers. Relayed spams leave spammers virtually immune to
any of the Habeas sanctions. So it's cheaper for spammers to use that
header than it is for most computer users out there.

Score it twice. Score it thrice. Score it whatever. The truth of the header
will eventually fall out of the mishmash.

{^_^}
Re: bayes scoring q [ In reply to ]
From: "Robert LeBlanc" <rjl@renaissoft.com>

> The folks that oppose the use of DNSBLs and such are usually against the
> idea of blocking mail outright, based on someone else's
> recommendation. With SpamAssassin you can assign different scores to each
> DNSBL, so that if you trust one of them more than another, you can adjust
> their scores individually. If a particular DNSBL seems too triggerhappy
> for you, or has questionable listing policies, you can disable it entirely
> (by assigning it a score of 0), or just downgrade its weight with a lower
> score.

I oppose them on those grounds and on the grounds that I found it
useless. I manually performed the DNS checks at three different
black lists over the better part of a day. As soon as I saw a spam
I checked it via the DNS lookup. I got maybe one hit out of twenty.
By the time I started getting hits on email addresses spam from
those addresses was no longer appearing in my inbox. I picked twenty
spams at random out of my inbox, which gets maybe 200 - 300 spams a
day. I guess I randomly picked a selection of email that was not from
typical fixed spam addresses. I do get a fair amount of that and maybe
subconsciously I recognized their header style and didn't test them.

Far more importantly I tested "fall through" spam that was missed
by SpamAssassin and fell through my folder rules or actually found
its way into valid folders. On the day I tested Bayes was not running
so I had about 10 such. Not a one of them hit any of the DNS tests I
made until well after I had received them and the relays were no longer
being used.

Since the DNS test is an added load on a rather underpowered machine
that somewhat loses its mind when a collection of kernel patches hit
the Linux Kernel mailing list I turned them off. (When the machine
loses its mind I seem to get two copies of all emal I received for
an hour or two in my mailbox. At least it self recovers.)

{^_^}
Re: bayes scoring q [ In reply to ]
On Wed, 10 Mar 2004, Jeff Makey wrote:

> Dave Funk wrote:
> > 553 5.1.8 <bill@nowhere.icaen.uiowa.edu>... Domain of sender address bill@nowhere.icaen.uiowa.edu does not exist
> >
> >You do a DNS resolvability check on each message that somebody tries to
> >hand you, which is dependent upon an arbitratry remote DNS server.
> >Based upon the results of that DNS lookup, you do reject mail.
> >Thus I could argue that you do use DNSBL lists.
>
> DNS blacklist maintainers are self-appointed and have varying
> reputations that may be difficult to ascertain reliably. Few people
> would recommend the risky idea of using a poorly-chosen DNS-based
> blacklist.

Agreed 100%. If you had read the rest of my post you would note that I
talked about issues of trust and "do due diligence" when deciding what
sources of information to use. This applies to DNS or other forms of
remote information such as DCC, Razor, BondedSender, etc.

ANY badly maintained DNS system is a disservice to those affected by it.

The main purpose of that post was to point out to Glenn Little
that he was doing DNS lookups on every message arriving at his system
and making accept/reject decisions upon that information.
That was in answer to his previous post questioning the sense of doing
DNSBL lookups PURELY on the grounds of performance and reliability
concerns.


> Confirming that a sending domain really exists is a form of whitelist
> query to the properly delegated authority for the domain. This
> function is very different from DNSBL queries, and it would be better
> if you didn't try to mislead people into thinking that they are
> morally equivalent just because they both use the DNS.
>
> :: Jeff Makey
> jeff@sdsc.edu

I am not trying to get into a holy war here, I was trying to point
out the technical feasability of using DNS to get information about
a message for filtering purposes (either for accept/reject decisions
or SA scoring).


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{