Mailing List Archive

1 2 3 4  View All
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Mon, 14 Jan 2013, Ben Johnson wrote:

> I understand that snowshoe spam may not hit any net tests. I guess my
> confusion is around what, exactly, classifies spam as "snowshoe".

http://www.spamhaus.org/faq/section/Glossary

Basically, a large number of spambots sending the message so that no one
sending IP can be easily tagged as evil.

Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
are they all performed by SA?

Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject
SMTP-time DNS check in your MTA. It is well-respected and very reliable.
One thing it includes is ranges of IP addresses that should not ever be
sending email, so it may help reduce snowshoe spam.

http://www.spamhaus.org/zen/

Another tactic that many report good results from is Greylisting. Do you
have greylisting in place? Does your userbase demand no delays in mail
delivery? In addition to blocking spam from spambots that do not retry, it
can delay mail enough for the BLs to get a chance to list new IPs/domains,
which can reduce the leakage if you happen to be at the leading edge of a
new delivery campaign.

http://www.greylisting.org/

> Are most/all of the BL services hash-based?

Generally:

DNSBL: Blacklist of IP addresses
URIBL: Blacklist of domain and host names appearing in URIs
EMAILBL: (not widely used) Blacklist of email addresses (e.g.
phishing response addresses)
Razor, Pyzor: Blacklist of message content checksums/hashes

> In other words, if a known spam message was added yesterday, will it be
> considered "snowshoe" spam if the spammer sends the same message today
> and changes only one character within the body?

No, the diverse IP addresses are the hallmark of "snowshoe", not so much
the specific message content. If you see identical or generally-similar
(e.g.) pharma spam coming from a wide range of different IP addresses,
that's snowshoe.

> If so, then I guess the only remedy here is to focus on why Bayes seems
> to perform so miserably.

Agreed.

> It must be a configuration issue, because I've sa-learn-ed messages that
> are incredibly similar for two days now and not only do their Bayes
> scores not change significantly, but sometimes they decrease. And I have
> a hard time believing that one of my users is sa-train-ing these
> messages as ham and negating my efforts.

This is why you retain your Bayes training corpora: so that if Bayes goes
off the rails you can review your corpora for misclassifications, wipe and
retrain. Do you have your training corpora? Or do you discard messages
once you've trained them?

_Do_ you allow your users to train Bayes? Do they do so unsupervised or do
you review their submissions? And if the process is automated, do you
retain what they have provided for training so that you can go back later
and do a troubleshooting review?

Do you have autolearn turned on? My opinion is that autolearn is only
appropriate for a large and very diverse userbase where a sufficiently
"common" corpus of ham can't be manually collected. but then, I don't
admin a Really Large Install, so YMMV.

Do you use per-user or sitewide Bayes? If per-user, then you need to make
sure that you're training Bayes as the same user that the MTA is running
SA as.

What user does your MTA run SA as? What user do you train Bayes as?

One possibility is that the MTA is running SA as a different user than you
are training Bayes as, and you have autolearn turned on, and Bayes has
been running in its own little world since day one regardless of what you
think you're telling it to do.

> I have ensured that the spam token count increases when I train these
> messages. That said, I do notice that the token count does not *always*
> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
> message(s) examined)". Does this mean that all tokens from these
> messages have already been learned, thereby making it pointless to
> continue feeding them to sa-learn?

No, it means that Message-ID has been learned from before.

> Finally, I added the test you supplied to my SA configuration, restarted
> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.

So this proves DNS lookups are indeed working for all messages.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
One death is a tragedy; thirty is a media sensation;
a million is a statistic. -- Joseph Stalin, modernized
-----------------------------------------------------------------------
3 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/14/2013 7:48 PM, Noel wrote:
> On 1/14/2013 2:59 PM, Ben Johnson wrote:
>
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
>
> Snowshoe spam - spreading a spam run across a large number of IPs so
> no single IP is sending a large volume. Typically also combined
> with "natural language" text, RFC compliant mail servers, verified
> SPF and DKIM, business-class ISP with FCrDNS, and every other
> criteria to look like a legit mail source. This type of spam is
> difficult to catch.
>
> http://www.spamhaus.org/faq/section/Glossary#233
> and countless other links if you ask google.
>
>> Are most/all of the BL services hash-based? In other words, if a known
>> spam message was added yesterday, will it be considered "snowshoe" spam
>> if the spammer sends the same message today and changes only one
>> character within the body?
>
> No, most all DNS blacklists are based on IP reputation. Check each
> list's website for their listing policy to see how an IP gets on
> their list; generally honypot email addresses or trusted user
> reports. Most lists require some number of reports before listing
> an IP to prevent false positives; snowshoe spammers take advantage
> of this.
>
>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
>
> Sounds as if your bayes has been improperly trained in the past.
> You might do better to just delete the bayes db and start over with
> hand-picked spam and ham.
>
>
>
> -- Noel Jones
>

jdow, Noel, and John, I can't thank you enough for your very thorough
responses. Your time is valuable and I sincerely appreciate your
willingness to help.

John, I'll respond to you separately, for the sake of keeping this
organized.

> Ben, do be aware that sometimes you draw the short straw and sit at the
> very start of the spam distribution cycle. In those cases the BLs will
> generally not have been alerted yet so they may not trigger. For those
> situations the rules should be your friends. (I still use my treasured
> set of SARE rules and personally hand crafted rules my partner and I
> have created that fit OUR needs but may not be good general purpose
> rules.)

This makes perfect sense and underscores the importance of a
finely-tuned rule-set. It's become apparent just how dynamic and capable
a monster the spam industry is. No one approach will ever be a panacea,
it seems.

The advice from your second email is well-received, too. Especially the
part about not killing anybody. ;) I do hope fighting spam becomes fun
for me, because so far, it's been an uphill battle! Hehe.

Noel, thanks for excellent responses to my questions.

> Sounds as if your bayes has been improperly trained in the past.
> You might do better to just delete the bayes db and start over with
> hand-picked spam and ham.

I hope not, because this is my second go-round with the Bayes DB. The
first time (as Mr. Hardin may remember), auto-learning was enabled
out-of-the-box and some misconfiguration or another (seemingly related
to DNSWL_* rules) caused a lot of spam to be learned as ham. With John's
help, I corrected the issues (I hope), which I'll detail in my reply to
John.

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/14/2013 8:16 PM, John Hardin wrote:
> On Mon, 14 Jan 2013, Ben Johnson wrote:
>
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
>
> http://www.spamhaus.org/faq/section/Glossary
>
> Basically, a large number of spambots sending the message so that no one
> sending IP can be easily tagged as evil.
>
> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
> are they all performed by SA?

In postfix's main.cf:

smtpd_recipient_restrictions = permit_mynetworks,
permit_sasl_authenticated, check_recipient_access
mysql:/etc/postfix/mysql-virtual_recipient.cf,
reject_unauth_destination, reject_rbl_client bl.spamcop.net

Do you recommend something more?

> Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject
> SMTP-time DNS check in your MTA. It is well-respected and very reliable.
> One thing it includes is ranges of IP addresses that should not ever be
> sending email, so it may help reduce snowshoe spam.
>
> http://www.spamhaus.org/zen/

This article looks to be pretty thorough:

http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/

I'll add Spamhaus ZEN and a few others to the list.

> Another tactic that many report good results from is Greylisting. Do you
> have greylisting in place? Does your userbase demand no delays in mail
> delivery? In addition to blocking spam from spambots that do not retry,
> it can delay mail enough for the BLs to get a chance to list new
> IPs/domains, which can reduce the leakage if you happen to be at the
> leading edge of a new delivery campaign.
>
> http://www.greylisting.org/

Hmm, very interesting. No, I have no greylisting in place as yet, and
no, my userbase doesn't demand immediate delivery. I will look into
greylisting further.

>> Are most/all of the BL services hash-based?
>
> Generally:
>
> DNSBL: Blacklist of IP addresses
> URIBL: Blacklist of domain and host names appearing in URIs
> EMAILBL: (not widely used) Blacklist of email addresses (e.g.
> phishing response addresses)
> Razor, Pyzor: Blacklist of message content checksums/hashes

Perfect; that answers my question.

>> In other words, if a known spam message was added yesterday, will it
>> be considered "snowshoe" spam if the spammer sends the same message
>> today and changes only one character within the body?
>
> No, the diverse IP addresses are the hallmark of "snowshoe", not so much
> the specific message content. If you see identical or generally-similar
> (e.g.) pharma spam coming from a wide range of different IP addresses,
> that's snowshoe.

I see. Given this information, it concerns me that Bayes scores hardly
seem to budge when I feed sa-learn nearly identical messages 3+ times.
We'll get into that below.

>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
>
> Agreed.
>
>> It must be a configuration issue, because I've sa-learn-ed messages
>> that are incredibly similar for two days now and not only do their
>> Bayes scores not change significantly, but sometimes they decrease.
>> And I have a hard time believing that one of my users is sa-train-ing
>> these messages as ham and negating my efforts.
>
> This is why you retain your Bayes training corpora: so that if Bayes
> goes off the rails you can review your corpora for misclassifications,
> wipe and retrain. Do you have your training corpora? Or do you discard
> messages once you've trained them?

I had the good sense to retain the corpora.

> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
> do you review their submissions? And if the process is automated, do you
> retain what they have provided for training so that you can go back
> later and do a troubleshooting review?

Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
They do so unsupervised. Why this could be a problem is obvious. And no,
I don't retain their submissions. I probably should. I wonder if I can
make a few slight modifications to the shell script that Antispam calls,
such that it simply sends a copy of the message to an administrator
rather than calling sa-learn on the message.

> Do you have autolearn turned on? My opinion is that autolearn is only
> appropriate for a large and very diverse userbase where a sufficiently
> "common" corpus of ham can't be manually collected. but then, I don't
> admin a Really Large Install, so YMMV.

No, I was sure to disable autolearn after the last Bayes fiasco. :)

> Do you use per-user or sitewide Bayes? If per-user, then you need to
> make sure that you're training Bayes as the same user that the MTA is
> running SA as.

Site-wide. And I have hard-coded the username in the SA configuration to
prevent confusion in this regard:

bayes_sql_override_username amavis

> What user does your MTA run SA as? What user do you train Bayes as?

The MTA should pass scanning off to "amavis". I train the DB in two
ways: via Dovecot Antispam and by calling sa-learn on my training
mailbox. Given that I have hard-coded the username, the output of
"sa-learn --dump magic" is the same whether I issue the command under my
own account or "su" to the "amavis" user.

> One possibility is that the MTA is running SA as a different user than
> you are training Bayes as, and you have autolearn turned on, and Bayes
> has been running in its own little world since day one regardless of
> what you think you're telling it to do.

That is what happened last year. I hope to have eliminated those issues
this time around. (I dumped the old DB and started over after that
debacle.) The X-Spam-Status header always displays "autolearn=disabled".

>> I have ensured that the spam token count increases when I train these
>> messages. That said, I do notice that the token count does not *always*
>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
>> message(s) examined)". Does this mean that all tokens from these
>> messages have already been learned, thereby making it pointless to
>> continue feeding them to sa-learn?
>
> No, it means that Message-ID has been learned from before.

I see. So, when this happens, it means that one of my users has already
dragged the message from Inbox to Junk (which triggers the Antispam
plug-in and feeds the message to sa-learn).

When this scenario occurs, my efforts in feeding the same message to
sa-learn are wasted, right? Bayes doesn't "learn more" from the message
the second time, or increase it's tokens' "weight", right? It would be
nice if I could eliminate this duplicate effort.

>> Finally, I added the test you supplied to my SA configuration, restarted
>> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.
>
> So this proves DNS lookups are indeed working for all messages.
>

Okay, good to know. I think we're "all clear" in the DNS/network test
department.

Based on my responses, what's the next move? Backup the Bayes DB, wipe
it, and feed my corpus through the ol' chipper?

Thanks again!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Tue, 15 Jan 2013, Ben Johnson wrote:

> On 1/14/2013 8:16 PM, John Hardin wrote:
>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>
>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>> are they all performed by SA?
>
> In postfix's main.cf:
>
> smtpd_recipient_restrictions = permit_mynetworks,
> permit_sasl_authenticated, check_recipient_access
> mysql:/etc/postfix/mysql-virtual_recipient.cf,
> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>
> Do you recommend something more?

Unfortunately I have no experience administering Postfix. Perhaps one of
the other listies can help.

>> http://www.greylisting.org/
>
> Hmm, very interesting. No, I have no greylisting in place as yet, and
> no, my userbase doesn't demand immediate delivery. I will look into
> greylisting further.

One other thing you might try is publishing an SPF record for your domain.
There is anecdotal evidence that this reduces the raw spam volume to that
domain a bit.

> Given this information, it concerns me that Bayes scores hardly seem to
> budge when I feed sa-learn nearly identical messages 3+ times. We'll get
> into that below.
>
>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>> to perform so miserably.
>>
>> Agreed.
>>
>>> It must be a configuration issue, because I've sa-learn-ed messages
>>> that are incredibly similar for two days now and not only do their
>>> Bayes scores not change significantly, but sometimes they decrease.
>>> And I have a hard time believing that one of my users is sa-train-ing
>>> these messages as ham and negating my efforts.
>>
>> This is why you retain your Bayes training corpora: so that if Bayes
>> goes off the rails you can review your corpora for misclassifications,
>> wipe and retrain. Do you have your training corpora? Or do you discard
>> messages once you've trained them?
>
> I had the good sense to retain the corpora.

Yay!

>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>> do you review their submissions? And if the process is automated, do you
>> retain what they have provided for training so that you can go back
>> later and do a troubleshooting review?
>
> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
> They do so unsupervised. Why this could be a problem is obvious. And no,
> I don't retain their submissions. I probably should. I wonder if I can
> make a few slight modifications to the shell script that Antispam calls,
> such that it simply sends a copy of the message to an administrator
> rather than calling sa-learn on the message.

That would be a very good idea if the number of users doing training is
small. At the very least, the messages should be captured to a permanent
corpus mailbox.

Do your users also train ham? Are the procedures similar enough that your
users could become easily confused?

>> Do you have autolearn turned on? My opinion is that autolearn is only
>> appropriate for a large and very diverse userbase where a sufficiently
>> "common" corpus of ham can't be manually collected. but then, I don't
>> admin a Really Large Install, so YMMV.
>
> No, I was sure to disable autolearn after the last Bayes fiasco. :)

OK.

>> Do you use per-user or sitewide Bayes? If per-user, then you need to
>> make sure that you're training Bayes as the same user that the MTA is
>> running SA as.
>
> Site-wide. And I have hard-coded the username in the SA configuration to
> prevent confusion in this regard:
>
> bayes_sql_override_username amavis
>
>> What user does your MTA run SA as? What user do you train Bayes as?
>
> The MTA should pass scanning off to "amavis". I train the DB in two
> ways: via Dovecot Antispam and by calling sa-learn on my training
> mailbox. Given that I have hard-coded the username, the output of
> "sa-learn --dump magic" is the same whether I issue the command under my
> own account or "su" to the "amavis" user.

OK, good.

>>> I have ensured that the spam token count increases when I train these
>>> messages. That said, I do notice that the token count does not *always*
>>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
>>> message(s) examined)". Does this mean that all tokens from these
>>> messages have already been learned, thereby making it pointless to
>>> continue feeding them to sa-learn?
>>
>> No, it means that Message-ID has been learned from before.
>
> I see. So, when this happens, it means that one of my users has already
> dragged the message from Inbox to Junk (which triggers the Antispam
> plug-in and feeds the message to sa-learn).

Very likely.

The extremely odd thing is that you say you sometimes train a message as
spam, and its Bayes score goes *down*. Are you training a message and
then running it torough spamc to see if the score changed, or is this
about _similar_ messages rather than _that_ message?

> When this scenario occurs, my efforts in feeding the same message to
> sa-learn are wasted, right? Bayes doesn't "learn more" from the message
> the second time, or increase it's tokens' "weight", right? It would be
> nice if I could eliminate this duplicate effort.

Correct, no new information is learned.

> Based on my responses, what's the next move? Backup the Bayes DB, wipe
> it, and feed my corpus through the ol' chipper?

That, and configure the user-based training to at the very least capture
what they submit to a corpus so you can review it. Whether you do that
review pre-training or post-bayes-is-insane is up to you.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The difference is that Unix has had thirty years of technical
types demanding basic functionality of it. And the Macintosh has
had fifteen years of interface fascist users shaping its progress.
Windows has the hairpin turns of the Microsoft marketing machine
and that's all. -- Red Drag Diva
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 1:55 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
>
>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>
>>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>>> are they all performed by SA?
>>
>> In postfix's main.cf:
>>
>> smtpd_recipient_restrictions = permit_mynetworks,
>> permit_sasl_authenticated, check_recipient_access
>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>
>> Do you recommend something more?
>
> Unfortunately I have no experience administering Postfix. Perhaps one of
> the other listies can help.

Wow! Adding several more reject_rbl_client entries to the
smtpd_recipient_restrictions directive in the Postfix configuration
seems to be having a tremendous impact. The amount of spam coming
through has dropped by 90% or more. This was a HUGELY helpful
suggestion, John!

>>> http://www.greylisting.org/
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
>
> One other thing you might try is publishing an SPF record for your
> domain. There is anecdotal evidence that this reduces the raw spam
> volume to that domain a bit.

We do publish SPF records for the domains within our control. The need
to do this arose when senderbase.org, et. al., began blacklisting
domains without SPF records. So, we're good there.

>> Given this information, it concerns me that Bayes scores hardly seem
>> to budge when I feed sa-learn nearly identical messages 3+ times.
>> We'll get into that below.
>>
>>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>>> to perform so miserably.
>>>
>>> Agreed.
>>>
>>>> It must be a configuration issue, because I've sa-learn-ed messages
>>>> that are incredibly similar for two days now and not only do their
>>>> Bayes scores not change significantly, but sometimes they decrease.
>>>> And I have a hard time believing that one of my users is sa-train-ing
>>>> these messages as ham and negating my efforts.
>>>
>>> This is why you retain your Bayes training corpora: so that if Bayes
>>> goes off the rails you can review your corpora for misclassifications,
>>> wipe and retrain. Do you have your training corpora? Or do you discard
>>> messages once you've trained them?
>>
>> I had the good sense to retain the corpora.
>
> Yay!
>
>>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>>> do you review their submissions? And if the process is automated, do you
>>> retain what they have provided for training so that you can go back
>>> later and do a troubleshooting review?
>>
>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>> They do so unsupervised. Why this could be a problem is obvious. And no,
>> I don't retain their submissions. I probably should. I wonder if I can
>> make a few slight modifications to the shell script that Antispam calls,
>> such that it simply sends a copy of the message to an administrator
>> rather than calling sa-learn on the message.
>
> That would be a very good idea if the number of users doing training is
> small. At the very least, the messages should be captured to a permanent
> corpus mailbox.

Good idea! I'll see if I can set this up.

> Do your users also train ham? Are the procedures similar enough that
> your users could become easily confused?

They do. The procedure is implemented via Dovecot's Antispam plug-in.
Basically, moving mail from Inbox to Junk trains it as spam, and moving
mail from Junk to Inbox trains it as ham. I really like this setup
(Antispam + calling SA through Amavis [i.e. not using spamd]) because
the results are effective immediately, which seems to be crucial for
combating this snowshoe spam (performance and scalability aside).

I don't find that procedure to be confusing, but people are different, I
suppose.

>>> Do you have autolearn turned on? My opinion is that autolearn is only
>>> appropriate for a large and very diverse userbase where a sufficiently
>>> "common" corpus of ham can't be manually collected. but then, I don't
>>> admin a Really Large Install, so YMMV.
>>
>> No, I was sure to disable autolearn after the last Bayes fiasco. :)
>
> OK.
>
>>> Do you use per-user or sitewide Bayes? If per-user, then you need to
>>> make sure that you're training Bayes as the same user that the MTA is
>>> running SA as.
>>
>> Site-wide. And I have hard-coded the username in the SA configuration to
>> prevent confusion in this regard:
>>
>> bayes_sql_override_username amavis
>>
>>> What user does your MTA run SA as? What user do you train Bayes as?
>>
>> The MTA should pass scanning off to "amavis". I train the DB in two
>> ways: via Dovecot Antispam and by calling sa-learn on my training
>> mailbox. Given that I have hard-coded the username, the output of
>> "sa-learn --dump magic" is the same whether I issue the command under my
>> own account or "su" to the "amavis" user.
>
> OK, good.
>
>>>> I have ensured that the spam token count increases when I train these
>>>> messages. That said, I do notice that the token count does not *always*
>>>> change; sometimes, sa-learn reports "Learned tokens from 0
>>>> message(s) (1
>>>> message(s) examined)". Does this mean that all tokens from these
>>>> messages have already been learned, thereby making it pointless to
>>>> continue feeding them to sa-learn?
>>>
>>> No, it means that Message-ID has been learned from before.
>>
>> I see. So, when this happens, it means that one of my users has already
>> dragged the message from Inbox to Junk (which triggers the Antispam
>> plug-in and feeds the message to sa-learn).
>
> Very likely.
>
> The extremely odd thing is that you say you sometimes train a message as
> spam, and its Bayes score goes *down*. Are you training a message and
> then running it torough spamc to see if the score changed, or is this
> about _similar_ messages rather than _that_ message?

Sorry for the ambiguity. This is about *similar* messages. Identical
messages, at least visually speaking (I realize that there is a lot more
to it than the visual component). For example, yesterday, I saw several
Canadian Pharmacy emails, all of which were identical with respect to
appearance. I classified each as spam, yet the Bayes score didn't budge
more than a few percent for the first three, and went *down* for the 4th.

I have to assume that while the messages (HTML-formatted) *appear* to be
identical, the underlying code has some pseudo-random element that is
designed very specifically to throw Bayes classifiers.

Out of curiosity, does the Bayes engine (or some other element of
SpamAssassin) have the ability to "see" rendered HTML messages, by
appearance, and not by source code? If it could, it would be far more
effective it seems.

>> When this scenario occurs, my efforts in feeding the same message to
>> sa-learn are wasted, right? Bayes doesn't "learn more" from the message
>> the second time, or increase it's tokens' "weight", right? It would be
>> nice if I could eliminate this duplicate effort.
>
> Correct, no new information is learned.
>
>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>> it, and feed my corpus through the ol' chipper?
>
> That, and configure the user-based training to at the very least capture
> what they submit to a corpus so you can review it. Whether you do that
> review pre-training or post-bayes-is-insane is up to you.
>

Right, right, that makes sense. I hope I can modify the Antispam plug-in
to accommodate this requirement.

Well, I can't thank you enough here, John and everyone else. I seem to
be on the right track; all is not lost.

That said, it seems clear that SA is nowhere near as effective as it can
be when an off-the-shelf configuration is used (and without configuring
the MTA to do some of the blocking).

I'll keep the list posted (pardon the pun) with regard to configuring
Antispam to fire-off a copy of any message that is submitted for
training. Ideally, whether the message is reviewed before or after
sa-learn is called will be configurable.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
One final question on this subject (sorry...).

Is there value in training Bayes on messages that SA classified as spam
*due to other test scores*? In other words, if a message is classified
as SPAM due to a block-list test, but the message is new enough for
Bayes to assign a zero score, should that message be kept and fed to
sa-learn so that Bayes can soak-up all the tokens from a message that is
almost certainly spam (based on the other tests)?

Am I making any sense?

Thanks again!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 3:47 PM, Ben Johnson wrote:
> One final question on this subject (sorry...).
>
> Is there value in training Bayes on messages that SA classified as spam
> *due to other test scores*? In other words, if a message is classified
> as SPAM due to a block-list test, but the message is new enough for
> Bayes to assign a zero score, should that message be kept and fed to
> sa-learn so that Bayes can soak-up all the tokens from a message that is
> almost certainly spam (based on the other tests)?
>
> Am I making any sense?

It is always worthwhile to train Bayes. In an ideal world, you would
hand-sort and train every email that comes through your system. The
more mail Bayes sees the more accurate it can be.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 4:05 PM, Bowie Bailey wrote:
> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>> One final question on this subject (sorry...).
>>
>> Is there value in training Bayes on messages that SA classified as spam
>> *due to other test scores*? In other words, if a message is classified
>> as SPAM due to a block-list test, but the message is new enough for
>> Bayes to assign a zero score, should that message be kept and fed to
>> sa-learn so that Bayes can soak-up all the tokens from a message that is
>> almost certainly spam (based on the other tests)?
>>
>> Am I making any sense?
>
> It is always worthwhile to train Bayes. In an ideal world, you would
> hand-sort and train every email that comes through your system. The
> more mail Bayes sees the more accurate it can be.
>

Thanks, Bowie. Given your response, would it then be prudent to call
"sa-learn --spam" on any message that *other tests* (non-Bayes tests)
determine to be spam (given some score threshold)?

The crux of my question/point is that I don't want to have to feed
messages that Bayes "misses" but that other tests identify *correctly*
as spam to "sa-learn --spam".

Is there value in implementing something like this? Or is there some
caveat that would make doing so self-defeating?

Thanks a bunch,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 4:27 PM, Ben Johnson wrote:
> On 1/15/2013 4:05 PM, Bowie Bailey wrote:
>> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>>> One final question on this subject (sorry...).
>>>
>>> Is there value in training Bayes on messages that SA classified as spam
>>> *due to other test scores*? In other words, if a message is classified
>>> as SPAM due to a block-list test, but the message is new enough for
>>> Bayes to assign a zero score, should that message be kept and fed to
>>> sa-learn so that Bayes can soak-up all the tokens from a message that is
>>> almost certainly spam (based on the other tests)?
>>>
>>> Am I making any sense?
>> It is always worthwhile to train Bayes. In an ideal world, you would
>> hand-sort and train every email that comes through your system. The
>> more mail Bayes sees the more accurate it can be.
>>
> Thanks, Bowie. Given your response, would it then be prudent to call
> "sa-learn --spam" on any message that *other tests* (non-Bayes tests)
> determine to be spam (given some score threshold)?

That is exactly what the autolearn setting does. I let my system run
with the default autolearn settings. Some people adjust the thresholds
and some people prefer to turn off autolearn and do purely manual training.

> The crux of my question/point is that I don't want to have to feed
> messages that Bayes "misses" but that other tests identify *correctly*
> as spam to "sa-learn --spam".

At one point, I had a script running on my server that looked for
messages that were marked as spam with a low Bayes rating (BAYES_00 to
BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60
to BAYES_99). I was then able to check the messages and learn them
properly. This let me learn from the edge cases that were not being
scored properly by Bayes while still making it to the correct folder due
to other rules.

If you do this, you MUST check the messages yourself prior to learning
since there is no other way to know whether they should be learned as
ham or spam.

> Is there value in implementing something like this? Or is there some
> caveat that would make doing so self-defeating?

I find that Bayes autolearn works quite well for me, but others have had
problems with it.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 4:39 PM, Bowie Bailey wrote:
> On 1/15/2013 4:27 PM, Ben Johnson wrote:
>> On 1/15/2013 4:05 PM, Bowie Bailey wrote:
>>> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>>>> One final question on this subject (sorry...).
>>>>
>>>> Is there value in training Bayes on messages that SA classified as spam
>>>> *due to other test scores*? In other words, if a message is classified
>>>> as SPAM due to a block-list test, but the message is new enough for
>>>> Bayes to assign a zero score, should that message be kept and fed to
>>>> sa-learn so that Bayes can soak-up all the tokens from a message
>>>> that is
>>>> almost certainly spam (based on the other tests)?
>>>>
>>>> Am I making any sense?
>>> It is always worthwhile to train Bayes. In an ideal world, you would
>>> hand-sort and train every email that comes through your system. The
>>> more mail Bayes sees the more accurate it can be.
>>>
>> Thanks, Bowie. Given your response, would it then be prudent to call
>> "sa-learn --spam" on any message that *other tests* (non-Bayes tests)
>> determine to be spam (given some score threshold)?
>
> That is exactly what the autolearn setting does. I let my system run
> with the default autolearn settings. Some people adjust the thresholds
> and some people prefer to turn off autolearn and do purely manual training.
>
>> The crux of my question/point is that I don't want to have to feed
>> messages that Bayes "misses" but that other tests identify *correctly*
>> as spam to "sa-learn --spam".
>
> At one point, I had a script running on my server that looked for
> messages that were marked as spam with a low Bayes rating (BAYES_00 to
> BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60
> to BAYES_99). I was then able to check the messages and learn them
> properly. This let me learn from the edge cases that were not being
> scored properly by Bayes while still making it to the correct folder due
> to other rules.
>
> If you do this, you MUST check the messages yourself prior to learning
> since there is no other way to know whether they should be learned as
> ham or spam.
>
>> Is there value in implementing something like this? Or is there some
>> caveat that would make doing so self-defeating?
>
> I find that Bayes autolearn works quite well for me, but others have had
> problems with it.
>

Aaaaah... I get it. Finally. :)

Excellent info here; thanks again!

You guys are heroes... seriously.

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Tue, 15 Jan 2013, Ben Johnson wrote:

>
>
> On 1/15/2013 1:55 PM, John Hardin wrote:
>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>
>>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>>
>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>>>> are they all performed by SA?
>>>
>>> In postfix's main.cf:
>>>
>>> smtpd_recipient_restrictions = permit_mynetworks,
>>> permit_sasl_authenticated, check_recipient_access
>>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>>
>>> Do you recommend something more?
>>
>> Unfortunately I have no experience administering Postfix. Perhaps one of
>> the other listies can help.
>
> Wow! Adding several more reject_rbl_client entries to the
> smtpd_recipient_restrictions directive in the Postfix configuration
> seems to be having a tremendous impact. The amount of spam coming
> through has dropped by 90% or more. This was a HUGELY helpful
> suggestion, John!

Which ones are you using now? There are DNSBLs that are good, but not
quite good enough to trust as hard-reject SMTP-time filters. That's why SA
does scored DNSBL checks.

>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>> They do so unsupervised. Why this could be a problem is obvious. And no,
>>> I don't retain their submissions. I probably should. I wonder if I can
>>> make a few slight modifications to the shell script that Antispam calls,
>>> such that it simply sends a copy of the message to an administrator
>>> rather than calling sa-learn on the message.
>>
>> That would be a very good idea if the number of users doing training is
>> small. At the very least, the messages should be captured to a permanent
>> corpus mailbox.
>
> Good idea! I'll see if I can set this up.
>
>> Do your users also train ham? Are the procedures similar enough that
>> your users could become easily confused?
>
> They do. The procedure is implemented via Dovecot's Antispam plug-in.
> Basically, moving mail from Inbox to Junk trains it as spam, and moving
> mail from Junk to Inbox trains it as ham. I really like this setup
> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
> the results are effective immediately, which seems to be crucial for
> combating this snowshoe spam (performance and scalability aside).
>
> I don't find that procedure to be confusing, but people are different, I
> suppose.

Hm. One thing I would watch out for in that environment is people who have
intentionally subscribed to some sort of mailing list deciding they don't
want to receive it any longer and just junking the messages rather than
unsubscribing.

However, your problem is FN Bayes scores...

>> The extremely odd thing is that you say you sometimes train a message as
>> spam, and its Bayes score goes *down*. Are you training a message and
>> then running it torough spamc to see if the score changed, or is this
>> about _similar_ messages rather than _that_ message?
>
> Sorry for the ambiguity. This is about *similar* messages. Identical
> messages, at least visually speaking (I realize that there is a lot more
> to it than the visual component). For example, yesterday, I saw several
> Canadian Pharmacy emails, all of which were identical with respect to
> appearance. I classified each as spam, yet the Bayes score didn't budge
> more than a few percent for the first three, and went *down* for the 4th.
>
> I have to assume that while the messages (HTML-formatted) *appear* to be
> identical, the underlying code has some pseudo-random element that is
> designed very specifically to throw Bayes classifiers.
>
> Out of curiosity, does the Bayes engine (or some other element of
> SpamAssassin) have the ability to "see" rendered HTML messages, by
> appearance, and not by source code? If it could, it would be far more
> effective it seems.

That I don't know.

>> That, and configure the user-based training to at the very least capture
>> what they submit to a corpus so you can review it. Whether you do that
>> review pre-training or post-bayes-is-insane is up to you.
>
> Right, right, that makes sense. I hope I can modify the Antispam plug-in
> to accommodate this requirement.
>
> Well, I can't thank you enough here, John and everyone else. I seem to
> be on the right track; all is not lost.
>
> That said, it seems clear that SA is nowhere near as effective as it can
> be when an off-the-shelf configuration is used (and without configuring
> the MTA to do some of the blocking).
>
> I'll keep the list posted (pardon the pun) with regard to configuring
> Antispam to fire-off a copy of any message that is submitted for
> training. Ideally, whether the message is reviewed before or after
> sa-learn is called will be configurable.

Great! Thanks!

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Your mouse has moved. Your Windows Operating System must be
relicensed due to this hardware change. Please contact Microsoft
to obtain a new activation key. If this hardware change results in
added functionality you may be subject to additional license fees.
Your system will now shut down. Thank you for choosing Microsoft.
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/15 07:27, Ben Johnson wrote:
>
>
> On 1/14/2013 7:48 PM, Noel wrote:
>> On 1/14/2013 2:59 PM, Ben Johnson wrote:
> jdow, Noel, and John, I can't thank you enough for your very thorough
> responses. Your time is valuable and I sincerely appreciate your
> willingness to help.

Glad it was even marginally helpful.

>> Ben, do be aware that sometimes you draw the short straw and sit at the
>> very start of the spam distribution cycle. In those cases the BLs will
>> generally not have been alerted yet so they may not trigger. For those
>> situations the rules should be your friends. (I still use my treasured
>> set of SARE rules and personally hand crafted rules my partner and I
>> have created that fit OUR needs but may not be good general purpose
>> rules.)
>
> This makes perfect sense and underscores the importance of a
> finely-tuned rule-set. It's become apparent just how dynamic and capable
> a monster the spam industry is. No one approach will ever be a panacea,
> it seems.
>
> The advice from your second email is well-received, too. Especially the
> part about not killing anybody. ;) I do hope fighting spam becomes fun
> for me, because so far, it's been an uphill battle! Hehe.
>
> Noel, thanks for excellent responses to my questions.

It got fun enough in the old days with more spam than I'm getting now
to taunt the spammers who monitored this list. "Gee, XXXX, you only
managed a 95 on that last spam I got. Surely you can do better and
make it to 100 on small scoring rules." He did.

You actually get to the point you can recognize the style of various
spam programs and often relate them back to the spammer using spamhaus.
These days of full automation might make that harder. But, still, you
can probably start recognizing stylistic elements of the various
programs soon enough.

{^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/15 08:26, Ben Johnson wrote:

> Based on my responses, what's the next move? Backup the Bayes DB, wipe
> it, and feed my corpus through the ol' chipper?

(Sure to infuriate BUT - read the WHOLE note.)

Are you sure your Bayes database is well trained? But let's change that
to, "Is the Bayes database SpamAssassin is using when receiving email
the same as the Bayes database you are training with sa_learn?"

If you are training a per user database and do not have that enabled
in SpamAssassin then the training is pretty useless. Worst case waste
some CPU and disk cycles to find every SpamAssassin related Bayes database
on your system. If you find more than one and shouldn't then ask yourself
why and sort out that problem.

{^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Tue, 15 Jan 2013, jdow wrote:

> On 2013/01/15 08:26, Ben Johnson wrote:
>
>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>> it, and feed my corpus through the ol' chipper?
>
> (Sure to infuriate BUT - read the WHOLE note.)
>
> Are you sure your Bayes database is well trained? But let's change that
> to, "Is the Bayes database SpamAssassin is using when receiving email
> the same as the Bayes database you are training with sa_learn?"

Yeah, we already checked that possibility.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Gun Control enables genocide while doing little to reduce crime.
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/15 17:23, John Hardin wrote:
> On Tue, 15 Jan 2013, jdow wrote:
>
>> On 2013/01/15 08:26, Ben Johnson wrote:
>>
>>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>>> it, and feed my corpus through the ol' chipper?
>>
>> (Sure to infuriate BUT - read the WHOLE note.)
>>
>> Are you sure your Bayes database is well trained? But let's change that
>> to, "Is the Bayes database SpamAssassin is using when receiving email
>> the same as the Bayes database you are training with sa_learn?"
>
> Yeah, we already checked that possibility.

OK, then I shut my fat mouth.

{^_-}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/13 5:26 PM, Ben Johnson wrote:

>
> In postfix's main.cf:
>
<snip>
>
> Hmm, very interesting. No, I have no greylisting in place as yet, and
> no, my userbase doesn't demand immediate delivery. I will look into
> greylisting further.

If you're running postfix, consider using postscreen. It's a recent
addition to postfix that also can behave in a greylisting alike way, and
much more.

Read: http://www.postfix.org/POSTSCREEN_README.html

--
Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 5:22 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
>
>>
>>
>> On 1/15/2013 1:55 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>
>>>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>>>
>>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in
>>>>> place? Or
>>>>> are they all performed by SA?
>>>>
>>>> In postfix's main.cf:
>>>>
>>>> smtpd_recipient_restrictions = permit_mynetworks,
>>>> permit_sasl_authenticated, check_recipient_access
>>>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>>>
>>>> Do you recommend something more?
>>>
>>> Unfortunately I have no experience administering Postfix. Perhaps one of
>>> the other listies can help.
>>
>> Wow! Adding several more reject_rbl_client entries to the
>> smtpd_recipient_restrictions directive in the Postfix configuration
>> seems to be having a tremendous impact. The amount of spam coming
>> through has dropped by 90% or more. This was a HUGELY helpful
>> suggestion, John!
>
> Which ones are you using now? There are DNSBLs that are good, but not
> quite good enough to trust as hard-reject SMTP-time filters. That's why
> SA does scored DNSBL checks.

smtpd_recipient_restrictions =
reject_rbl_client bl.spamcop.net,
reject_rbl_client list.dsbl.org,
reject_rbl_client sbl-xbl.spamhaus.org,
reject_rbl_client cbl.abuseat.org,
reject_rbl_client dul.dnsbl.sorbs.net,

I acquired this list from the article that I cited a few responses back.
It is quite possible that some of these are obsolete, as the article is
from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is
obsolete, but now I can't find the source.

These are "hard rejects", right? So if this change has reduced spam,
said spam would not be accepted for delivery at all; it would be
rejected outright. Correct? (And if I understand you, this is part of
your concern.)

The reason I ask, and a point that I should have clarified in my last
post, is that the *volume* of spam didn't drop by 90% (although, it may
have dropped by some measure), but rather the accuracy with which SA
tagged spam was 90% higher.

Ultimately, I'm wondering if the observed change was simply a product of
these message "campaigns" being black-listed after a few days of
circulation, and not the Postfix configuration change.

At this point, the vast majority of X-Spam-Status headers include Razor2
and Pyzor tests that contribute significantly to the score. I should
have mentioned earlier that I installed Razor2 and Pyzor after making my
initial post. The only reasons I didn't are that a) they didn't seem to
be making a significant difference for the first day or so after I
installed them (this could be for the snowshoe reasons we've already
discussed), and b) the low Bayes scores seemed to be the real problem
anyway.

That said, the Bayes scores seem to be much more accurate now, too. I
was hardly ever seeing BAYES_99 before, but now almost all spam messages
have BAYES_99.

Is it possible that the training I've been doing over the last week or
so wasn't *effective* until recently, say, after restarting some
component of the mail stack? My understanding is that calling SA via
Amavis, which does not need/use the spamd daemon, forces all Bayes data
to be up-to-date on each call to spamassassin.

It bears mention that I haven't yet dumped the Bayes DB and retrained
using my corpus. I'll do that next and see where we land once the DB is
repopulated.

>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>>> They do so unsupervised. Why this could be a problem is obvious. And
>>>> no,
>>>> I don't retain their submissions. I probably should. I wonder if I can
>>>> make a few slight modifications to the shell script that Antispam
>>>> calls,
>>>> such that it simply sends a copy of the message to an administrator
>>>> rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.
>>
>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The procedure is implemented via Dovecot's Antispam plug-in.
>> Basically, moving mail from Inbox to Junk trains it as spam, and moving
>> mail from Junk to Inbox trains it as ham. I really like this setup
>> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
>> the results are effective immediately, which seems to be crucial for
>> combating this snowshoe spam (performance and scalability aside).
>>
>> I don't find that procedure to be confusing, but people are different, I
>> suppose.
>
> Hm. One thing I would watch out for in that environment is people who
> have intentionally subscribed to some sort of mailing list deciding they
> don't want to receive it any longer and just junking the messages rather
> than unsubscribing.

Good point. I hadn't thought of that. All the more reason to "screen"
the messages that are submitted for training.

> However, your problem is FN Bayes scores...
>
>>> The extremely odd thing is that you say you sometimes train a message as
>>> spam, and its Bayes score goes *down*. Are you training a message and
>>> then running it torough spamc to see if the score changed, or is this
>>> about _similar_ messages rather than _that_ message?
>>
>> Sorry for the ambiguity. This is about *similar* messages. Identical
>> messages, at least visually speaking (I realize that there is a lot more
>> to it than the visual component). For example, yesterday, I saw several
>> Canadian Pharmacy emails, all of which were identical with respect to
>> appearance. I classified each as spam, yet the Bayes score didn't budge
>> more than a few percent for the first three, and went *down* for the 4th.
>>
>> I have to assume that while the messages (HTML-formatted) *appear* to be
>> identical, the underlying code has some pseudo-random element that is
>> designed very specifically to throw Bayes classifiers.
>>
>> Out of curiosity, does the Bayes engine (or some other element of
>> SpamAssassin) have the ability to "see" rendered HTML messages, by
>> appearance, and not by source code? If it could, it would be far more
>> effective it seems.
>
> That I don't know.
>
>>> That, and configure the user-based training to at the very least capture
>>> what they submit to a corpus so you can review it. Whether you do that
>>> review pre-training or post-bayes-is-insane is up to you.
>>
>> Right, right, that makes sense. I hope I can modify the Antispam plug-in
>> to accommodate this requirement.
>>
>> Well, I can't thank you enough here, John and everyone else. I seem to
>> be on the right track; all is not lost.
>>
>> That said, it seems clear that SA is nowhere near as effective as it can
>> be when an off-the-shelf configuration is used (and without configuring
>> the MTA to do some of the blocking).
>>
>> I'll keep the list posted (pardon the pun) with regard to configuring
>> Antispam to fire-off a copy of any message that is submitted for
>> training. Ideally, whether the message is reviewed before or after
>> sa-learn is called will be configurable.
>
> Great! Thanks!
>

Thanks again for all the insight here, John.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 2:02 AM, Tom Hendrikx wrote:
> On 1/15/13 5:26 PM, Ben Johnson wrote:
>
>>
>> In postfix's main.cf:
>>
> <snip>
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
>
> If you're running postfix, consider using postscreen. It's a recent
> addition to postfix that also can behave in a greylisting alike way, and
> much more.
>
> Read: http://www.postfix.org/POSTSCREEN_README.html
>
> --
> Tom
>

Thanks for the suggestion, Tom!

Unfortunately, I'm stuck on Postfix 2.7 for a while yet, and Postscreen
is available for versions >= 2.8 only.

I will definitely look into it once I'm on 2.8+, however.

Cheers,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 16 Jan 2013, Ben Johnson wrote:

> On 1/15/2013 5:22 PM, John Hardin wrote:
>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>
>>> Wow! Adding several more reject_rbl_client entries to the
>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>> seems to be having a tremendous impact. The amount of spam coming
>>> through has dropped by 90% or more. This was a HUGELY helpful
>>> suggestion, John!
>>
>> Which ones are you using now? There are DNSBLs that are good, but not
>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>> SA does scored DNSBL checks.
>
> smtpd_recipient_restrictions =
> reject_rbl_client bl.spamcop.net,
> reject_rbl_client list.dsbl.org,
> reject_rbl_client sbl-xbl.spamhaus.org,
> reject_rbl_client cbl.abuseat.org,
> reject_rbl_client dul.dnsbl.sorbs.net,

Several of those are combined into ZEN. If you use Zen instead you'll save
some DNS queries. See the Spamhaus link I provided earlier for details, I
don't offhand remember which ones go into ZEN.

> These are "hard rejects", right? So if this change has reduced spam,
> said spam would not be accepted for delivery at all; it would be
> rejected outright. Correct? (And if I understand you, this is part of
> your concern.)

Correct.

> The reason I ask, and a point that I should have clarified in my last
> post, is that the *volume* of spam didn't drop by 90% (although, it may
> have dropped by some measure), but rather the accuracy with which SA
> tagged spam was 90% higher.

That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
would have contributed to the score.

Check your trusted networks setting. One difference between SMTP-time and
SA-time DNSBL checks is that SMTP-time checks the IP address of the client
talking to the MTA, while SA-time can go back up the relay chain if
necessary (e.g. to check the client IP submitting to your ISP if your
ISP's MTA is between your MTA and the Internet, rather than always
checking your ISP's MTA IP address).

> Ultimately, I'm wondering if the observed change was simply a product of
> these message "campaigns" being black-listed after a few days of
> circulation, and not the Postfix configuration change.

Maybe.

> At this point, the vast majority of X-Spam-Status headers include Razor2
> and Pyzor tests that contribute significantly to the score. I should
> have mentioned earlier that I installed Razor2 and Pyzor after making my
> initial post. The only reasons I didn't are that a) they didn't seem to
> be making a significant difference for the first day or so after I
> installed them (this could be for the snowshoe reasons we've already
> discussed), and b) the low Bayes scores seemed to be the real problem
> anyway.
>
> That said, the Bayes scores seem to be much more accurate now, too. I
> was hardly ever seeing BAYES_99 before, but now almost all spam messages
> have BAYES_99.

Odd. SMTP-time hard rejects shouldn't change that.

> Is it possible that the training I've been doing over the last week or
> so wasn't *effective* until recently, say, after restarting some
> component of the mail stack? My understanding is that calling SA via
> Amavis, which does not need/use the spamd daemon, forces all Bayes data
> to be up-to-date on each call to spamassassin.

That shouldn't be the case. SA and sa-learn both use a shared-access
database; if you're training the database that SA is learning, the results
of training should be effective immediately.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
One difference between a liberal and a pickpocket is that if you
demand your money back from a pickpocket he will not question your
motives. -- William Rusher
-----------------------------------------------------------------------
Tomorrow: Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 10:49 AM, Ben Johnson wrote:
> On 1/15/2013 5:22 PM, John Hardin wrote:
>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>
>>> Wow! Adding several more reject_rbl_client entries to the
>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>> seems to be having a tremendous impact. The amount of spam coming
>>> through has dropped by 90% or more. This was a HUGELY helpful
>>> suggestion, John!
>> Which ones are you using now? There are DNSBLs that are good, but not
>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>> SA does scored DNSBL checks.
> smtpd_recipient_restrictions =
> reject_rbl_client bl.spamcop.net,
> reject_rbl_client list.dsbl.org,
> reject_rbl_client sbl-xbl.spamhaus.org,
> reject_rbl_client cbl.abuseat.org,
> reject_rbl_client dul.dnsbl.sorbs.net,
>
> I acquired this list from the article that I cited a few responses back.
> It is quite possible that some of these are obsolete, as the article is
> from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is
> obsolete, but now I can't find the source.

I'm not sure if it is considered "obsolete", but it has been generally
replaced by zen.spamhaus.org instead. Zen incorporates SBL, XBL, CSS,
and PBL. (See http://www.spamhaus.org/zen/)

> These are "hard rejects", right? So if this change has reduced spam,
> said spam would not be accepted for delivery at all; it would be
> rejected outright. Correct? (And if I understand you, this is part of
> your concern.)

Exactly.

> The reason I ask, and a point that I should have clarified in my last
> post, is that the *volume* of spam didn't drop by 90% (although, it may
> have dropped by some measure), but rather the accuracy with which SA
> tagged spam was 90% higher.

These rejects will drop the total volume of spam. SA's accuracy may
appear to go up if some of the more difficult spams are now being
blocked by the blacklists.

> Ultimately, I'm wondering if the observed change was simply a product of
> these message "campaigns" being black-listed after a few days of
> circulation, and not the Postfix configuration change.
>
> At this point, the vast majority of X-Spam-Status headers include Razor2
> and Pyzor tests that contribute significantly to the score. I should
> have mentioned earlier that I installed Razor2 and Pyzor after making my
> initial post. The only reasons I didn't are that a) they didn't seem to
> be making a significant difference for the first day or so after I
> installed them (this could be for the snowshoe reasons we've already
> discussed), and b) the low Bayes scores seemed to be the real problem
> anyway.
>
> That said, the Bayes scores seem to be much more accurate now, too. I
> was hardly ever seeing BAYES_99 before, but now almost all spam messages
> have BAYES_99.
>
> Is it possible that the training I've been doing over the last week or
> so wasn't *effective* until recently, say, after restarting some
> component of the mail stack? My understanding is that calling SA via
> Amavis, which does not need/use the spamd daemon, forces all Bayes data
> to be up-to-date on each call to spamassassin.

Amavis incorporates the SA code into itself. So in any instance where
you would normally need to restart spamd, you should instead restart
Amavis. In effect, Amavis is its own spamd daemon.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 9:49 AM, Ben Johnson wrote:
> smtpd_recipient_restrictions =
> reject_rbl_client bl.spamcop.net,

spamcop has a reputation of being somewhat aggressive on blocking,
and their website recommends using it in a scoring system (eg.
SpamAssassin) rather than for outright blocking. That said, many
folks (including me) use it anyway and find it acceptable.

See the spamcop website for details, and make your own choice.

> reject_rbl_client list.dsbl.org,

list.dsbl.org is no longer active. Remove this line.

> reject_rbl_client sbl-xbl.spamhaus.org,
> reject_rbl_client cbl.abuseat.org,

The spamhaus lists are now consolidated in zen.spamhaus.org, replace
the above two lines. See the spamhaus web site for details.

> reject_rbl_client dul.dnsbl.sorbs.net,

This one is OK. Again, you should check their website and review
their published listing policy to see if this is something you want
to block.

Blocking mail is a very site-specific choice. Use the advice you
get as a starting point and make your own decision about how
aggressive you want to be.



-- Noel Jones
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 11:00 AM, John Hardin wrote:
> On Wed, 16 Jan 2013, Ben Johnson wrote:
>
>> On 1/15/2013 5:22 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>>
>>>> Wow! Adding several more reject_rbl_client entries to the
>>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>>> seems to be having a tremendous impact. The amount of spam coming
>>>> through has dropped by 90% or more. This was a HUGELY helpful
>>>> suggestion, John!
>>>
>>> Which ones are you using now? There are DNSBLs that are good, but not
>>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>>> SA does scored DNSBL checks.
>>
>> smtpd_recipient_restrictions =
>> reject_rbl_client bl.spamcop.net,
>> reject_rbl_client list.dsbl.org,
>> reject_rbl_client sbl-xbl.spamhaus.org,
>> reject_rbl_client cbl.abuseat.org,
>> reject_rbl_client dul.dnsbl.sorbs.net,
>
> Several of those are combined into ZEN. If you use Zen instead you'll
> save some DNS queries. See the Spamhaus link I provided earlier for
> details, I don't offhand remember which ones go into ZEN.

Per Noel's advice, I have shortened the list (dsbl.org is defunct) and
acted upon your mutual suggestion regarding ZEN:

reject_rbl_client bl.spamcop.net,
reject_rbl_client zen.spamhaus.org,
reject_rbl_client dnsbl.sorbs.net,

Indeed, block entries for all three lists are being registered in the
mail log. Very nice.

It seems as though adding these SMTP-time rejects has blocked about 1/2
of the spam that was coming through previously. Awesome.

>> These are "hard rejects", right? So if this change has reduced spam,
>> said spam would not be accepted for delivery at all; it would be
>> rejected outright. Correct? (And if I understand you, this is part of
>> your concern.)
>
> Correct.
>
>> The reason I ask, and a point that I should have clarified in my last
>> post, is that the *volume* of spam didn't drop by 90% (although, it may
>> have dropped by some measure), but rather the accuracy with which SA
>> tagged spam was 90% higher.
>
> That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
> would have contributed to the score.
>
> Check your trusted networks setting. One difference between SMTP-time
> and SA-time DNSBL checks is that SMTP-time checks the IP address of the
> client talking to the MTA, while SA-time can go back up the relay chain
> if necessary (e.g. to check the client IP submitting to your ISP if your
> ISP's MTA is between your MTA and the Internet, rather than always
> checking your ISP's MTA IP address).

Are you referring to SA's "trusted_networks" directive? If so, it is
commented-out (presumably by default). Does this need to be set? I've
read the info re: trusted_networks at
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html ,
but I'm struggling to understand it.

If the info is helpful, I have a very simple setup here: a single server
with a single public IP address and a single MTA.

>> Ultimately, I'm wondering if the observed change was simply a product of
>> these message "campaigns" being black-listed after a few days of
>> circulation, and not the Postfix configuration change.
>
> Maybe.
>
>> At this point, the vast majority of X-Spam-Status headers include Razor2
>> and Pyzor tests that contribute significantly to the score. I should
>> have mentioned earlier that I installed Razor2 and Pyzor after making my
>> initial post. The only reasons I didn't are that a) they didn't seem to
>> be making a significant difference for the first day or so after I
>> installed them (this could be for the snowshoe reasons we've already
>> discussed), and b) the low Bayes scores seemed to be the real problem
>> anyway.
>>
>> That said, the Bayes scores seem to be much more accurate now, too. I
>> was hardly ever seeing BAYES_99 before, but now almost all spam messages
>> have BAYES_99.
>
> Odd. SMTP-time hard rejects shouldn't change that.

That's what I figured. I wonder if feeding all of the messages that I
"auto-learned manually" -- messages that were tagged as spam (but for
reasons unrelated to Bayes) -- contributed significantly to this change.
I did this late yesterday afternoon and when I took a status check this
morning, I was seeing BAYES_99 for almost every message.

>> Is it possible that the training I've been doing over the last week or
>> so wasn't *effective* until recently, say, after restarting some
>> component of the mail stack? My understanding is that calling SA via
>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>> to be up-to-date on each call to spamassassin.
>
> That shouldn't be the case. SA and sa-learn both use a shared-access
> database; if you're training the database that SA is learning, the
> results of training should be effective immediately.
>

Okay, good. Bowie's response to this question differed (he suggested
that Amavis would need to be restarted for Bayes to be updated), but I'm
pretty sure that restarting Amavis is not necessary. It seems unlikely
that Amavis would copy the entire Bayes DB (which is stored in MySQL on
this server) into memory every time that the Amavis service is started.
To do so seems self-defeating: more RAM usage, worse performance, etc.

So, I emptied the Bayes DB and re-trained ham and spam on my hand-sorted
corpus. The net result was to discard all previous end-user training, if
I understand correctly.

Everything still looks good; mostly BAYES_99 on the messages that are
and should be marked as spam, and no false-positives at all.

I've disabled the Antispam plug-in for now, for the reasons we've
already discussed. I have asked the Dovecot mailing list for suggestions
regarding how best to pre-screen end-user training submissions.

I think I'm in pretty good shape here, unless setting trusted_networks
is a must, in which case I could use some guidance.

All the best,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 16 Jan 2013, Ben Johnson wrote:

> On 1/16/2013 11:00 AM, John Hardin wrote:
>>
>> That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
>> would have contributed to the score.
>>
>> Check your trusted networks setting. One difference between SMTP-time
>> and SA-time DNSBL checks is that SMTP-time checks the IP address of the
>> client talking to the MTA, while SA-time can go back up the relay chain
>> if necessary (e.g. to check the client IP submitting to your ISP if your
>> ISP's MTA is between your MTA and the Internet, rather than always
>> checking your ISP's MTA IP address).
>
> Are you referring to SA's "trusted_networks" directive?

Yes.

> If so, it is commented-out (presumably by default). Does this need to be
> set? I've read the info re: trusted_networks at
> http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html
> , but I'm struggling to understand it.

It means "which MTAs are trusted to not forge Received headers".

There is a related one: internal_networks, which lists networks that are
considered "internal" to your inbound mail topology. Sorry I missed that
one in my first message. This one you'd set if you were retrieving your
email from your ISP rather than directly exposing a MTA to the Internet.

> If the info is helpful, I have a very simple setup here: a single server
> with a single public IP address and a single MTA.

That's the assumed default environment. If you aren't explicitly setting
trusted_networks and internal_networks you should be okay.

>>> That said, the Bayes scores seem to be much more accurate now, too. I
>>> was hardly ever seeing BAYES_99 before, but now almost all spam messages
>>> have BAYES_99.
>>
>> Odd. SMTP-time hard rejects shouldn't change that.
>
> That's what I figured. I wonder if feeding all of the messages that I
> "auto-learned manually" -- messages that were tagged as spam (but for
> reasons unrelated to Bayes) -- contributed significantly to this change.

Quite possibly.

>> That shouldn't be the case. SA and sa-learn both use a shared-access
>> database; if you're training the database that SA is learning, the
>> results of training should be effective immediately.
>
> Okay, good. Bowie's response to this question differed (he suggested
> that Amavis would need to be restarted for Bayes to be updated),

No, he didn't, he said that in a situation where you'd have to restart
spamd, you instead need to restart amavisd. One such situation is after
running sa-update and getting updated rules.

> but I'm pretty sure that restarting Amavis is not necessary. It seems
> unlikely that Amavis would copy the entire Bayes DB (which is stored in
> MySQL on this server) into memory every time that the Amavis service is
> started. To do so seems self-defeating: more RAM usage, worse
> performance, etc.

Right.

> So, I emptied the Bayes DB and re-trained ham and spam on my hand-sorted
> corpus. The net result was to discard all previous end-user training, if
> I understand correctly.

That is correct.

> Everything still looks good; mostly BAYES_99 on the messages that are
> and should be marked as spam, and no false-positives at all.

yay!

> I've disabled the Antispam plug-in for now, for the reasons we've
> already discussed. I have asked the Dovecot mailing list for suggestions
> regarding how best to pre-screen end-user training submissions.
>
> I think I'm in pretty good shape here, unless setting trusted_networks
> is a must, in which case I could use some guidance.

No, sounds like you're good for that.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
It is criminal to teach a man not to defend himself when he is the
constant victim of brutal attacks. -- Malcolm X (1964)
-----------------------------------------------------------------------
Tomorrow: Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 1:18 PM, Ben Johnson wrote:
>
> On 1/16/2013 11:00 AM, John Hardin wrote:
>> On Wed, 16 Jan 2013, Ben Johnson wrote:
>>
>>> Is it possible that the training I've been doing over the last week or
>>> so wasn't *effective* until recently, say, after restarting some
>>> component of the mail stack? My understanding is that calling SA via
>>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>>> to be up-to-date on each call to spamassassin.
>> That shouldn't be the case. SA and sa-learn both use a shared-access
>> database; if you're training the database that SA is learning, the
>> results of training should be effective immediately.
>>
> Okay, good. Bowie's response to this question differed (he suggested
> that Amavis would need to be restarted for Bayes to be updated), but I'm
> pretty sure that restarting Amavis is not necessary. It seems unlikely
> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
> this server) into memory every time that the Amavis service is started.
> To do so seems self-defeating: more RAM usage, worse performance, etc.

Actually, I was making a general observation.

For cases where you would normally need to restart spamd, you will need
to restart amavis. This includes things like rule and configuration
changes.

Bayes data is read dynamically from your MySQL database and thus does
not require a restart of amavis/spamd when updated.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

>>> smtpd_recipient_restrictions =
>>> reject_rbl_client bl.spamcop.net,
>>> reject_rbl_client list.dsbl.org,
>>> reject_rbl_client sbl-xbl.spamhaus.org,
>>> reject_rbl_client cbl.abuseat.org,
>>> reject_rbl_client dul.dnsbl.sorbs.net,
>>
>> Several of those are combined into ZEN. If you use Zen instead you'll
>> save some DNS queries. See the Spamhaus link I provided earlier for
>> details, I don't offhand remember which ones go into ZEN.
>
> Per Noel's advice, I have shortened the list (dsbl.org is defunct) and
> acted upon your mutual suggestion regarding ZEN:
>
> reject_rbl_client bl.spamcop.net,
> reject_rbl_client zen.spamhaus.org,
> reject_rbl_client dnsbl.sorbs.net,

I've also started using the following, but it could be specific to postfix v2.9:

reject_rhsbl_reverse_client zen.spamhaus.org,
reject_rhsbl_sender zen.spamhaus.org,
reject_rhsbl_helo zen.spamhaus.org,

Are you using rbl_reply_maps? Prior to postscreen, I was using it in this way:

rbl_reply_maps = hash:/etc/postfix/rbl_reply_maps

I'm not sure it's necessary in your situation. You can find more about
this here:

http://www.postfix.org/STRESS_README.html

No doubt the guys on this list have been incredibly helpful in the
past. I'd like to thank them again as well.

> Okay, good. Bowie's response to this question differed (he suggested
> that Amavis would need to be restarted for Bayes to be updated), but I'm
> pretty sure that restarting Amavis is not necessary. It seems unlikely
> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
> this server) into memory every time that the Amavis service is started.
> To do so seems self-defeating: more RAM usage, worse performance, etc.

I also don't believe it's necessary to restart amavisd when changes
are made to bayes. I'm also using mysql. I just wish replication was
faster, or it would use it across my multiple mail servers. Instead, I
have to have multiple separate mysql bayes databases, each with their
own tokens, corpus that's used for training, etc, despite it all being
for a single domain.

Regarding restarting amavisd, this is always frustrating to me. I'm
sometimes making changes very frequently, and amavisd doesn't always
restart reliably. Despite a "service amavisd stop" on fedora, it
doesn't completely stop, but instead just goes catatonic and requires
me to manually kill it.

I've asked on the amavisd list, but no one has been able to help. I've
tried just issuing a "reload" but that doesn't always work either.
Does anyone know if it's possible to send it a signal or a way to more
reliably signal amavisd?

Thanks,
Alex

1 2 3 4  View All