Mailing List Archive

Calling spamassassin directly yields very different results than calling spamassassin via amavis-new
About five months ago, I experienced a problem that I *thought* I had
resolved, but I am observing similar behavior after retraining the Bayes
database. While the symptoms are similar, the root cause seems to be
different (thankfully). The original problem is documented at
http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html
.

In any case, I am again seeing SA scores that seem way too low for the
message content in question. My "glue", as it were, is Amavis-New.

In particular, certain messages that are clearly SPAM are scored between
0 and 3 when processed via Amavis. However, if I process the same
messages with the "spamassassin" binary, directly, the scores are much
higher and much more in-line with what one would expect.

The X-Spam-Status header, when processed via Amavis, looks like this:

X-Spam-Status: No, score=0.8 tagged_above=-999 required=2
tests=[BAYES_50=0.8, HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled

When I process the same message with spamassassin, directly
(spamassassin -t -D < /tmp/msg.txt), the header looks like this:

----------------------------------------------------------------------
X-Spam-Status: Yes, score=7.5 required=5.0
tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS
autolearn=disabled version=3.3.1

[...]

Content analysis details: (7.5 points, 5.0 required)

pts rule name description
---- ----------------------
--------------------------------------------------
-0.0 NO_RELAYS Informational: message was not relayed via SMTP
1.2 MISSING_HEADERS Missing To: header
2.0 BAYES_50 BODY: Bayes spam probability is 40 to 60%
[score: 0.5000]
1.2 MISSING_MID Missing Message-Id: header
1.3 MISSING_SUBJECT Missing Subject: header
-0.0 NO_RECEIVED Informational: message has no Received headers
1.8 MISSING_DATE Missing Date: header
0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822
headers
----------------------------------------------------------------------

In short, my question is, how the **** is the message scoring 0.8 in one
case and 7.5 in another? That is a massive discrepancy.

From what I can tell, the same tests aren't even being performed in each
case.

I have to assume that the options that are passed to SA are wildly
different in each case.

It bears mention that the server in question uses ISPConfig 3. ISPConfig
allows for SA policies to be configured per-domain and per-user, and
Amavis leverages MySQL to make that happen. If relevant, I can provide
more information about this aspect of my setup.

These are the only directives that I've added to /etc/spamassassin/local.cf:

----------------------------------------------------------------------
bayes_path /var/lib/amavis/.spamassassin/bayes

use_bayes 1
bayes_auto_expire 0
bayes_store_module Mail::SpamAssassin::BayesStore::MySQL
bayes_sql_dsn DBI:mysql:sa_bayes:localhost
bayes_sql_username sa_user
bayes_sql_password [scrubbed]
bayes_sql_override_username amavis
----------------------------------------------------------------------

Given the first directive, SA should always use the same Bayes database
(the one I've configured in MySQL), regardless of how SA is called, right?

For those curious about the state of the Bayes database, here's the
output from "sa-learn --dump magic" (sorry for the wrapping):

0.000 0 3 0 non-token data: bayes db version
0.000 0 2007 0 non-token data: nspam
0.000 0 6554 0 non-token data: nham
0.000 0 188379 0 non-token data: ntokens
0.000 0 1356345829 0 non-token data: oldest atime
0.000 0 1357769317 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal
sync atime
0.000 0 1357727978 0 non-token data: last expiry atime
0.000 0 1382400 0 non-token data: last expire
atime delta
0.000 0 3191 0 non-token data: last expire
reduction count

Ultimately, it seems that I should be trying to figure out how, exactly,
Amavis is calling SpamAssassin in the course of normal operation.

Thanks for any help here, folks!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, Jan 09, 2013 at 05:14:05PM -0500, Ben Johnson wrote:
> Content analysis details: (7.5 points, 5.0 required)
>
> pts rule name description
> ---- ----------------------
> --------------------------------------------------
> -0.0 NO_RELAYS Informational: message was not relayed via SMTP
> 1.2 MISSING_HEADERS Missing To: header
> 2.0 BAYES_50 BODY: Bayes spam probability is 40 to 60%
> [score: 0.5000]
> 1.2 MISSING_MID Missing Message-Id: header
> 1.3 MISSING_SUBJECT Missing Subject: header
> -0.0 NO_RECEIVED Informational: message has no Received headers
> 1.8 MISSING_DATE Missing Date: header
> 0.0 NO_HEADERS_MESSAGE Message appears to be missing most RFC-822
> headers
> ----------------------------------------------------------------------

These hits indicate that the mail you're testing (/tmp/msg.txt) is
corrupted, as it is missing most email headers.

> In short, my question is, how the **** is the message scoring 0.8 in one
> case and 7.5 in another? That is a massive discrepancy.

In the second case, the mail you are testing is corrupted. Open /tmp/msg.txt
in a text editor and check if it looks sane.
--
Marius Gavrilescu
(warnings) Do not dangle the mouse by its cable or throw the mouse at co-workers. --From a manual for an SGI computer.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 09 Jan 2013 17:14:05 -0500
Ben Johnson wrote:

> About five months ago, I experienced a problem that I *thought* I had
> resolved, but I am observing similar behavior after retraining the
> Bayes database. While the symptoms are similar, the root cause seems
> to be different (thankfully). The original problem is documented at
> http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html
> ..
>
> In any case, I am again seeing SA scores that seem way too low for the
> message content in question. My "glue", as it were, is Amavis-New.
>
> In particular, certain messages that are clearly SPAM are scored
> between 0 and 3 when processed via Amavis. However, if I process the
> same messages with the "spamassassin" binary, directly, the scores
> are much higher and much more in-line with what one would expect.
>...
> When I process the same message with spamassassin, directly
> (spamassassin -t -D < /tmp/msg.txt), the header looks like this:
>
> ----------------------------------------------------------------------
> X-Spam-Status: Yes, score=7.5 required=5.0
> tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS
> autolearn=disabled version=3.3.1


This is not better, it indicates that SA didn't recognise it as an
email, not that it recognised it as a spam. Whatever /tmp/msg.txt was
it wasn't a properly formatted email.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/9/2013 5:36 PM, RW wrote:
> On Wed, 09 Jan 2013 17:14:05 -0500
> Ben Johnson wrote:
>
>> About five months ago, I experienced a problem that I *thought* I had
>> resolved, but I am observing similar behavior after retraining the
>> Bayes database. While the symptoms are similar, the root cause seems
>> to be different (thankfully). The original problem is documented at
>> http://spamassassin.1065346.n5.nabble.com/Very-spammy-messages-yield-BAYES-00-1-9-td101167.html
>> ..
>>
>> In any case, I am again seeing SA scores that seem way too low for the
>> message content in question. My "glue", as it were, is Amavis-New.
>>
>> In particular, certain messages that are clearly SPAM are scored
>> between 0 and 3 when processed via Amavis. However, if I process the
>> same messages with the "spamassassin" binary, directly, the scores
>> are much higher and much more in-line with what one would expect.
>> ...
>> When I process the same message with spamassassin, directly
>> (spamassassin -t -D < /tmp/msg.txt), the header looks like this:
>>
>> ----------------------------------------------------------------------
>> X-Spam-Status: Yes, score=7.5 required=5.0
>> tests=BAYES_50,MISSING_DATE,MISSING_HEADERS,MISSING_MID,MISSING_SUBJECT,NO_HEADERS_MESSAGE,NO_RECEIVED,NO_RELAYS
>> autolearn=disabled version=3.3.1
>
>
> This is not better, it indicates that SA didn't recognise it as an
> email, not that it recognised it as a spam. Whatever /tmp/msg.txt was
> it wasn't a properly formatted email.
>

Thanks for the quick replies, Marius and RW.

I see; I saved the email message out of Thunderbird (with View ->
Headers -> All), as a plain text file. Apparently, that process butchers
the original message.

I'm reviewing SA's behavior using an email client to view the messages,
but I also have access to the mailbox on the server. I realize that this
question may seem amateurish, but how does one discern the "message ID"
from the email client and locate the corresponding file in the user's
"Maildir"? I'm using Dovecot 1.x.

The file names in the user's Maildir look like this:

1357762471.M952293P32429.example.com,S=4300,W=4381:2,

I assume that the first bit is a UNIX timestamp. Is there any means by
which to correlate the second bit (M952293P32429) to the message as I
see it in my email client (Thunderbird)? I don't see that string
anywhere in the headers (maybe that's by design).

In other words, when I spot a message that SA seems to be scoring
incorrectly in my Inbox, how do I track-down the actual file on the
server that should be fed into "spamassassin"?

Is there some better method than doing something like

# grep -ir 20B2834E4242 /var/vmail/example.com/user/Maildir

where 20B2834E4242 is the ID in the "Received" header?

In any case, I tracked-down the original message on the server and
repeated the process (spamassassin -t < /tmp/msg.txt):

----------------------------------------------------------------------
X-Spam-Status: Yes, score=9.3 required=5.0 tests=BAYES_50,HTML_MESSAGE,

RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_SPAM,
URIBL_JP_SURBL autolearn=disabled version=3.3.1

[...]

Content analysis details: (9.3 points, 5.0 required)

pts rule name description
---- ----------------------
--------------------------------------------------
0.4 RCVD_IN_XBL RBL: Received via a relay in Spamhaus XBL
[188.165.126.107 listed in zen.spamhaus.org]
1.0 RCVD_IN_CSS RBL: Received via a relay in Spamhaus CSS
2.7 RCVD_IN_PSBL RBL: Received via a relay in PSBL
[188.165.126.107 listed in psbl.surriel.com]
1.2 URIBL_JP_SURBL Contains an URL listed in the JP SURBL blocklist
[URIs: ehylle.info]
1.4 RCVD_IN_BRBL_LASTEXT RBL: RCVD_IN_BRBL_LASTEXT
[188.165.126.107 listed in
bb.barracudacentral.org]
1.7 URIBL_DBL_SPAM Contains an URL listed in the DBL blocklist
[URIs: ehylle.info]
0.0 HTML_MESSAGE BODY: HTML included in message
0.8 BAYES_50 BODY: Bayes spam probability is 40 to 60%
[score: 0.5428]
----------------------------------------------------------------------

So, if I've done this correctly, the score discrepancy is even larger.

Thanks, guys!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013-01-10 01:03, Ben Johnson wrote:

> I see; I saved the email message out of Thunderbird (with View ->
> Headers -> All), as a plain text file. Apparently, that process
> butchers the original message.

In Thunderbird, rather use File > Save as to save the entire message.

> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S
>PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1

Rules based on RBL/URIBL checks depend on DNS based blacklist queries.
And between the time you first receive an email and the time you
re-scan it, the originating client IP and/or URIs from the mail body
may have been added the the black lists after you first received the
mail. Did you re-scan the mail with amavis, too, or did you post the
X-Spam header lines from the original amavis scan and re-scan the mail
with spamassassin significantly later?

I am not familiar with amavis, but I know that it calls spamassassin in
a special way, depending on the amavis config. Wild guess: could it be
that RBL/URIBL queries are disabled in your amavis config?

Hope this helps.

Cheers,

wolfgang
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/9/2013 7:36 PM, wolfgang wrote:
> On 2013-01-10 01:03, Ben Johnson wrote:
>
>> I see; I saved the email message out of Thunderbird (with View ->
>> Headers -> All), as a plain text file. Apparently, that process
>> butchers the original message.
>
> In Thunderbird, rather use File > Save as to save the entire message.
>
>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S
>> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1
>
> Rules based on RBL/URIBL checks depend on DNS based blacklist queries.
> And between the time you first receive an email and the time you
> re-scan it, the originating client IP and/or URIs from the mail body
> may have been added the the black lists after you first received the
> mail. Did you re-scan the mail with amavis, too, or did you post the
> X-Spam header lines from the original amavis scan and re-scan the mail
> with spamassassin significantly later?
>
> I am not familiar with amavis, but I know that it calls spamassassin in
> a special way, depending on the amavis config. Wild guess: could it be
> that RBL/URIBL queries are disabled in your amavis config?
>
> Hope this helps.
>
> Cheers,
>
> wolfgang
>

Hi, Wolfgang,

Thanks for the reply.

What you say about the RBL/URIBL tests makes sense. I did not rescan the
message with amavis; I posted the X-Spam-Status header contents from the
original scan. The only reason for which I did not rescan the message
with Amavis is that I don't know how to perform a SpamAssassin scan
through Amavis in a manual capacity. And I can't find instructions
regarding the process.

All of that said, less than eight hours elapsed between the original
scan with Amavis and the manual scan with "spamassassin". But, that's
probably long enough for the IP addresses to be blacklisted.

If nobody knows how to scan messages through Amavis, maybe I need to
take this question over to the Amavis list for the time being.

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 9 Jan 2013, Ben Johnson wrote:

> On 1/9/2013 7:36 PM, wolfgang wrote:
>>
>>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S
>>> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1
>>
>> I am not familiar with amavis, but I know that it calls spamassassin in
>> a special way, depending on the amavis config. Wild guess: could it be
>> that RBL/URIBL queries are disabled in your amavis config?
>
> Thanks for the reply.
>
> What you say about the RBL/URIBL tests makes sense.

Check your amavis configuration to see whether you have network tests
disabled. That's the simplest explanation.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
"They will be slaughtered as result of England's anti-gun laws
that concentrates power to the Government."
-- Shifty Powers (101 abn) observing British
subjects training to repel a German invasion
using rakes, hoes and pitchforks
-----------------------------------------------------------------------
8 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 10/01/13 00:03, Ben Johnson wrote:
>
>
> On 1/9/2013 5:36 PM, RW wrote:
>>
>>
>> This is not better, it indicates that SA didn't recognise it as an
>> email, not that it recognised it as a spam. Whatever /tmp/msg.txt was
>> it wasn't a properly formatted email.
>>
>
> Thanks for the quick replies, Marius and RW.
>
> I see; I saved the email message out of Thunderbird (with View ->
> Headers -> All), as a plain text file. Apparently, that process butchers
> the original message.
>

Ben,

In thunderbird, select the message and then press Ctrl-U (or from the
menus: View > Message Source) and select File > Save to save the email
including all headers in plain text format. You can then feed it to
spamassassin as above.

Hope that helps.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/9/2013 9:13 PM, John Hardin wrote:
> On Wed, 9 Jan 2013, Ben Johnson wrote:
>
>> On 1/9/2013 7:36 PM, wolfgang wrote:
>>>
>>>> RCVD_IN_BRBL_LASTEXT,RCVD_IN_CSS,RCVD_IN_PSBL,RCVD_IN_XBL,URIBL_DBL_S
>>>> PAM, URIBL_JP_SURBL autolearn=disabled version=3.3.1
>>>
>>> I am not familiar with amavis, but I know that it calls spamassassin in
>>> a special way, depending on the amavis config. Wild guess: could it be
>>> that RBL/URIBL queries are disabled in your amavis config?
>>
>> Thanks for the reply.
>>
>> What you say about the RBL/URIBL tests makes sense.
>
> Check your amavis configuration to see whether you have network tests
> disabled. That's the simplest explanation.
>

Thanks, John.

On the surface, network tests appear to be enabled:

# grep -ir sa_local_tests_only /etc/amavis
/etc/amavis/conf.d/20-debian_defaults:$sa_local_tests_only = 0; #
only tests which do not require internet access?

Also, some of the incoming messages do contain network test scoring data
in the X-Spam-Status header; here are two examples:

Yes, score=8.451 tagged_above=-999 required=2 tests=[BAYES_99=3.5,
RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_CSS=1, RDNS_NONE=0.793,
SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7]
autolearn=disabled

Yes, score=12.266 tagged_above=-999 required=2 tests=[.BAYES_50=0.8,
DATE_IN_FUTURE_12_24=3.199, DIET_1=0.001, HTML_MESSAGE=0.001,
RCVD_IN_BRBL_LASTEXT=1.449, RCVD_IN_PSBL=2.7, RCVD_IN_XBL=0.375,
RDNS_NONE=0.793, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001,
URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25] autolearn=disabled

(Several of those are network tests, right?)

What's strange is that another message was delivered at nearly the same
time as the above two, yet it shows no evidence of network tests being
performed (right?):

No, score=0.8 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
HTML_MESSAGE=0.001, SPF_PASS=-0.001] autolearn=disabled

It seems as though the SPAM that slips through never shows evidence of
network tests, whereas the SPAM that is caught (and usually has a high
score -- 10 or higher) always seems to show evidence of network tests.

This observation begs the question: why are network tests being
performed for some messages but not others? To my knowledge, no
white/gray/black listing has been done on this box.

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 10 Jan 2013 11:43:44 -0500
Ben Johnson wrote:


> This observation begs the question: why are network tests being
> performed for some messages but not others? To my knowledge, no
> white/gray/black listing has been done on this box.

As has already been said, the score from network tests is commonly a
lot higher on retesting because of all the reporting that happened
in-between.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/10/2013 11:49 AM, RW wrote:
> On Thu, 10 Jan 2013 11:43:44 -0500
> Ben Johnson wrote:
>
>
>> This observation begs the question: why are network tests being
>> performed for some messages but not others? To my knowledge, no
>> white/gray/black listing has been done on this box.
>
> As has already been said, the score from network tests is commonly a
> lot higher on retesting because of all the reporting that happened
> in-between.
>

RW,

I understand that, but that doesn't explain why if I retest a given
message by calling SpamAssassin directly, and I *disable network tests*,
the score is sometimes *higher* than when the message was scanned
initially with AMaViS.

When this message came through initially, the X-Spam-Status header was:

No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled

About an hour later, I fed the same message to the spamassassin
executable, while disabling network tests:

# spamassassin -L -t -D < /tmp/msg.txt

Content analysis details: (5.0 points, 5.0 required)

pts rule name description
---- ----------------------
--------------------------------------------------
3.8 BAYES_99 BODY: Bayes spam probability is 99 to 100%
[score: 1.0000]
0.0 HTML_MESSAGE BODY: HTML included in message
1.2 RDNS_NONE Delivered to internal network by a host with
no rDNS

To restate the question, if network tests are not outright disabled in
Amavis, why is Amavis returning lower scores than the SA binary does
when called directly with network tests disabled? Shouldn't the SA score
with network tests disabled *always* be lower than or equal to the
Amavis score with network tests enabled (provided that all else is equal)?

Or am I way off-base here?

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/10/2013 12:18 PM, Ben Johnson wrote:
>
>
> On 1/10/2013 11:49 AM, RW wrote:
>> On Thu, 10 Jan 2013 11:43:44 -0500
>> Ben Johnson wrote:
>>
>>
>>> This observation begs the question: why are network tests being
>>> performed for some messages but not others? To my knowledge, no
>>> white/gray/black listing has been done on this box.
>>
>> As has already been said, the score from network tests is commonly a
>> lot higher on retesting because of all the reporting that happened
>> in-between.
>>
>
> RW,
>
> I understand that, but that doesn't explain why if I retest a given
> message by calling SpamAssassin directly, and I *disable network tests*,
> the score is sometimes *higher* than when the message was scanned
> initially with AMaViS.
>
> When this message came through initially, the X-Spam-Status header was:
>
> No, score=1.593 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
> HTML_MESSAGE=0.001, RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled
>
> About an hour later, I fed the same message to the spamassassin
> executable, while disabling network tests:
>
> # spamassassin -L -t -D < /tmp/msg.txt
>
> Content analysis details: (5.0 points, 5.0 required)
>
> pts rule name description
> ---- ----------------------
> --------------------------------------------------
> 3.8 BAYES_99 BODY: Bayes spam probability is 99 to 100%
> [score: 1.0000]
> 0.0 HTML_MESSAGE BODY: HTML included in message
> 1.2 RDNS_NONE Delivered to internal network by a host with
> no rDNS
>
> To restate the question, if network tests are not outright disabled in
> Amavis, why is Amavis returning lower scores than the SA binary does
> when called directly with network tests disabled? Shouldn't the SA score
> with network tests disabled *always* be lower than or equal to the
> Amavis score with network tests enabled (provided that all else is equal)?
>
> Or am I way off-base here?
>
> Thanks again,
>
> -Ben
>

Upon further consideration, this behavior makes perfect sense if the
mailbox user has moved the message from Inbox to Junk between scans;
Dovecot's Antispam filter is in use on this server. This action would
cause the message tokens to be added to the Bayes database, which
explains why the SA score is higher on subsequent scans, even with
network tests disabled.

Sorry... I'm still trying to wrap my head around all of this. Lots of
moving parts.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 10 Jan 2013 12:48:07 -0500
Ben Johnson wrote:
> pon further consideration, this behavior makes perfect sense if the
> mailbox user has moved the message from Inbox to Junk between scans;
> Dovecot's Antispam filter is in use on this server. This action would
> cause the message tokens to be added to the Bayes database, which
> explains why the SA score is higher on subsequent scans, even with
> network tests disabled.

Also by turning-off network tests you switch to a different score set so
the score for RDNS_NONE rose.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/10/2013 1:06 PM, RW wrote:
> On Thu, 10 Jan 2013 12:48:07 -0500
> Ben Johnson wrote:
>> pon further consideration, this behavior makes perfect sense if the
>> mailbox user has moved the message from Inbox to Junk between scans;
>> Dovecot's Antispam filter is in use on this server. This action would
>> cause the message tokens to be added to the Bayes database, which
>> explains why the SA score is higher on subsequent scans, even with
>> network tests disabled.
>
> Also by turning-off network tests you switch to a different score set so
> the score for RDNS_NONE rose.
>

Ahh; I didn't realize that disabling network tests changes the score set
entirely. Thanks for the clarification there.

So, at this point, I'm struggling to understand how the following happened.

Over the course of 15 minutes, I received the same exact message four
times. Each time, the message was sent to the same recipient mailbox.
The "From" and "Return-Path" headers changed slightly each time, but the
message bodies appear to be identical.

Here are the X-Spam-Status headers for each message:

1:28 PM

Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
URIBL_WS_SURBL=1.608] autolearn=disabled

1:35 PM

No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled

1:36 PM

Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
URIBL_WS_SURBL=1.608] autolearn=disabled

1:41 PM

Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
URIBL_WS_SURBL=1.608] autolearn=disabled

Questions:

1.) I have a fairly well-trained Bayes DB; why on earth does a message
with the subject "Cash Quick? Get up to 1500 Now", and an equally
nefarious body, trigger BAYES_00?

2.) Why weren't network tests performed on message 2 of 4? This seems to
be evidence of the fact that network tests are not being performed some
percentage of the time, which could very well be at the root of this
whole problem.

Thanks,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 10-01-13 19:55, Ben Johnson wrote:
>
>
> On 1/10/2013 1:06 PM, RW wrote:
>> On Thu, 10 Jan 2013 12:48:07 -0500
>> Ben Johnson wrote:
>>> pon further consideration, this behavior makes perfect sense if the
>>> mailbox user has moved the message from Inbox to Junk between scans;
>>> Dovecot's Antispam filter is in use on this server. This action would
>>> cause the message tokens to be added to the Bayes database, which
>>> explains why the SA score is higher on subsequent scans, even with
>>> network tests disabled.
>>
>> Also by turning-off network tests you switch to a different score set so
>> the score for RDNS_NONE rose.
>>
>
> Ahh; I didn't realize that disabling network tests changes the score set
> entirely. Thanks for the clarification there.
>
> So, at this point, I'm struggling to understand how the following happened.
>
> Over the course of 15 minutes, I received the same exact message four
> times. Each time, the message was sent to the same recipient mailbox.
> The "From" and "Return-Path" headers changed slightly each time, but the
> message bodies appear to be identical.
>
> Here are the X-Spam-Status headers for each message:
>
> 1:28 PM
>
> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
> URIBL_WS_SURBL=1.608] autolearn=disabled
>
> 1:35 PM
>
> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled
>
> 1:36 PM
>
> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
> URIBL_WS_SURBL=1.608] autolearn=disabled
>
> 1:41 PM
>
> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
> URIBL_WS_SURBL=1.608] autolearn=disabled
>
> Questions:
>
> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
> with the subject "Cash Quick? Get up to 1500 Now", and an equally
> nefarious body, trigger BAYES_00?

This will solely depend on the contents of your bayes db. Is this shared
between users, etc etc. No good answer ready without looking at it.

> 2.) Why weren't network tests performed on message 2 of 4? This seems to
> be evidence of the fact that network tests are not being performed some
> percentage of the time, which could very well be at the root of this
> whole problem.

The fact that not a single network test was triggered, is indeed
suspicious. The DNSBL tests are of course sender sender dependent, but
if the body is the same the URIBL stuff should fire. Maybe you DNS
queries timed because your DNS setup is borked? Maybe you should
temporarily enable debug logging for dns lookups in spamassassin?

--
Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 10 Jan 2013, Ben Johnson wrote:

> So, at this point, I'm struggling to understand how the following happened.
>
> Over the course of 15 minutes, I received the same exact message four
> times. Each time, the message was sent to the same recipient mailbox.
> The "From" and "Return-Path" headers changed slightly each time, but the
> message bodies appear to be identical.
>
> Here are the X-Spam-Status headers for each message:
>
> 1:28 PM
>
> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
> URIBL_WS_SURBL=1.608] autolearn=disabled
>
> 1:35 PM
>
> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled
>
> 1:36 PM
>
> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
> URIBL_WS_SURBL=1.608] autolearn=disabled
>
> 1:41 PM
>
> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
> URIBL_WS_SURBL=1.608] autolearn=disabled
>
> Questions:
>
> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
> with the subject "Cash Quick? Get up to 1500 Now", and an equally
> nefarious body, trigger BAYES_00?
>
> 2.) Why weren't network tests performed on message 2 of 4? This seems to
> be evidence of the fact that network tests are not being performed some
> percentage of the time, which could very well be at the root of this
> whole problem.

How many MTAs do you have? Is it possible the low-scoring one went via a
different MTA?

Have you sotpped amavisd, killed all of the amavis processes, and
restarted it?


--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Maxim I: Pillage, _then_ burn.
-----------------------------------------------------------------------
7 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 10 Jan 2013 13:55:58 -0500
Ben Johnson wrote:

> So, at this point, I'm struggling to understand how the following
> happened.
>
> Over the course of 15 minutes, I received the same exact message four
> times. Each time, the message was sent to the same recipient mailbox.
> The "From" and "Return-Path" headers changed slightly each time, but
> the message bodies appear to be identical.

> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
> with the subject "Cash Quick? Get up to 1500 Now", and an equally
> nefarious body, trigger BAYES_00?

From what you wrote before the database is trained by end users, so you
can't really be sure that it is well trained.

> 2.) Why weren't network tests performed on message 2 of 4? This seems
> to be evidence of the fact that network tests are not being performed
> some percentage of the time, which could very well be at the root of
> this whole problem.


It may be that there was some local problem, but there is a simpler
explanation. Are you sure that message 2 has exactly the same IP and
URI as 1 and that it hasn't been delayed with respect to 1. The rest are
in RCVD_IN_CSS which is a snow-shoe spam list, so you expect that early
spams from a given IP address wont hit any URI or IP blocklist at all.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/10/2013 4:12 PM, John Hardin wrote:
> On Thu, 10 Jan 2013, Ben Johnson wrote:
>
>> So, at this point, I'm struggling to understand how the following
>> happened.
>>
>> Over the course of 15 minutes, I received the same exact message four
>> times. Each time, the message was sent to the same recipient mailbox.
>> The "From" and "Return-Path" headers changed slightly each time, but the
>> message bodies appear to be identical.
>>
>> Here are the X-Spam-Status headers for each message:
>>
>> 1:28 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:35 PM
>>
>> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
>> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled
>>
>> 1:36 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:41 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> Questions:
>>
>> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
>> with the subject "Cash Quick? Get up to 1500 Now", and an equally
>> nefarious body, trigger BAYES_00?
>>
>> 2.) Why weren't network tests performed on message 2 of 4? This seems to
>> be evidence of the fact that network tests are not being performed some
>> percentage of the time, which could very well be at the root of this
>> whole problem.
>
> How many MTAs do you have? Is it possible the low-scoring one went via a
> different MTA?

Just one; there should be no possibility of that.

> Have you sotpped amavisd, killed all of the amavis processes, and
> restarted it?
>
>

I have now. And I enabled amavis's $sa_debug option, so we should see a
lot more in the way of useful SA debugging information now.

In fact, I was just able to capture the out that I believe we're after,
and I'll paste a link in my response to RW's message (shortly forthcoming).

Thanks,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/10/2013 3:13 PM, Tom Hendrikx wrote:
> On 10-01-13 19:55, Ben Johnson wrote:
>>
>>
>> On 1/10/2013 1:06 PM, RW wrote:
>>> On Thu, 10 Jan 2013 12:48:07 -0500
>>> Ben Johnson wrote:
>>>> pon further consideration, this behavior makes perfect sense if the
>>>> mailbox user has moved the message from Inbox to Junk between scans;
>>>> Dovecot's Antispam filter is in use on this server. This action would
>>>> cause the message tokens to be added to the Bayes database, which
>>>> explains why the SA score is higher on subsequent scans, even with
>>>> network tests disabled.
>>>
>>> Also by turning-off network tests you switch to a different score set so
>>> the score for RDNS_NONE rose.
>>>
>>
>> Ahh; I didn't realize that disabling network tests changes the score set
>> entirely. Thanks for the clarification there.
>>
>> So, at this point, I'm struggling to understand how the following happened.
>>
>> Over the course of 15 minutes, I received the same exact message four
>> times. Each time, the message was sent to the same recipient mailbox.
>> The "From" and "Return-Path" headers changed slightly each time, but the
>> message bodies appear to be identical.
>>
>> Here are the X-Spam-Status headers for each message:
>>
>> 1:28 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:35 PM
>>
>> No, score=-0.374 tagged_above=-999 required=2 tests=[BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RDNS_NONE=0.793,
>> SPF_PASS=-0.001, T_LOTS_OF_MONEY=0.01] autolearn=disabled
>>
>> 1:36 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> 1:41 PM
>>
>> Yes, score=7.008 tagged_above=-999 required=2 tests=[.BAYES_00=-1.9,
>> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, RCVD_IN_BRBL_LASTEXT=1.449,
>> RCVD_IN_CSS=1, RCVD_IN_XBL=0.375, RDNS_NONE=0.793, SPF_PASS=-0.001,
>> T_LOTS_OF_MONEY=0.01, URIBL_DBL_SPAM=1.7, URIBL_JP_SURBL=1.25,
>> URIBL_WS_SURBL=1.608] autolearn=disabled
>>
>> Questions:
>>
>> 1.) I have a fairly well-trained Bayes DB; why on earth does a message
>> with the subject "Cash Quick? Get up to 1500 Now", and an equally
>> nefarious body, trigger BAYES_00?
>
> This will solely depend on the contents of your bayes db. Is this shared
> between users, etc etc. No good answer ready without looking at it.

Yes, the Bayes DB is shared between users. But it seems that focusing on
the "low-hanging fruit" (the network test issues) will be more
productive in the short term.

>> 2.) Why weren't network tests performed on message 2 of 4? This seems to
>> be evidence of the fact that network tests are not being performed some
>> percentage of the time, which could very well be at the root of this
>> whole problem.
>
> The fact that not a single network test was triggered, is indeed
> suspicious. The DNSBL tests are of course sender sender dependent, but
> if the body is the same the URIBL stuff should fire. Maybe you DNS
> queries timed because your DNS setup is borked? Maybe you should
> temporarily enable debug logging for dns lookups in spamassassin?
>

I enabled Amavis's SA debugging mode on the server in question and was
able to extract the debug output for two messages that seem like they
should definitely be classified as spam.

Message #1: http://pastebin.com/xLMikNJH

Message #2: http://pastebin.com/Ug78tPrt

A couple points of note and a couple of questions:

a.) There seems to be plenty of network activity, but I don't any
"results" (for lack of a better term) for those queries. The final
X-Spam-Status header that is generated looks like this:

No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled

Does the absence of network tests in the resultant header simply mean
that none of the network tests contributed to the score? If so, why
might that be? Are these messages simply "too new" to appear in any
blacklists?

b.) The scores for both messages are identical, which, I suppose, is not
surprising, given that the same exact tests were performed and produced
the same exact results. Is this normal?

c.) 45 minutes after receiving Message #2 from above, I received a very
similar message. The subjects varied only in dollar amount advertised,
and the bodies varies only in the hyperlink URLs and the footer/signature.

Here's the debug output: http://pastebin.com/sLMgXrf5

The second message was scored at 14.75, which seems much better. Of
course, the second score was so much higher because the
network/blacklist tests contributed significantly.

Is the conclusion to be drawn the same as in a) (these messages are "too
new" to appear in blacklists)?

One final point of concern on this item: the Bayes score for the first
of the two emails was BAYES_50=0.8, and I fed the message through
sa-learn as spam shortly after it arrived. Yet, the Bayes score for the
second message was BAYES_40=-0.001 -- *lower* than the first. How could
this be? Is there some rational explanation?

Thanks for all the help here, guys!

-Ben

> --
> Tom
>
>
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/11/2013 4:27 PM, Ben Johnson wrote:
> I enabled Amavis's SA debugging mode on the server in question and was
> able to extract the debug output for two messages that seem like they
> should definitely be classified as spam.
>
> Message #1: http://pastebin.com/xLMikNJH
>
> Message #2: http://pastebin.com/Ug78tPrt
>
> A couple points of note and a couple of questions:
>
> a.) There seems to be plenty of network activity, but I don't any
> "results" (for lack of a better term) for those queries. The final
> X-Spam-Status header that is generated looks like this:
>
> No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
> RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled
>
> Does the absence of network tests in the resultant header simply mean
> that none of the network tests contributed to the score? If so, why
> might that be? Are these messages simply "too new" to appear in any
> blacklists?
>
> b.) The scores for both messages are identical, which, I suppose, is not
> surprising, given that the same exact tests were performed and produced
> the same exact results. Is this normal?
>
> c.) 45 minutes after receiving Message #2 from above, I received a very
> similar message. The subjects varied only in dollar amount advertised,
> and the bodies varies only in the hyperlink URLs and the footer/signature.
>
> Here's the debug output: http://pastebin.com/sLMgXrf5
>
> The second message was scored at 14.75, which seems much better. Of
> course, the second score was so much higher because the
> network/blacklist tests contributed significantly.
>
> Is the conclusion to be drawn the same as in a) (these messages are "too
> new" to appear in blacklists)?
>
> One final point of concern on this item: the Bayes score for the first
> of the two emails was BAYES_50=0.8, and I fed the message through
> sa-learn as spam shortly after it arrived. Yet, the Bayes score for the
> second message was BAYES_40=-0.001 -- *lower* than the first. How could
> this be? Is there some rational explanation?
>
> Thanks for all the help here, guys!
>
> -Ben

Nobody?

A clear pattern has emerged: the X-Spam-Status headers for very
obviously spammy messages never contain evidence that network tests
contributed to their SA scores.

Ultimately, I need to know whether:

a.) Network tests are not being run at all for these messages

b.) Network tests are being run, but are failing in some way

c.) Network tests are being run, and are succeeding, but return
responses that do not contribute to the messages' scores

I've had a look at the log entries to which I link in my previous
message and I just need a little help interpreting the "dns" and "async"
messages.

Thanks for any insight,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Mon, 14 Jan 2013 13:24:55 -0500
Ben Johnson wrote:


> A clear pattern has emerged: the X-Spam-Status headers for very
> obviously spammy messages never contain evidence that network tests
> contributed to their SA scores.
>
> Ultimately, I need to know whether:
>
> a.) Network tests are not being run at all for these messages
>
> b.) Network tests are being run, but are failing in some way
>
> c.) Network tests are being run, and are succeeding, but return
> responses that do not contribute to the messages' scores
>
> I've had a look at the log entries to which I link in my previous
> message and I just need a little help interpreting the "dns" and
> "async" messages.

As I said before, it's not unusual for snowshoe spam to hit no net
tests at all. Also obvious spam isn't any more likely to be in a
blocklist than less obvious spam.

However, try adding this to your SpamAssassin configuration, and
restart the appropriate daemon:

header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', 'ipv4.fahq2.com.')
tflags RCVD_IN_HITALL net
score RCVD_IN_HITALL 0.001


It should add a dns test that is hit for all mail delivered from an
IPv4 address.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/14/2013 2:49 PM, RW wrote:
> On Mon, 14 Jan 2013 13:24:55 -0500
> Ben Johnson wrote:
>
>
>> A clear pattern has emerged: the X-Spam-Status headers for very
>> obviously spammy messages never contain evidence that network tests
>> contributed to their SA scores.
>>
>> Ultimately, I need to know whether:
>>
>> a.) Network tests are not being run at all for these messages
>>
>> b.) Network tests are being run, but are failing in some way
>>
>> c.) Network tests are being run, and are succeeding, but return
>> responses that do not contribute to the messages' scores
>>
>> I've had a look at the log entries to which I link in my previous
>> message and I just need a little help interpreting the "dns" and
>> "async" messages.
>
> As I said before, it's not unusual for snowshoe spam to hit no net
> tests at all. Also obvious spam isn't any more likely to be in a
> blocklist than less obvious spam.
>
> However, try adding this to your SpamAssassin configuration, and
> restart the appropriate daemon:
>
> header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', 'ipv4.fahq2.com.')
> tflags RCVD_IN_HITALL net
> score RCVD_IN_HITALL 0.001
>
>
> It should add a dns test that is hit for all mail delivered from an
> IPv4 address.
>

Thanks, RW.

I understand that snowshoe spam may not hit any net tests. I guess my
confusion is around what, exactly, classifies spam as "snowshoe".

Are most/all of the BL services hash-based? In other words, if a known
spam message was added yesterday, will it be considered "snowshoe" spam
if the spammer sends the same message today and changes only one
character within the body?

If so, then I guess the only remedy here is to focus on why Bayes seems
to perform so miserably. It must be a configuration issue, because I've
sa-learn-ed messages that are incredibly similar for two days now and
not only do their Bayes scores not change significantly, but sometimes
they decrease. And I have a hard time believing that one of my users is
sa-train-ing these messages as ham and negating my efforts.

I have ensured that the spam token count increases when I train these
messages. That said, I do notice that the token count does not *always*
change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
message(s) examined)". Does this mean that all tokens from these
messages have already been learned, thereby making it pointless to
continue feeding them to sa-learn?

If I receive one more uncaught message about how some mom is angering
doctors by doing something crazy to her face, I'm going to hunt-down the
****er and rip her face OFF.

Finally, I added the test you supplied to my SA configuration, restarted
Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.

Thanks for all your help,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/14 10:24, Ben Johnson wrote:
>
>
> On 1/11/2013 4:27 PM, Ben Johnson wrote:
>> I enabled Amavis's SA debugging mode on the server in question and was
>> able to extract the debug output for two messages that seem like they
>> should definitely be classified as spam.
>>
>> Message #1: http://pastebin.com/xLMikNJH
>>
>> Message #2: http://pastebin.com/Ug78tPrt
>>
>> A couple points of note and a couple of questions:
>>
>> a.) There seems to be plenty of network activity, but I don't any
>> "results" (for lack of a better term) for those queries. The final
>> X-Spam-Status header that is generated looks like this:
>>
>> No, score=1.592 tagged_above=-999 required=2 tests=[BAYES_50=0.8,
>> RDNS_NONE=0.793, SPF_PASS=-0.001] autolearn=disabled
>>
>> Does the absence of network tests in the resultant header simply mean
>> that none of the network tests contributed to the score? If so, why
>> might that be? Are these messages simply "too new" to appear in any
>> blacklists?
>>
>> b.) The scores for both messages are identical, which, I suppose, is not
>> surprising, given that the same exact tests were performed and produced
>> the same exact results. Is this normal?
>>
>> c.) 45 minutes after receiving Message #2 from above, I received a very
>> similar message. The subjects varied only in dollar amount advertised,
>> and the bodies varies only in the hyperlink URLs and the footer/signature.
>>
>> Here's the debug output: http://pastebin.com/sLMgXrf5
>>
>> The second message was scored at 14.75, which seems much better. Of
>> course, the second score was so much higher because the
>> network/blacklist tests contributed significantly.
>>
>> Is the conclusion to be drawn the same as in a) (these messages are "too
>> new" to appear in blacklists)?
>>
>> One final point of concern on this item: the Bayes score for the first
>> of the two emails was BAYES_50=0.8, and I fed the message through
>> sa-learn as spam shortly after it arrived. Yet, the Bayes score for the
>> second message was BAYES_40=-0.001 -- *lower* than the first. How could
>> this be? Is there some rational explanation?
>>
>> Thanks for all the help here, guys!
>>
>> -Ben
>
> Nobody?
>
> A clear pattern has emerged: the X-Spam-Status headers for very
> obviously spammy messages never contain evidence that network tests
> contributed to their SA scores.
>
> Ultimately, I need to know whether:
>
> a.) Network tests are not being run at all for these messages
>
> b.) Network tests are being run, but are failing in some way
>
> c.) Network tests are being run, and are succeeding, but return
> responses that do not contribute to the messages' scores
>
> I've had a look at the log entries to which I link in my previous
> message and I just need a little help interpreting the "dns" and "async"
> messages.

Ben, do be aware that sometimes you draw the short straw and sit at the
very start of the spam distribution cycle. In those cases the BLs will
generally not have been alerted yet so they may not trigger. For those
situations the rules should be your friends. (I still use my treasured
set of SARE rules and personally hand crafted rules my partner and I
have created that fit OUR needs but may not be good general purpose
rules.)

{^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/14 12:59, Ben Johnson wrote:
>
>
> On 1/14/2013 2:49 PM, RW wrote:
>> On Mon, 14 Jan 2013 13:24:55 -0500
>> Ben Johnson wrote:
>>
>>
>>> A clear pattern has emerged: the X-Spam-Status headers for very
>>> obviously spammy messages never contain evidence that network tests
>>> contributed to their SA scores.
>>>
>>> Ultimately, I need to know whether:
>>>
>>> a.) Network tests are not being run at all for these messages
>>>
>>> b.) Network tests are being run, but are failing in some way
>>>
>>> c.) Network tests are being run, and are succeeding, but return
>>> responses that do not contribute to the messages' scores
>>>
>>> I've had a look at the log entries to which I link in my previous
>>> message and I just need a little help interpreting the "dns" and
>>> "async" messages.
>>
>> As I said before, it's not unusual for snowshoe spam to hit no net
>> tests at all. Also obvious spam isn't any more likely to be in a
>> blocklist than less obvious spam.
>>
>> However, try adding this to your SpamAssassin configuration, and
>> restart the appropriate daemon:
>>
>> header RCVD_IN_HITALL eval:check_rbl('hitall-lastexternal', 'ipv4.fahq2.com.')
>> tflags RCVD_IN_HITALL net
>> score RCVD_IN_HITALL 0.001
>>
>>
>> It should add a dns test that is hit for all mail delivered from an
>> IPv4 address.
>>
>
> Thanks, RW.
>
> I understand that snowshoe spam may not hit any net tests. I guess my
> confusion is around what, exactly, classifies spam as "snowshoe".
>
> Are most/all of the BL services hash-based? In other words, if a known
> spam message was added yesterday, will it be considered "snowshoe" spam
> if the spammer sends the same message today and changes only one
> character within the body?
>
> If so, then I guess the only remedy here is to focus on why Bayes seems
> to perform so miserably. It must be a configuration issue, because I've
> sa-learn-ed messages that are incredibly similar for two days now and
> not only do their Bayes scores not change significantly, but sometimes
> they decrease. And I have a hard time believing that one of my users is
> sa-train-ing these messages as ham and negating my efforts.
>
> I have ensured that the spam token count increases when I train these
> messages. That said, I do notice that the token count does not *always*
> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
> message(s) examined)". Does this mean that all tokens from these
> messages have already been learned, thereby making it pointless to
> continue feeding them to sa-learn?
>
> If I receive one more uncaught message about how some mom is angering
> doctors by doing something crazy to her face, I'm going to hunt-down the
> ****er and rip her face OFF.
>
> Finally, I added the test you supplied to my SA configuration, restarted
> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.

As much as I might applaud that sentiment I'd like to note two things.
First, it might involve just a whole lot of nasty paperwork and unpleasant
contact with authorities. Second the energy wasted doing that might have
been better spent had you learned how to create rules and recognize the
elements of a spam that are likely to be relatively unique so you can
create rules for it.

After awhile creating rules to knock down such "stuff" can become fun.
(Then after a longer while it gets "old", sigh.)

Another thing to learn in the process is that what you consider to be
spam is another person's (jerk's?) ham. So crafting rules needs to be
done with care if you're filtering for more than one person. Erm, of
course this is what allowing per user rules is good for.

{^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/14/2013 2:59 PM, Ben Johnson wrote:

> I understand that snowshoe spam may not hit any net tests. I guess my
> confusion is around what, exactly, classifies spam as "snowshoe".

Snowshoe spam - spreading a spam run across a large number of IPs so
no single IP is sending a large volume. Typically also combined
with "natural language" text, RFC compliant mail servers, verified
SPF and DKIM, business-class ISP with FCrDNS, and every other
criteria to look like a legit mail source. This type of spam is
difficult to catch.

http://www.spamhaus.org/faq/section/Glossary#233
and countless other links if you ask google.

> Are most/all of the BL services hash-based? In other words, if a known
> spam message was added yesterday, will it be considered "snowshoe" spam
> if the spammer sends the same message today and changes only one
> character within the body?

No, most all DNS blacklists are based on IP reputation. Check each
list's website for their listing policy to see how an IP gets on
their list; generally honypot email addresses or trusted user
reports. Most lists require some number of reports before listing
an IP to prevent false positives; snowshoe spammers take advantage
of this.

> If so, then I guess the only remedy here is to focus on why Bayes seems
> to perform so miserably.

Sounds as if your bayes has been improperly trained in the past.
You might do better to just delete the bayes db and start over with
hand-picked spam and ham.



-- Noel Jones
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Mon, 14 Jan 2013, Ben Johnson wrote:

> I understand that snowshoe spam may not hit any net tests. I guess my
> confusion is around what, exactly, classifies spam as "snowshoe".

http://www.spamhaus.org/faq/section/Glossary

Basically, a large number of spambots sending the message so that no one
sending IP can be easily tagged as evil.

Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
are they all performed by SA?

Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject
SMTP-time DNS check in your MTA. It is well-respected and very reliable.
One thing it includes is ranges of IP addresses that should not ever be
sending email, so it may help reduce snowshoe spam.

http://www.spamhaus.org/zen/

Another tactic that many report good results from is Greylisting. Do you
have greylisting in place? Does your userbase demand no delays in mail
delivery? In addition to blocking spam from spambots that do not retry, it
can delay mail enough for the BLs to get a chance to list new IPs/domains,
which can reduce the leakage if you happen to be at the leading edge of a
new delivery campaign.

http://www.greylisting.org/

> Are most/all of the BL services hash-based?

Generally:

DNSBL: Blacklist of IP addresses
URIBL: Blacklist of domain and host names appearing in URIs
EMAILBL: (not widely used) Blacklist of email addresses (e.g.
phishing response addresses)
Razor, Pyzor: Blacklist of message content checksums/hashes

> In other words, if a known spam message was added yesterday, will it be
> considered "snowshoe" spam if the spammer sends the same message today
> and changes only one character within the body?

No, the diverse IP addresses are the hallmark of "snowshoe", not so much
the specific message content. If you see identical or generally-similar
(e.g.) pharma spam coming from a wide range of different IP addresses,
that's snowshoe.

> If so, then I guess the only remedy here is to focus on why Bayes seems
> to perform so miserably.

Agreed.

> It must be a configuration issue, because I've sa-learn-ed messages that
> are incredibly similar for two days now and not only do their Bayes
> scores not change significantly, but sometimes they decrease. And I have
> a hard time believing that one of my users is sa-train-ing these
> messages as ham and negating my efforts.

This is why you retain your Bayes training corpora: so that if Bayes goes
off the rails you can review your corpora for misclassifications, wipe and
retrain. Do you have your training corpora? Or do you discard messages
once you've trained them?

_Do_ you allow your users to train Bayes? Do they do so unsupervised or do
you review their submissions? And if the process is automated, do you
retain what they have provided for training so that you can go back later
and do a troubleshooting review?

Do you have autolearn turned on? My opinion is that autolearn is only
appropriate for a large and very diverse userbase where a sufficiently
"common" corpus of ham can't be manually collected. but then, I don't
admin a Really Large Install, so YMMV.

Do you use per-user or sitewide Bayes? If per-user, then you need to make
sure that you're training Bayes as the same user that the MTA is running
SA as.

What user does your MTA run SA as? What user do you train Bayes as?

One possibility is that the MTA is running SA as a different user than you
are training Bayes as, and you have autolearn turned on, and Bayes has
been running in its own little world since day one regardless of what you
think you're telling it to do.

> I have ensured that the spam token count increases when I train these
> messages. That said, I do notice that the token count does not *always*
> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
> message(s) examined)". Does this mean that all tokens from these
> messages have already been learned, thereby making it pointless to
> continue feeding them to sa-learn?

No, it means that Message-ID has been learned from before.

> Finally, I added the test you supplied to my SA configuration, restarted
> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.

So this proves DNS lookups are indeed working for all messages.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
One death is a tragedy; thirty is a media sensation;
a million is a statistic. -- Joseph Stalin, modernized
-----------------------------------------------------------------------
3 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/14/2013 7:48 PM, Noel wrote:
> On 1/14/2013 2:59 PM, Ben Johnson wrote:
>
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
>
> Snowshoe spam - spreading a spam run across a large number of IPs so
> no single IP is sending a large volume. Typically also combined
> with "natural language" text, RFC compliant mail servers, verified
> SPF and DKIM, business-class ISP with FCrDNS, and every other
> criteria to look like a legit mail source. This type of spam is
> difficult to catch.
>
> http://www.spamhaus.org/faq/section/Glossary#233
> and countless other links if you ask google.
>
>> Are most/all of the BL services hash-based? In other words, if a known
>> spam message was added yesterday, will it be considered "snowshoe" spam
>> if the spammer sends the same message today and changes only one
>> character within the body?
>
> No, most all DNS blacklists are based on IP reputation. Check each
> list's website for their listing policy to see how an IP gets on
> their list; generally honypot email addresses or trusted user
> reports. Most lists require some number of reports before listing
> an IP to prevent false positives; snowshoe spammers take advantage
> of this.
>
>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
>
> Sounds as if your bayes has been improperly trained in the past.
> You might do better to just delete the bayes db and start over with
> hand-picked spam and ham.
>
>
>
> -- Noel Jones
>

jdow, Noel, and John, I can't thank you enough for your very thorough
responses. Your time is valuable and I sincerely appreciate your
willingness to help.

John, I'll respond to you separately, for the sake of keeping this
organized.

> Ben, do be aware that sometimes you draw the short straw and sit at the
> very start of the spam distribution cycle. In those cases the BLs will
> generally not have been alerted yet so they may not trigger. For those
> situations the rules should be your friends. (I still use my treasured
> set of SARE rules and personally hand crafted rules my partner and I
> have created that fit OUR needs but may not be good general purpose
> rules.)

This makes perfect sense and underscores the importance of a
finely-tuned rule-set. It's become apparent just how dynamic and capable
a monster the spam industry is. No one approach will ever be a panacea,
it seems.

The advice from your second email is well-received, too. Especially the
part about not killing anybody. ;) I do hope fighting spam becomes fun
for me, because so far, it's been an uphill battle! Hehe.

Noel, thanks for excellent responses to my questions.

> Sounds as if your bayes has been improperly trained in the past.
> You might do better to just delete the bayes db and start over with
> hand-picked spam and ham.

I hope not, because this is my second go-round with the Bayes DB. The
first time (as Mr. Hardin may remember), auto-learning was enabled
out-of-the-box and some misconfiguration or another (seemingly related
to DNSWL_* rules) caused a lot of spam to be learned as ham. With John's
help, I corrected the issues (I hope), which I'll detail in my reply to
John.

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/14/2013 8:16 PM, John Hardin wrote:
> On Mon, 14 Jan 2013, Ben Johnson wrote:
>
>> I understand that snowshoe spam may not hit any net tests. I guess my
>> confusion is around what, exactly, classifies spam as "snowshoe".
>
> http://www.spamhaus.org/faq/section/Glossary
>
> Basically, a large number of spambots sending the message so that no one
> sending IP can be easily tagged as evil.
>
> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
> are they all performed by SA?

In postfix's main.cf:

smtpd_recipient_restrictions = permit_mynetworks,
permit_sasl_authenticated, check_recipient_access
mysql:/etc/postfix/mysql-virtual_recipient.cf,
reject_unauth_destination, reject_rbl_client bl.spamcop.net

Do you recommend something more?

> Recommendation: consider using the Spamhaus ZEN DNSBL as a hard-reject
> SMTP-time DNS check in your MTA. It is well-respected and very reliable.
> One thing it includes is ranges of IP addresses that should not ever be
> sending email, so it may help reduce snowshoe spam.
>
> http://www.spamhaus.org/zen/

This article looks to be pretty thorough:

http://www.cyberciti.biz/faq/howto-configure-postfix-dnsrbls-under-linux-unix/

I'll add Spamhaus ZEN and a few others to the list.

> Another tactic that many report good results from is Greylisting. Do you
> have greylisting in place? Does your userbase demand no delays in mail
> delivery? In addition to blocking spam from spambots that do not retry,
> it can delay mail enough for the BLs to get a chance to list new
> IPs/domains, which can reduce the leakage if you happen to be at the
> leading edge of a new delivery campaign.
>
> http://www.greylisting.org/

Hmm, very interesting. No, I have no greylisting in place as yet, and
no, my userbase doesn't demand immediate delivery. I will look into
greylisting further.

>> Are most/all of the BL services hash-based?
>
> Generally:
>
> DNSBL: Blacklist of IP addresses
> URIBL: Blacklist of domain and host names appearing in URIs
> EMAILBL: (not widely used) Blacklist of email addresses (e.g.
> phishing response addresses)
> Razor, Pyzor: Blacklist of message content checksums/hashes

Perfect; that answers my question.

>> In other words, if a known spam message was added yesterday, will it
>> be considered "snowshoe" spam if the spammer sends the same message
>> today and changes only one character within the body?
>
> No, the diverse IP addresses are the hallmark of "snowshoe", not so much
> the specific message content. If you see identical or generally-similar
> (e.g.) pharma spam coming from a wide range of different IP addresses,
> that's snowshoe.

I see. Given this information, it concerns me that Bayes scores hardly
seem to budge when I feed sa-learn nearly identical messages 3+ times.
We'll get into that below.

>> If so, then I guess the only remedy here is to focus on why Bayes seems
>> to perform so miserably.
>
> Agreed.
>
>> It must be a configuration issue, because I've sa-learn-ed messages
>> that are incredibly similar for two days now and not only do their
>> Bayes scores not change significantly, but sometimes they decrease.
>> And I have a hard time believing that one of my users is sa-train-ing
>> these messages as ham and negating my efforts.
>
> This is why you retain your Bayes training corpora: so that if Bayes
> goes off the rails you can review your corpora for misclassifications,
> wipe and retrain. Do you have your training corpora? Or do you discard
> messages once you've trained them?

I had the good sense to retain the corpora.

> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
> do you review their submissions? And if the process is automated, do you
> retain what they have provided for training so that you can go back
> later and do a troubleshooting review?

Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
They do so unsupervised. Why this could be a problem is obvious. And no,
I don't retain their submissions. I probably should. I wonder if I can
make a few slight modifications to the shell script that Antispam calls,
such that it simply sends a copy of the message to an administrator
rather than calling sa-learn on the message.

> Do you have autolearn turned on? My opinion is that autolearn is only
> appropriate for a large and very diverse userbase where a sufficiently
> "common" corpus of ham can't be manually collected. but then, I don't
> admin a Really Large Install, so YMMV.

No, I was sure to disable autolearn after the last Bayes fiasco. :)

> Do you use per-user or sitewide Bayes? If per-user, then you need to
> make sure that you're training Bayes as the same user that the MTA is
> running SA as.

Site-wide. And I have hard-coded the username in the SA configuration to
prevent confusion in this regard:

bayes_sql_override_username amavis

> What user does your MTA run SA as? What user do you train Bayes as?

The MTA should pass scanning off to "amavis". I train the DB in two
ways: via Dovecot Antispam and by calling sa-learn on my training
mailbox. Given that I have hard-coded the username, the output of
"sa-learn --dump magic" is the same whether I issue the command under my
own account or "su" to the "amavis" user.

> One possibility is that the MTA is running SA as a different user than
> you are training Bayes as, and you have autolearn turned on, and Bayes
> has been running in its own little world since day one regardless of
> what you think you're telling it to do.

That is what happened last year. I hope to have eliminated those issues
this time around. (I dumped the old DB and started over after that
debacle.) The X-Spam-Status header always displays "autolearn=disabled".

>> I have ensured that the spam token count increases when I train these
>> messages. That said, I do notice that the token count does not *always*
>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
>> message(s) examined)". Does this mean that all tokens from these
>> messages have already been learned, thereby making it pointless to
>> continue feeding them to sa-learn?
>
> No, it means that Message-ID has been learned from before.

I see. So, when this happens, it means that one of my users has already
dragged the message from Inbox to Junk (which triggers the Antispam
plug-in and feeds the message to sa-learn).

When this scenario occurs, my efforts in feeding the same message to
sa-learn are wasted, right? Bayes doesn't "learn more" from the message
the second time, or increase it's tokens' "weight", right? It would be
nice if I could eliminate this duplicate effort.

>> Finally, I added the test you supplied to my SA configuration, restarted
>> Amavis, and all messages appear to be tagged with RCVD_IN_HITALL=0.001.
>
> So this proves DNS lookups are indeed working for all messages.
>

Okay, good to know. I think we're "all clear" in the DNS/network test
department.

Based on my responses, what's the next move? Backup the Bayes DB, wipe
it, and feed my corpus through the ol' chipper?

Thanks again!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Tue, 15 Jan 2013, Ben Johnson wrote:

> On 1/14/2013 8:16 PM, John Hardin wrote:
>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>
>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>> are they all performed by SA?
>
> In postfix's main.cf:
>
> smtpd_recipient_restrictions = permit_mynetworks,
> permit_sasl_authenticated, check_recipient_access
> mysql:/etc/postfix/mysql-virtual_recipient.cf,
> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>
> Do you recommend something more?

Unfortunately I have no experience administering Postfix. Perhaps one of
the other listies can help.

>> http://www.greylisting.org/
>
> Hmm, very interesting. No, I have no greylisting in place as yet, and
> no, my userbase doesn't demand immediate delivery. I will look into
> greylisting further.

One other thing you might try is publishing an SPF record for your domain.
There is anecdotal evidence that this reduces the raw spam volume to that
domain a bit.

> Given this information, it concerns me that Bayes scores hardly seem to
> budge when I feed sa-learn nearly identical messages 3+ times. We'll get
> into that below.
>
>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>> to perform so miserably.
>>
>> Agreed.
>>
>>> It must be a configuration issue, because I've sa-learn-ed messages
>>> that are incredibly similar for two days now and not only do their
>>> Bayes scores not change significantly, but sometimes they decrease.
>>> And I have a hard time believing that one of my users is sa-train-ing
>>> these messages as ham and negating my efforts.
>>
>> This is why you retain your Bayes training corpora: so that if Bayes
>> goes off the rails you can review your corpora for misclassifications,
>> wipe and retrain. Do you have your training corpora? Or do you discard
>> messages once you've trained them?
>
> I had the good sense to retain the corpora.

Yay!

>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>> do you review their submissions? And if the process is automated, do you
>> retain what they have provided for training so that you can go back
>> later and do a troubleshooting review?
>
> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
> They do so unsupervised. Why this could be a problem is obvious. And no,
> I don't retain their submissions. I probably should. I wonder if I can
> make a few slight modifications to the shell script that Antispam calls,
> such that it simply sends a copy of the message to an administrator
> rather than calling sa-learn on the message.

That would be a very good idea if the number of users doing training is
small. At the very least, the messages should be captured to a permanent
corpus mailbox.

Do your users also train ham? Are the procedures similar enough that your
users could become easily confused?

>> Do you have autolearn turned on? My opinion is that autolearn is only
>> appropriate for a large and very diverse userbase where a sufficiently
>> "common" corpus of ham can't be manually collected. but then, I don't
>> admin a Really Large Install, so YMMV.
>
> No, I was sure to disable autolearn after the last Bayes fiasco. :)

OK.

>> Do you use per-user or sitewide Bayes? If per-user, then you need to
>> make sure that you're training Bayes as the same user that the MTA is
>> running SA as.
>
> Site-wide. And I have hard-coded the username in the SA configuration to
> prevent confusion in this regard:
>
> bayes_sql_override_username amavis
>
>> What user does your MTA run SA as? What user do you train Bayes as?
>
> The MTA should pass scanning off to "amavis". I train the DB in two
> ways: via Dovecot Antispam and by calling sa-learn on my training
> mailbox. Given that I have hard-coded the username, the output of
> "sa-learn --dump magic" is the same whether I issue the command under my
> own account or "su" to the "amavis" user.

OK, good.

>>> I have ensured that the spam token count increases when I train these
>>> messages. That said, I do notice that the token count does not *always*
>>> change; sometimes, sa-learn reports "Learned tokens from 0 message(s) (1
>>> message(s) examined)". Does this mean that all tokens from these
>>> messages have already been learned, thereby making it pointless to
>>> continue feeding them to sa-learn?
>>
>> No, it means that Message-ID has been learned from before.
>
> I see. So, when this happens, it means that one of my users has already
> dragged the message from Inbox to Junk (which triggers the Antispam
> plug-in and feeds the message to sa-learn).

Very likely.

The extremely odd thing is that you say you sometimes train a message as
spam, and its Bayes score goes *down*. Are you training a message and
then running it torough spamc to see if the score changed, or is this
about _similar_ messages rather than _that_ message?

> When this scenario occurs, my efforts in feeding the same message to
> sa-learn are wasted, right? Bayes doesn't "learn more" from the message
> the second time, or increase it's tokens' "weight", right? It would be
> nice if I could eliminate this duplicate effort.

Correct, no new information is learned.

> Based on my responses, what's the next move? Backup the Bayes DB, wipe
> it, and feed my corpus through the ol' chipper?

That, and configure the user-based training to at the very least capture
what they submit to a corpus so you can review it. Whether you do that
review pre-training or post-bayes-is-insane is up to you.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The difference is that Unix has had thirty years of technical
types demanding basic functionality of it. And the Macintosh has
had fifteen years of interface fascist users shaping its progress.
Windows has the hairpin turns of the Microsoft marketing machine
and that's all. -- Red Drag Diva
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 1:55 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
>
>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>
>>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>>> are they all performed by SA?
>>
>> In postfix's main.cf:
>>
>> smtpd_recipient_restrictions = permit_mynetworks,
>> permit_sasl_authenticated, check_recipient_access
>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>
>> Do you recommend something more?
>
> Unfortunately I have no experience administering Postfix. Perhaps one of
> the other listies can help.

Wow! Adding several more reject_rbl_client entries to the
smtpd_recipient_restrictions directive in the Postfix configuration
seems to be having a tremendous impact. The amount of spam coming
through has dropped by 90% or more. This was a HUGELY helpful
suggestion, John!

>>> http://www.greylisting.org/
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
>
> One other thing you might try is publishing an SPF record for your
> domain. There is anecdotal evidence that this reduces the raw spam
> volume to that domain a bit.

We do publish SPF records for the domains within our control. The need
to do this arose when senderbase.org, et. al., began blacklisting
domains without SPF records. So, we're good there.

>> Given this information, it concerns me that Bayes scores hardly seem
>> to budge when I feed sa-learn nearly identical messages 3+ times.
>> We'll get into that below.
>>
>>>> If so, then I guess the only remedy here is to focus on why Bayes seems
>>>> to perform so miserably.
>>>
>>> Agreed.
>>>
>>>> It must be a configuration issue, because I've sa-learn-ed messages
>>>> that are incredibly similar for two days now and not only do their
>>>> Bayes scores not change significantly, but sometimes they decrease.
>>>> And I have a hard time believing that one of my users is sa-train-ing
>>>> these messages as ham and negating my efforts.
>>>
>>> This is why you retain your Bayes training corpora: so that if Bayes
>>> goes off the rails you can review your corpora for misclassifications,
>>> wipe and retrain. Do you have your training corpora? Or do you discard
>>> messages once you've trained them?
>>
>> I had the good sense to retain the corpora.
>
> Yay!
>
>>> _Do_ you allow your users to train Bayes? Do they do so unsupervised or
>>> do you review their submissions? And if the process is automated, do you
>>> retain what they have provided for training so that you can go back
>>> later and do a troubleshooting review?
>>
>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>> They do so unsupervised. Why this could be a problem is obvious. And no,
>> I don't retain their submissions. I probably should. I wonder if I can
>> make a few slight modifications to the shell script that Antispam calls,
>> such that it simply sends a copy of the message to an administrator
>> rather than calling sa-learn on the message.
>
> That would be a very good idea if the number of users doing training is
> small. At the very least, the messages should be captured to a permanent
> corpus mailbox.

Good idea! I'll see if I can set this up.

> Do your users also train ham? Are the procedures similar enough that
> your users could become easily confused?

They do. The procedure is implemented via Dovecot's Antispam plug-in.
Basically, moving mail from Inbox to Junk trains it as spam, and moving
mail from Junk to Inbox trains it as ham. I really like this setup
(Antispam + calling SA through Amavis [i.e. not using spamd]) because
the results are effective immediately, which seems to be crucial for
combating this snowshoe spam (performance and scalability aside).

I don't find that procedure to be confusing, but people are different, I
suppose.

>>> Do you have autolearn turned on? My opinion is that autolearn is only
>>> appropriate for a large and very diverse userbase where a sufficiently
>>> "common" corpus of ham can't be manually collected. but then, I don't
>>> admin a Really Large Install, so YMMV.
>>
>> No, I was sure to disable autolearn after the last Bayes fiasco. :)
>
> OK.
>
>>> Do you use per-user or sitewide Bayes? If per-user, then you need to
>>> make sure that you're training Bayes as the same user that the MTA is
>>> running SA as.
>>
>> Site-wide. And I have hard-coded the username in the SA configuration to
>> prevent confusion in this regard:
>>
>> bayes_sql_override_username amavis
>>
>>> What user does your MTA run SA as? What user do you train Bayes as?
>>
>> The MTA should pass scanning off to "amavis". I train the DB in two
>> ways: via Dovecot Antispam and by calling sa-learn on my training
>> mailbox. Given that I have hard-coded the username, the output of
>> "sa-learn --dump magic" is the same whether I issue the command under my
>> own account or "su" to the "amavis" user.
>
> OK, good.
>
>>>> I have ensured that the spam token count increases when I train these
>>>> messages. That said, I do notice that the token count does not *always*
>>>> change; sometimes, sa-learn reports "Learned tokens from 0
>>>> message(s) (1
>>>> message(s) examined)". Does this mean that all tokens from these
>>>> messages have already been learned, thereby making it pointless to
>>>> continue feeding them to sa-learn?
>>>
>>> No, it means that Message-ID has been learned from before.
>>
>> I see. So, when this happens, it means that one of my users has already
>> dragged the message from Inbox to Junk (which triggers the Antispam
>> plug-in and feeds the message to sa-learn).
>
> Very likely.
>
> The extremely odd thing is that you say you sometimes train a message as
> spam, and its Bayes score goes *down*. Are you training a message and
> then running it torough spamc to see if the score changed, or is this
> about _similar_ messages rather than _that_ message?

Sorry for the ambiguity. This is about *similar* messages. Identical
messages, at least visually speaking (I realize that there is a lot more
to it than the visual component). For example, yesterday, I saw several
Canadian Pharmacy emails, all of which were identical with respect to
appearance. I classified each as spam, yet the Bayes score didn't budge
more than a few percent for the first three, and went *down* for the 4th.

I have to assume that while the messages (HTML-formatted) *appear* to be
identical, the underlying code has some pseudo-random element that is
designed very specifically to throw Bayes classifiers.

Out of curiosity, does the Bayes engine (or some other element of
SpamAssassin) have the ability to "see" rendered HTML messages, by
appearance, and not by source code? If it could, it would be far more
effective it seems.

>> When this scenario occurs, my efforts in feeding the same message to
>> sa-learn are wasted, right? Bayes doesn't "learn more" from the message
>> the second time, or increase it's tokens' "weight", right? It would be
>> nice if I could eliminate this duplicate effort.
>
> Correct, no new information is learned.
>
>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>> it, and feed my corpus through the ol' chipper?
>
> That, and configure the user-based training to at the very least capture
> what they submit to a corpus so you can review it. Whether you do that
> review pre-training or post-bayes-is-insane is up to you.
>

Right, right, that makes sense. I hope I can modify the Antispam plug-in
to accommodate this requirement.

Well, I can't thank you enough here, John and everyone else. I seem to
be on the right track; all is not lost.

That said, it seems clear that SA is nowhere near as effective as it can
be when an off-the-shelf configuration is used (and without configuring
the MTA to do some of the blocking).

I'll keep the list posted (pardon the pun) with regard to configuring
Antispam to fire-off a copy of any message that is submitted for
training. Ideally, whether the message is reviewed before or after
sa-learn is called will be configurable.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
One final question on this subject (sorry...).

Is there value in training Bayes on messages that SA classified as spam
*due to other test scores*? In other words, if a message is classified
as SPAM due to a block-list test, but the message is new enough for
Bayes to assign a zero score, should that message be kept and fed to
sa-learn so that Bayes can soak-up all the tokens from a message that is
almost certainly spam (based on the other tests)?

Am I making any sense?

Thanks again!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 3:47 PM, Ben Johnson wrote:
> One final question on this subject (sorry...).
>
> Is there value in training Bayes on messages that SA classified as spam
> *due to other test scores*? In other words, if a message is classified
> as SPAM due to a block-list test, but the message is new enough for
> Bayes to assign a zero score, should that message be kept and fed to
> sa-learn so that Bayes can soak-up all the tokens from a message that is
> almost certainly spam (based on the other tests)?
>
> Am I making any sense?

It is always worthwhile to train Bayes. In an ideal world, you would
hand-sort and train every email that comes through your system. The
more mail Bayes sees the more accurate it can be.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 4:05 PM, Bowie Bailey wrote:
> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>> One final question on this subject (sorry...).
>>
>> Is there value in training Bayes on messages that SA classified as spam
>> *due to other test scores*? In other words, if a message is classified
>> as SPAM due to a block-list test, but the message is new enough for
>> Bayes to assign a zero score, should that message be kept and fed to
>> sa-learn so that Bayes can soak-up all the tokens from a message that is
>> almost certainly spam (based on the other tests)?
>>
>> Am I making any sense?
>
> It is always worthwhile to train Bayes. In an ideal world, you would
> hand-sort and train every email that comes through your system. The
> more mail Bayes sees the more accurate it can be.
>

Thanks, Bowie. Given your response, would it then be prudent to call
"sa-learn --spam" on any message that *other tests* (non-Bayes tests)
determine to be spam (given some score threshold)?

The crux of my question/point is that I don't want to have to feed
messages that Bayes "misses" but that other tests identify *correctly*
as spam to "sa-learn --spam".

Is there value in implementing something like this? Or is there some
caveat that would make doing so self-defeating?

Thanks a bunch,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 4:27 PM, Ben Johnson wrote:
> On 1/15/2013 4:05 PM, Bowie Bailey wrote:
>> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>>> One final question on this subject (sorry...).
>>>
>>> Is there value in training Bayes on messages that SA classified as spam
>>> *due to other test scores*? In other words, if a message is classified
>>> as SPAM due to a block-list test, but the message is new enough for
>>> Bayes to assign a zero score, should that message be kept and fed to
>>> sa-learn so that Bayes can soak-up all the tokens from a message that is
>>> almost certainly spam (based on the other tests)?
>>>
>>> Am I making any sense?
>> It is always worthwhile to train Bayes. In an ideal world, you would
>> hand-sort and train every email that comes through your system. The
>> more mail Bayes sees the more accurate it can be.
>>
> Thanks, Bowie. Given your response, would it then be prudent to call
> "sa-learn --spam" on any message that *other tests* (non-Bayes tests)
> determine to be spam (given some score threshold)?

That is exactly what the autolearn setting does. I let my system run
with the default autolearn settings. Some people adjust the thresholds
and some people prefer to turn off autolearn and do purely manual training.

> The crux of my question/point is that I don't want to have to feed
> messages that Bayes "misses" but that other tests identify *correctly*
> as spam to "sa-learn --spam".

At one point, I had a script running on my server that looked for
messages that were marked as spam with a low Bayes rating (BAYES_00 to
BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60
to BAYES_99). I was then able to check the messages and learn them
properly. This let me learn from the edge cases that were not being
scored properly by Bayes while still making it to the correct folder due
to other rules.

If you do this, you MUST check the messages yourself prior to learning
since there is no other way to know whether they should be learned as
ham or spam.

> Is there value in implementing something like this? Or is there some
> caveat that would make doing so self-defeating?

I find that Bayes autolearn works quite well for me, but others have had
problems with it.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 4:39 PM, Bowie Bailey wrote:
> On 1/15/2013 4:27 PM, Ben Johnson wrote:
>> On 1/15/2013 4:05 PM, Bowie Bailey wrote:
>>> On 1/15/2013 3:47 PM, Ben Johnson wrote:
>>>> One final question on this subject (sorry...).
>>>>
>>>> Is there value in training Bayes on messages that SA classified as spam
>>>> *due to other test scores*? In other words, if a message is classified
>>>> as SPAM due to a block-list test, but the message is new enough for
>>>> Bayes to assign a zero score, should that message be kept and fed to
>>>> sa-learn so that Bayes can soak-up all the tokens from a message
>>>> that is
>>>> almost certainly spam (based on the other tests)?
>>>>
>>>> Am I making any sense?
>>> It is always worthwhile to train Bayes. In an ideal world, you would
>>> hand-sort and train every email that comes through your system. The
>>> more mail Bayes sees the more accurate it can be.
>>>
>> Thanks, Bowie. Given your response, would it then be prudent to call
>> "sa-learn --spam" on any message that *other tests* (non-Bayes tests)
>> determine to be spam (given some score threshold)?
>
> That is exactly what the autolearn setting does. I let my system run
> with the default autolearn settings. Some people adjust the thresholds
> and some people prefer to turn off autolearn and do purely manual training.
>
>> The crux of my question/point is that I don't want to have to feed
>> messages that Bayes "misses" but that other tests identify *correctly*
>> as spam to "sa-learn --spam".
>
> At one point, I had a script running on my server that looked for
> messages that were marked as spam with a low Bayes rating (BAYES_00 to
> BAYES_40) or messages marked as ham with a high Bayes rating (BAYES_60
> to BAYES_99). I was then able to check the messages and learn them
> properly. This let me learn from the edge cases that were not being
> scored properly by Bayes while still making it to the correct folder due
> to other rules.
>
> If you do this, you MUST check the messages yourself prior to learning
> since there is no other way to know whether they should be learned as
> ham or spam.
>
>> Is there value in implementing something like this? Or is there some
>> caveat that would make doing so self-defeating?
>
> I find that Bayes autolearn works quite well for me, but others have had
> problems with it.
>

Aaaaah... I get it. Finally. :)

Excellent info here; thanks again!

You guys are heroes... seriously.

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Tue, 15 Jan 2013, Ben Johnson wrote:

>
>
> On 1/15/2013 1:55 PM, John Hardin wrote:
>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>
>>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>>
>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in place? Or
>>>> are they all performed by SA?
>>>
>>> In postfix's main.cf:
>>>
>>> smtpd_recipient_restrictions = permit_mynetworks,
>>> permit_sasl_authenticated, check_recipient_access
>>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>>
>>> Do you recommend something more?
>>
>> Unfortunately I have no experience administering Postfix. Perhaps one of
>> the other listies can help.
>
> Wow! Adding several more reject_rbl_client entries to the
> smtpd_recipient_restrictions directive in the Postfix configuration
> seems to be having a tremendous impact. The amount of spam coming
> through has dropped by 90% or more. This was a HUGELY helpful
> suggestion, John!

Which ones are you using now? There are DNSBLs that are good, but not
quite good enough to trust as hard-reject SMTP-time filters. That's why SA
does scored DNSBL checks.

>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>> They do so unsupervised. Why this could be a problem is obvious. And no,
>>> I don't retain their submissions. I probably should. I wonder if I can
>>> make a few slight modifications to the shell script that Antispam calls,
>>> such that it simply sends a copy of the message to an administrator
>>> rather than calling sa-learn on the message.
>>
>> That would be a very good idea if the number of users doing training is
>> small. At the very least, the messages should be captured to a permanent
>> corpus mailbox.
>
> Good idea! I'll see if I can set this up.
>
>> Do your users also train ham? Are the procedures similar enough that
>> your users could become easily confused?
>
> They do. The procedure is implemented via Dovecot's Antispam plug-in.
> Basically, moving mail from Inbox to Junk trains it as spam, and moving
> mail from Junk to Inbox trains it as ham. I really like this setup
> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
> the results are effective immediately, which seems to be crucial for
> combating this snowshoe spam (performance and scalability aside).
>
> I don't find that procedure to be confusing, but people are different, I
> suppose.

Hm. One thing I would watch out for in that environment is people who have
intentionally subscribed to some sort of mailing list deciding they don't
want to receive it any longer and just junking the messages rather than
unsubscribing.

However, your problem is FN Bayes scores...

>> The extremely odd thing is that you say you sometimes train a message as
>> spam, and its Bayes score goes *down*. Are you training a message and
>> then running it torough spamc to see if the score changed, or is this
>> about _similar_ messages rather than _that_ message?
>
> Sorry for the ambiguity. This is about *similar* messages. Identical
> messages, at least visually speaking (I realize that there is a lot more
> to it than the visual component). For example, yesterday, I saw several
> Canadian Pharmacy emails, all of which were identical with respect to
> appearance. I classified each as spam, yet the Bayes score didn't budge
> more than a few percent for the first three, and went *down* for the 4th.
>
> I have to assume that while the messages (HTML-formatted) *appear* to be
> identical, the underlying code has some pseudo-random element that is
> designed very specifically to throw Bayes classifiers.
>
> Out of curiosity, does the Bayes engine (or some other element of
> SpamAssassin) have the ability to "see" rendered HTML messages, by
> appearance, and not by source code? If it could, it would be far more
> effective it seems.

That I don't know.

>> That, and configure the user-based training to at the very least capture
>> what they submit to a corpus so you can review it. Whether you do that
>> review pre-training or post-bayes-is-insane is up to you.
>
> Right, right, that makes sense. I hope I can modify the Antispam plug-in
> to accommodate this requirement.
>
> Well, I can't thank you enough here, John and everyone else. I seem to
> be on the right track; all is not lost.
>
> That said, it seems clear that SA is nowhere near as effective as it can
> be when an off-the-shelf configuration is used (and without configuring
> the MTA to do some of the blocking).
>
> I'll keep the list posted (pardon the pun) with regard to configuring
> Antispam to fire-off a copy of any message that is submitted for
> training. Ideally, whether the message is reviewed before or after
> sa-learn is called will be configurable.

Great! Thanks!

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Your mouse has moved. Your Windows Operating System must be
relicensed due to this hardware change. Please contact Microsoft
to obtain a new activation key. If this hardware change results in
added functionality you may be subject to additional license fees.
Your system will now shut down. Thank you for choosing Microsoft.
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/15 07:27, Ben Johnson wrote:
>
>
> On 1/14/2013 7:48 PM, Noel wrote:
>> On 1/14/2013 2:59 PM, Ben Johnson wrote:
> jdow, Noel, and John, I can't thank you enough for your very thorough
> responses. Your time is valuable and I sincerely appreciate your
> willingness to help.

Glad it was even marginally helpful.

>> Ben, do be aware that sometimes you draw the short straw and sit at the
>> very start of the spam distribution cycle. In those cases the BLs will
>> generally not have been alerted yet so they may not trigger. For those
>> situations the rules should be your friends. (I still use my treasured
>> set of SARE rules and personally hand crafted rules my partner and I
>> have created that fit OUR needs but may not be good general purpose
>> rules.)
>
> This makes perfect sense and underscores the importance of a
> finely-tuned rule-set. It's become apparent just how dynamic and capable
> a monster the spam industry is. No one approach will ever be a panacea,
> it seems.
>
> The advice from your second email is well-received, too. Especially the
> part about not killing anybody. ;) I do hope fighting spam becomes fun
> for me, because so far, it's been an uphill battle! Hehe.
>
> Noel, thanks for excellent responses to my questions.

It got fun enough in the old days with more spam than I'm getting now
to taunt the spammers who monitored this list. "Gee, XXXX, you only
managed a 95 on that last spam I got. Surely you can do better and
make it to 100 on small scoring rules." He did.

You actually get to the point you can recognize the style of various
spam programs and often relate them back to the spammer using spamhaus.
These days of full automation might make that harder. But, still, you
can probably start recognizing stylistic elements of the various
programs soon enough.

{^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/15 08:26, Ben Johnson wrote:

> Based on my responses, what's the next move? Backup the Bayes DB, wipe
> it, and feed my corpus through the ol' chipper?

(Sure to infuriate BUT - read the WHOLE note.)

Are you sure your Bayes database is well trained? But let's change that
to, "Is the Bayes database SpamAssassin is using when receiving email
the same as the Bayes database you are training with sa_learn?"

If you are training a per user database and do not have that enabled
in SpamAssassin then the training is pretty useless. Worst case waste
some CPU and disk cycles to find every SpamAssassin related Bayes database
on your system. If you find more than one and shouldn't then ask yourself
why and sort out that problem.

{^_^}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Tue, 15 Jan 2013, jdow wrote:

> On 2013/01/15 08:26, Ben Johnson wrote:
>
>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>> it, and feed my corpus through the ol' chipper?
>
> (Sure to infuriate BUT - read the WHOLE note.)
>
> Are you sure your Bayes database is well trained? But let's change that
> to, "Is the Bayes database SpamAssassin is using when receiving email
> the same as the Bayes database you are training with sa_learn?"

Yeah, we already checked that possibility.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Gun Control enables genocide while doing little to reduce crime.
-----------------------------------------------------------------------
2 days until Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2013/01/15 17:23, John Hardin wrote:
> On Tue, 15 Jan 2013, jdow wrote:
>
>> On 2013/01/15 08:26, Ben Johnson wrote:
>>
>>> Based on my responses, what's the next move? Backup the Bayes DB, wipe
>>> it, and feed my corpus through the ol' chipper?
>>
>> (Sure to infuriate BUT - read the WHOLE note.)
>>
>> Are you sure your Bayes database is well trained? But let's change that
>> to, "Is the Bayes database SpamAssassin is using when receiving email
>> the same as the Bayes database you are training with sa_learn?"
>
> Yeah, we already checked that possibility.

OK, then I shut my fat mouth.

{^_-}
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/13 5:26 PM, Ben Johnson wrote:

>
> In postfix's main.cf:
>
<snip>
>
> Hmm, very interesting. No, I have no greylisting in place as yet, and
> no, my userbase doesn't demand immediate delivery. I will look into
> greylisting further.

If you're running postfix, consider using postscreen. It's a recent
addition to postfix that also can behave in a greylisting alike way, and
much more.

Read: http://www.postfix.org/POSTSCREEN_README.html

--
Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 5:22 PM, John Hardin wrote:
> On Tue, 15 Jan 2013, Ben Johnson wrote:
>
>>
>>
>> On 1/15/2013 1:55 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>
>>>> On 1/14/2013 8:16 PM, John Hardin wrote:
>>>>> On Mon, 14 Jan 2013, Ben Johnson wrote:
>>>>>
>>>>> Question: do you have any SMTP-time hard-reject DNSBL tests in
>>>>> place? Or
>>>>> are they all performed by SA?
>>>>
>>>> In postfix's main.cf:
>>>>
>>>> smtpd_recipient_restrictions = permit_mynetworks,
>>>> permit_sasl_authenticated, check_recipient_access
>>>> mysql:/etc/postfix/mysql-virtual_recipient.cf,
>>>> reject_unauth_destination, reject_rbl_client bl.spamcop.net
>>>>
>>>> Do you recommend something more?
>>>
>>> Unfortunately I have no experience administering Postfix. Perhaps one of
>>> the other listies can help.
>>
>> Wow! Adding several more reject_rbl_client entries to the
>> smtpd_recipient_restrictions directive in the Postfix configuration
>> seems to be having a tremendous impact. The amount of spam coming
>> through has dropped by 90% or more. This was a HUGELY helpful
>> suggestion, John!
>
> Which ones are you using now? There are DNSBLs that are good, but not
> quite good enough to trust as hard-reject SMTP-time filters. That's why
> SA does scored DNSBL checks.

smtpd_recipient_restrictions =
reject_rbl_client bl.spamcop.net,
reject_rbl_client list.dsbl.org,
reject_rbl_client sbl-xbl.spamhaus.org,
reject_rbl_client cbl.abuseat.org,
reject_rbl_client dul.dnsbl.sorbs.net,

I acquired this list from the article that I cited a few responses back.
It is quite possible that some of these are obsolete, as the article is
from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is
obsolete, but now I can't find the source.

These are "hard rejects", right? So if this change has reduced spam,
said spam would not be accepted for delivery at all; it would be
rejected outright. Correct? (And if I understand you, this is part of
your concern.)

The reason I ask, and a point that I should have clarified in my last
post, is that the *volume* of spam didn't drop by 90% (although, it may
have dropped by some measure), but rather the accuracy with which SA
tagged spam was 90% higher.

Ultimately, I'm wondering if the observed change was simply a product of
these message "campaigns" being black-listed after a few days of
circulation, and not the Postfix configuration change.

At this point, the vast majority of X-Spam-Status headers include Razor2
and Pyzor tests that contribute significantly to the score. I should
have mentioned earlier that I installed Razor2 and Pyzor after making my
initial post. The only reasons I didn't are that a) they didn't seem to
be making a significant difference for the first day or so after I
installed them (this could be for the snowshoe reasons we've already
discussed), and b) the low Bayes scores seemed to be the real problem
anyway.

That said, the Bayes scores seem to be much more accurate now, too. I
was hardly ever seeing BAYES_99 before, but now almost all spam messages
have BAYES_99.

Is it possible that the training I've been doing over the last week or
so wasn't *effective* until recently, say, after restarting some
component of the mail stack? My understanding is that calling SA via
Amavis, which does not need/use the spamd daemon, forces all Bayes data
to be up-to-date on each call to spamassassin.

It bears mention that I haven't yet dumped the Bayes DB and retrained
using my corpus. I'll do that next and see where we land once the DB is
repopulated.

>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>>> They do so unsupervised. Why this could be a problem is obvious. And
>>>> no,
>>>> I don't retain their submissions. I probably should. I wonder if I can
>>>> make a few slight modifications to the shell script that Antispam
>>>> calls,
>>>> such that it simply sends a copy of the message to an administrator
>>>> rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.
>>
>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The procedure is implemented via Dovecot's Antispam plug-in.
>> Basically, moving mail from Inbox to Junk trains it as spam, and moving
>> mail from Junk to Inbox trains it as ham. I really like this setup
>> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
>> the results are effective immediately, which seems to be crucial for
>> combating this snowshoe spam (performance and scalability aside).
>>
>> I don't find that procedure to be confusing, but people are different, I
>> suppose.
>
> Hm. One thing I would watch out for in that environment is people who
> have intentionally subscribed to some sort of mailing list deciding they
> don't want to receive it any longer and just junking the messages rather
> than unsubscribing.

Good point. I hadn't thought of that. All the more reason to "screen"
the messages that are submitted for training.

> However, your problem is FN Bayes scores...
>
>>> The extremely odd thing is that you say you sometimes train a message as
>>> spam, and its Bayes score goes *down*. Are you training a message and
>>> then running it torough spamc to see if the score changed, or is this
>>> about _similar_ messages rather than _that_ message?
>>
>> Sorry for the ambiguity. This is about *similar* messages. Identical
>> messages, at least visually speaking (I realize that there is a lot more
>> to it than the visual component). For example, yesterday, I saw several
>> Canadian Pharmacy emails, all of which were identical with respect to
>> appearance. I classified each as spam, yet the Bayes score didn't budge
>> more than a few percent for the first three, and went *down* for the 4th.
>>
>> I have to assume that while the messages (HTML-formatted) *appear* to be
>> identical, the underlying code has some pseudo-random element that is
>> designed very specifically to throw Bayes classifiers.
>>
>> Out of curiosity, does the Bayes engine (or some other element of
>> SpamAssassin) have the ability to "see" rendered HTML messages, by
>> appearance, and not by source code? If it could, it would be far more
>> effective it seems.
>
> That I don't know.
>
>>> That, and configure the user-based training to at the very least capture
>>> what they submit to a corpus so you can review it. Whether you do that
>>> review pre-training or post-bayes-is-insane is up to you.
>>
>> Right, right, that makes sense. I hope I can modify the Antispam plug-in
>> to accommodate this requirement.
>>
>> Well, I can't thank you enough here, John and everyone else. I seem to
>> be on the right track; all is not lost.
>>
>> That said, it seems clear that SA is nowhere near as effective as it can
>> be when an off-the-shelf configuration is used (and without configuring
>> the MTA to do some of the blocking).
>>
>> I'll keep the list posted (pardon the pun) with regard to configuring
>> Antispam to fire-off a copy of any message that is submitted for
>> training. Ideally, whether the message is reviewed before or after
>> sa-learn is called will be configurable.
>
> Great! Thanks!
>

Thanks again for all the insight here, John.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 2:02 AM, Tom Hendrikx wrote:
> On 1/15/13 5:26 PM, Ben Johnson wrote:
>
>>
>> In postfix's main.cf:
>>
> <snip>
>>
>> Hmm, very interesting. No, I have no greylisting in place as yet, and
>> no, my userbase doesn't demand immediate delivery. I will look into
>> greylisting further.
>
> If you're running postfix, consider using postscreen. It's a recent
> addition to postfix that also can behave in a greylisting alike way, and
> much more.
>
> Read: http://www.postfix.org/POSTSCREEN_README.html
>
> --
> Tom
>

Thanks for the suggestion, Tom!

Unfortunately, I'm stuck on Postfix 2.7 for a while yet, and Postscreen
is available for versions >= 2.8 only.

I will definitely look into it once I'm on 2.8+, however.

Cheers,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 16 Jan 2013, Ben Johnson wrote:

> On 1/15/2013 5:22 PM, John Hardin wrote:
>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>
>>> Wow! Adding several more reject_rbl_client entries to the
>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>> seems to be having a tremendous impact. The amount of spam coming
>>> through has dropped by 90% or more. This was a HUGELY helpful
>>> suggestion, John!
>>
>> Which ones are you using now? There are DNSBLs that are good, but not
>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>> SA does scored DNSBL checks.
>
> smtpd_recipient_restrictions =
> reject_rbl_client bl.spamcop.net,
> reject_rbl_client list.dsbl.org,
> reject_rbl_client sbl-xbl.spamhaus.org,
> reject_rbl_client cbl.abuseat.org,
> reject_rbl_client dul.dnsbl.sorbs.net,

Several of those are combined into ZEN. If you use Zen instead you'll save
some DNS queries. See the Spamhaus link I provided earlier for details, I
don't offhand remember which ones go into ZEN.

> These are "hard rejects", right? So if this change has reduced spam,
> said spam would not be accepted for delivery at all; it would be
> rejected outright. Correct? (And if I understand you, this is part of
> your concern.)

Correct.

> The reason I ask, and a point that I should have clarified in my last
> post, is that the *volume* of spam didn't drop by 90% (although, it may
> have dropped by some measure), but rather the accuracy with which SA
> tagged spam was 90% higher.

That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
would have contributed to the score.

Check your trusted networks setting. One difference between SMTP-time and
SA-time DNSBL checks is that SMTP-time checks the IP address of the client
talking to the MTA, while SA-time can go back up the relay chain if
necessary (e.g. to check the client IP submitting to your ISP if your
ISP's MTA is between your MTA and the Internet, rather than always
checking your ISP's MTA IP address).

> Ultimately, I'm wondering if the observed change was simply a product of
> these message "campaigns" being black-listed after a few days of
> circulation, and not the Postfix configuration change.

Maybe.

> At this point, the vast majority of X-Spam-Status headers include Razor2
> and Pyzor tests that contribute significantly to the score. I should
> have mentioned earlier that I installed Razor2 and Pyzor after making my
> initial post. The only reasons I didn't are that a) they didn't seem to
> be making a significant difference for the first day or so after I
> installed them (this could be for the snowshoe reasons we've already
> discussed), and b) the low Bayes scores seemed to be the real problem
> anyway.
>
> That said, the Bayes scores seem to be much more accurate now, too. I
> was hardly ever seeing BAYES_99 before, but now almost all spam messages
> have BAYES_99.

Odd. SMTP-time hard rejects shouldn't change that.

> Is it possible that the training I've been doing over the last week or
> so wasn't *effective* until recently, say, after restarting some
> component of the mail stack? My understanding is that calling SA via
> Amavis, which does not need/use the spamd daemon, forces all Bayes data
> to be up-to-date on each call to spamassassin.

That shouldn't be the case. SA and sa-learn both use a shared-access
database; if you're training the database that SA is learning, the results
of training should be effective immediately.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
One difference between a liberal and a pickpocket is that if you
demand your money back from a pickpocket he will not question your
motives. -- William Rusher
-----------------------------------------------------------------------
Tomorrow: Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 10:49 AM, Ben Johnson wrote:
> On 1/15/2013 5:22 PM, John Hardin wrote:
>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>
>>> Wow! Adding several more reject_rbl_client entries to the
>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>> seems to be having a tremendous impact. The amount of spam coming
>>> through has dropped by 90% or more. This was a HUGELY helpful
>>> suggestion, John!
>> Which ones are you using now? There are DNSBLs that are good, but not
>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>> SA does scored DNSBL checks.
> smtpd_recipient_restrictions =
> reject_rbl_client bl.spamcop.net,
> reject_rbl_client list.dsbl.org,
> reject_rbl_client sbl-xbl.spamhaus.org,
> reject_rbl_client cbl.abuseat.org,
> reject_rbl_client dul.dnsbl.sorbs.net,
>
> I acquired this list from the article that I cited a few responses back.
> It is quite possible that some of these are obsolete, as the article is
> from 2009. I seem to recall reading that sbl-xbl.spamhaus.org is
> obsolete, but now I can't find the source.

I'm not sure if it is considered "obsolete", but it has been generally
replaced by zen.spamhaus.org instead. Zen incorporates SBL, XBL, CSS,
and PBL. (See http://www.spamhaus.org/zen/)

> These are "hard rejects", right? So if this change has reduced spam,
> said spam would not be accepted for delivery at all; it would be
> rejected outright. Correct? (And if I understand you, this is part of
> your concern.)

Exactly.

> The reason I ask, and a point that I should have clarified in my last
> post, is that the *volume* of spam didn't drop by 90% (although, it may
> have dropped by some measure), but rather the accuracy with which SA
> tagged spam was 90% higher.

These rejects will drop the total volume of spam. SA's accuracy may
appear to go up if some of the more difficult spams are now being
blocked by the blacklists.

> Ultimately, I'm wondering if the observed change was simply a product of
> these message "campaigns" being black-listed after a few days of
> circulation, and not the Postfix configuration change.
>
> At this point, the vast majority of X-Spam-Status headers include Razor2
> and Pyzor tests that contribute significantly to the score. I should
> have mentioned earlier that I installed Razor2 and Pyzor after making my
> initial post. The only reasons I didn't are that a) they didn't seem to
> be making a significant difference for the first day or so after I
> installed them (this could be for the snowshoe reasons we've already
> discussed), and b) the low Bayes scores seemed to be the real problem
> anyway.
>
> That said, the Bayes scores seem to be much more accurate now, too. I
> was hardly ever seeing BAYES_99 before, but now almost all spam messages
> have BAYES_99.
>
> Is it possible that the training I've been doing over the last week or
> so wasn't *effective* until recently, say, after restarting some
> component of the mail stack? My understanding is that calling SA via
> Amavis, which does not need/use the spamd daemon, forces all Bayes data
> to be up-to-date on each call to spamassassin.

Amavis incorporates the SA code into itself. So in any instance where
you would normally need to restart spamd, you should instead restart
Amavis. In effect, Amavis is its own spamd daemon.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 9:49 AM, Ben Johnson wrote:
> smtpd_recipient_restrictions =
> reject_rbl_client bl.spamcop.net,

spamcop has a reputation of being somewhat aggressive on blocking,
and their website recommends using it in a scoring system (eg.
SpamAssassin) rather than for outright blocking. That said, many
folks (including me) use it anyway and find it acceptable.

See the spamcop website for details, and make your own choice.

> reject_rbl_client list.dsbl.org,

list.dsbl.org is no longer active. Remove this line.

> reject_rbl_client sbl-xbl.spamhaus.org,
> reject_rbl_client cbl.abuseat.org,

The spamhaus lists are now consolidated in zen.spamhaus.org, replace
the above two lines. See the spamhaus web site for details.

> reject_rbl_client dul.dnsbl.sorbs.net,

This one is OK. Again, you should check their website and review
their published listing policy to see if this is something you want
to block.

Blocking mail is a very site-specific choice. Use the advice you
get as a starting point and make your own decision about how
aggressive you want to be.



-- Noel Jones
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 11:00 AM, John Hardin wrote:
> On Wed, 16 Jan 2013, Ben Johnson wrote:
>
>> On 1/15/2013 5:22 PM, John Hardin wrote:
>>> On Tue, 15 Jan 2013, Ben Johnson wrote:
>>>>
>>>> Wow! Adding several more reject_rbl_client entries to the
>>>> smtpd_recipient_restrictions directive in the Postfix configuration
>>>> seems to be having a tremendous impact. The amount of spam coming
>>>> through has dropped by 90% or more. This was a HUGELY helpful
>>>> suggestion, John!
>>>
>>> Which ones are you using now? There are DNSBLs that are good, but not
>>> quite good enough to trust as hard-reject SMTP-time filters. That's why
>>> SA does scored DNSBL checks.
>>
>> smtpd_recipient_restrictions =
>> reject_rbl_client bl.spamcop.net,
>> reject_rbl_client list.dsbl.org,
>> reject_rbl_client sbl-xbl.spamhaus.org,
>> reject_rbl_client cbl.abuseat.org,
>> reject_rbl_client dul.dnsbl.sorbs.net,
>
> Several of those are combined into ZEN. If you use Zen instead you'll
> save some DNS queries. See the Spamhaus link I provided earlier for
> details, I don't offhand remember which ones go into ZEN.

Per Noel's advice, I have shortened the list (dsbl.org is defunct) and
acted upon your mutual suggestion regarding ZEN:

reject_rbl_client bl.spamcop.net,
reject_rbl_client zen.spamhaus.org,
reject_rbl_client dnsbl.sorbs.net,

Indeed, block entries for all three lists are being registered in the
mail log. Very nice.

It seems as though adding these SMTP-time rejects has blocked about 1/2
of the spam that was coming through previously. Awesome.

>> These are "hard rejects", right? So if this change has reduced spam,
>> said spam would not be accepted for delivery at all; it would be
>> rejected outright. Correct? (And if I understand you, this is part of
>> your concern.)
>
> Correct.
>
>> The reason I ask, and a point that I should have clarified in my last
>> post, is that the *volume* of spam didn't drop by 90% (although, it may
>> have dropped by some measure), but rather the accuracy with which SA
>> tagged spam was 90% higher.
>
> That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
> would have contributed to the score.
>
> Check your trusted networks setting. One difference between SMTP-time
> and SA-time DNSBL checks is that SMTP-time checks the IP address of the
> client talking to the MTA, while SA-time can go back up the relay chain
> if necessary (e.g. to check the client IP submitting to your ISP if your
> ISP's MTA is between your MTA and the Internet, rather than always
> checking your ISP's MTA IP address).

Are you referring to SA's "trusted_networks" directive? If so, it is
commented-out (presumably by default). Does this need to be set? I've
read the info re: trusted_networks at
http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html ,
but I'm struggling to understand it.

If the info is helpful, I have a very simple setup here: a single server
with a single public IP address and a single MTA.

>> Ultimately, I'm wondering if the observed change was simply a product of
>> these message "campaigns" being black-listed after a few days of
>> circulation, and not the Postfix configuration change.
>
> Maybe.
>
>> At this point, the vast majority of X-Spam-Status headers include Razor2
>> and Pyzor tests that contribute significantly to the score. I should
>> have mentioned earlier that I installed Razor2 and Pyzor after making my
>> initial post. The only reasons I didn't are that a) they didn't seem to
>> be making a significant difference for the first day or so after I
>> installed them (this could be for the snowshoe reasons we've already
>> discussed), and b) the low Bayes scores seemed to be the real problem
>> anyway.
>>
>> That said, the Bayes scores seem to be much more accurate now, too. I
>> was hardly ever seeing BAYES_99 before, but now almost all spam messages
>> have BAYES_99.
>
> Odd. SMTP-time hard rejects shouldn't change that.

That's what I figured. I wonder if feeding all of the messages that I
"auto-learned manually" -- messages that were tagged as spam (but for
reasons unrelated to Bayes) -- contributed significantly to this change.
I did this late yesterday afternoon and when I took a status check this
morning, I was seeing BAYES_99 for almost every message.

>> Is it possible that the training I've been doing over the last week or
>> so wasn't *effective* until recently, say, after restarting some
>> component of the mail stack? My understanding is that calling SA via
>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>> to be up-to-date on each call to spamassassin.
>
> That shouldn't be the case. SA and sa-learn both use a shared-access
> database; if you're training the database that SA is learning, the
> results of training should be effective immediately.
>

Okay, good. Bowie's response to this question differed (he suggested
that Amavis would need to be restarted for Bayes to be updated), but I'm
pretty sure that restarting Amavis is not necessary. It seems unlikely
that Amavis would copy the entire Bayes DB (which is stored in MySQL on
this server) into memory every time that the Amavis service is started.
To do so seems self-defeating: more RAM usage, worse performance, etc.

So, I emptied the Bayes DB and re-trained ham and spam on my hand-sorted
corpus. The net result was to discard all previous end-user training, if
I understand correctly.

Everything still looks good; mostly BAYES_99 on the messages that are
and should be marked as spam, and no false-positives at all.

I've disabled the Antispam plug-in for now, for the reasons we've
already discussed. I have asked the Dovecot mailing list for suggestions
regarding how best to pre-screen end-user training submissions.

I think I'm in pretty good shape here, unless setting trusted_networks
is a must, in which case I could use some guidance.

All the best,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 16 Jan 2013, Ben Johnson wrote:

> On 1/16/2013 11:00 AM, John Hardin wrote:
>>
>> That's odd. That suggests you SA wasn't looking up those DNSBLs, or they
>> would have contributed to the score.
>>
>> Check your trusted networks setting. One difference between SMTP-time
>> and SA-time DNSBL checks is that SMTP-time checks the IP address of the
>> client talking to the MTA, while SA-time can go back up the relay chain
>> if necessary (e.g. to check the client IP submitting to your ISP if your
>> ISP's MTA is between your MTA and the Internet, rather than always
>> checking your ISP's MTA IP address).
>
> Are you referring to SA's "trusted_networks" directive?

Yes.

> If so, it is commented-out (presumably by default). Does this need to be
> set? I've read the info re: trusted_networks at
> http://spamassassin.apache.org/full/3.3.x/doc/Mail_SpamAssassin_Conf.html
> , but I'm struggling to understand it.

It means "which MTAs are trusted to not forge Received headers".

There is a related one: internal_networks, which lists networks that are
considered "internal" to your inbound mail topology. Sorry I missed that
one in my first message. This one you'd set if you were retrieving your
email from your ISP rather than directly exposing a MTA to the Internet.

> If the info is helpful, I have a very simple setup here: a single server
> with a single public IP address and a single MTA.

That's the assumed default environment. If you aren't explicitly setting
trusted_networks and internal_networks you should be okay.

>>> That said, the Bayes scores seem to be much more accurate now, too. I
>>> was hardly ever seeing BAYES_99 before, but now almost all spam messages
>>> have BAYES_99.
>>
>> Odd. SMTP-time hard rejects shouldn't change that.
>
> That's what I figured. I wonder if feeding all of the messages that I
> "auto-learned manually" -- messages that were tagged as spam (but for
> reasons unrelated to Bayes) -- contributed significantly to this change.

Quite possibly.

>> That shouldn't be the case. SA and sa-learn both use a shared-access
>> database; if you're training the database that SA is learning, the
>> results of training should be effective immediately.
>
> Okay, good. Bowie's response to this question differed (he suggested
> that Amavis would need to be restarted for Bayes to be updated),

No, he didn't, he said that in a situation where you'd have to restart
spamd, you instead need to restart amavisd. One such situation is after
running sa-update and getting updated rules.

> but I'm pretty sure that restarting Amavis is not necessary. It seems
> unlikely that Amavis would copy the entire Bayes DB (which is stored in
> MySQL on this server) into memory every time that the Amavis service is
> started. To do so seems self-defeating: more RAM usage, worse
> performance, etc.

Right.

> So, I emptied the Bayes DB and re-trained ham and spam on my hand-sorted
> corpus. The net result was to discard all previous end-user training, if
> I understand correctly.

That is correct.

> Everything still looks good; mostly BAYES_99 on the messages that are
> and should be marked as spam, and no false-positives at all.

yay!

> I've disabled the Antispam plug-in for now, for the reasons we've
> already discussed. I have asked the Dovecot mailing list for suggestions
> regarding how best to pre-screen end-user training submissions.
>
> I think I'm in pretty good shape here, unless setting trusted_networks
> is a must, in which case I could use some guidance.

No, sounds like you're good for that.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
It is criminal to teach a man not to defend himself when he is the
constant victim of brutal attacks. -- Malcolm X (1964)
-----------------------------------------------------------------------
Tomorrow: Benjamin Franklin's 307th Birthday
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 1:18 PM, Ben Johnson wrote:
>
> On 1/16/2013 11:00 AM, John Hardin wrote:
>> On Wed, 16 Jan 2013, Ben Johnson wrote:
>>
>>> Is it possible that the training I've been doing over the last week or
>>> so wasn't *effective* until recently, say, after restarting some
>>> component of the mail stack? My understanding is that calling SA via
>>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>>> to be up-to-date on each call to spamassassin.
>> That shouldn't be the case. SA and sa-learn both use a shared-access
>> database; if you're training the database that SA is learning, the
>> results of training should be effective immediately.
>>
> Okay, good. Bowie's response to this question differed (he suggested
> that Amavis would need to be restarted for Bayes to be updated), but I'm
> pretty sure that restarting Amavis is not necessary. It seems unlikely
> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
> this server) into memory every time that the Amavis service is started.
> To do so seems self-defeating: more RAM usage, worse performance, etc.

Actually, I was making a general observation.

For cases where you would normally need to restart spamd, you will need
to restart amavis. This includes things like rule and configuration
changes.

Bayes data is read dynamically from your MySQL database and thus does
not require a restart of amavis/spamd when updated.

--
Bowie
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

>>> smtpd_recipient_restrictions =
>>> reject_rbl_client bl.spamcop.net,
>>> reject_rbl_client list.dsbl.org,
>>> reject_rbl_client sbl-xbl.spamhaus.org,
>>> reject_rbl_client cbl.abuseat.org,
>>> reject_rbl_client dul.dnsbl.sorbs.net,
>>
>> Several of those are combined into ZEN. If you use Zen instead you'll
>> save some DNS queries. See the Spamhaus link I provided earlier for
>> details, I don't offhand remember which ones go into ZEN.
>
> Per Noel's advice, I have shortened the list (dsbl.org is defunct) and
> acted upon your mutual suggestion regarding ZEN:
>
> reject_rbl_client bl.spamcop.net,
> reject_rbl_client zen.spamhaus.org,
> reject_rbl_client dnsbl.sorbs.net,

I've also started using the following, but it could be specific to postfix v2.9:

reject_rhsbl_reverse_client zen.spamhaus.org,
reject_rhsbl_sender zen.spamhaus.org,
reject_rhsbl_helo zen.spamhaus.org,

Are you using rbl_reply_maps? Prior to postscreen, I was using it in this way:

rbl_reply_maps = hash:/etc/postfix/rbl_reply_maps

I'm not sure it's necessary in your situation. You can find more about
this here:

http://www.postfix.org/STRESS_README.html

No doubt the guys on this list have been incredibly helpful in the
past. I'd like to thank them again as well.

> Okay, good. Bowie's response to this question differed (he suggested
> that Amavis would need to be restarted for Bayes to be updated), but I'm
> pretty sure that restarting Amavis is not necessary. It seems unlikely
> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
> this server) into memory every time that the Amavis service is started.
> To do so seems self-defeating: more RAM usage, worse performance, etc.

I also don't believe it's necessary to restart amavisd when changes
are made to bayes. I'm also using mysql. I just wish replication was
faster, or it would use it across my multiple mail servers. Instead, I
have to have multiple separate mysql bayes databases, each with their
own tokens, corpus that's used for training, etc, despite it all being
for a single domain.

Regarding restarting amavisd, this is always frustrating to me. I'm
sometimes making changes very frequently, and amavisd doesn't always
restart reliably. Despite a "service amavisd stop" on fedora, it
doesn't completely stop, but instead just goes catatonic and requires
me to manually kill it.

I've asked on the amavisd list, but no one has been able to help. I've
tried just issuing a "reload" but that doesn't always work either.
Does anyone know if it's possible to send it a signal or a way to more
reliably signal amavisd?

Thanks,
Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 2:22 PM, Bowie Bailey wrote:
> On 1/16/2013 1:18 PM, Ben Johnson wrote:
>>
>> On 1/16/2013 11:00 AM, John Hardin wrote:
>>> On Wed, 16 Jan 2013, Ben Johnson wrote:
>>>
>>>> Is it possible that the training I've been doing over the last week or
>>>> so wasn't *effective* until recently, say, after restarting some
>>>> component of the mail stack? My understanding is that calling SA via
>>>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>>>> to be up-to-date on each call to spamassassin.
>>> That shouldn't be the case. SA and sa-learn both use a shared-access
>>> database; if you're training the database that SA is learning, the
>>> results of training should be effective immediately.
>>>
>> Okay, good. Bowie's response to this question differed (he suggested
>> that Amavis would need to be restarted for Bayes to be updated), but I'm
>> pretty sure that restarting Amavis is not necessary. It seems unlikely
>> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
>> this server) into memory every time that the Amavis service is started.
>> To do so seems self-defeating: more RAM usage, worse performance, etc.
>
> Actually, I was making a general observation.
>
> For cases where you would normally need to restart spamd, you will need
> to restart amavis. This includes things like rule and configuration
> changes.
>
> Bayes data is read dynamically from your MySQL database and thus does
> not require a restart of amavis/spamd when updated.
>

My apologies, Bowie. I misinterpreted your response. Thank you very much
for the follow-up and for the clear explanation.

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
So, I've been keeping an eye on things again today.

Overall, things look pretty good, and most spam is being blocked
outright at the MTA and scored appropriately in SA if not.

I've been inspecting the X-Spam-Status headers for the handful of
messages that do slip through and noticed that most of them lack any
evidence of the BAYES_* tests. Here's one such header:

No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392,
SPF_PASS=-0.001] autolearn=disabled

The messages that were delivered just before and after this one do have
evidence of BAYES_* tests, so, it's not as though something is
completely broken.

Are there any normal circumstances under which Bayes tests are not run?
Do I need to turn debugging back on and wait until this happens again?

Thanks for all the help, everyone!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 5:22 PM, John Hardin wrote:
>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>>> They do so unsupervised. Why this could be a problem is obvious. And
>>>> no,
>>>> I don't retain their submissions. I probably should. I wonder if I can
>>>> make a few slight modifications to the shell script that Antispam
>>>> calls,
>>>> such that it simply sends a copy of the message to an administrator
>>>> rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.

So, I finally got around to tackling this change.

With a couple of simple modifications, I was able to achieve the desired
result with the Dovecot Antispam plug-in.

In dovecot.conf:

-----------------------------------------------------
plugin {
# [...]

# For Dovecot < 2.0.
antispam_spam_pattern_ignorecase = SPAM;JUNK
antispam_mail_tmpdir = /tmp
antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh
antispam_mail_spam = proposed-spam@example.com
antispam_mail_notspam = proposed-ham@example.com
}
-----------------------------------------------------

Basically, I changed the last two directive values from the switches
that are normally passed to the "sa-learn" binary (--spam and --ham) to
destination email addresses that are passed to "sendmail" in my revised
pipe script.

Here is the full pipe script, /usr/bin/sa-learn-pipe.sh (apologies for
the wrapping); the original commands are commented with two pound
symbols [##]):

-----------------------------------------------------
#!/bin/sh

# Add "starting now" string to log.
echo "$$-start ($*)" >> /tmp/sa-learn-pipe.log

# Copy the message contents to a temporary text file.
cat<&0 >> /tmp/sendmail-msg-$$.txt

CURRENT_USER=$(whoami)

##echo "Calling (as user $CURRENT_USER) '/usr/bin/sa-learn $*
/tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log
echo "Calling (as user $CURRENT_USER) 'sendmail $* <
/tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log

# Execute sa-learn, with the passed ham/spam argument, and the temporary
message contents.
# Send the output to the log file while redirecting stderr to stdout (so
we capture debug output).
##/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt >>
/tmp/sa-learn-pipe.log 2>&1
sendmail $* < /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1

# Remove the temporary message.
rm -f /tmp/sendmail-msg-$$.txt

# Add "ending now" string to log.
echo "$$-end" >> /tmp/sa-learn-pipe.log

# Exit with "success" status code.
exit 0
-----------------------------------------------------

It seems as though creating a temporary copy of the message is not
strictly necessary, as the message contents could be passed to the
"sendmail" command via standard input (stdin), but creating the copy
could be useful in debugging.

>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The procedure is implemented via Dovecot's Antispam plug-in.
>> Basically, moving mail from Inbox to Junk trains it as spam, and moving
>> mail from Junk to Inbox trains it as ham. I really like this setup
>> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
>> the results are effective immediately, which seems to be crucial for
>> combating this snowshoe spam (performance and scalability aside).
>>
>> I don't find that procedure to be confusing, but people are different, I
>> suppose.
>
> Hm. One thing I would watch out for in that environment is people who
> have intentionally subscribed to some sort of mailing list deciding they
> don't want to receive it any longer and just junking the messages rather
> than unsubscribing.

The steps I've taken above will allow me to review submissions and
educate users who engage in this practice. Thanks again for elucidating
this scenario.

I hope that this approach to user-based SpamAssassin training is useful
to others.

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 31 Jan 2013, Ben Johnson wrote:

> On 1/15/2013 5:22 PM, John Hardin wrote:
>>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam
>>>>> plug-in. They do so unsupervised. Why this could be a problem is
>>>>> obvious. And no, I don't retain their submissions. I probably
>>>>> should. I wonder if I can make a few slight modifications to the
>>>>> shell script that Antispam calls, such that it simply sends a copy
>>>>> of the message to an administrator rather than calling sa-learn on
>>>>> the message.
>>>>
>>>> That would be a very good idea if the number of users doing training is
>>>> small. At the very least, the messages should be captured to a permanent
>>>> corpus mailbox.
>>>
>>> Good idea! I'll see if I can set this up.
>
> So, I finally got around to tackling this change.
>
> With a couple of simple modifications, I was able to achieve the desired
> result with the Dovecot Antispam plug-in.
>
> Basically, I changed the last two directive values from the switches
> that are normally passed to the "sa-learn" binary (--spam and --ham) to
> destination email addresses that are passed to "sendmail" in my revised
> pipe script.

Passing the messages through sendmail again isn't optimal as that will
make further changes to the headers. This may have effects on the quality
of the learning, unless the original message is attached as an RFC-822
attachment to the message being sent to the corpus mailbox, which of
course means you then can't just run sa-learn directly against that
mailbox - the review process would involve moving the attachment as a
standalone message to the spam or ham learning mailbox.

Ideally you want to just move the messages between mailboxes without
involving another delivery processing. I don't know enough about Dovecot
or your topology to say whether that's going to be as easy as using
sendmail to mail the message to you.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
If guns kill people, then...
-- pencils miss spel words.
-- cars make people drive drunk.
-- spoons make people fat.
-----------------------------------------------------------------------
Tomorrow: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 31 Jan 2013 12:12:15 -0800 (PST)
John Hardin wrote:

> On Thu, 31 Jan 2013, Ben Johnson wrote:
>

> > So, I finally got around to tackling this change.
> >
> > With a couple of simple modifications, I was able to achieve the
> > desired result with the Dovecot Antispam plug-in.
> >
> > Basically, I changed the last two directive values from the switches
> > that are normally passed to the "sa-learn" binary (--spam and
> > --ham) to destination email addresses that are passed to "sendmail"
> > in my revised pipe script.
>
> Passing the messages through sendmail again isn't optimal as that
> will make further changes to the headers. This may have effects on
> the quality of the learning, unless the original message is attached
> as an RFC-822 attachment to the message being sent to the corpus
> mailbox, which of course means you then can't just run sa-learn
> directly against that mailbox - the review process would involve
> moving the attachment as a standalone message to the spam or ham
> learning mailbox.
>
> Ideally you want to just move the messages between mailboxes without
> involving another delivery processing. I don't know enough about
> Dovecot or your topology to say whether that's going to be as easy as
> using sendmail to mail the message to you.

Actually that's the way that the dovecot plugin works. I think that the
sendmail option is mainly a way to get training done on a remote
machine - it's a standard feature of DSPAM for which the plugin was
originally developed.

When I looked at the plugin it seemed to have quite a serious flaw.
IIRC it disables IMAP APPENDs on the Spam folder which makes it
incompatible with synchronisation tools like OfflineImap and probably
some IMAP clients that implement offline support in the same way.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/31/2013 5:50 PM, RW wrote:
> On Thu, 31 Jan 2013 12:12:15 -0800 (PST)
> John Hardin wrote:
>
>> On Thu, 31 Jan 2013, Ben Johnson wrote:
>>
>
>>> So, I finally got around to tackling this change.
>>>
>>> With a couple of simple modifications, I was able to achieve the
>>> desired result with the Dovecot Antispam plug-in.
>>>
>>> Basically, I changed the last two directive values from the switches
>>> that are normally passed to the "sa-learn" binary (--spam and
>>> --ham) to destination email addresses that are passed to "sendmail"
>>> in my revised pipe script.
>>
>> Passing the messages through sendmail again isn't optimal as that
>> will make further changes to the headers. This may have effects on
>> the quality of the learning, unless the original message is attached
>> as an RFC-822 attachment to the message being sent to the corpus
>> mailbox, which of course means you then can't just run sa-learn
>> directly against that mailbox - the review process would involve
>> moving the attachment as a standalone message to the spam or ham
>> learning mailbox.
>>
>> Ideally you want to just move the messages between mailboxes without
>> involving another delivery processing. I don't know enough about
>> Dovecot or your topology to say whether that's going to be as easy as
>> using sendmail to mail the message to you.
>
> Actually that's the way that the dovecot plugin works. I think that the
> sendmail option is mainly a way to get training done on a remote
> machine - it's a standard feature of DSPAM for which the plugin was
> originally developed.
>
> When I looked at the plugin it seemed to have quite a serious flaw.
> IIRC it disables IMAP APPENDs on the Spam folder which makes it
> incompatible with synchronisation tools like OfflineImap and probably
> some IMAP clients that implement offline support in the same way.
>

John, thanks for pointing-out the problems associated with re-sending
the messages via sendmail.

I threw a line out to the Dovecot users group and learned how to move
messages without going through the MTA. Dovecot has a utility
executable, "deliver", which is well-suited to the task.

For those who may have a similar need, here's the Dovecot Antispam pipe
script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
mailing list:

---------------------------------------
#!/bin/bash

mode=
for opt; do
if test "x$*" == "x--ham"; then
mode=HAM
break
elif test "x$*" == "x--spam"; then
mode=SPAM
break
fi
done

if test -n "$mode"; then
# options from http://wiki1.dovecot.org/LDA
/usr/lib/dovecot/deliver -d user@example.com -m Training.$mode
fi

exit 0
---------------------------------------


And here are the Antispam plug-in options:


---------------------------------------
# For Dovecot < 2.0.
antispam_spam_pattern_ignorecase = SPAM;JUNK
antispam_mail_tmpdir = /tmp
antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh
antispam_mail_spam = --spam
antispam_mail_notspam = --ham
---------------------------------------

RW, thank you for underscoring the issue with IMAP appends. It looks as
though a configuration directive exists to control this behavior:

# Whether to allow APPENDing to SPAM folders or not. Must be set to
# "yes" (case insensitive) to be activated. Before activating, please
# read the discussion below.
# antispam_allow_append_to_spam = no

Unfortunately, I don't fully understand the implications or enabling or
disabling this option. Here's the "discussion below" that is referenced
in the above comment:

---------------------------------------
ALLOWING APPENDS?

You should be careful with allowing APPENDs to SPAM folders. The reason
for possibly allowing it is to allow not-SPAM --> SPAM transitions to
work with offlineimap. However, because with APPEND the plugin cannot
know the source of the message, multiple bad scenarios can happen:

1. SPAM --> SPAM transitions cannot be recognised and are trained

2. the same holds for Trash --> SPAM transitions

Additionally, because we cannot recognise SPAM --> not-SPAM
transitions, training good messages will never work with APPEND.
---------------------------------------

In consideration of the first point, what is a "SPAM --> SPAM
transition"? Is that when the mailbox contains more than one "spam
folder", e.g., "JUNK" and "SPAM", and the user drags a message from one
to the other?

Regarding the second point, I'm not sure I understand the problem. If
someone drags a message from Trash to SPAM, shouldn't it be submitted
for learning as spam?

The last sentence sounds like somewhat of a deal-breaker. Doesn't my
whole strategy go somewhat limp if ham cannot be submitted for training?

John and RW, do you recommend enabling or disabling the append option,
given the way I'm reviewing the submissions and sorting them manually?

Sorry for all the questions! And thanks!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Fri, 1 Feb 2013, Ben Johnson wrote:

> John, thanks for pointing-out the problems associated with re-sending
> the messages via sendmail.
>
> I threw a line out to the Dovecot users group and learned how to move
> messages without going through the MTA. Dovecot has a utility
> executable, "deliver", which is well-suited to the task.
>
> For those who may have a similar need, here's the Dovecot Antispam pipe
> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
> mailing list:
>
> ---------------------------------------
> #!/bin/bash
>
> mode=
> for opt; do
> if test "x$*" == "x--ham"; then
> mode=HAM
> break
> elif test "x$*" == "x--spam"; then
> mode=SPAM
> break
> fi
> done
>
> if test -n "$mode"; then
> # options from http://wiki1.dovecot.org/LDA
> /usr/lib/dovecot/deliver -d user@example.com -m Training.$mode
> fi
>
> exit 0
> ---------------------------------------

That seems a lot better.

> Regarding the second point, I'm not sure I understand the problem. If
> someone drags a message from Trash to SPAM, shouldn't it be submitted
> for learning as spam?
>
> The last sentence sounds like somewhat of a deal-breaker. Doesn't my
> whole strategy go somewhat limp if ham cannot be submitted for training?
>
> John and RW, do you recommend enabling or disabling the append option,
> given the way I'm reviewing the submissions and sorting them manually?

I think they're proceeding from the assumption of *un-reviewed* training,
i.e. blind trust in the reliability of the users.

If it's possible to enable IMAP Append on a per-folder basis then enabling
it only on your training inbox folders shouldn't be an issue - the
messages won't be trained until you've reviewed them.

Without that level of fine-grain control I still don't see an issue from
this if you can prevent the users from adding content directly to the
folders that sa-learn actually processes. If IMAP Append only applies to
"shared" folders then there shouldn't be a problem - configure sa-learn to
learn from folders in *your account*, that nobody else can access
directly.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Gun Control laws aren't enacted to control guns, they are enacted
to control people: catholics (1500s), japanese peasants (1600s),
blacks (1860s), italian immigrants (1911), the irish (1920s),
jews (1930s), blacks (1960s), the poor (always)
-----------------------------------------------------------------------
Today: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Fri, 1 Feb 2013 09:00:48 -0800 (PST)
John Hardin wrote:

> On Fri, 1 Feb 2013, Ben Johnson wrote:
>
> > John, thanks for pointing-out the problems associated with
> > re-sending the messages via sendmail.
> >
> > I threw a line out to the Dovecot users group and learned how to
> > move messages without going through the MTA. Dovecot has a utility
> > executable, "deliver", which is well-suited to the task.
> >
> > For those who may have a similar need, here's the Dovecot Antispam
> > pipe script that I'm using, courtesy of Steffen Kaiser on the
> > Dovecot Users mailing list:
> >
> > ---------------------------------------
> > #!/bin/bash
> >
> > mode=
> > for opt; do
> > if test "x$*" == "x--ham"; then
> > mode=HAM
> > break
> > elif test "x$*" == "x--spam"; then
> > mode=SPAM
> > break
> > fi
> > done
> >
> > if test -n "$mode"; then
> > # options from http://wiki1.dovecot.org/LDA
> > /usr/lib/dovecot/deliver -d user@example.com -m
> > Training.$mode fi
> >
> > exit 0
> > ---------------------------------------
>
> That seems a lot better.
>
> > Regarding the second point, I'm not sure I understand the problem.
> > If someone drags a message from Trash to SPAM, shouldn't it be
> > submitted for learning as spam?
> >
> > The last sentence sounds like somewhat of a deal-breaker. Doesn't my
> > whole strategy go somewhat limp if ham cannot be submitted for
> > training?
> >
> > John and RW, do you recommend enabling or disabling the append
> > option, given the way I'm reviewing the submissions and sorting
> > them manually?
>
> I think they're proceeding from the assumption of *un-reviewed*
> training, i.e. blind trust in the reliability of the users.
>
> If it's possible to enable IMAP Append on a per-folder basis then
> enabling it only on your training inbox folders shouldn't be an issue
> - the messages won't be trained until you've reviewed them.
>
> Without that level of fine-grain control I still don't see an issue
> from this if you can prevent the users from adding content directly
> to the folders that sa-learn actually processes. If IMAP Append only
> applies to "shared" folders then there shouldn't be a problem -
> configure sa-learn to learn from folders in *your account*, that
> nobody else can access directly.

This is what it says:


antispam_allow_append_to_spam (boolean) Specifies whether to allow
appending mails to the spam folder from the unknown source. See the
ALLOWING APPENDS section below for the details on why it is not
advised to turn this option on. Optional, default = NO.

...

ALLOWING APPENDS
By appends we mean the case of mail moving when the source folder is
unknown, e.g. when you move from some other account or with tools
like offlineimap. You should be careful with allowing APPENDs to
SPAM folders. The reason for possibly allowing it is to allow
not-SPAM --> SPAM transitions to work and be trained. However,
because the plugin cannot know the source of the message (it is
assumed to be from OTHER folder), multiple bad scenarios can happen:

1. SPAM --> SPAM transitions cannot be recognised and are trained;
2. TRASH --> SPAM transitions cannot be recognised and are trained;
3. SPAM --> not-SPAM transitions cannot be recognised therefore
training good messages will never work with APPENDs.


I presume that the plugin works by monitoring COPY commands and so
can't work properly when a move is done by FETCH-APPEND-DELETE.

For sa-learn the problem would be 3, but I don't see how that is
affected by allowing appends on the spam folder.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Sat, 2 Feb 2013, RW wrote:

> ALLOWING APPENDS
> By appends we mean the case of mail moving when the source folder is
> unknown, e.g. when you move from some other account or with tools
> like offlineimap. You should be careful with allowing APPENDs to
> SPAM folders. The reason for possibly allowing it is to allow
> not-SPAM --> SPAM transitions to work and be trained. However,
> because the plugin cannot know the source of the message (it is
> assumed to be from OTHER folder), multiple bad scenarios can happen:
>
> 1. SPAM --> SPAM transitions cannot be recognised and are trained;
> 2. TRASH --> SPAM transitions cannot be recognised and are trained;
> 3. SPAM --> not-SPAM transitions cannot be recognised therefore
> training good messages will never work with APPENDs.
>
>
> I presume that the plugin works by monitoring COPY commands and so
> can't work properly when a move is done by FETCH-APPEND-DELETE.
>
> For sa-learn the problem would be 3, but I don't see how that is
> affected by allowing appends on the spam folder.

Yeah, all of that sounds like they're talking about non-vetted training
mailboxes where the users are effectively talking directly to sa-learn.

I think I may see at least part of what they are driving at.

If one user trains a message as ham and another user who got a copy of the
same message trains it as spam, who wins?

Absent some conflict-detection mechanism, the last mailbox trained (either
spam or ham) wins.

As for the other two:

spam -> spam transitions don't matter, sa-learn recognises message-IDs and
won't learn from the same message in the same corpus more than once (i.e.
having the same message in the spam corpus multiple times does not
"weight" the tokens learned from that message). So (1) may be a
performance concern but it won't affect the database.

trash -> spam transition being learned is a problem how?

That latter brings up another concern for the vetted-corpora model: if a
message is *removed* from a training corpora mailbox rather than
reclassified, you'd have to wipe and retrain your database from scratch to
remove that message's effects.

So, you need *three* vetted corpus mailboxes: spam, ham, and
should-not-have-been-trained (forget). Rather than deleting a message from
the ham or spam corpus mailbox you move it to the forget mailbox and the
in next training pass sa-learn forgets the message and removes it from the
forget mailbox. This would be some special scripting, because you can't
just "sa-learn --forget" a whole mailbox.

There would also need to be an audit process to detect whether the same
message_id is in both the ham and spam corpus mailboxes, so that the admin
can delete (NOT forget) the incorrect classification, or forget the
message if neither classification is reasonable.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
When designing software, any time you think to yourself "a user
would never be stupid enough to do *that*", you're wrong.
-----------------------------------------------------------------------
Today: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2/1/2013 12:00 PM, John Hardin wrote:
> On Fri, 1 Feb 2013, Ben Johnson wrote:
>
>> John, thanks for pointing-out the problems associated with re-sending
>> the messages via sendmail.
>>
>> I threw a line out to the Dovecot users group and learned how to move
>> messages without going through the MTA. Dovecot has a utility
>> executable, "deliver", which is well-suited to the task.
>>
>> For those who may have a similar need, here's the Dovecot Antispam pipe
>> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
>> mailing list:
>>
>> ---------------------------------------
>> #!/bin/bash
>>
>> mode=
>> for opt; do
>> if test "x$*" == "x--ham"; then
>> mode=HAM
>> break
>> elif test "x$*" == "x--spam"; then
>> mode=SPAM
>> break
>> fi
>> done
>>
>> if test -n "$mode"; then
>> # options from http://wiki1.dovecot.org/LDA
>> /usr/lib/dovecot/deliver -d user@example.com -m Training.$mode
>> fi
>>
>> exit 0
>> ---------------------------------------
>
> That seems a lot better.
>
>> Regarding the second point, I'm not sure I understand the problem. If
>> someone drags a message from Trash to SPAM, shouldn't it be submitted
>> for learning as spam?
>>
>> The last sentence sounds like somewhat of a deal-breaker. Doesn't my
>> whole strategy go somewhat limp if ham cannot be submitted for training?
>>
>> John and RW, do you recommend enabling or disabling the append option,
>> given the way I'm reviewing the submissions and sorting them manually?
>
> I think they're proceeding from the assumption of *un-reviewed*
> training, i.e. blind trust in the reliability of the users.
>
> If it's possible to enable IMAP Append on a per-folder basis then
> enabling it only on your training inbox folders shouldn't be an issue -
> the messages won't be trained until you've reviewed them.
>
> Without that level of fine-grain control I still don't see an issue from
> this if you can prevent the users from adding content directly to the
> folders that sa-learn actually processes. If IMAP Append only applies to
> "shared" folders then there shouldn't be a problem - configure sa-learn
> to learn from folders in *your account*, that nobody else can access
> directly.
>

Thanks, John.

If I'm understanding you correctly, your assessment is that enabling
IMAP append in the Antispam plug-in configuration (not the default, by
the way) shouldn't cause problems for my Bayes training setup, primarily
because users cannot train Bayes unsupervised.

If that is so, what's the real benefit to enabling this "feature" that
is off by default? Users will be able to submit messages for training
while "offline" and when they reconnect the plug-in will be triggered
and the messages copied to the training mailbox?

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2/1/2013 7:58 PM, John Hardin wrote:
> On Sat, 2 Feb 2013, RW wrote:
>
>> ALLOWING APPENDS
>> By appends we mean the case of mail moving when the source folder is
>> unknown, e.g. when you move from some other account or with tools
>> like offlineimap. You should be careful with allowing APPENDs to
>> SPAM folders. The reason for possibly allowing it is to allow
>> not-SPAM --> SPAM transitions to work and be trained. However,
>> because the plugin cannot know the source of the message (it is
>> assumed to be from OTHER folder), multiple bad scenarios can happen:
>>
>> 1. SPAM --> SPAM transitions cannot be recognised and are trained;
>> 2. TRASH --> SPAM transitions cannot be recognised and are trained;
>> 3. SPAM --> not-SPAM transitions cannot be recognised therefore
>> training good messages will never work with APPENDs.
>>
>>
>> I presume that the plugin works by monitoring COPY commands and so
>> can't work properly when a move is done by FETCH-APPEND-DELETE.
>>
>> For sa-learn the problem would be 3, but I don't see how that is
>> affected by allowing appends on the spam folder.
>
> Yeah, all of that sounds like they're talking about non-vetted training
> mailboxes where the users are effectively talking directly to sa-learn.
>
> I think I may see at least part of what they are driving at.
>
> If one user trains a message as ham and another user who got a copy of
> the same message trains it as spam, who wins?
>
> Absent some conflict-detection mechanism, the last mailbox trained
> (either spam or ham) wins.
>
> As for the other two:
>
> spam -> spam transitions don't matter, sa-learn recognises message-IDs
> and won't learn from the same message in the same corpus more than once
> (i.e. having the same message in the spam corpus multiple times does not
> "weight" the tokens learned from that message). So (1) may be a
> performance concern but it won't affect the database.
>
> trash -> spam transition being learned is a problem how?
>
> That latter brings up another concern for the vetted-corpora model: if a
> message is *removed* from a training corpora mailbox rather than
> reclassified, you'd have to wipe and retrain your database from scratch
> to remove that message's effects.
>
> So, you need *three* vetted corpus mailboxes: spam, ham, and
> should-not-have-been-trained (forget). Rather than deleting a message
> from the ham or spam corpus mailbox you move it to the forget mailbox
> and the in next training pass sa-learn forgets the message and removes
> it from the forget mailbox. This would be some special scripting,
> because you can't just "sa-learn --forget" a whole mailbox.
>
> There would also need to be an audit process to detect whether the same
> message_id is in both the ham and spam corpus mailboxes, so that the
> admin can delete (NOT forget) the incorrect classification, or forget
> the message if neither classification is reasonable.
>

You reveal some crucial information with regard to corpora management
here, John.

I've taken your good advice and created a third mailbox (well, a third
"folder" within the same mailbox), named "Forget".

It sounds as though the key here is never to delete messages from either
corpus -- unless the same message exists in both corpora, in which case
the misclassified message should be deleted. If neither classification
is reasonable and the message should instead be forgotten, what's the
order of operations? Should a copy of the message be created in the
"Forget" corpus and then the message deleted from both the "Ham" and
"Spam" corpora?

With regard to the specialized scripting required to "forget" messages,
this sounds cumbersome

> because you can't just "sa-learn --forget" a whole mailbox.

Is there a non-obvious reason for this? Would the logic behind a
recursive --forget switch not be the same or similar as with the
existing --ham and --spam switches?

Finally, when a user submits a message to be classified as ham or spam,
how should I be sorting the messages? I see the following scenarios:

1.) I agree with the end-user's classification.

2.) I disagree with the end-user's classification.
a.) Because the message was submitted as ham but is really spam (or
vice versa)
b.) Because neither classification is reasonable

In case 1.), should I *copy* the message from the submission inbox's Ham
folder to the permanent Ham corpus folder? Or should I *move* the
message? I'm trying to discern whether or not there's value in retaining
end-user submissions *as they were classified upon submission*.

In case 2.), should I simply delete the message from the submission
folder? Or is there some reason to retain the message (i.e., move it
into an "Erroneous" folder within the submission mailbox)?

I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
but it doesn't address these issues, specifically.

Thanks again!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 6 Feb 2013, Ben Johnson wrote:
> On 2/1/2013 7:58 PM, John Hardin wrote:
>>
>> That latter brings up another concern for the vetted-corpora model: if a
>> message is *removed* from a training corpora mailbox rather than
>> reclassified, you'd have to wipe and retrain your database from scratch
>> to remove that message's effects.
>>
>> So, you need *three* vetted corpus mailboxes: spam, ham, and
>> should-not-have-been-trained (forget). Rather than deleting a message
>> from the ham or spam corpus mailbox you move it to the forget mailbox
>> and the in next training pass sa-learn forgets the message and removes
>> it from the forget mailbox. This would be some special scripting,
>> because you can't just "sa-learn --forget" a whole mailbox.
>>
>> There would also need to be an audit process to detect whether the same
>> message_id is in both the ham and spam corpus mailboxes, so that the
>> admin can delete (NOT forget) the incorrect classification, or forget
>> the message if neither classification is reasonable.
>
> You reveal some crucial information with regard to corpora management
> here, John.
>
> I've taken your good advice and created a third mailbox (well, a third
> "folder" within the same mailbox), named "Forget".
>
> It sounds as though the key here is never to delete messages from either
> corpus -- unless the same message exists in both corpora, in which case
> the misclassified message should be deleted. If neither classification
> is reasonable and the message should instead be forgotten, what's the
> order of operations? Should a copy of the message be created in the
> "Forget" corpus and then the message deleted from both the "Ham" and
> "Spam" corpora?

I would suggest: *move* one to the Forget folder and delete the other.

I am assuming that learning from the vetted corpora folders is on a
schedule rather than in real-time, so that you have a liberal window for
completing these operations.

> With regard to the specialized scripting required to "forget" messages,
> this sounds cumbersome

Yeah.

>> because you can't just "sa-learn --forget" a whole mailbox.
>
> Is there a non-obvious reason for this? Would the logic behind a
> recursive --forget switch not be the same or similar as with the
> existing --ham and --spam switches?

Oh, the logic would be the same, it's just not implemented. That's why you
can't do it. :)

> Finally, when a user submits a message to be classified as ham or spam,
> how should I be sorting the messages? I see the following scenarios:
>
> 1.) I agree with the end-user's classification.
>
> 2.) I disagree with the end-user's classification.
> a.) Because the message was submitted as ham but is really spam (or
> vice versa)
> b.) Because neither classification is reasonable

> In case 1.), should I *copy* the message from the submission inbox's Ham
> folder to the permanent Ham corpus folder? Or should I *move* the
> message? I'm trying to discern whether or not there's value in retaining
> end-user submissions *as they were classified upon submission*.

I don't see any value to retaining them in the public submission folders.

In fact, you may want to make the ham submission folder write-only (if
that's possible) in order to help preserve your individual users' privacy.

> In case 2.), should I simply delete the message from the submission
> folder? Or is there some reason to retain the message (i.e., move it
> into an "Erroneous" folder within the submission mailbox)?

You might want to do that if you intend to approach the user and train
them about why it wasn't a correct submission and you want evidence - for
example, to say that this looks like a message from a legitimate mailing
list that they intentionally subscribed to at some point, and the
unsubscribe link is right there (points at screen).

Apart from that, I don't see a reason to keep erroneous submissions
either.

> I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
> but it doesn't address these issues, specifically.

Yeah, that assumes familiarity with these issues, and managing masscheck
corpora is a slightly different task than managing user-fed Bayes training
corpora.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...we talk about creating "millions of shovel-ready jobs" for a
society that doesn't really encourage people to pick up a shovel.
-- Mike Rowe, testifying before Congress
-----------------------------------------------------------------------
6 days until Abraham Lincoln's and Charles Darwin's 204th Birthdays
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Apologies for resurrecting the thread, but I never did receive a
response to this particular aspect of the problem (asked on Jan 18,
2013, 8:51 AM). This is probably because I replied to my own post before
anyone else did, and changed the subject slightly.

We are being hammered pretty hard with spam (again), and as I inspect
messages whose score is below tag2_level, BAYES_* is conspicuously
absent from the headers.

To reiterate my question:

>> Are there any normal circumstances under which Bayes tests are not run?

If not, are there circumstances under which Bayes tests are run but
their results are not included in the message headers? (I have tag_level
set to -999, so SA headers are always added.)

Likewise, for the vast majority of spam messages that slip-through, I
see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
that this observation indicates that the network tests were performed,
but did not contribute to the SA score. Is this assumption valid?

Also, is there some means by which to *force* Pyzor and Razor2 scores to
be added to the SA header, even if they did not contribute to the score?

To refresh folks' memories, we have verified that Bayes is setup
correctly (database was wiped and now training is done manually and is
supervised), and that network tests are being performed when messages
are scanned.

Thanks for sticking with me through all of this, guys!

-Ben



On 1/18/2013 11:51 AM, Ben Johnson wrote:
> So, I've been keeping an eye on things again today.
>
> Overall, things look pretty good, and most spam is being blocked
> outright at the MTA and scored appropriately in SA if not.
>
> I've been inspecting the X-Spam-Status headers for the handful of
> messages that do slip through and noticed that most of them lack any
> evidence of the BAYES_* tests. Here's one such header:
>
> No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392,
> SPF_PASS=-0.001] autolearn=disabled
>
> The messages that were delivered just before and after this one do have
> evidence of BAYES_* tests, so, it's not as though something is
> completely broken.
>
> Are there any normal circumstances under which Bayes tests are not run?
> Do I need to turn debugging back on and wait until this happens again?
>
> Thanks for all the help, everyone!
>
> -Ben
>
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/16/13 2:59 PM, "Ben Johnson" <ben@indietorrent.org> wrote:

>Are there any normal circumstances under which Bayes tests are not run?
Yes, if USE_BAYES = 0 is included in the local.cf file.

>
> If not, are there circumstances under which Bayes tests are run but
> their results are not included in the message headers? (I have tag_level
> set to -999, so SA headers are always added.)

That sounds like an amavisd command, you may want to check in
~amavisd/.spamassassin/user_prefs as well....

>
> Likewise, for the vast majority of spam messages that slip-through, I
> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
> that this observation indicates that the network tests were performed,
> but did not contribute to the SA score. Is this assumption valid?
Yes.

>
> Also, is there some means by which to *force* Pyzor and Razor2 scores to
> be added to the SA header, even if they did not contribute to the score?

I imagine you would want something like this:

full RAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50')
tflags RAZOR2_CF_RANGE_0_50 net
reuse RAZOR2_CF_RANGE_0_50
describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50%
score RAZOR2_CF_RANGE_0_50 0.01

full RAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50')
tflags RAZOR2_CF_RANGE_E4_0_50 net
reuse RAZOR2_CF_RANGE_E4_0_50
describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level
below 50%
score RAZOR2_CF_RANGE_E4_0_50 0.01

full RAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50')
tflags RAZOR2_CF_RANGE_E8_0_50 net
reuse RAZOR2_CF_RANGE_E8_0_50
describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level
below 50%
score RAZOR2_CF_RANGE_E8_0_50 0.01

>
> To refresh folks' memories, we have verified that Bayes is setup
> correctly (database was wiped and now training is done manually and is
> supervised), and that network tests are being performed when messages
> are scanned.
>
> Thanks for sticking with me through all of this, guys!
>
> -Ben

--
Daniel J McDonald, CCIE # 2495, CISSP # 78281
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Daniel, thanks for the quick reply. I'll reply inline, below.

On 4/16/2013 5:01 PM, Daniel McDonald wrote:
>
>
>
> On 4/16/13 2:59 PM, "Ben Johnson" <ben@indietorrent.org> wrote:
>
>> Are there any normal circumstances under which Bayes tests are not run?
> Yes, if USE_BAYES = 0 is included in the local.cf file.

I checked in /etc/spamassassin/local.cf, and find the following:

use_bayes 1

So, that seems not to be the issue.

>>
>> If not, are there circumstances under which Bayes tests are run but
>> their results are not included in the message headers? (I have tag_level
>> set to -999, so SA headers are always added.)
>
> That sounds like an amavisd command, you may want to check in
> ~amavisd/.spamassassin/user_prefs as well....

I checked in the equivalent path on my system
(/var/lib/amavis/.spamassassin/user_prefs) and the entire file is
commented-out. So, that seems not to be the issue, either.

Is there anything else that would cause Bayes tests not be performed? I
ask because other types of tests are disabled automatically under
certain circumstances (e.g., network tests), and I'm wondering if there
is some obscure combination of factors that causes Bayes tests not to be
performed.

>>
>> Likewise, for the vast majority of spam messages that slip-through, I
>> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
>> that this observation indicates that the network tests were performed,
>> but did not contribute to the SA score. Is this assumption valid?
> Yes.

Okay, very good.

It occurred to me that perhaps the Pyzor and/or Razor2 tests are
timing-out (both timeouts are set to 15 seconds) some percentage of the
time, which may explain why these tests do not contribute to a given
message's score.

That's why I asked about forcing the results into the SA header.

>>
>> Also, is there some means by which to *force* Pyzor and Razor2 scores to
>> be added to the SA header, even if they did not contribute to the score?
>
> I imagine you would want something like this:
>
> full RAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50')
> tflags RAZOR2_CF_RANGE_0_50 net
> reuse RAZOR2_CF_RANGE_0_50
> describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50%
> score RAZOR2_CF_RANGE_0_50 0.01
>
> full RAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50')
> tflags RAZOR2_CF_RANGE_E4_0_50 net
> reuse RAZOR2_CF_RANGE_E4_0_50
> describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level
> below 50%
> score RAZOR2_CF_RANGE_E4_0_50 0.01
>
> full RAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50')
> tflags RAZOR2_CF_RANGE_E8_0_50 net
> reuse RAZOR2_CF_RANGE_E8_0_50
> describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level
> below 50%
> score RAZOR2_CF_RANGE_E8_0_50 0.01

This seems to work brilliantly. I can't thank you enough; I never would
have figured this out.

Ideally, using the above directives will tell us whether we're
experiencing timeouts, or these spam messages are simply not in the
Pyzor or Razor2 databases.

Off the top of your head, do you happen to know what will happen if one
or both of the Pyzor/Razor2 tests timeout? Will some indication that the
tests were at least *started* still be added to the SA header?

>>
>> To refresh folks' memories, we have verified that Bayes is setup
>> correctly (database was wiped and now training is done manually and is
>> supervised), and that network tests are being performed when messages
>> are scanned.
>>
>> Thanks for sticking with me through all of this, guys!
>>
>> -Ben
>

Thanks again, Daniel!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson wrote:
> Is there anything else that would cause Bayes tests not be performed? I
> ask because other types of tests are disabled automatically under
> certain circumstances (e.g., network tests), and I'm wondering if there
> is some obscure combination of factors that causes Bayes tests not to be
> performed.

Do you have bayes_sql_override_username set? (This forces use of a
single Bayes DB for all SA calls that reference this configuration file
set.)

If not, you may be getting a Bayes DB for each user on your system;
IIRC this is supported (sort of) and default with Amavis.

-kgd
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 17-04-13 21:40, Ben Johnson wrote:
> Ideally, using the above directives will tell us whether we're
> experiencing timeouts, or these spam messages are simply not in the
> Pyzor or Razor2 databases.
>
> Off the top of your head, do you happen to know what will happen if one
> or both of the Pyzor/Razor2 tests timeout? Will some indication that the
> tests were at least *started* still be added to the SA header?

The razor client (don't know about pyzor) logs its activity to some
logfile in ~razor. There you can see what (or what not) is happening.

It's also possible to raise logfile verbosity by changing the razor
config file. See the man page for details.

--
Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 5:05 PM, Kris Deugau wrote:
> Ben Johnson wrote:
>> Is there anything else that would cause Bayes tests not be performed? I
>> ask because other types of tests are disabled automatically under
>> certain circumstances (e.g., network tests), and I'm wondering if there
>> is some obscure combination of factors that causes Bayes tests not to be
>> performed.
>
> Do you have bayes_sql_override_username set? (This forces use of a
> single Bayes DB for all SA calls that reference this configuration file
> set.)
>
> If not, you may be getting a Bayes DB for each user on your system;
> IIRC this is supported (sort of) and default with Amavis.
>
> -kgd
>

Thanks for jumping-in here, Kris.

Yes, I do have the following in my SA local.cf:

bayes_sql_override_username amavis

So, all users are sharing the same Bayes DB. I train Bayes daily and the
token count, etc., etc. all look good and correct.

Just a quick update to my previous post.

The Pyzor and Razor2 score information is indeed coming through for the
handful of messages that have landed since I made those configuration
changes. So, all seems to be well on the Pyzor / Razor2 front.

However, I still don't see any evidence that Bayes testing was performed
on the messages that are "slipping through".

It bears mention that *most* messages do indeed show evidence of Bayes
scoring.

--- OH, SNAP! I found the root cause. ---

Well, when I went to confirm the above statement, regarding most
messages showing evidence of Bayes scoring, I realized that *none* show
evidence of it since 3/23! No wonder all of this garbage is slipping
through!

I recognized the date 3/23 immediately; it was the date on which we
upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no
knowledge of ISPConfig, it is basically a FOSS solution to managing vast
numbers of websites, domains, mailboxes, etc., as the name implies.)

We also updated OS packages (security only) on that day.

After diff-ing all of the relevant service configuration files
(amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any
discrepancies.

Then, I tried:

-----------------------------------------------------
# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis
Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358)
Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established
Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3
Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1
Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham
= 2334
Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163
Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal
mix of collations for operation ' IN '
Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this
message; none of the tokens were found in the database
Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef
Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804
(15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%),
poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%),
tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18
(0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%),
tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804
(15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%),
check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211
(4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%)
-----------------------------------------------------

Check-out the message buried half-way down:

bayes: tok_get_all: SQL error: Illegal mix of collations for operation '
IN '

I have run into this unsightly message before, but in that case, I could
see the entire query, which enabled me to change the collations accordingly.

In this case, I have no idea what the original query might have been.

Further, I have no idea what changed that introduced this problems on 3/23.

Was it a MySQL upgrade? Was it an ISPConfig change?

Has anybody else run into this?

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 6:47 PM, Ben Johnson wrote:
>
>
> On 4/17/2013 5:05 PM, Kris Deugau wrote:
>> Ben Johnson wrote:
>>> Is there anything else that would cause Bayes tests not be performed? I
>>> ask because other types of tests are disabled automatically under
>>> certain circumstances (e.g., network tests), and I'm wondering if there
>>> is some obscure combination of factors that causes Bayes tests not to be
>>> performed.
>>
>> Do you have bayes_sql_override_username set? (This forces use of a
>> single Bayes DB for all SA calls that reference this configuration file
>> set.)
>>
>> If not, you may be getting a Bayes DB for each user on your system;
>> IIRC this is supported (sort of) and default with Amavis.
>>
>> -kgd
>>
>
> Thanks for jumping-in here, Kris.
>
> Yes, I do have the following in my SA local.cf:
>
> bayes_sql_override_username amavis
>
> So, all users are sharing the same Bayes DB. I train Bayes daily and the
> token count, etc., etc. all look good and correct.
>
> Just a quick update to my previous post.
>
> The Pyzor and Razor2 score information is indeed coming through for the
> handful of messages that have landed since I made those configuration
> changes. So, all seems to be well on the Pyzor / Razor2 front.
>
> However, I still don't see any evidence that Bayes testing was performed
> on the messages that are "slipping through".
>
> It bears mention that *most* messages do indeed show evidence of Bayes
> scoring.
>
> --- OH, SNAP! I found the root cause. ---
>
> Well, when I went to confirm the above statement, regarding most
> messages showing evidence of Bayes scoring, I realized that *none* show
> evidence of it since 3/23! No wonder all of this garbage is slipping
> through!
>
> I recognized the date 3/23 immediately; it was the date on which we
> upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no
> knowledge of ISPConfig, it is basically a FOSS solution to managing vast
> numbers of websites, domains, mailboxes, etc., as the name implies.)
>
> We also updated OS packages (security only) on that day.
>
> After diff-ing all of the relevant service configuration files
> (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any
> discrepancies.
>
> Then, I tried:
>
> -----------------------------------------------------
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
>
> Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis
> Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358)
> Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established
> Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3
> Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1
> Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham
> = 2334
> Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163
> Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal
> mix of collations for operation ' IN '
> Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this
> message; none of the tokens were found in the database
> Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef
> Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804
> (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%),
> poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%),
> tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18
> (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%),
> tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804
> (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%),
> check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211
> (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%)
> -----------------------------------------------------
>
> Check-out the message buried half-way down:
>
> bayes: tok_get_all: SQL error: Illegal mix of collations for operation '
> IN '
>
> I have run into this unsightly message before, but in that case, I could
> see the entire query, which enabled me to change the collations accordingly.
>
> In this case, I have no idea what the original query might have been.
>
> Further, I have no idea what changed that introduced this problems on 3/23.
>
> Was it a MySQL upgrade? Was it an ISPConfig change?
>
> Has anybody else run into this?
>
> Thanks again,
>
> -Ben
>

I managed to fix this issue.

The date on which Bayes stopped "working" was relevant only in as much
it was the first date on which MySQL had been restarted in months. The
software updates had nothing to do with the issue. The critical change
was that sometime between the last MySQL start/stop, I had added a
handful of [mysqld] configuration directives to my.cnf:

default_storage_engine=InnoDB
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8

Searching the Web for "bayes: _put_token: Updated an unexpected number
of rows" yielded the following thread:

http://spamassassin.1065346.n5.nabble.com/Migrating-bayes-to-mysql-fails-with-parsing-errors-td10064i20.html

There were several clues here. After backing-up the token DB with
"sa-learn --backup", I dropped and recreated the tables using the schema
from
http://svn.apache.org/repos/asf/spamassassin/tags/spamassassin_current_release_3.3.x/sql/bayes_mysql.sql
.

Then, I tried restoring the data into the new tables:

# sa-learn --restore bayes-backup
bayes: encountered too many errors (20) while parsing token line,
reverting to empty database and exiting
ERROR: Bayes restore returned an error, please re-run with -D for more
information

# sa-learn -D --restore bayes-backup

dbg: bayes: _put_token: Updated an unexpected number of rows.
dbg: bayes: error inserting token for line: t 1 0 1364380202 3f3f1a2a3f
dbg: bayes: _put_token: Updated an unexpected number of rows.
dbg: bayes: error inserting token for line: t 1 0 1365878113 727f3f3f20

Still no joy. So I re-read the above thread, cover-to-cover.

The first post on that page was the key. In particular, adding the
following to each MySQL "CREATE TABLE" statement:

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

To create the new tables, I performed a find-and-replace-all on
"TYPE=MyISAM;" by "ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;"
on the above .sql file. All the tables were created successfully.

To restore the data, I used "sa-learn --restore", and this time, the
process completed without error.

I was a little nervous that restoring from the backup would create
corrupted token data, and because I retain my spam and ham corpora, it
felt "safer" to empty the tables and re-run sa-learn on the corpora.

In any case, problem solved!

Thanks for the help and pointers in the right direction.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 17 Apr 2013, Ben Johnson wrote:

> The first post on that page was the key. In particular, adding the
> following to each MySQL "CREATE TABLE" statement:
>
> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Please check the SpamAssassin bugzilla to see if this situation is already
mentioned, and if not, add a bug. This seems pretty critical.

It's possible that there's a good reason the default script still uses
myISAM. If so, the documentation for this fix should at least be easier to
find.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Our government should bear in mind the fact that the American
Revolution was touched off by the then-current government
attempting to confiscate firearms from the people.
-----------------------------------------------------------------------
2 days until the 238th anniversary of The Shot Heard 'Round The World
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 5:39 PM, Tom Hendrikx wrote:
> On 17-04-13 21:40, Ben Johnson wrote:
>> Ideally, using the above directives will tell us whether we're
>> experiencing timeouts, or these spam messages are simply not in the
>> Pyzor or Razor2 databases.
>>
>> Off the top of your head, do you happen to know what will happen if one
>> or both of the Pyzor/Razor2 tests timeout? Will some indication that the
>> tests were at least *started* still be added to the SA header?
>
> The razor client (don't know about pyzor) logs its activity to some
> logfile in ~razor. There you can see what (or what not) is happening.
>
> It's also possible to raise logfile verbosity by changing the razor
> config file. See the man page for details.
>
> --
> Tom
>

Tom, thanks for the excellent tip regarding Razor's own log file.
Tailing that log will make this kind of debugging much simpler in the
future. Much appreciated.

One of the reasons for which I also like the idea of using Daniel
McDonald's include-scores-in-header rule (for Pyzor and Razor) is that
the data is embedded right in the message, which can be useful. For one,
this makes the scoring data more "portable" (it stays with the message
to which it applies). Secondly, when tailing a log, it can be difficult
to determine where the data relevant to one message ends and another begins.

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 10:15 PM, John Hardin wrote:
> On Wed, 17 Apr 2013, Ben Johnson wrote:
>
>> The first post on that page was the key. In particular, adding the
>> following to each MySQL "CREATE TABLE" statement:
>>
>> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
>
> Please check the SpamAssassin bugzilla to see if this situation is
> already mentioned, and if not, add a bug. This seems pretty critical.

Mark Martinec opened three reports in relation to this issue (quoted
from the archive thread cited in my previous post):

[Bug 6624] BayesStore/MySQL.pm fails to update tokens due to
MySQL server bug (wrong count of rows affected)
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624

(^^ Fixed in 3.4 ^^)

[Bug 6625] Bayes SQL schema treats bayes_token.token as char
instead of binary, fails chset checks
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

(^^ Fixed in 3.4 ^^)

[Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626

(^^ Fixed in 3.4 ^^)

My concern now is that I am on 3.3.1, with little control over upgrades.
I have read all three bug reports in their entirety and Bug 6624 seems
to be a very legitimate concern. To quote Mark in the bug description:

> The effect of the bug with SpamAssassin is that tokens are only able
> to be inserted once, but their counts cannot increase, leading to
> terrible bayes results if the bug is not noticed. Also the conversion
> form db fails, as reported by Dave.
>
> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
> provide a workaround for the MySQL server bug, and improved debug logging.

How can I discern whether or not this bug does, in fact, affect me? Are
my Bayes results being crippled as a result of this bug?

> It's possible that there's a good reason the default script still uses
> myISAM. If so, the documentation for this fix should at least be easier
> to find.
>

If there is a good reason, I have yet to discern what it might be. The
third bug from above (Mark's comments, specifically) imply that there is
no particular reason for using MyISAM.

I have good reason for wanting to use the InnoDB storage engine, and I
have seen no performance hit as a result of so doing. (In fact,
performance seems better than with MyISAM in my scripted, once-a-day
training setup.)

The perfectly acceptable performance I'm observing could be because a)
the InnoDB-related resources allocated to MySQL are more than
sufficient, b) the schema that I used has a newly-added INDEX whereas
those prior to it did not, or c) I was sure to use the "MySQL" module
instead of the "SQL" module with my InnoDB setup:

bayes_store_module Mail::SpamAssassin::BayesStore::MySQL

The bottom line seems to be that for those who have settings like these
in their MySQL configurations

> default_storage_engine=InnoDB
> skip-character-set-client-handshake
> collation_server=utf8_unicode_ci
> character_set_server=utf8

it is absolutely necessary to include

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

at the end of each CREATE TABLE statement (otherwise, the MySQL syntax
error results and all Bayes SELECT statements fail).

In any event, I'm a little concerned because while the majority of
messages are now tagged with BAYES_* hits, I am now seeing this debug
output on a significant percentage of messages ("cannot use bayes on
this message; not enough usable tokens found"):

# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

--------------------------------------------------------------
Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
= 2342
Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
message; not enough usable tokens found
Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
(39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
(0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
(48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
(4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
--------------------------------------------------------------

I have done some searching-around on the string "cannot use bayes on
this message; not enough usable tokens found" and have not found
anything authoritative regarding what this message might mean and
whether or not it can be ignored or if it is symptomatic of a larger
Bayes problem.

Thank you,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 04/18/2013 06:18 PM, Ben Johnson wrote:
> I have done some searching-around on the string "cannot use bayes on
> this message; not enough usable tokens found" and have not found
> anything authoritative regarding what this message might mean and
> whether or not it can be ignored or if it is symptomatic of a larger
> Bayes problem.

Curious: what are your reasons for using Bayes in SQL?
Are you sharing the DB among several machines? Or is this a single
box/global bayes setup?
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/18/2013 12:26 PM, Axb wrote:
> On 04/18/2013 06:18 PM, Ben Johnson wrote:
>> I have done some searching-around on the string "cannot use bayes on
>> this message; not enough usable tokens found" and have not found
>> anything authoritative regarding what this message might mean and
>> whether or not it can be ignored or if it is symptomatic of a larger
>> Bayes problem.
>
> Curious: what are your reasons for using Bayes in SQL?
> Are you sharing the DB among several machines? Or is this a single
> box/global bayes setup?
>
>

Not yet, but that is the ultimate plan (to share the DB across multiple
servers). Also, I like the idea that the Bayes DB is backed-up
automatically along with all other databases on the server (we run a
cron script that performs the dump). Granted, it would be trivial to
schedule a call to "sa-learn --backup", but storing the data in SQL
seems more portable and makes it easier to query the data for reporting
purposes.

Then again, I retain the corpora, so backing-up the DB is only useful
for when data needs to be moved from one server or database to another
(as moving the corpora seems far less practical).

Are you suggesting that I should scrap SQL and go back to a flat-file
DB? Is that the only path to a fix (short of upgrading SA)?

Thanks for your help!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

> Curious: what are your reasons for using Bayes in SQL?

> > Are you sharing the DB among several machines? Or is this a single
> > box/global bayes setup?
> >
> >
>
> Not yet, but that is the ultimate plan (to share the DB across multiple
> servers). Also, I like the idea that the Bayes DB is backed-up
> automatically along with all other databases on the server (we run a
> cron script that performs the dump). Granted, it would be trivial to
> schedule a call to "sa-learn --backup", but storing the data in SQL
> seems more portable and makes it easier to query the data for reporting
> purposes.
>

I have bayes in MySQL now, and I think it performs better than with just a
flat file berkeley db. I believe it solved some locking/sharing issues I
was having too.

I converted to it a few months ago (relearned the corpus from scratch to
mysql) with the intention of sharing between three systems, but the network
latency and general performance between the systems for updates was
horrible so they're all separate databases now. I'm still a mysql novice,
so I don't doubt someone with more mysql networking experience could figure
out how to share them between systems properly. I thought there would be
one master system with two slaves, but instead they all seemed to be shared
interactively for every query or update.

For the InnoDB/MyISAM issue, if I'm understanding it correctly, I just
edited the sql file I used to create the database, and I'm using InnoDB now
without any issues on v3.3.2.

I believe I used these instructions, with the sql modifications from above:

http://www200.pair.com/mecham/spam/debian-spamassassin-sql.html

Regards,
Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/18/2013 12:18 PM, Ben Johnson wrote:
>
> My concern now is that I am on 3.3.1, with little control over upgrades.
> I have read all three bug reports in their entirety and Bug 6624 seems
> to be a very legitimate concern. To quote Mark in the bug description:
>
>> The effect of the bug with SpamAssassin is that tokens are only able
>> to be inserted once, but their counts cannot increase, leading to
>> terrible bayes results if the bug is not noticed. Also the conversion
>> form db fails, as reported by Dave.
>>
>> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
>> provide a workaround for the MySQL server bug, and improved debug logging.
>
> How can I discern whether or not this bug does, in fact, affect me? Are
> my Bayes results being crippled as a result of this bug?
>
>> It's possible that there's a good reason the default script still uses
>> myISAM. If so, the documentation for this fix should at least be easier
>> to find.
>>
>
> In any event, I'm a little concerned because while the majority of
> messages are now tagged with BAYES_* hits, I am now seeing this debug
> output on a significant percentage of messages ("cannot use bayes on
> this message; not enough usable tokens found"):
>
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
>
> --------------------------------------------------------------
> Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
> Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
> Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
> Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
> Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
> Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
> = 2342
> Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
> Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
> message; not enough usable tokens found
> Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
> Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
> (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
> poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
> tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
> (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
> tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
> (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
> check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
> (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
> --------------------------------------------------------------
>
> I have done some searching-around on the string "cannot use bayes on
> this message; not enough usable tokens found" and have not found
> anything authoritative regarding what this message might mean and
> whether or not it can be ignored or if it is symptomatic of a larger
> Bayes problem.
>
> Thank you,
>
> -Ben
>

Might anyone be in a position to offer an authoritative response to
these questions?

I continue to see messages that are very similar to dozens of messages
that have been marked as SPAM slipping through with *no Bayes scoring*
(this is *after* fixing the SQL syntax error issue):

bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Is this normal? If so, what is the explanation for this behavior? I have
marked dozens of nearly-identical messages with the subject "Garden hose
expands up to three times its length" as SPAM (over the course of
several weeks) as SPAM, and yet SA reports "not enough usable tokens found".

Is SA referring to the number of tokens in the message? Or the Bayes DB?

Thanks,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

> Might anyone be in a position to offer an authoritative response to

> these questions?
>
> I continue to see messages that are very similar to dozens of messages
> that have been marked as SPAM slipping through with *no Bayes scoring*
> (this is *after* fixing the SQL syntax error issue):
>
> bayes: cannot use bayes on this message; not enough usable tokens found
> bayes: not scoring message, returning undef
>

Have you tried to find out how many tokens are in your bayes DB? As the
user specified by bayes_sql_username (actually, it probably doesn't matter,
but you should to be sure) run the following:

# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 466417 0 non-token data: nspam
0.000 0 508868 0 non-token data: nham
0.000 0 10788203 0 non-token data: ntokens
0.000 0 1320901921 0 non-token data: oldest atime
0.000 0 1366385643 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync
atime
0.000 0 1366348380 0 non-token data: last expiry atime
0.000 0 28651364 0 non-token data: last expire atime
delta
0.000 0 0 0 non-token data: last expire
reduction count

This should show you the number of spam (nspam) and ham (nham) tokens in
the db.

> Is this normal? If so, what is the explanation for this behavior? I have

> marked dozens of nearly-identical messages with the subject "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable tokens
> found".
>

If they are identical, I don't believe it will create new tokens, per se.


> Is SA referring to the number of tokens in the message? Or the Bayes DB?
>

I believe it would be talking about the database, not the message.

Regards,
Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

> Is this normal? If so, what is the explanation for this behavior? I have

> marked dozens of nearly-identical messages with the subject "Garden hose
>> expands up to three times its length" as SPAM (over the course of
>> several weeks) as SPAM, and yet SA reports "not enough usable tokens
>> found".
>>
>
> If they are identical, I don't believe it will create new tokens, per se.
>
>
>
>> Is SA referring to the number of tokens in the message? Or the Bayes DB?
>>
>
I should also mention that while training a message, use "--progress", as
such (assuming you're running it on an mbox or message that's in mbox
format):

# sa-learn --progress --spam --mbox mymboxfile

It will show you how many tokens have been learned during that run. It
might also be a good idea to add the token summary flag to your config:

add_header all Tok-Stat _TOKENSUMMARY_

If you run spamassassin on a message directly, and add the -t option, it
will show you the number of different types of tokens found in the message:

X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.

Regards,
Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/19/2013 11:42 AM, Alex wrote:
> Hi,
>
>> Is this normal? If so, what is the explanation for this behavior? I have
>
> marked dozens of nearly-identical messages with the subject
> "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable
> tokens found".
>
>
> If they are identical, I don't believe it will create new tokens,
> per se.
>
>
>
> Is SA referring to the number of tokens in the message? Or the
> Bayes DB?
>
>
> I should also mention that while training a message, use "--progress",
> as such (assuming you're running it on an mbox or message that's in mbox
> format):
>
> # sa-learn --progress --spam --mbox mymboxfile
>
> It will show you how many tokens have been learned during that run. It
> might also be a good idea to add the token summary flag to your config:
>
> add_header all Tok-Stat _TOKENSUMMARY_
>
> If you run spamassassin on a message directly, and add the -t option, it
> will show you the number of different types of tokens found in the message:
>
> X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.
>
> Regards,
> Alex
>

Alex, thanks very much for the quick reply. I really appreciate it.

One can see from the output in my previous message (two messages back)
that the user is amavis (correct for my system) and the corpus size, as
well as the token count:

dbg: bayes: corpus size: nspam = 6155, nham = 2342
dbg: bayes: tok_get_all: token count: 176
dbg: bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Now that I look at this output again, the "token count: 176" stands-out.
That seems like a pretty low value. Is this the token count for the
entire Bayes DB??? Or only the tokens that apply to the particular
message being fed to SA?

The "garden hose" messages are probably not *identical*, but they are
very similar, so it seems that each variant should have tokens to offer.

The concern I expressed around bug 6624 relates to Mark's comment, which
seems to imply that while SA will not insert a token twice, it *will*
increase the token "count". Here's an excerpt from Mark's comment from
that bug report:

"The effect of the bug with SpamAssassin is that tokens are only able
to be inserted once, but their counts cannot increase, leading to
terrible bayes results if the bug is not noticed. Also the conversion
form db fails, as reported by Dave."

Is it possible that training similar messages as SPAM is not having the
intended effect due to this bug in my version of SA?

My "bayes_vars" table looks like this (sorry for the wrapping, this is
the best I could do):

id username spam_count ham_count token_count last_expire
last_atime_delta last_expire_reduce oldest_token_age newest_token_age
1 amavis 6185 2427 120092 1366364379 8380417
14747 1357985848 1366386865

The SQL query:

SELECT count( * )
FROM `bayes_token`

returns 120092 rows, so the above value is accurate (that is, the
"token_count" value in the `bayes_vars` table matches the actual row
count in the `bayes_token` table).

Also, thanks for the other tips regarding the "token summary flag"
directive an the -t switch. I was actually using the -t switch to
produce the output that I pasted two messages back. So, it seems that
the "X-Spam-Tok-Stat" output is added only when the token count is high
enough to be useful.

Still stumped here...

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 04/19/2013 06:02 PM, Ben Johnson wrote:

> Still stumped here...

do a bayes sa-learn --backup

switch to file based in SDBM format (which is fast)

do a

sa-learn --restore

feed it a few thousand NEW spams

see what happens
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
John Hardin skrev den 2013-04-18 04:15:

>> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

unicode is overkill since bayes is just ascii

it will if unicode is used create bigger db, that will slow down more
then ascii

> Please check the SpamAssassin bugzilla to see if this situation is
> already mentioned, and if not, add a bug. This seems pretty critical.

i dont know how bayes in 3.4.x is now adays, its long since i have seen
the source for it, but i maked some changes to bayes mysql so it can be
cleaned up with timed expire of data, this is properly lost in
transistion with 3.4.x :(

> It's possible that there's a good reason the default script still
> uses myISAM. If so, the documentation for this fix should at least be
> easier to find.

it was dokumented ?

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-19 18:02:

> Still stumped here...

for amavisd-new, put spamassassin sql setup into user_prefs file for
the user amavisd-new runs as might be working better then have insecure
sql settings in /etc/mail/spamassassin :)

i dont know if this is really that you have another user for amavisd,
and test spamassassin -t msg with another user that uses another sql
user ?

make sure both users is really using same sql user as intended

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/19/2013 12:12 PM, Axb wrote:
> On 04/19/2013 06:02 PM, Ben Johnson wrote:
>
>> Still stumped here...
>
> do a bayes sa-learn --backup
>
> switch to file based in SDBM format (which is fast)
>
> do a
>
> sa-learn --restore
>
> feed it a few thousand NEW spams
>
> see what happens
>
>
>
>
>
>

Thanks for the suggestion, Axb. Your help and time is much appreciated.

By "feed it a few thousand NEW spams", do you mean to scrap the training
corpora that I've hand-sorted in favor of starting over? Or do you mean
to clear the database and re-run the training script against the corpora?

If your thinking is that the token data may be "stale", then I will
really be stumped. When I hand-classify 12 messages with a subject and
body about a retractable garden hose as spam, I expect the 13th message
about the same hose to score high on the Bayes tests. Is this an
unreasonable expectation?

I commented-out all of the DB-related lines in my SA configuration file
(local.cf) and restarted amavis-new.

I also cleared the existing DB tokens (with "sa-learn --clear") after
amavis had restarted, and then executed my normal training script
against my ham and spam corpora.

I'll keep an eye on incoming messages to see if those that "slip
through" and score below 4.0 demonstrate evidence of Bayes testing.

I am beginning to wonder if some kind of "corruption", for lack of a
better term, had been introduced by using utf8 to store the token data
(Benny Pedersen mentioned that Unicode is overkill, and he is probably
right). Performance aside, could using utf8_bin (instead of ascii)
introduce a problem for SA (despite no errors during "sa-learn" training
or --restore commands)?

The strange thing is that Bayes seems to work fine most of the time. But
as I've stated previously, almost all "obvious to a human" spam that
scores below 4.0 lacks evidence of Bayes testing.

Since switching back to a DBM Bayes setup, the results look pretty much
as expected (wrapped), and this is the type of thing I expect to see on
every message:

-----------------------------------------------------------
spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)'
dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558)
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen
dbg: bayes: found bayes db version 3
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: corpus size: nspam = 6203, nham = 2479
dbg: bayes: score = 5.55111512312578e-17
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: untie-ing
dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%),
extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%),
get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%),
compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5
(0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%),
check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27
(0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%),
check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%),
tests_pri_500: 988 (33.8%)
-----------------------------------------------------------

I'll wait and see if I receive messages without Bayes results and report
back.

Even if using DBM "works", I don't see this as a long-term solution --
only as a troubleshooting step. I would really like to keep my Bayes
data in a MySQL or PostgreSQL database.

Thanks again for the help!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/19/2013 1:54 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-19 18:02:
>
>> Still stumped here...
>
> for amavisd-new, put spamassassin sql setup into user_prefs file for the
> user amavisd-new runs as might be working better then have insecure sql
> settings in /etc/mail/spamassassin :)
>
> i dont know if this is really that you have another user for amavisd,
> and test spamassassin -t msg with another user that uses another sql user ?
>
> make sure both users is really using same sql user as intended
>

Benny, thanks for the suggestion regarding moving the SA SQL setup into
user_prefs. I will look into that soon.

Yes, I believe that me and the system always execute SA commands as the
"amavis" user. When I was using the SQL setup, I had the following in
local.cf:

bayes_path /var/lib/amavis/.spamassassin/bayes

With the DBM setup, I had the following (I have since commented it out,
while attempting to debug this Bayes issue):

bayes_sql_override_username amavis

Is something more required to ensure that my mail system, which runs
under the "amavis" user, is always reading from and writing to the same DB?

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Apologies for the rapid-fire here folks, but I wanted to correct something.

I had these backwards:

>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>>
>> bayes_path /var/lib/amavis/.spamassassin/bayes
>>
>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>>
>> bayes_sql_override_username amavis

I meant to say that I have *always* had

bayes_path /var/lib/amavis/.spamassassin/bayes

in local.cf, and using the SQL setup, I added

bayes_sql_override_username amavis

Sorry for the confusion!

-Ben



On 4/19/2013 11:02 PM, Ben Johnson wrote:
>
>
> On 4/19/2013 1:54 PM, Benny Pedersen wrote:
>> Ben Johnson skrev den 2013-04-19 18:02:
>>
>>> Still stumped here...
>>
>> for amavisd-new, put spamassassin sql setup into user_prefs file for the
>> user amavisd-new runs as might be working better then have insecure sql
>> settings in /etc/mail/spamassassin :)
>>
>> i dont know if this is really that you have another user for amavisd,
>> and test spamassassin -t msg with another user that uses another sql user ?
>>
>> make sure both users is really using same sql user as intended
>>
>
> Benny, thanks for the suggestion regarding moving the SA SQL setup into
> user_prefs. I will look into that soon.
>
> Yes, I believe that me and the system always execute SA commands as the
> "amavis" user. When I was using the SQL setup, I had the following in
> local.cf:
>
> bayes_path /var/lib/amavis/.spamassassin/bayes
>
> With the DBM setup, I had the following (I have since commented it out,
> while attempting to debug this Bayes issue):
>
> bayes_sql_override_username amavis
>
> Is something more required to ensure that my mail system, which runs
> under the "amavis" user, is always reading from and writing to the same DB?
>
> Best regards,
>
> -Ben
>
>
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
So, the problem seems not to be SQL-specific, as it occurs with SQL or
flat-file DB.

Upon following Benny Pedersen's advice (to move SA configuration
directives from /etc/spamassassin/local.cf to
/var/lib/amavis/.spamassassin/user_prefs), I noticed something unusual:

$ ls -lah /var/lib/amavis/.spamassassin/
total 7.6M
drwx------ 2 amavis amavis 4.0K Apr 20 08:54 .
drwxr-xr-x 7 amavis amavis 4.0K Apr 20 08:56 ..
-rw------- 1 root root 8.0K Apr 20 08:33 bayes_journal
-rw------- 1 root root 1.3M Apr 20 00:09 bayes_seen
-rw------- 1 root root 4.8M Apr 20 08:29 bayes_toks
-rw-r--r-- 1 root root 799 Jun 28 2004 gtube.txt
-rw-r--r-- 1 amavis amavis 2.7K Apr 20 08:55 user_prefs

Welp, that'll do it! How those four files were set to root:root
ownership is beyond me, but that was certainly a factor. Maybe this was
a result of executing my training script as root (even though I had
hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes,
and when using SQL, hard-coded bayes_sql_override_username to use amavis)?

I changed ownership to amavis:amavis and now messages are being scored
with Bayes (all of them, from what I can tell so far).

Also, I looked into the fact that I was running the cron job that trains
ham and spam as root. I did this only because the amavis user lacks
access to /var/vmail, which is where mail is stored on this system. (As
a corollary, I'm a bit curious as to how amavis is able to scan incoming
mail, given this lack of access -- maybe it does so using a pipe or some
other method that does not require access to /var/vmail.)

I think the disconnect was in the fact that I placed my custom
configuration directives in /etc/spamassassin/local.cf, when I should
have placed them in /var/lib/amavis/.spamassassin/user_prefs (for
message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam
training). (Thanks for pointing-out this mistake, Benny P.)

Putting my custom SA configuration directives in both of these files was
the only way I was able to train mail and score incoming messages using
the same credentials "across-the-board".

Once I did this, I was able to use SQL or flat-file DB with the same
exact results.

Is there a better way to achieve this consistency, aside from putting
duplicate content into /var/lib/amavis/.spamassassin/user_prefs and
/root/.spamassassin/user_prefs?

Feels like I'm out of the woods here! Thanks for all the expert help, guys.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-20 04:40:

> By "feed it a few thousand NEW spams", do you mean to scrap the
> training
> corpora that I've hand-sorted in favor of starting over? Or do you
> mean
> to clear the database and re-run the training script against the
> corpora?

ls /path/to/maildir/spam >/tmp/spam
cd /path/to/maildir/spam
sa-learn --spam --progress -f /tmp/spam

ls /path/to/maildir/ham >/tmp/ham
cd /path/to/maildir/ham
sa-learn --ham --progress -f /tmp/ham

do this for each bayes user, dependign on how your setup is this should
be it basicly to get bayes on track again

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-20 05:02:

> Yes, I believe that me and the system always execute SA commands as
> the
> "amavis" user. When I was using the SQL setup, I had the following in
> local.cf:
>
> bayes_path /var/lib/amavis/.spamassassin/bayes

is amavis have homedir in /var/lib/ ?

in gentoo its default as /var/amavis where the .spamassassin dir is
created by amavisd

use user_prefs to set bayes_path does not make sense if sql is used

> With the DBM setup, I had the following (I have since commented it
> out,
> while attempting to debug this Bayes issue):
>
> bayes_sql_override_username amavis

+1 to this one since amavis cant use multiple sa users very easy, but
depending on what amavis it being supported with complicated setups :(

i changed away from amavisd to clamav-milter, spampd in postfix after
queue, this is working very well for me, and i hope sa 3.4.x will not
break spampd :=)

> Is something more required to ensure that my mail system, which runs
> under the "amavis" user, is always reading from and writing to the
> same DB?

nope just remember that amavis also reads .spamassassin/user_prefs

if you like you can copy that user_prefs to /root/.spamassassin so you
dont have to remember something :)

user_prefs should ONLY be readble by that user that runs it

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-20 19:01:

> Welp, that'll do it! How those four files were set to root:root
> ownership is beyond me,

that means that root have doing some testing :)

later amavisd cant write, you should change to amavis user before
testing

su amavis -c cmd foo

> but that was certainly a factor. Maybe this was
> a result of executing my training script as root

yep, relaern scripts should run from cron user of amavisd, to keep
permission owner of amavis, if running as root it would change to be
owned by root

change setup so cron it started by amavis, then it works

> (even though I had
> hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes,
> and when using SQL, hard-coded bayes_sql_override_username to use
> amavis)?

do you want sql bayes ?, you still using dbm based bayes setup, but sql
bayes does not use bayes_path

> I changed ownership to amavis:amavis and now messages are being
> scored
> with Bayes (all of them, from what I can tell so far).
>
> Also, I looked into the fact that I was running the cron job that
> trains
> ham and spam as root. I did this only because the amavis user lacks
> access to /var/vmail, which is where mail is stored on this system.
> (As
> a corollary, I'm a bit curious as to how amavis is able to scan
> incoming
> mail, given this lack of access -- maybe it does so using a pipe or
> some
> other method that does not require access to /var/vmail.)

if you use sql based bayes, then you can change learn scripting to be
runned by vmail user, problem solved, remember vmail should then have
user_prefs :)

> I think the disconnect was in the fact that I placed my custom
> configuration directives in /etc/spamassassin/local.cf, when I should
> have placed them in /var/lib/amavis/.spamassassin/user_prefs (for
> message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam
> training). (Thanks for pointing-out this mistake, Benny P.)

yep, this is common error, i also remember pyzord was in the latest
ebuild setup to run as root, but hey it uses udp port above 1023 :)

> Putting my custom SA configuration directives in both of these files
> was
> the only way I was able to train mail and score incoming messages
> using
> the same credentials "across-the-board".

its possible to use dovecot-antispam ?, then it will call sa-learn pr
msg, with the user that owns vmail, but i dropped it since i still not
upgraded to dovecot 2.x yet

> Once I did this, I was able to use SQL or flat-file DB with the same
> exact results.

progress ?

> Is there a better way to achieve this consistency, aside from putting
> duplicate content into /var/lib/amavis/.spamassassin/user_prefs and
> /root/.spamassassin/user_prefs?

nope this is the perfect way, also security wise

> Feels like I'm out of the woods here! Thanks for all the expert help,
> guys.

+1

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/20/2013 3:20 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-20 05:02:
>
>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>>
>> bayes_path /var/lib/amavis/.spamassassin/bayes
>
> is amavis have homedir in /var/lib/ ?

The amavis user's home directory is /var/lib/amavis. This seems to be
the default on Ubuntu; I didn't change this path.

> in gentoo its default as /var/amavis where the .spamassassin dir is
> created by amavisd
>
> use user_prefs to set bayes_path does not make sense if sql is used
>

Thanks; I did comment-out the "bayes_path" directive. I figured that it
wouldn't matter whether it is commented or not, in the presence of
SQL-related directives, but it can't hurt to comment-out this line.

>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>>
>> bayes_sql_override_username amavis
>
> +1 to this one since amavis cant use multiple sa users very easy, but
> depending on what amavis it being supported with complicated setups :(

I only need one Bayes user, so this setup is adequate.

> i changed away from amavisd to clamav-milter, spampd in postfix after
> queue, this is working very well for me, and i hope sa 3.4.x will not
> break spampd :=)

Hmm, I will consider your sound advice in this regard. amavis does seem
to be overly memory-hungry (despite setting $max_servers = 1 and
$max_requests = 1). If there is a better alternative, I'm all ears (or
eyes, as the case may be).

>> Is something more required to ensure that my mail system, which runs
>> under the "amavis" user, is always reading from and writing to the
>> same DB?
>
> nope just remember that amavis also reads .spamassassin/user_prefs
>
> if you like you can copy that user_prefs to /root/.spamassassin so you
> dont have to remember something :)
>
> user_prefs should ONLY be readble by that user that runs it
>

Thanks for pointing this out. I will double-check the permissions.

I'll respond to your other email momentarily.

Thanks, Benny!

-Ben