Mailing List Archive

1 2 3 4  View All
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/16/2013 2:22 PM, Bowie Bailey wrote:
> On 1/16/2013 1:18 PM, Ben Johnson wrote:
>>
>> On 1/16/2013 11:00 AM, John Hardin wrote:
>>> On Wed, 16 Jan 2013, Ben Johnson wrote:
>>>
>>>> Is it possible that the training I've been doing over the last week or
>>>> so wasn't *effective* until recently, say, after restarting some
>>>> component of the mail stack? My understanding is that calling SA via
>>>> Amavis, which does not need/use the spamd daemon, forces all Bayes data
>>>> to be up-to-date on each call to spamassassin.
>>> That shouldn't be the case. SA and sa-learn both use a shared-access
>>> database; if you're training the database that SA is learning, the
>>> results of training should be effective immediately.
>>>
>> Okay, good. Bowie's response to this question differed (he suggested
>> that Amavis would need to be restarted for Bayes to be updated), but I'm
>> pretty sure that restarting Amavis is not necessary. It seems unlikely
>> that Amavis would copy the entire Bayes DB (which is stored in MySQL on
>> this server) into memory every time that the Amavis service is started.
>> To do so seems self-defeating: more RAM usage, worse performance, etc.
>
> Actually, I was making a general observation.
>
> For cases where you would normally need to restart spamd, you will need
> to restart amavis. This includes things like rule and configuration
> changes.
>
> Bayes data is read dynamically from your MySQL database and thus does
> not require a restart of amavis/spamd when updated.
>

My apologies, Bowie. I misinterpreted your response. Thank you very much
for the follow-up and for the clear explanation.

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
So, I've been keeping an eye on things again today.

Overall, things look pretty good, and most spam is being blocked
outright at the MTA and scored appropriately in SA if not.

I've been inspecting the X-Spam-Status headers for the handful of
messages that do slip through and noticed that most of them lack any
evidence of the BAYES_* tests. Here's one such header:

No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1,
HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392,
SPF_PASS=-0.001] autolearn=disabled

The messages that were delivered just before and after this one do have
evidence of BAYES_* tests, so, it's not as though something is
completely broken.

Are there any normal circumstances under which Bayes tests are not run?
Do I need to turn debugging back on and wait until this happens again?

Thanks for all the help, everyone!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/15/2013 5:22 PM, John Hardin wrote:
>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam plug-in.
>>>> They do so unsupervised. Why this could be a problem is obvious. And
>>>> no,
>>>> I don't retain their submissions. I probably should. I wonder if I can
>>>> make a few slight modifications to the shell script that Antispam
>>>> calls,
>>>> such that it simply sends a copy of the message to an administrator
>>>> rather than calling sa-learn on the message.
>>>
>>> That would be a very good idea if the number of users doing training is
>>> small. At the very least, the messages should be captured to a permanent
>>> corpus mailbox.
>>
>> Good idea! I'll see if I can set this up.

So, I finally got around to tackling this change.

With a couple of simple modifications, I was able to achieve the desired
result with the Dovecot Antispam plug-in.

In dovecot.conf:

-----------------------------------------------------
plugin {
# [...]

# For Dovecot < 2.0.
antispam_spam_pattern_ignorecase = SPAM;JUNK
antispam_mail_tmpdir = /tmp
antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh
antispam_mail_spam = proposed-spam@example.com
antispam_mail_notspam = proposed-ham@example.com
}
-----------------------------------------------------

Basically, I changed the last two directive values from the switches
that are normally passed to the "sa-learn" binary (--spam and --ham) to
destination email addresses that are passed to "sendmail" in my revised
pipe script.

Here is the full pipe script, /usr/bin/sa-learn-pipe.sh (apologies for
the wrapping); the original commands are commented with two pound
symbols [##]):

-----------------------------------------------------
#!/bin/sh

# Add "starting now" string to log.
echo "$$-start ($*)" >> /tmp/sa-learn-pipe.log

# Copy the message contents to a temporary text file.
cat<&0 >> /tmp/sendmail-msg-$$.txt

CURRENT_USER=$(whoami)

##echo "Calling (as user $CURRENT_USER) '/usr/bin/sa-learn $*
/tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log
echo "Calling (as user $CURRENT_USER) 'sendmail $* <
/tmp/sendmail-msg-$$.txt'" >> /tmp/sa-learn-pipe.log

# Execute sa-learn, with the passed ham/spam argument, and the temporary
message contents.
# Send the output to the log file while redirecting stderr to stdout (so
we capture debug output).
##/usr/bin/sa-learn $* /tmp/sendmail-msg-$$.txt >>
/tmp/sa-learn-pipe.log 2>&1
sendmail $* < /tmp/sendmail-msg-$$.txt >> /tmp/sa-learn-pipe.log 2>&1

# Remove the temporary message.
rm -f /tmp/sendmail-msg-$$.txt

# Add "ending now" string to log.
echo "$$-end" >> /tmp/sa-learn-pipe.log

# Exit with "success" status code.
exit 0
-----------------------------------------------------

It seems as though creating a temporary copy of the message is not
strictly necessary, as the message contents could be passed to the
"sendmail" command via standard input (stdin), but creating the copy
could be useful in debugging.

>>> Do your users also train ham? Are the procedures similar enough that
>>> your users could become easily confused?
>>
>> They do. The procedure is implemented via Dovecot's Antispam plug-in.
>> Basically, moving mail from Inbox to Junk trains it as spam, and moving
>> mail from Junk to Inbox trains it as ham. I really like this setup
>> (Antispam + calling SA through Amavis [i.e. not using spamd]) because
>> the results are effective immediately, which seems to be crucial for
>> combating this snowshoe spam (performance and scalability aside).
>>
>> I don't find that procedure to be confusing, but people are different, I
>> suppose.
>
> Hm. One thing I would watch out for in that environment is people who
> have intentionally subscribed to some sort of mailing list deciding they
> don't want to receive it any longer and just junking the messages rather
> than unsubscribing.

The steps I've taken above will allow me to review submissions and
educate users who engage in this practice. Thanks again for elucidating
this scenario.

I hope that this approach to user-based SpamAssassin training is useful
to others.

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 31 Jan 2013, Ben Johnson wrote:

> On 1/15/2013 5:22 PM, John Hardin wrote:
>>>>> Yes, users are allowed to train Bayes, via Dovecot's Antispam
>>>>> plug-in. They do so unsupervised. Why this could be a problem is
>>>>> obvious. And no, I don't retain their submissions. I probably
>>>>> should. I wonder if I can make a few slight modifications to the
>>>>> shell script that Antispam calls, such that it simply sends a copy
>>>>> of the message to an administrator rather than calling sa-learn on
>>>>> the message.
>>>>
>>>> That would be a very good idea if the number of users doing training is
>>>> small. At the very least, the messages should be captured to a permanent
>>>> corpus mailbox.
>>>
>>> Good idea! I'll see if I can set this up.
>
> So, I finally got around to tackling this change.
>
> With a couple of simple modifications, I was able to achieve the desired
> result with the Dovecot Antispam plug-in.
>
> Basically, I changed the last two directive values from the switches
> that are normally passed to the "sa-learn" binary (--spam and --ham) to
> destination email addresses that are passed to "sendmail" in my revised
> pipe script.

Passing the messages through sendmail again isn't optimal as that will
make further changes to the headers. This may have effects on the quality
of the learning, unless the original message is attached as an RFC-822
attachment to the message being sent to the corpus mailbox, which of
course means you then can't just run sa-learn directly against that
mailbox - the review process would involve moving the attachment as a
standalone message to the spam or ham learning mailbox.

Ideally you want to just move the messages between mailboxes without
involving another delivery processing. I don't know enough about Dovecot
or your topology to say whether that's going to be as easy as using
sendmail to mail the message to you.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
If guns kill people, then...
-- pencils miss spel words.
-- cars make people drive drunk.
-- spoons make people fat.
-----------------------------------------------------------------------
Tomorrow: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Thu, 31 Jan 2013 12:12:15 -0800 (PST)
John Hardin wrote:

> On Thu, 31 Jan 2013, Ben Johnson wrote:
>

> > So, I finally got around to tackling this change.
> >
> > With a couple of simple modifications, I was able to achieve the
> > desired result with the Dovecot Antispam plug-in.
> >
> > Basically, I changed the last two directive values from the switches
> > that are normally passed to the "sa-learn" binary (--spam and
> > --ham) to destination email addresses that are passed to "sendmail"
> > in my revised pipe script.
>
> Passing the messages through sendmail again isn't optimal as that
> will make further changes to the headers. This may have effects on
> the quality of the learning, unless the original message is attached
> as an RFC-822 attachment to the message being sent to the corpus
> mailbox, which of course means you then can't just run sa-learn
> directly against that mailbox - the review process would involve
> moving the attachment as a standalone message to the spam or ham
> learning mailbox.
>
> Ideally you want to just move the messages between mailboxes without
> involving another delivery processing. I don't know enough about
> Dovecot or your topology to say whether that's going to be as easy as
> using sendmail to mail the message to you.

Actually that's the way that the dovecot plugin works. I think that the
sendmail option is mainly a way to get training done on a remote
machine - it's a standard feature of DSPAM for which the plugin was
originally developed.

When I looked at the plugin it seemed to have quite a serious flaw.
IIRC it disables IMAP APPENDs on the Spam folder which makes it
incompatible with synchronisation tools like OfflineImap and probably
some IMAP clients that implement offline support in the same way.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 1/31/2013 5:50 PM, RW wrote:
> On Thu, 31 Jan 2013 12:12:15 -0800 (PST)
> John Hardin wrote:
>
>> On Thu, 31 Jan 2013, Ben Johnson wrote:
>>
>
>>> So, I finally got around to tackling this change.
>>>
>>> With a couple of simple modifications, I was able to achieve the
>>> desired result with the Dovecot Antispam plug-in.
>>>
>>> Basically, I changed the last two directive values from the switches
>>> that are normally passed to the "sa-learn" binary (--spam and
>>> --ham) to destination email addresses that are passed to "sendmail"
>>> in my revised pipe script.
>>
>> Passing the messages through sendmail again isn't optimal as that
>> will make further changes to the headers. This may have effects on
>> the quality of the learning, unless the original message is attached
>> as an RFC-822 attachment to the message being sent to the corpus
>> mailbox, which of course means you then can't just run sa-learn
>> directly against that mailbox - the review process would involve
>> moving the attachment as a standalone message to the spam or ham
>> learning mailbox.
>>
>> Ideally you want to just move the messages between mailboxes without
>> involving another delivery processing. I don't know enough about
>> Dovecot or your topology to say whether that's going to be as easy as
>> using sendmail to mail the message to you.
>
> Actually that's the way that the dovecot plugin works. I think that the
> sendmail option is mainly a way to get training done on a remote
> machine - it's a standard feature of DSPAM for which the plugin was
> originally developed.
>
> When I looked at the plugin it seemed to have quite a serious flaw.
> IIRC it disables IMAP APPENDs on the Spam folder which makes it
> incompatible with synchronisation tools like OfflineImap and probably
> some IMAP clients that implement offline support in the same way.
>

John, thanks for pointing-out the problems associated with re-sending
the messages via sendmail.

I threw a line out to the Dovecot users group and learned how to move
messages without going through the MTA. Dovecot has a utility
executable, "deliver", which is well-suited to the task.

For those who may have a similar need, here's the Dovecot Antispam pipe
script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
mailing list:

---------------------------------------
#!/bin/bash

mode=
for opt; do
if test "x$*" == "x--ham"; then
mode=HAM
break
elif test "x$*" == "x--spam"; then
mode=SPAM
break
fi
done

if test -n "$mode"; then
# options from http://wiki1.dovecot.org/LDA
/usr/lib/dovecot/deliver -d user@example.com -m Training.$mode
fi

exit 0
---------------------------------------


And here are the Antispam plug-in options:


---------------------------------------
# For Dovecot < 2.0.
antispam_spam_pattern_ignorecase = SPAM;JUNK
antispam_mail_tmpdir = /tmp
antispam_mail_sendmail = /usr/bin/sa-learn-pipe.sh
antispam_mail_spam = --spam
antispam_mail_notspam = --ham
---------------------------------------

RW, thank you for underscoring the issue with IMAP appends. It looks as
though a configuration directive exists to control this behavior:

# Whether to allow APPENDing to SPAM folders or not. Must be set to
# "yes" (case insensitive) to be activated. Before activating, please
# read the discussion below.
# antispam_allow_append_to_spam = no

Unfortunately, I don't fully understand the implications or enabling or
disabling this option. Here's the "discussion below" that is referenced
in the above comment:

---------------------------------------
ALLOWING APPENDS?

You should be careful with allowing APPENDs to SPAM folders. The reason
for possibly allowing it is to allow not-SPAM --> SPAM transitions to
work with offlineimap. However, because with APPEND the plugin cannot
know the source of the message, multiple bad scenarios can happen:

1. SPAM --> SPAM transitions cannot be recognised and are trained

2. the same holds for Trash --> SPAM transitions

Additionally, because we cannot recognise SPAM --> not-SPAM
transitions, training good messages will never work with APPEND.
---------------------------------------

In consideration of the first point, what is a "SPAM --> SPAM
transition"? Is that when the mailbox contains more than one "spam
folder", e.g., "JUNK" and "SPAM", and the user drags a message from one
to the other?

Regarding the second point, I'm not sure I understand the problem. If
someone drags a message from Trash to SPAM, shouldn't it be submitted
for learning as spam?

The last sentence sounds like somewhat of a deal-breaker. Doesn't my
whole strategy go somewhat limp if ham cannot be submitted for training?

John and RW, do you recommend enabling or disabling the append option,
given the way I'm reviewing the submissions and sorting them manually?

Sorry for all the questions! And thanks!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Fri, 1 Feb 2013, Ben Johnson wrote:

> John, thanks for pointing-out the problems associated with re-sending
> the messages via sendmail.
>
> I threw a line out to the Dovecot users group and learned how to move
> messages without going through the MTA. Dovecot has a utility
> executable, "deliver", which is well-suited to the task.
>
> For those who may have a similar need, here's the Dovecot Antispam pipe
> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
> mailing list:
>
> ---------------------------------------
> #!/bin/bash
>
> mode=
> for opt; do
> if test "x$*" == "x--ham"; then
> mode=HAM
> break
> elif test "x$*" == "x--spam"; then
> mode=SPAM
> break
> fi
> done
>
> if test -n "$mode"; then
> # options from http://wiki1.dovecot.org/LDA
> /usr/lib/dovecot/deliver -d user@example.com -m Training.$mode
> fi
>
> exit 0
> ---------------------------------------

That seems a lot better.

> Regarding the second point, I'm not sure I understand the problem. If
> someone drags a message from Trash to SPAM, shouldn't it be submitted
> for learning as spam?
>
> The last sentence sounds like somewhat of a deal-breaker. Doesn't my
> whole strategy go somewhat limp if ham cannot be submitted for training?
>
> John and RW, do you recommend enabling or disabling the append option,
> given the way I'm reviewing the submissions and sorting them manually?

I think they're proceeding from the assumption of *un-reviewed* training,
i.e. blind trust in the reliability of the users.

If it's possible to enable IMAP Append on a per-folder basis then enabling
it only on your training inbox folders shouldn't be an issue - the
messages won't be trained until you've reviewed them.

Without that level of fine-grain control I still don't see an issue from
this if you can prevent the users from adding content directly to the
folders that sa-learn actually processes. If IMAP Append only applies to
"shared" folders then there shouldn't be a problem - configure sa-learn to
learn from folders in *your account*, that nobody else can access
directly.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Gun Control laws aren't enacted to control guns, they are enacted
to control people: catholics (1500s), japanese peasants (1600s),
blacks (1860s), italian immigrants (1911), the irish (1920s),
jews (1930s), blacks (1960s), the poor (always)
-----------------------------------------------------------------------
Today: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Fri, 1 Feb 2013 09:00:48 -0800 (PST)
John Hardin wrote:

> On Fri, 1 Feb 2013, Ben Johnson wrote:
>
> > John, thanks for pointing-out the problems associated with
> > re-sending the messages via sendmail.
> >
> > I threw a line out to the Dovecot users group and learned how to
> > move messages without going through the MTA. Dovecot has a utility
> > executable, "deliver", which is well-suited to the task.
> >
> > For those who may have a similar need, here's the Dovecot Antispam
> > pipe script that I'm using, courtesy of Steffen Kaiser on the
> > Dovecot Users mailing list:
> >
> > ---------------------------------------
> > #!/bin/bash
> >
> > mode=
> > for opt; do
> > if test "x$*" == "x--ham"; then
> > mode=HAM
> > break
> > elif test "x$*" == "x--spam"; then
> > mode=SPAM
> > break
> > fi
> > done
> >
> > if test -n "$mode"; then
> > # options from http://wiki1.dovecot.org/LDA
> > /usr/lib/dovecot/deliver -d user@example.com -m
> > Training.$mode fi
> >
> > exit 0
> > ---------------------------------------
>
> That seems a lot better.
>
> > Regarding the second point, I'm not sure I understand the problem.
> > If someone drags a message from Trash to SPAM, shouldn't it be
> > submitted for learning as spam?
> >
> > The last sentence sounds like somewhat of a deal-breaker. Doesn't my
> > whole strategy go somewhat limp if ham cannot be submitted for
> > training?
> >
> > John and RW, do you recommend enabling or disabling the append
> > option, given the way I'm reviewing the submissions and sorting
> > them manually?
>
> I think they're proceeding from the assumption of *un-reviewed*
> training, i.e. blind trust in the reliability of the users.
>
> If it's possible to enable IMAP Append on a per-folder basis then
> enabling it only on your training inbox folders shouldn't be an issue
> - the messages won't be trained until you've reviewed them.
>
> Without that level of fine-grain control I still don't see an issue
> from this if you can prevent the users from adding content directly
> to the folders that sa-learn actually processes. If IMAP Append only
> applies to "shared" folders then there shouldn't be a problem -
> configure sa-learn to learn from folders in *your account*, that
> nobody else can access directly.

This is what it says:


antispam_allow_append_to_spam (boolean) Specifies whether to allow
appending mails to the spam folder from the unknown source. See the
ALLOWING APPENDS section below for the details on why it is not
advised to turn this option on. Optional, default = NO.

...

ALLOWING APPENDS
By appends we mean the case of mail moving when the source folder is
unknown, e.g. when you move from some other account or with tools
like offlineimap. You should be careful with allowing APPENDs to
SPAM folders. The reason for possibly allowing it is to allow
not-SPAM --> SPAM transitions to work and be trained. However,
because the plugin cannot know the source of the message (it is
assumed to be from OTHER folder), multiple bad scenarios can happen:

1. SPAM --> SPAM transitions cannot be recognised and are trained;
2. TRASH --> SPAM transitions cannot be recognised and are trained;
3. SPAM --> not-SPAM transitions cannot be recognised therefore
training good messages will never work with APPENDs.


I presume that the plugin works by monitoring COPY commands and so
can't work properly when a move is done by FETCH-APPEND-DELETE.

For sa-learn the problem would be 3, but I don't see how that is
affected by allowing appends on the spam folder.
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Sat, 2 Feb 2013, RW wrote:

> ALLOWING APPENDS
> By appends we mean the case of mail moving when the source folder is
> unknown, e.g. when you move from some other account or with tools
> like offlineimap. You should be careful with allowing APPENDs to
> SPAM folders. The reason for possibly allowing it is to allow
> not-SPAM --> SPAM transitions to work and be trained. However,
> because the plugin cannot know the source of the message (it is
> assumed to be from OTHER folder), multiple bad scenarios can happen:
>
> 1. SPAM --> SPAM transitions cannot be recognised and are trained;
> 2. TRASH --> SPAM transitions cannot be recognised and are trained;
> 3. SPAM --> not-SPAM transitions cannot be recognised therefore
> training good messages will never work with APPENDs.
>
>
> I presume that the plugin works by monitoring COPY commands and so
> can't work properly when a move is done by FETCH-APPEND-DELETE.
>
> For sa-learn the problem would be 3, but I don't see how that is
> affected by allowing appends on the spam folder.

Yeah, all of that sounds like they're talking about non-vetted training
mailboxes where the users are effectively talking directly to sa-learn.

I think I may see at least part of what they are driving at.

If one user trains a message as ham and another user who got a copy of the
same message trains it as spam, who wins?

Absent some conflict-detection mechanism, the last mailbox trained (either
spam or ham) wins.

As for the other two:

spam -> spam transitions don't matter, sa-learn recognises message-IDs and
won't learn from the same message in the same corpus more than once (i.e.
having the same message in the spam corpus multiple times does not
"weight" the tokens learned from that message). So (1) may be a
performance concern but it won't affect the database.

trash -> spam transition being learned is a problem how?

That latter brings up another concern for the vetted-corpora model: if a
message is *removed* from a training corpora mailbox rather than
reclassified, you'd have to wipe and retrain your database from scratch to
remove that message's effects.

So, you need *three* vetted corpus mailboxes: spam, ham, and
should-not-have-been-trained (forget). Rather than deleting a message from
the ham or spam corpus mailbox you move it to the forget mailbox and the
in next training pass sa-learn forgets the message and removes it from the
forget mailbox. This would be some special scripting, because you can't
just "sa-learn --forget" a whole mailbox.

There would also need to be an audit process to detect whether the same
message_id is in both the ham and spam corpus mailboxes, so that the admin
can delete (NOT forget) the incorrect classification, or forget the
message if neither classification is reasonable.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
When designing software, any time you think to yourself "a user
would never be stupid enough to do *that*", you're wrong.
-----------------------------------------------------------------------
Today: the 10th anniversary of the loss of STS-107 Columbia
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2/1/2013 12:00 PM, John Hardin wrote:
> On Fri, 1 Feb 2013, Ben Johnson wrote:
>
>> John, thanks for pointing-out the problems associated with re-sending
>> the messages via sendmail.
>>
>> I threw a line out to the Dovecot users group and learned how to move
>> messages without going through the MTA. Dovecot has a utility
>> executable, "deliver", which is well-suited to the task.
>>
>> For those who may have a similar need, here's the Dovecot Antispam pipe
>> script that I'm using, courtesy of Steffen Kaiser on the Dovecot Users
>> mailing list:
>>
>> ---------------------------------------
>> #!/bin/bash
>>
>> mode=
>> for opt; do
>> if test "x$*" == "x--ham"; then
>> mode=HAM
>> break
>> elif test "x$*" == "x--spam"; then
>> mode=SPAM
>> break
>> fi
>> done
>>
>> if test -n "$mode"; then
>> # options from http://wiki1.dovecot.org/LDA
>> /usr/lib/dovecot/deliver -d user@example.com -m Training.$mode
>> fi
>>
>> exit 0
>> ---------------------------------------
>
> That seems a lot better.
>
>> Regarding the second point, I'm not sure I understand the problem. If
>> someone drags a message from Trash to SPAM, shouldn't it be submitted
>> for learning as spam?
>>
>> The last sentence sounds like somewhat of a deal-breaker. Doesn't my
>> whole strategy go somewhat limp if ham cannot be submitted for training?
>>
>> John and RW, do you recommend enabling or disabling the append option,
>> given the way I'm reviewing the submissions and sorting them manually?
>
> I think they're proceeding from the assumption of *un-reviewed*
> training, i.e. blind trust in the reliability of the users.
>
> If it's possible to enable IMAP Append on a per-folder basis then
> enabling it only on your training inbox folders shouldn't be an issue -
> the messages won't be trained until you've reviewed them.
>
> Without that level of fine-grain control I still don't see an issue from
> this if you can prevent the users from adding content directly to the
> folders that sa-learn actually processes. If IMAP Append only applies to
> "shared" folders then there shouldn't be a problem - configure sa-learn
> to learn from folders in *your account*, that nobody else can access
> directly.
>

Thanks, John.

If I'm understanding you correctly, your assessment is that enabling
IMAP append in the Antispam plug-in configuration (not the default, by
the way) shouldn't cause problems for my Bayes training setup, primarily
because users cannot train Bayes unsupervised.

If that is so, what's the real benefit to enabling this "feature" that
is off by default? Users will be able to submit messages for training
while "offline" and when they reconnect the plug-in will be triggered
and the messages copied to the training mailbox?

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 2/1/2013 7:58 PM, John Hardin wrote:
> On Sat, 2 Feb 2013, RW wrote:
>
>> ALLOWING APPENDS
>> By appends we mean the case of mail moving when the source folder is
>> unknown, e.g. when you move from some other account or with tools
>> like offlineimap. You should be careful with allowing APPENDs to
>> SPAM folders. The reason for possibly allowing it is to allow
>> not-SPAM --> SPAM transitions to work and be trained. However,
>> because the plugin cannot know the source of the message (it is
>> assumed to be from OTHER folder), multiple bad scenarios can happen:
>>
>> 1. SPAM --> SPAM transitions cannot be recognised and are trained;
>> 2. TRASH --> SPAM transitions cannot be recognised and are trained;
>> 3. SPAM --> not-SPAM transitions cannot be recognised therefore
>> training good messages will never work with APPENDs.
>>
>>
>> I presume that the plugin works by monitoring COPY commands and so
>> can't work properly when a move is done by FETCH-APPEND-DELETE.
>>
>> For sa-learn the problem would be 3, but I don't see how that is
>> affected by allowing appends on the spam folder.
>
> Yeah, all of that sounds like they're talking about non-vetted training
> mailboxes where the users are effectively talking directly to sa-learn.
>
> I think I may see at least part of what they are driving at.
>
> If one user trains a message as ham and another user who got a copy of
> the same message trains it as spam, who wins?
>
> Absent some conflict-detection mechanism, the last mailbox trained
> (either spam or ham) wins.
>
> As for the other two:
>
> spam -> spam transitions don't matter, sa-learn recognises message-IDs
> and won't learn from the same message in the same corpus more than once
> (i.e. having the same message in the spam corpus multiple times does not
> "weight" the tokens learned from that message). So (1) may be a
> performance concern but it won't affect the database.
>
> trash -> spam transition being learned is a problem how?
>
> That latter brings up another concern for the vetted-corpora model: if a
> message is *removed* from a training corpora mailbox rather than
> reclassified, you'd have to wipe and retrain your database from scratch
> to remove that message's effects.
>
> So, you need *three* vetted corpus mailboxes: spam, ham, and
> should-not-have-been-trained (forget). Rather than deleting a message
> from the ham or spam corpus mailbox you move it to the forget mailbox
> and the in next training pass sa-learn forgets the message and removes
> it from the forget mailbox. This would be some special scripting,
> because you can't just "sa-learn --forget" a whole mailbox.
>
> There would also need to be an audit process to detect whether the same
> message_id is in both the ham and spam corpus mailboxes, so that the
> admin can delete (NOT forget) the incorrect classification, or forget
> the message if neither classification is reasonable.
>

You reveal some crucial information with regard to corpora management
here, John.

I've taken your good advice and created a third mailbox (well, a third
"folder" within the same mailbox), named "Forget".

It sounds as though the key here is never to delete messages from either
corpus -- unless the same message exists in both corpora, in which case
the misclassified message should be deleted. If neither classification
is reasonable and the message should instead be forgotten, what's the
order of operations? Should a copy of the message be created in the
"Forget" corpus and then the message deleted from both the "Ham" and
"Spam" corpora?

With regard to the specialized scripting required to "forget" messages,
this sounds cumbersome

> because you can't just "sa-learn --forget" a whole mailbox.

Is there a non-obvious reason for this? Would the logic behind a
recursive --forget switch not be the same or similar as with the
existing --ham and --spam switches?

Finally, when a user submits a message to be classified as ham or spam,
how should I be sorting the messages? I see the following scenarios:

1.) I agree with the end-user's classification.

2.) I disagree with the end-user's classification.
a.) Because the message was submitted as ham but is really spam (or
vice versa)
b.) Because neither classification is reasonable

In case 1.), should I *copy* the message from the submission inbox's Ham
folder to the permanent Ham corpus folder? Or should I *move* the
message? I'm trying to discern whether or not there's value in retaining
end-user submissions *as they were classified upon submission*.

In case 2.), should I simply delete the message from the submission
folder? Or is there some reason to retain the message (i.e., move it
into an "Erroneous" folder within the submission mailbox)?

I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
but it doesn't address these issues, specifically.

Thanks again!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 6 Feb 2013, Ben Johnson wrote:
> On 2/1/2013 7:58 PM, John Hardin wrote:
>>
>> That latter brings up another concern for the vetted-corpora model: if a
>> message is *removed* from a training corpora mailbox rather than
>> reclassified, you'd have to wipe and retrain your database from scratch
>> to remove that message's effects.
>>
>> So, you need *three* vetted corpus mailboxes: spam, ham, and
>> should-not-have-been-trained (forget). Rather than deleting a message
>> from the ham or spam corpus mailbox you move it to the forget mailbox
>> and the in next training pass sa-learn forgets the message and removes
>> it from the forget mailbox. This would be some special scripting,
>> because you can't just "sa-learn --forget" a whole mailbox.
>>
>> There would also need to be an audit process to detect whether the same
>> message_id is in both the ham and spam corpus mailboxes, so that the
>> admin can delete (NOT forget) the incorrect classification, or forget
>> the message if neither classification is reasonable.
>
> You reveal some crucial information with regard to corpora management
> here, John.
>
> I've taken your good advice and created a third mailbox (well, a third
> "folder" within the same mailbox), named "Forget".
>
> It sounds as though the key here is never to delete messages from either
> corpus -- unless the same message exists in both corpora, in which case
> the misclassified message should be deleted. If neither classification
> is reasonable and the message should instead be forgotten, what's the
> order of operations? Should a copy of the message be created in the
> "Forget" corpus and then the message deleted from both the "Ham" and
> "Spam" corpora?

I would suggest: *move* one to the Forget folder and delete the other.

I am assuming that learning from the vetted corpora folders is on a
schedule rather than in real-time, so that you have a liberal window for
completing these operations.

> With regard to the specialized scripting required to "forget" messages,
> this sounds cumbersome

Yeah.

>> because you can't just "sa-learn --forget" a whole mailbox.
>
> Is there a non-obvious reason for this? Would the logic behind a
> recursive --forget switch not be the same or similar as with the
> existing --ham and --spam switches?

Oh, the logic would be the same, it's just not implemented. That's why you
can't do it. :)

> Finally, when a user submits a message to be classified as ham or spam,
> how should I be sorting the messages? I see the following scenarios:
>
> 1.) I agree with the end-user's classification.
>
> 2.) I disagree with the end-user's classification.
> a.) Because the message was submitted as ham but is really spam (or
> vice versa)
> b.) Because neither classification is reasonable

> In case 1.), should I *copy* the message from the submission inbox's Ham
> folder to the permanent Ham corpus folder? Or should I *move* the
> message? I'm trying to discern whether or not there's value in retaining
> end-user submissions *as they were classified upon submission*.

I don't see any value to retaining them in the public submission folders.

In fact, you may want to make the ham submission folder write-only (if
that's possible) in order to help preserve your individual users' privacy.

> In case 2.), should I simply delete the message from the submission
> folder? Or is there some reason to retain the message (i.e., move it
> into an "Erroneous" folder within the submission mailbox)?

You might want to do that if you intend to approach the user and train
them about why it wasn't a correct submission and you want evidence - for
example, to say that this looks like a message from a legitimate mailing
list that they intentionally subscribed to at some point, and the
unsubscribe link is right there (points at screen).

Apart from that, I don't see a reason to keep erroneous submissions
either.

> I did read http://wiki.apache.org/spamassassin/HandClassifiedCorpora ,
> but it doesn't address these issues, specifically.

Yeah, that assumes familiarity with these issues, and managing masscheck
corpora is a slightly different task than managing user-fed Bayes training
corpora.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...we talk about creating "millions of shovel-ready jobs" for a
society that doesn't really encourage people to pick up a shovel.
-- Mike Rowe, testifying before Congress
-----------------------------------------------------------------------
6 days until Abraham Lincoln's and Charles Darwin's 204th Birthdays
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Apologies for resurrecting the thread, but I never did receive a
response to this particular aspect of the problem (asked on Jan 18,
2013, 8:51 AM). This is probably because I replied to my own post before
anyone else did, and changed the subject slightly.

We are being hammered pretty hard with spam (again), and as I inspect
messages whose score is below tag2_level, BAYES_* is conspicuously
absent from the headers.

To reiterate my question:

>> Are there any normal circumstances under which Bayes tests are not run?

If not, are there circumstances under which Bayes tests are run but
their results are not included in the message headers? (I have tag_level
set to -999, so SA headers are always added.)

Likewise, for the vast majority of spam messages that slip-through, I
see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
that this observation indicates that the network tests were performed,
but did not contribute to the SA score. Is this assumption valid?

Also, is there some means by which to *force* Pyzor and Razor2 scores to
be added to the SA header, even if they did not contribute to the score?

To refresh folks' memories, we have verified that Bayes is setup
correctly (database was wiped and now training is done manually and is
supervised), and that network tests are being performed when messages
are scanned.

Thanks for sticking with me through all of this, guys!

-Ben



On 1/18/2013 11:51 AM, Ben Johnson wrote:
> So, I've been keeping an eye on things again today.
>
> Overall, things look pretty good, and most spam is being blocked
> outright at the MTA and scored appropriately in SA if not.
>
> I've been inspecting the X-Spam-Status headers for the handful of
> messages that do slip through and noticed that most of them lack any
> evidence of the BAYES_* tests. Here's one such header:
>
> No, score=3.115 tagged_above=-999 required=4.5 tests=[HK_NAME_FREE=1,
> HTML_MESSAGE=0.001, MIME_HTML_ONLY=0.723, PYZOR_CHECK=1.392,
> SPF_PASS=-0.001] autolearn=disabled
>
> The messages that were delivered just before and after this one do have
> evidence of BAYES_* tests, so, it's not as though something is
> completely broken.
>
> Are there any normal circumstances under which Bayes tests are not run?
> Do I need to turn debugging back on and wait until this happens again?
>
> Thanks for all the help, everyone!
>
> -Ben
>
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/16/13 2:59 PM, "Ben Johnson" <ben@indietorrent.org> wrote:

>Are there any normal circumstances under which Bayes tests are not run?
Yes, if USE_BAYES = 0 is included in the local.cf file.

>
> If not, are there circumstances under which Bayes tests are run but
> their results are not included in the message headers? (I have tag_level
> set to -999, so SA headers are always added.)

That sounds like an amavisd command, you may want to check in
~amavisd/.spamassassin/user_prefs as well....

>
> Likewise, for the vast majority of spam messages that slip-through, I
> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
> that this observation indicates that the network tests were performed,
> but did not contribute to the SA score. Is this assumption valid?
Yes.

>
> Also, is there some means by which to *force* Pyzor and Razor2 scores to
> be added to the SA header, even if they did not contribute to the score?

I imagine you would want something like this:

full RAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50')
tflags RAZOR2_CF_RANGE_0_50 net
reuse RAZOR2_CF_RANGE_0_50
describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50%
score RAZOR2_CF_RANGE_0_50 0.01

full RAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50')
tflags RAZOR2_CF_RANGE_E4_0_50 net
reuse RAZOR2_CF_RANGE_E4_0_50
describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level
below 50%
score RAZOR2_CF_RANGE_E4_0_50 0.01

full RAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50')
tflags RAZOR2_CF_RANGE_E8_0_50 net
reuse RAZOR2_CF_RANGE_E8_0_50
describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level
below 50%
score RAZOR2_CF_RANGE_E8_0_50 0.01

>
> To refresh folks' memories, we have verified that Bayes is setup
> correctly (database was wiped and now training is done manually and is
> supervised), and that network tests are being performed when messages
> are scanned.
>
> Thanks for sticking with me through all of this, guys!
>
> -Ben

--
Daniel J McDonald, CCIE # 2495, CISSP # 78281
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Daniel, thanks for the quick reply. I'll reply inline, below.

On 4/16/2013 5:01 PM, Daniel McDonald wrote:
>
>
>
> On 4/16/13 2:59 PM, "Ben Johnson" <ben@indietorrent.org> wrote:
>
>> Are there any normal circumstances under which Bayes tests are not run?
> Yes, if USE_BAYES = 0 is included in the local.cf file.

I checked in /etc/spamassassin/local.cf, and find the following:

use_bayes 1

So, that seems not to be the issue.

>>
>> If not, are there circumstances under which Bayes tests are run but
>> their results are not included in the message headers? (I have tag_level
>> set to -999, so SA headers are always added.)
>
> That sounds like an amavisd command, you may want to check in
> ~amavisd/.spamassassin/user_prefs as well....

I checked in the equivalent path on my system
(/var/lib/amavis/.spamassassin/user_prefs) and the entire file is
commented-out. So, that seems not to be the issue, either.

Is there anything else that would cause Bayes tests not be performed? I
ask because other types of tests are disabled automatically under
certain circumstances (e.g., network tests), and I'm wondering if there
is some obscure combination of factors that causes Bayes tests not to be
performed.

>>
>> Likewise, for the vast majority of spam messages that slip-through, I
>> see no evidence of Pyzor or Razor2 activity. I have heretofore assumed
>> that this observation indicates that the network tests were performed,
>> but did not contribute to the SA score. Is this assumption valid?
> Yes.

Okay, very good.

It occurred to me that perhaps the Pyzor and/or Razor2 tests are
timing-out (both timeouts are set to 15 seconds) some percentage of the
time, which may explain why these tests do not contribute to a given
message's score.

That's why I asked about forcing the results into the SA header.

>>
>> Also, is there some means by which to *force* Pyzor and Razor2 scores to
>> be added to the SA header, even if they did not contribute to the score?
>
> I imagine you would want something like this:
>
> full RAZOR2_CF_RANGE_0_50 eval:check_razor2_range('','0','50')
> tflags RAZOR2_CF_RANGE_0_50 net
> reuse RAZOR2_CF_RANGE_0_50
> describe RAZOR2_CF_RANGE_0_50 Razor2 gives confidence level under 50%
> score RAZOR2_CF_RANGE_0_50 0.01
>
> full RAZOR2_CF_RANGE_E4_0_50 eval:check_razor2_range('4','0','50')
> tflags RAZOR2_CF_RANGE_E4_0_50 net
> reuse RAZOR2_CF_RANGE_E4_0_50
> describe RAZOR2_CF_RANGE_E4_0_50 Razor2 gives engine 4 confidence level
> below 50%
> score RAZOR2_CF_RANGE_E4_0_50 0.01
>
> full RAZOR2_CF_RANGE_E8_0_50 eval:check_razor2_range('8','0','50')
> tflags RAZOR2_CF_RANGE_E8_0_50 net
> reuse RAZOR2_CF_RANGE_E8_0_50
> describe RAZOR2_CF_RANGE_E8_0_50 Razor2 gives engine 8 confidence level
> below 50%
> score RAZOR2_CF_RANGE_E8_0_50 0.01

This seems to work brilliantly. I can't thank you enough; I never would
have figured this out.

Ideally, using the above directives will tell us whether we're
experiencing timeouts, or these spam messages are simply not in the
Pyzor or Razor2 databases.

Off the top of your head, do you happen to know what will happen if one
or both of the Pyzor/Razor2 tests timeout? Will some indication that the
tests were at least *started* still be added to the SA header?

>>
>> To refresh folks' memories, we have verified that Bayes is setup
>> correctly (database was wiped and now training is done manually and is
>> supervised), and that network tests are being performed when messages
>> are scanned.
>>
>> Thanks for sticking with me through all of this, guys!
>>
>> -Ben
>

Thanks again, Daniel!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson wrote:
> Is there anything else that would cause Bayes tests not be performed? I
> ask because other types of tests are disabled automatically under
> certain circumstances (e.g., network tests), and I'm wondering if there
> is some obscure combination of factors that causes Bayes tests not to be
> performed.

Do you have bayes_sql_override_username set? (This forces use of a
single Bayes DB for all SA calls that reference this configuration file
set.)

If not, you may be getting a Bayes DB for each user on your system;
IIRC this is supported (sort of) and default with Amavis.

-kgd
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 17-04-13 21:40, Ben Johnson wrote:
> Ideally, using the above directives will tell us whether we're
> experiencing timeouts, or these spam messages are simply not in the
> Pyzor or Razor2 databases.
>
> Off the top of your head, do you happen to know what will happen if one
> or both of the Pyzor/Razor2 tests timeout? Will some indication that the
> tests were at least *started* still be added to the SA header?

The razor client (don't know about pyzor) logs its activity to some
logfile in ~razor. There you can see what (or what not) is happening.

It's also possible to raise logfile verbosity by changing the razor
config file. See the man page for details.

--
Tom
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 5:05 PM, Kris Deugau wrote:
> Ben Johnson wrote:
>> Is there anything else that would cause Bayes tests not be performed? I
>> ask because other types of tests are disabled automatically under
>> certain circumstances (e.g., network tests), and I'm wondering if there
>> is some obscure combination of factors that causes Bayes tests not to be
>> performed.
>
> Do you have bayes_sql_override_username set? (This forces use of a
> single Bayes DB for all SA calls that reference this configuration file
> set.)
>
> If not, you may be getting a Bayes DB for each user on your system;
> IIRC this is supported (sort of) and default with Amavis.
>
> -kgd
>

Thanks for jumping-in here, Kris.

Yes, I do have the following in my SA local.cf:

bayes_sql_override_username amavis

So, all users are sharing the same Bayes DB. I train Bayes daily and the
token count, etc., etc. all look good and correct.

Just a quick update to my previous post.

The Pyzor and Razor2 score information is indeed coming through for the
handful of messages that have landed since I made those configuration
changes. So, all seems to be well on the Pyzor / Razor2 front.

However, I still don't see any evidence that Bayes testing was performed
on the messages that are "slipping through".

It bears mention that *most* messages do indeed show evidence of Bayes
scoring.

--- OH, SNAP! I found the root cause. ---

Well, when I went to confirm the above statement, regarding most
messages showing evidence of Bayes scoring, I realized that *none* show
evidence of it since 3/23! No wonder all of this garbage is slipping
through!

I recognized the date 3/23 immediately; it was the date on which we
upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no
knowledge of ISPConfig, it is basically a FOSS solution to managing vast
numbers of websites, domains, mailboxes, etc., as the name implies.)

We also updated OS packages (security only) on that day.

After diff-ing all of the relevant service configuration files
(amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any
discrepancies.

Then, I tried:

-----------------------------------------------------
# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis
Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358)
Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established
Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3
Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1
Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham
= 2334
Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163
Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal
mix of collations for operation ' IN '
Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this
message; none of the tokens were found in the database
Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef
Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804
(15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%),
poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%),
tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18
(0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%),
tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804
(15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%),
check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211
(4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%)
-----------------------------------------------------

Check-out the message buried half-way down:

bayes: tok_get_all: SQL error: Illegal mix of collations for operation '
IN '

I have run into this unsightly message before, but in that case, I could
see the entire query, which enabled me to change the collations accordingly.

In this case, I have no idea what the original query might have been.

Further, I have no idea what changed that introduced this problems on 3/23.

Was it a MySQL upgrade? Was it an ISPConfig change?

Has anybody else run into this?

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 6:47 PM, Ben Johnson wrote:
>
>
> On 4/17/2013 5:05 PM, Kris Deugau wrote:
>> Ben Johnson wrote:
>>> Is there anything else that would cause Bayes tests not be performed? I
>>> ask because other types of tests are disabled automatically under
>>> certain circumstances (e.g., network tests), and I'm wondering if there
>>> is some obscure combination of factors that causes Bayes tests not to be
>>> performed.
>>
>> Do you have bayes_sql_override_username set? (This forces use of a
>> single Bayes DB for all SA calls that reference this configuration file
>> set.)
>>
>> If not, you may be getting a Bayes DB for each user on your system;
>> IIRC this is supported (sort of) and default with Amavis.
>>
>> -kgd
>>
>
> Thanks for jumping-in here, Kris.
>
> Yes, I do have the following in my SA local.cf:
>
> bayes_sql_override_username amavis
>
> So, all users are sharing the same Bayes DB. I train Bayes daily and the
> token count, etc., etc. all look good and correct.
>
> Just a quick update to my previous post.
>
> The Pyzor and Razor2 score information is indeed coming through for the
> handful of messages that have landed since I made those configuration
> changes. So, all seems to be well on the Pyzor / Razor2 front.
>
> However, I still don't see any evidence that Bayes testing was performed
> on the messages that are "slipping through".
>
> It bears mention that *most* messages do indeed show evidence of Bayes
> scoring.
>
> --- OH, SNAP! I found the root cause. ---
>
> Well, when I went to confirm the above statement, regarding most
> messages showing evidence of Bayes scoring, I realized that *none* show
> evidence of it since 3/23! No wonder all of this garbage is slipping
> through!
>
> I recognized the date 3/23 immediately; it was the date on which we
> upgraded ISPConfig from 3.0.4.6 to 3.0.5. (For those who have no
> knowledge of ISPConfig, it is basically a FOSS solution to managing vast
> numbers of websites, domains, mailboxes, etc., as the name implies.)
>
> We also updated OS packages (security only) on that day.
>
> After diff-ing all of the relevant service configuration files
> (amavis-new, spamassassin, postfix, dovecot, etc.) I couldn't find any
> discrepancies.
>
> Then, I tried:
>
> -----------------------------------------------------
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
>
> Apr 17 15:36:08.723 [23302] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x2fbc508),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 17 15:36:08.746 [23302] dbg: bayes: using username: amavis
> Apr 17 15:36:08.746 [23302] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x3305358)
> Apr 17 15:36:08.758 [23302] dbg: bayes: database connection established
> Apr 17 15:36:08.758 [23302] dbg: bayes: found bayes db version 3
> Apr 17 15:36:08.759 [23302] dbg: bayes: Using userid: 1
> Apr 17 15:36:08.914 [23302] dbg: bayes: corpus size: nspam = 6083, nham
> = 2334
> Apr 17 15:36:08.920 [23302] dbg: bayes: tok_get_all: token count: 163
> Apr 17 15:36:08.921 [23302] dbg: bayes: tok_get_all: SQL error: Illegal
> mix of collations for operation ' IN '
> Apr 17 15:36:08.921 [23302] dbg: bayes: cannot use bayes on this
> message; none of the tokens were found in the database
> Apr 17 15:36:08.921 [23302] dbg: bayes: not scoring message, returning undef
> Apr 17 15:36:13.116 [23302] dbg: timing: total 5159 ms - init: 804
> (15.6%), parse: 10 (0.2%), extract_message_metadata: 99 (1.9%),
> poll_dns_idle: 3426 (66.4%), get_uri_detail_list: 0.24 (0.0%),
> tests_pri_-1000: 11 (0.2%), compile_gen: 133 (2.6%), compile_eval: 18
> (0.3%), tests_pri_-950: 5 (0.1%), tests_pri_-900: 5 (0.1%),
> tests_pri_-400: 12 (0.2%), check_bayes: 8 (0.1%), tests_pri_0: 804
> (15.6%), dkim_load_modules: 23 (0.4%), check_dkim_signature: 5 (0.1%),
> check_dkim_adsp: 99 (1.9%), check_spf: 61 (1.2%), check_razor2: 211
> (4.1%), check_pyzor: 138 (2.7%), tests_pri_500: 3387 (65.7%)
> -----------------------------------------------------
>
> Check-out the message buried half-way down:
>
> bayes: tok_get_all: SQL error: Illegal mix of collations for operation '
> IN '
>
> I have run into this unsightly message before, but in that case, I could
> see the entire query, which enabled me to change the collations accordingly.
>
> In this case, I have no idea what the original query might have been.
>
> Further, I have no idea what changed that introduced this problems on 3/23.
>
> Was it a MySQL upgrade? Was it an ISPConfig change?
>
> Has anybody else run into this?
>
> Thanks again,
>
> -Ben
>

I managed to fix this issue.

The date on which Bayes stopped "working" was relevant only in as much
it was the first date on which MySQL had been restarted in months. The
software updates had nothing to do with the issue. The critical change
was that sometime between the last MySQL start/stop, I had added a
handful of [mysqld] configuration directives to my.cnf:

default_storage_engine=InnoDB
skip-character-set-client-handshake
collation_server=utf8_unicode_ci
character_set_server=utf8

Searching the Web for "bayes: _put_token: Updated an unexpected number
of rows" yielded the following thread:

http://spamassassin.1065346.n5.nabble.com/Migrating-bayes-to-mysql-fails-with-parsing-errors-td10064i20.html

There were several clues here. After backing-up the token DB with
"sa-learn --backup", I dropped and recreated the tables using the schema
from
http://svn.apache.org/repos/asf/spamassassin/tags/spamassassin_current_release_3.3.x/sql/bayes_mysql.sql
.

Then, I tried restoring the data into the new tables:

# sa-learn --restore bayes-backup
bayes: encountered too many errors (20) while parsing token line,
reverting to empty database and exiting
ERROR: Bayes restore returned an error, please re-run with -D for more
information

# sa-learn -D --restore bayes-backup

dbg: bayes: _put_token: Updated an unexpected number of rows.
dbg: bayes: error inserting token for line: t 1 0 1364380202 3f3f1a2a3f
dbg: bayes: _put_token: Updated an unexpected number of rows.
dbg: bayes: error inserting token for line: t 1 0 1365878113 727f3f3f20

Still no joy. So I re-read the above thread, cover-to-cover.

The first post on that page was the key. In particular, adding the
following to each MySQL "CREATE TABLE" statement:

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

To create the new tables, I performed a find-and-replace-all on
"TYPE=MyISAM;" by "ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;"
on the above .sql file. All the tables were created successfully.

To restore the data, I used "sa-learn --restore", and this time, the
process completed without error.

I was a little nervous that restoring from the backup would create
corrupted token data, and because I retain my spam and ham corpora, it
felt "safer" to empty the tables and re-run sa-learn on the corpora.

In any case, problem solved!

Thanks for the help and pointers in the right direction.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On Wed, 17 Apr 2013, Ben Johnson wrote:

> The first post on that page was the key. In particular, adding the
> following to each MySQL "CREATE TABLE" statement:
>
> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

Please check the SpamAssassin bugzilla to see if this situation is already
mentioned, and if not, add a bug. This seems pretty critical.

It's possible that there's a good reason the default script still uses
myISAM. If so, the documentation for this fix should at least be easier to
find.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
jhardin@impsec.org FALaholic #11174 pgpk -a jhardin@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
Our government should bear in mind the fact that the American
Revolution was touched off by the then-current government
attempting to confiscate firearms from the people.
-----------------------------------------------------------------------
2 days until the 238th anniversary of The Shot Heard 'Round The World
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 5:39 PM, Tom Hendrikx wrote:
> On 17-04-13 21:40, Ben Johnson wrote:
>> Ideally, using the above directives will tell us whether we're
>> experiencing timeouts, or these spam messages are simply not in the
>> Pyzor or Razor2 databases.
>>
>> Off the top of your head, do you happen to know what will happen if one
>> or both of the Pyzor/Razor2 tests timeout? Will some indication that the
>> tests were at least *started* still be added to the SA header?
>
> The razor client (don't know about pyzor) logs its activity to some
> logfile in ~razor. There you can see what (or what not) is happening.
>
> It's also possible to raise logfile verbosity by changing the razor
> config file. See the man page for details.
>
> --
> Tom
>

Tom, thanks for the excellent tip regarding Razor's own log file.
Tailing that log will make this kind of debugging much simpler in the
future. Much appreciated.

One of the reasons for which I also like the idea of using Daniel
McDonald's include-scores-in-header rule (for Pyzor and Razor) is that
the data is embedded right in the message, which can be useful. For one,
this makes the scoring data more "portable" (it stays with the message
to which it applies). Secondly, when tailing a log, it can be difficult
to determine where the data relevant to one message ends and another begins.

Thanks again,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/17/2013 10:15 PM, John Hardin wrote:
> On Wed, 17 Apr 2013, Ben Johnson wrote:
>
>> The first post on that page was the key. In particular, adding the
>> following to each MySQL "CREATE TABLE" statement:
>>
>> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;
>
> Please check the SpamAssassin bugzilla to see if this situation is
> already mentioned, and if not, add a bug. This seems pretty critical.

Mark Martinec opened three reports in relation to this issue (quoted
from the archive thread cited in my previous post):

[Bug 6624] BayesStore/MySQL.pm fails to update tokens due to
MySQL server bug (wrong count of rows affected)
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6624

(^^ Fixed in 3.4 ^^)

[Bug 6625] Bayes SQL schema treats bayes_token.token as char
instead of binary, fails chset checks
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6625

(^^ Fixed in 3.4 ^^)

[Bug 6626] Newer MySQL chokes on TYPE=MyISAM syntax
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6626

(^^ Fixed in 3.4 ^^)

My concern now is that I am on 3.3.1, with little control over upgrades.
I have read all three bug reports in their entirety and Bug 6624 seems
to be a very legitimate concern. To quote Mark in the bug description:

> The effect of the bug with SpamAssassin is that tokens are only able
> to be inserted once, but their counts cannot increase, leading to
> terrible bayes results if the bug is not noticed. Also the conversion
> form db fails, as reported by Dave.
>
> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
> provide a workaround for the MySQL server bug, and improved debug logging.

How can I discern whether or not this bug does, in fact, affect me? Are
my Bayes results being crippled as a result of this bug?

> It's possible that there's a good reason the default script still uses
> myISAM. If so, the documentation for this fix should at least be easier
> to find.
>

If there is a good reason, I have yet to discern what it might be. The
third bug from above (Mark's comments, specifically) imply that there is
no particular reason for using MyISAM.

I have good reason for wanting to use the InnoDB storage engine, and I
have seen no performance hit as a result of so doing. (In fact,
performance seems better than with MyISAM in my scripted, once-a-day
training setup.)

The perfectly acceptable performance I'm observing could be because a)
the InnoDB-related resources allocated to MySQL are more than
sufficient, b) the schema that I used has a newly-added INDEX whereas
those prior to it did not, or c) I was sure to use the "MySQL" module
instead of the "SQL" module with my InnoDB setup:

bayes_store_module Mail::SpamAssassin::BayesStore::MySQL

The bottom line seems to be that for those who have settings like these
in their MySQL configurations

> default_storage_engine=InnoDB
> skip-character-set-client-handshake
> collation_server=utf8_unicode_ci
> character_set_server=utf8

it is absolutely necessary to include

ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

at the end of each CREATE TABLE statement (otherwise, the MySQL syntax
error results and all Bayes SELECT statements fail).

In any event, I'm a little concerned because while the majority of
messages are now tagged with BAYES_* hits, I am now seeing this debug
output on a significant percentage of messages ("cannot use bayes on
this message; not enough usable tokens found"):

# spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'

--------------------------------------------------------------
Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
= 2342
Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
message; not enough usable tokens found
Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
(39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
(0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
(48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
(4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
--------------------------------------------------------------

I have done some searching-around on the string "cannot use bayes on
this message; not enough usable tokens found" and have not found
anything authoritative regarding what this message might mean and
whether or not it can be ignored or if it is symptomatic of a larger
Bayes problem.

Thank you,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 04/18/2013 06:18 PM, Ben Johnson wrote:
> I have done some searching-around on the string "cannot use bayes on
> this message; not enough usable tokens found" and have not found
> anything authoritative regarding what this message might mean and
> whether or not it can be ignored or if it is symptomatic of a larger
> Bayes problem.

Curious: what are your reasons for using Bayes in SQL?
Are you sharing the DB among several machines? Or is this a single
box/global bayes setup?
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/18/2013 12:26 PM, Axb wrote:
> On 04/18/2013 06:18 PM, Ben Johnson wrote:
>> I have done some searching-around on the string "cannot use bayes on
>> this message; not enough usable tokens found" and have not found
>> anything authoritative regarding what this message might mean and
>> whether or not it can be ignored or if it is symptomatic of a larger
>> Bayes problem.
>
> Curious: what are your reasons for using Bayes in SQL?
> Are you sharing the DB among several machines? Or is this a single
> box/global bayes setup?
>
>

Not yet, but that is the ultimate plan (to share the DB across multiple
servers). Also, I like the idea that the Bayes DB is backed-up
automatically along with all other databases on the server (we run a
cron script that performs the dump). Granted, it would be trivial to
schedule a call to "sa-learn --backup", but storing the data in SQL
seems more portable and makes it easier to query the data for reporting
purposes.

Then again, I retain the corpora, so backing-up the DB is only useful
for when data needs to be moved from one server or database to another
(as moving the corpora seems far less practical).

Are you suggesting that I should scrap SQL and go back to a flat-file
DB? Is that the only path to a fix (short of upgrading SA)?

Thanks for your help!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

> Curious: what are your reasons for using Bayes in SQL?

> > Are you sharing the DB among several machines? Or is this a single
> > box/global bayes setup?
> >
> >
>
> Not yet, but that is the ultimate plan (to share the DB across multiple
> servers). Also, I like the idea that the Bayes DB is backed-up
> automatically along with all other databases on the server (we run a
> cron script that performs the dump). Granted, it would be trivial to
> schedule a call to "sa-learn --backup", but storing the data in SQL
> seems more portable and makes it easier to query the data for reporting
> purposes.
>

I have bayes in MySQL now, and I think it performs better than with just a
flat file berkeley db. I believe it solved some locking/sharing issues I
was having too.

I converted to it a few months ago (relearned the corpus from scratch to
mysql) with the intention of sharing between three systems, but the network
latency and general performance between the systems for updates was
horrible so they're all separate databases now. I'm still a mysql novice,
so I don't doubt someone with more mysql networking experience could figure
out how to share them between systems properly. I thought there would be
one master system with two slaves, but instead they all seemed to be shared
interactively for every query or update.

For the InnoDB/MyISAM issue, if I'm understanding it correctly, I just
edited the sql file I used to create the database, and I'm using InnoDB now
without any issues on v3.3.2.

I believe I used these instructions, with the sql modifications from above:

http://www200.pair.com/mecham/spam/debian-spamassassin-sql.html

Regards,
Alex

1 2 3 4  View All