Mailing List Archive

1 2 3 4  View All
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/18/2013 12:18 PM, Ben Johnson wrote:
>
> My concern now is that I am on 3.3.1, with little control over upgrades.
> I have read all three bug reports in their entirety and Bug 6624 seems
> to be a very legitimate concern. To quote Mark in the bug description:
>
>> The effect of the bug with SpamAssassin is that tokens are only able
>> to be inserted once, but their counts cannot increase, leading to
>> terrible bayes results if the bug is not noticed. Also the conversion
>> form db fails, as reported by Dave.
>>
>> Attached is a patch for lib/Mail/SpamAssassin/BayesStore/MySQL.pm to
>> provide a workaround for the MySQL server bug, and improved debug logging.
>
> How can I discern whether or not this bug does, in fact, affect me? Are
> my Bayes results being crippled as a result of this bug?
>
>> It's possible that there's a good reason the default script still uses
>> myISAM. If so, the documentation for this fix should at least be easier
>> to find.
>>
>
> In any event, I'm a little concerned because while the majority of
> messages are now tagged with BAYES_* hits, I am now seeing this debug
> output on a significant percentage of messages ("cannot use bayes on
> this message; not enough usable tokens found"):
>
> # spamassassin -D -t < /tmp/msg.txt 2>&1 | egrep '(bayes:|whitelist:|AWL)'
>
> --------------------------------------------------------------
> Apr 18 09:15:36.537 [21797] dbg: bayes: learner_new
> self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x4430388),
> bayes_store_module=Mail::SpamAssassin::BayesStore::MySQL
> Apr 18 09:15:36.568 [21797] dbg: bayes: using username: amavis
> Apr 18 09:15:36.568 [21797] dbg: bayes: learner_new: got
> store=Mail::SpamAssassin::BayesStore::MySQL=HASH(0x4779778)
> Apr 18 09:15:36.580 [21797] dbg: bayes: database connection established
> Apr 18 09:15:36.580 [21797] dbg: bayes: found bayes db version 3
> Apr 18 09:15:36.581 [21797] dbg: bayes: Using userid: 1
> Apr 18 09:15:36.781 [21797] dbg: bayes: corpus size: nspam = 6155, nham
> = 2342
> Apr 18 09:15:36.787 [21797] dbg: bayes: tok_get_all: token count: 176
> Apr 18 09:15:36.790 [21797] dbg: bayes: cannot use bayes on this
> message; not enough usable tokens found
> Apr 18 09:15:36.790 [21797] dbg: bayes: not scoring message, returning undef
> Apr 18 09:15:37.861 [21797] dbg: timing: total 2109 ms - init: 830
> (39.4%), parse: 7 (0.4%), extract_message_metadata: 123 (5.9%),
> poll_dns_idle: 74 (3.5%), get_uri_detail_list: 2 (0.1%),
> tests_pri_-1000: 26 (1.3%), compile_gen: 155 (7.4%), compile_eval: 19
> (0.9%), tests_pri_-950: 7 (0.3%), tests_pri_-900: 7 (0.3%),
> tests_pri_-400: 15 (0.7%), check_bayes: 10 (0.5%), tests_pri_0: 1018
> (48.3%), dkim_load_modules: 25 (1.2%), check_dkim_signature: 3 (0.2%),
> check_dkim_adsp: 16 (0.7%), check_spf: 78 (3.7%), check_razor2: 91
> (4.3%), check_pyzor: 430 (20.4%), tests_pri_500: 50 (2.4%)
> --------------------------------------------------------------
>
> I have done some searching-around on the string "cannot use bayes on
> this message; not enough usable tokens found" and have not found
> anything authoritative regarding what this message might mean and
> whether or not it can be ignored or if it is symptomatic of a larger
> Bayes problem.
>
> Thank you,
>
> -Ben
>

Might anyone be in a position to offer an authoritative response to
these questions?

I continue to see messages that are very similar to dozens of messages
that have been marked as SPAM slipping through with *no Bayes scoring*
(this is *after* fixing the SQL syntax error issue):

bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Is this normal? If so, what is the explanation for this behavior? I have
marked dozens of nearly-identical messages with the subject "Garden hose
expands up to three times its length" as SPAM (over the course of
several weeks) as SPAM, and yet SA reports "not enough usable tokens found".

Is SA referring to the number of tokens in the message? Or the Bayes DB?

Thanks,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

> Might anyone be in a position to offer an authoritative response to

> these questions?
>
> I continue to see messages that are very similar to dozens of messages
> that have been marked as SPAM slipping through with *no Bayes scoring*
> (this is *after* fixing the SQL syntax error issue):
>
> bayes: cannot use bayes on this message; not enough usable tokens found
> bayes: not scoring message, returning undef
>

Have you tried to find out how many tokens are in your bayes DB? As the
user specified by bayes_sql_username (actually, it probably doesn't matter,
but you should to be sure) run the following:

# sa-learn --dump magic
0.000 0 3 0 non-token data: bayes db version
0.000 0 466417 0 non-token data: nspam
0.000 0 508868 0 non-token data: nham
0.000 0 10788203 0 non-token data: ntokens
0.000 0 1320901921 0 non-token data: oldest atime
0.000 0 1366385643 0 non-token data: newest atime
0.000 0 0 0 non-token data: last journal sync
atime
0.000 0 1366348380 0 non-token data: last expiry atime
0.000 0 28651364 0 non-token data: last expire atime
delta
0.000 0 0 0 non-token data: last expire
reduction count

This should show you the number of spam (nspam) and ham (nham) tokens in
the db.

> Is this normal? If so, what is the explanation for this behavior? I have

> marked dozens of nearly-identical messages with the subject "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable tokens
> found".
>

If they are identical, I don't believe it will create new tokens, per se.


> Is SA referring to the number of tokens in the message? Or the Bayes DB?
>

I believe it would be talking about the database, not the message.

Regards,
Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Hi,

> Is this normal? If so, what is the explanation for this behavior? I have

> marked dozens of nearly-identical messages with the subject "Garden hose
>> expands up to three times its length" as SPAM (over the course of
>> several weeks) as SPAM, and yet SA reports "not enough usable tokens
>> found".
>>
>
> If they are identical, I don't believe it will create new tokens, per se.
>
>
>
>> Is SA referring to the number of tokens in the message? Or the Bayes DB?
>>
>
I should also mention that while training a message, use "--progress", as
such (assuming you're running it on an mbox or message that's in mbox
format):

# sa-learn --progress --spam --mbox mymboxfile

It will show you how many tokens have been learned during that run. It
might also be a good idea to add the token summary flag to your config:

add_header all Tok-Stat _TOKENSUMMARY_

If you run spamassassin on a message directly, and add the -t option, it
will show you the number of different types of tokens found in the message:

X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.

Regards,
Alex
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/19/2013 11:42 AM, Alex wrote:
> Hi,
>
>> Is this normal? If so, what is the explanation for this behavior? I have
>
> marked dozens of nearly-identical messages with the subject
> "Garden hose
> expands up to three times its length" as SPAM (over the course of
> several weeks) as SPAM, and yet SA reports "not enough usable
> tokens found".
>
>
> If they are identical, I don't believe it will create new tokens,
> per se.
>
>
>
> Is SA referring to the number of tokens in the message? Or the
> Bayes DB?
>
>
> I should also mention that while training a message, use "--progress",
> as such (assuming you're running it on an mbox or message that's in mbox
> format):
>
> # sa-learn --progress --spam --mbox mymboxfile
>
> It will show you how many tokens have been learned during that run. It
> might also be a good idea to add the token summary flag to your config:
>
> add_header all Tok-Stat _TOKENSUMMARY_
>
> If you run spamassassin on a message directly, and add the -t option, it
> will show you the number of different types of tokens found in the message:
>
> X-Spam-Tok-Stat: Tokens: new, 0; hammy, 6; neutral, 84; spammy, 36.
>
> Regards,
> Alex
>

Alex, thanks very much for the quick reply. I really appreciate it.

One can see from the output in my previous message (two messages back)
that the user is amavis (correct for my system) and the corpus size, as
well as the token count:

dbg: bayes: corpus size: nspam = 6155, nham = 2342
dbg: bayes: tok_get_all: token count: 176
dbg: bayes: cannot use bayes on this message; not enough usable tokens found
bayes: not scoring message, returning undef

Now that I look at this output again, the "token count: 176" stands-out.
That seems like a pretty low value. Is this the token count for the
entire Bayes DB??? Or only the tokens that apply to the particular
message being fed to SA?

The "garden hose" messages are probably not *identical*, but they are
very similar, so it seems that each variant should have tokens to offer.

The concern I expressed around bug 6624 relates to Mark's comment, which
seems to imply that while SA will not insert a token twice, it *will*
increase the token "count". Here's an excerpt from Mark's comment from
that bug report:

"The effect of the bug with SpamAssassin is that tokens are only able
to be inserted once, but their counts cannot increase, leading to
terrible bayes results if the bug is not noticed. Also the conversion
form db fails, as reported by Dave."

Is it possible that training similar messages as SPAM is not having the
intended effect due to this bug in my version of SA?

My "bayes_vars" table looks like this (sorry for the wrapping, this is
the best I could do):

id username spam_count ham_count token_count last_expire
last_atime_delta last_expire_reduce oldest_token_age newest_token_age
1 amavis 6185 2427 120092 1366364379 8380417
14747 1357985848 1366386865

The SQL query:

SELECT count( * )
FROM `bayes_token`

returns 120092 rows, so the above value is accurate (that is, the
"token_count" value in the `bayes_vars` table matches the actual row
count in the `bayes_token` table).

Also, thanks for the other tips regarding the "token summary flag"
directive an the -t switch. I was actually using the -t switch to
produce the output that I pasted two messages back. So, it seems that
the "X-Spam-Tok-Stat" output is added only when the token count is high
enough to be useful.

Still stumped here...

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 04/19/2013 06:02 PM, Ben Johnson wrote:

> Still stumped here...

do a bayes sa-learn --backup

switch to file based in SDBM format (which is fast)

do a

sa-learn --restore

feed it a few thousand NEW spams

see what happens
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
John Hardin skrev den 2013-04-18 04:15:

>> ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

unicode is overkill since bayes is just ascii

it will if unicode is used create bigger db, that will slow down more
then ascii

> Please check the SpamAssassin bugzilla to see if this situation is
> already mentioned, and if not, add a bug. This seems pretty critical.

i dont know how bayes in 3.4.x is now adays, its long since i have seen
the source for it, but i maked some changes to bayes mysql so it can be
cleaned up with timed expire of data, this is properly lost in
transistion with 3.4.x :(

> It's possible that there's a good reason the default script still
> uses myISAM. If so, the documentation for this fix should at least be
> easier to find.

it was dokumented ?

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-19 18:02:

> Still stumped here...

for amavisd-new, put spamassassin sql setup into user_prefs file for
the user amavisd-new runs as might be working better then have insecure
sql settings in /etc/mail/spamassassin :)

i dont know if this is really that you have another user for amavisd,
and test spamassassin -t msg with another user that uses another sql
user ?

make sure both users is really using same sql user as intended

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/19/2013 12:12 PM, Axb wrote:
> On 04/19/2013 06:02 PM, Ben Johnson wrote:
>
>> Still stumped here...
>
> do a bayes sa-learn --backup
>
> switch to file based in SDBM format (which is fast)
>
> do a
>
> sa-learn --restore
>
> feed it a few thousand NEW spams
>
> see what happens
>
>
>
>
>
>

Thanks for the suggestion, Axb. Your help and time is much appreciated.

By "feed it a few thousand NEW spams", do you mean to scrap the training
corpora that I've hand-sorted in favor of starting over? Or do you mean
to clear the database and re-run the training script against the corpora?

If your thinking is that the token data may be "stale", then I will
really be stumped. When I hand-classify 12 messages with a subject and
body about a retractable garden hose as spam, I expect the 13th message
about the same hose to score high on the Bayes tests. Is this an
unreasonable expectation?

I commented-out all of the DB-related lines in my SA configuration file
(local.cf) and restarted amavis-new.

I also cleared the existing DB tokens (with "sa-learn --clear") after
amavis had restarted, and then executed my normal training script
against my ham and spam corpora.

I'll keep an eye on incoming messages to see if those that "slip
through" and score below 4.0 demonstrate evidence of Bayes testing.

I am beginning to wonder if some kind of "corruption", for lack of a
better term, had been introduced by using utf8 to store the token data
(Benny Pedersen mentioned that Unicode is overkill, and he is probably
right). Performance aside, could using utf8_bin (instead of ascii)
introduce a problem for SA (despite no errors during "sa-learn" training
or --restore commands)?

The strange thing is that Bayes seems to work fine most of the time. But
as I've stated previously, almost all "obvious to a human" spam that
scores below 4.0 lacks evidence of Bayes testing.

Since switching back to a DBM Bayes setup, the results look pretty much
as expected (wrapped), and this is the type of thing I expect to see on
every message:

-----------------------------------------------------------
spamassassin -D -t < "/tmp/email.txt" 2>&1 | egrep '(bayes:|whitelist:|AWL)'
dbg: bayes: learner_new
self=Mail::SpamAssassin::Plugin::Bayes=HASH(0x37520f0),
bayes_store_module=Mail::SpamAssassin::BayesStore::DBM
dbg: bayes: learner_new: got
store=Mail::SpamAssassin::BayesStore::DBM=HASH(0x2c52558)
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_toks
dbg: bayes: tie-ing to DB file R/O /var/lib/amavis/.spamassassin/bayes_seen
dbg: bayes: found bayes db version 3
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: corpus size: nspam = 6203, nham = 2479
dbg: bayes: score = 5.55111512312578e-17
dbg: bayes: DB journal sync: last sync: 0
dbg: bayes: untie-ing
dbg: timing: total 2925 ms - init: 907 (31.0%), parse: 1.92 (0.1%),
extract_message_metadata: 108 (3.7%), poll_dns_idle: 1040 (35.6%),
get_uri_detail_list: 1.22 (0.0%), tests_pri_-1000: 19 (0.7%),
compile_gen: 185 (6.3%), compile_eval: 19 (0.6%), tests_pri_-950: 5
(0.2%), tests_pri_-900: 5 (0.2%), tests_pri_-400: 32 (1.1%),
check_bayes: 26 (0.9%), tests_pri_0: 836 (28.6%), dkim_load_modules: 27
(0.9%), check_dkim_signature: 1.23 (0.0%), check_dkim_adsp: 24 (0.8%),
check_spf: 70 (2.4%), check_razor2: 202 (6.9%), check_pyzor: 135 (4.6%),
tests_pri_500: 988 (33.8%)
-----------------------------------------------------------

I'll wait and see if I receive messages without Bayes results and report
back.

Even if using DBM "works", I don't see this as a long-term solution --
only as a troubleshooting step. I would really like to keep my Bayes
data in a MySQL or PostgreSQL database.

Thanks again for the help!

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/19/2013 1:54 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-19 18:02:
>
>> Still stumped here...
>
> for amavisd-new, put spamassassin sql setup into user_prefs file for the
> user amavisd-new runs as might be working better then have insecure sql
> settings in /etc/mail/spamassassin :)
>
> i dont know if this is really that you have another user for amavisd,
> and test spamassassin -t msg with another user that uses another sql user ?
>
> make sure both users is really using same sql user as intended
>

Benny, thanks for the suggestion regarding moving the SA SQL setup into
user_prefs. I will look into that soon.

Yes, I believe that me and the system always execute SA commands as the
"amavis" user. When I was using the SQL setup, I had the following in
local.cf:

bayes_path /var/lib/amavis/.spamassassin/bayes

With the DBM setup, I had the following (I have since commented it out,
while attempting to debug this Bayes issue):

bayes_sql_override_username amavis

Is something more required to ensure that my mail system, which runs
under the "amavis" user, is always reading from and writing to the same DB?

Best regards,

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Apologies for the rapid-fire here folks, but I wanted to correct something.

I had these backwards:

>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>>
>> bayes_path /var/lib/amavis/.spamassassin/bayes
>>
>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>>
>> bayes_sql_override_username amavis

I meant to say that I have *always* had

bayes_path /var/lib/amavis/.spamassassin/bayes

in local.cf, and using the SQL setup, I added

bayes_sql_override_username amavis

Sorry for the confusion!

-Ben



On 4/19/2013 11:02 PM, Ben Johnson wrote:
>
>
> On 4/19/2013 1:54 PM, Benny Pedersen wrote:
>> Ben Johnson skrev den 2013-04-19 18:02:
>>
>>> Still stumped here...
>>
>> for amavisd-new, put spamassassin sql setup into user_prefs file for the
>> user amavisd-new runs as might be working better then have insecure sql
>> settings in /etc/mail/spamassassin :)
>>
>> i dont know if this is really that you have another user for amavisd,
>> and test spamassassin -t msg with another user that uses another sql user ?
>>
>> make sure both users is really using same sql user as intended
>>
>
> Benny, thanks for the suggestion regarding moving the SA SQL setup into
> user_prefs. I will look into that soon.
>
> Yes, I believe that me and the system always execute SA commands as the
> "amavis" user. When I was using the SQL setup, I had the following in
> local.cf:
>
> bayes_path /var/lib/amavis/.spamassassin/bayes
>
> With the DBM setup, I had the following (I have since commented it out,
> while attempting to debug this Bayes issue):
>
> bayes_sql_override_username amavis
>
> Is something more required to ensure that my mail system, which runs
> under the "amavis" user, is always reading from and writing to the same DB?
>
> Best regards,
>
> -Ben
>
>
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
So, the problem seems not to be SQL-specific, as it occurs with SQL or
flat-file DB.

Upon following Benny Pedersen's advice (to move SA configuration
directives from /etc/spamassassin/local.cf to
/var/lib/amavis/.spamassassin/user_prefs), I noticed something unusual:

$ ls -lah /var/lib/amavis/.spamassassin/
total 7.6M
drwx------ 2 amavis amavis 4.0K Apr 20 08:54 .
drwxr-xr-x 7 amavis amavis 4.0K Apr 20 08:56 ..
-rw------- 1 root root 8.0K Apr 20 08:33 bayes_journal
-rw------- 1 root root 1.3M Apr 20 00:09 bayes_seen
-rw------- 1 root root 4.8M Apr 20 08:29 bayes_toks
-rw-r--r-- 1 root root 799 Jun 28 2004 gtube.txt
-rw-r--r-- 1 amavis amavis 2.7K Apr 20 08:55 user_prefs

Welp, that'll do it! How those four files were set to root:root
ownership is beyond me, but that was certainly a factor. Maybe this was
a result of executing my training script as root (even though I had
hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes,
and when using SQL, hard-coded bayes_sql_override_username to use amavis)?

I changed ownership to amavis:amavis and now messages are being scored
with Bayes (all of them, from what I can tell so far).

Also, I looked into the fact that I was running the cron job that trains
ham and spam as root. I did this only because the amavis user lacks
access to /var/vmail, which is where mail is stored on this system. (As
a corollary, I'm a bit curious as to how amavis is able to scan incoming
mail, given this lack of access -- maybe it does so using a pipe or some
other method that does not require access to /var/vmail.)

I think the disconnect was in the fact that I placed my custom
configuration directives in /etc/spamassassin/local.cf, when I should
have placed them in /var/lib/amavis/.spamassassin/user_prefs (for
message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam
training). (Thanks for pointing-out this mistake, Benny P.)

Putting my custom SA configuration directives in both of these files was
the only way I was able to train mail and score incoming messages using
the same credentials "across-the-board".

Once I did this, I was able to use SQL or flat-file DB with the same
exact results.

Is there a better way to achieve this consistency, aside from putting
duplicate content into /var/lib/amavis/.spamassassin/user_prefs and
/root/.spamassassin/user_prefs?

Feels like I'm out of the woods here! Thanks for all the expert help, guys.

-Ben
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-20 04:40:

> By "feed it a few thousand NEW spams", do you mean to scrap the
> training
> corpora that I've hand-sorted in favor of starting over? Or do you
> mean
> to clear the database and re-run the training script against the
> corpora?

ls /path/to/maildir/spam >/tmp/spam
cd /path/to/maildir/spam
sa-learn --spam --progress -f /tmp/spam

ls /path/to/maildir/ham >/tmp/ham
cd /path/to/maildir/ham
sa-learn --ham --progress -f /tmp/ham

do this for each bayes user, dependign on how your setup is this should
be it basicly to get bayes on track again

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-20 05:02:

> Yes, I believe that me and the system always execute SA commands as
> the
> "amavis" user. When I was using the SQL setup, I had the following in
> local.cf:
>
> bayes_path /var/lib/amavis/.spamassassin/bayes

is amavis have homedir in /var/lib/ ?

in gentoo its default as /var/amavis where the .spamassassin dir is
created by amavisd

use user_prefs to set bayes_path does not make sense if sql is used

> With the DBM setup, I had the following (I have since commented it
> out,
> while attempting to debug this Bayes issue):
>
> bayes_sql_override_username amavis

+1 to this one since amavis cant use multiple sa users very easy, but
depending on what amavis it being supported with complicated setups :(

i changed away from amavisd to clamav-milter, spampd in postfix after
queue, this is working very well for me, and i hope sa 3.4.x will not
break spampd :=)

> Is something more required to ensure that my mail system, which runs
> under the "amavis" user, is always reading from and writing to the
> same DB?

nope just remember that amavis also reads .spamassassin/user_prefs

if you like you can copy that user_prefs to /root/.spamassassin so you
dont have to remember something :)

user_prefs should ONLY be readble by that user that runs it

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
Ben Johnson skrev den 2013-04-20 19:01:

> Welp, that'll do it! How those four files were set to root:root
> ownership is beyond me,

that means that root have doing some testing :)

later amavisd cant write, you should change to amavis user before
testing

su amavis -c cmd foo

> but that was certainly a factor. Maybe this was
> a result of executing my training script as root

yep, relaern scripts should run from cron user of amavisd, to keep
permission owner of amavis, if running as root it would change to be
owned by root

change setup so cron it started by amavis, then it works

> (even though I had
> hard-coded the bayes_path to use /var/lib/amavis/.spamassassin/bayes,
> and when using SQL, hard-coded bayes_sql_override_username to use
> amavis)?

do you want sql bayes ?, you still using dbm based bayes setup, but sql
bayes does not use bayes_path

> I changed ownership to amavis:amavis and now messages are being
> scored
> with Bayes (all of them, from what I can tell so far).
>
> Also, I looked into the fact that I was running the cron job that
> trains
> ham and spam as root. I did this only because the amavis user lacks
> access to /var/vmail, which is where mail is stored on this system.
> (As
> a corollary, I'm a bit curious as to how amavis is able to scan
> incoming
> mail, given this lack of access -- maybe it does so using a pipe or
> some
> other method that does not require access to /var/vmail.)

if you use sql based bayes, then you can change learn scripting to be
runned by vmail user, problem solved, remember vmail should then have
user_prefs :)

> I think the disconnect was in the fact that I placed my custom
> configuration directives in /etc/spamassassin/local.cf, when I should
> have placed them in /var/lib/amavis/.spamassassin/user_prefs (for
> message scoring) *and* /root/.spamassassin/user_prefs (for ham/spam
> training). (Thanks for pointing-out this mistake, Benny P.)

yep, this is common error, i also remember pyzord was in the latest
ebuild setup to run as root, but hey it uses udp port above 1023 :)

> Putting my custom SA configuration directives in both of these files
> was
> the only way I was able to train mail and score incoming messages
> using
> the same credentials "across-the-board".

its possible to use dovecot-antispam ?, then it will call sa-learn pr
msg, with the user that owns vmail, but i dropped it since i still not
upgraded to dovecot 2.x yet

> Once I did this, I was able to use SQL or flat-file DB with the same
> exact results.

progress ?

> Is there a better way to achieve this consistency, aside from putting
> duplicate content into /var/lib/amavis/.spamassassin/user_prefs and
> /root/.spamassassin/user_prefs?

nope this is the perfect way, also security wise

> Feels like I'm out of the woods here! Thanks for all the expert help,
> guys.

+1

--
senders that put my email into body content will deliver it to my own
trashcan, so if you like to get reply, dont do it
Re: Calling spamassassin directly yields very different results than calling spamassassin via amavis-new [ In reply to ]
On 4/20/2013 3:20 PM, Benny Pedersen wrote:
> Ben Johnson skrev den 2013-04-20 05:02:
>
>> Yes, I believe that me and the system always execute SA commands as the
>> "amavis" user. When I was using the SQL setup, I had the following in
>> local.cf:
>>
>> bayes_path /var/lib/amavis/.spamassassin/bayes
>
> is amavis have homedir in /var/lib/ ?

The amavis user's home directory is /var/lib/amavis. This seems to be
the default on Ubuntu; I didn't change this path.

> in gentoo its default as /var/amavis where the .spamassassin dir is
> created by amavisd
>
> use user_prefs to set bayes_path does not make sense if sql is used
>

Thanks; I did comment-out the "bayes_path" directive. I figured that it
wouldn't matter whether it is commented or not, in the presence of
SQL-related directives, but it can't hurt to comment-out this line.

>> With the DBM setup, I had the following (I have since commented it out,
>> while attempting to debug this Bayes issue):
>>
>> bayes_sql_override_username amavis
>
> +1 to this one since amavis cant use multiple sa users very easy, but
> depending on what amavis it being supported with complicated setups :(

I only need one Bayes user, so this setup is adequate.

> i changed away from amavisd to clamav-milter, spampd in postfix after
> queue, this is working very well for me, and i hope sa 3.4.x will not
> break spampd :=)

Hmm, I will consider your sound advice in this regard. amavis does seem
to be overly memory-hungry (despite setting $max_servers = 1 and
$max_requests = 1). If there is a better alternative, I'm all ears (or
eyes, as the case may be).

>> Is something more required to ensure that my mail system, which runs
>> under the "amavis" user, is always reading from and writing to the
>> same DB?
>
> nope just remember that amavis also reads .spamassassin/user_prefs
>
> if you like you can copy that user_prefs to /root/.spamassassin so you
> dont have to remember something :)
>
> user_prefs should ONLY be readble by that user that runs it
>

Thanks for pointing this out. I will double-check the permissions.

I'll respond to your other email momentarily.

Thanks, Benny!

-Ben

1 2 3 4  View All