Mailing List Archive

per-user bayes
Hi all,

I've got a site-wide bayes mysql setup. It keeps getting poisoned
quickly, because the user patterns are far too divergent from each
other. One person's spam is another person's ham, nobody is happy.

A per-user setup would let each user do their own thing, but I don't see
how I can do that because our system doesn't have individual system
users and I don't see that there are options in the bayes sql
configuration or per-user tables possible.

There is this bayes_sql_override_username configuration option, but this
is a configuration option that I can only set once, and is not
dynamic. There is this hint in the documentation that you can also use
this config option to trick sa-learn to learn data as a specific user,
but there is not much more information.

Has someone out there done this, and can show how you have done it?

At this point my options are to turn down the score for bayes, so it has
less of an impact, maybe turn off bayes auto-learning, or just simply
disabling bayes altogether.

thanks for any information

--
micah
Re: per-user bayes [ In reply to ]
On 07 Dec 2020, at 13:56, micah anderson <micah@riseup.net> wrote:
> A per-user setup would let each user do their own thing, but I don't see
> how I can do that because our system doesn't have individual system
> users and I don't see that there are options in the bayes sql
> configuration or per-user tables possible.

This may help

<http://svn.apache.org/repos/asf/spamassassin/branches/duncf_masses/sql/README.bayes>

--
"Dignity intact! Dignity intact!" -- Aisling Bee, dancing on a pier in her pants.
Re: per-user bayes [ In reply to ]
Hi

> This may help
>
> <http://svn.apache.org/repos/asf/spamassassin/branches/duncf_masses/sql/README.bayes>

I sort of have the same issue. Unfortunately that does not help, it
merely explains how to store bayes data in a database. But there is
still only one 'global' database on your mail platform which applies to
all your customers.

Especially in Switzerland, with four national languages, this causes
the bayes filter not to be very efficient.

What we would need, is for the bayes module a possibility to store
bayes data per 'recipient' not just globally.

So SpamAssassin would need to somehow pass the recipient(s) to the bayes
module.

Mit freundlichen Grüssen

-Benoît Panizzon-
--
I m p r o W a r e A G - Leiter Commerce Kunden
______________________________________________________

Zurlindenstrasse 29 Tel +41 61 826 93 00
CH-4133 Pratteln Fax +41 61 826 93 01
Schweiz Web http://www.imp.ch
______________________________________________________
Re: per-user bayes [ In reply to ]
Hi

Adding the list back to CC as I believe this is an interesting topic
many have pondered over.

Yes, I see that is states 'per user' but I still don't see, how that
'bayes user' is being set on a per recipient base.

On the email platform there is ONE config file for spamassassin. So if I
set the user with:

bayes_sql_override_username someusername

That is the username under which the bayes data is being stored for all
recipients (thousands of mailboxes on a big ISP mailserver)

How do I tell SpamAssassin to pass the recipient to the bayes
filter while scanning an email?

Mit freundlichen Grüssen

-Benoît Panizzon-
--
I m p r o W a r e A G - Leiter Commerce Kunden
______________________________________________________

Zurlindenstrasse 29 Tel +41 61 826 93 00
CH-4133 Pratteln Fax +41 61 826 93 01
Schweiz Web http://www.imp.ch
______________________________________________________
Re: per-user bayes [ In reply to ]
On 08 Dec 2020, at 08:36, Benoit Panizzon <benoit.panizzon@imp.ch> wrote:
> Adding the list back to CC as I believe this is an interesting topic
> many have pondered over.

Forgot to fix the reply to on this list for some reason. Fixed now.

> Yes, I see that is states 'per user' but I still don't see, how that
> 'bayes user' is being set on a per recipient base.
>
> On the email platform there is ONE config file for spamassassin. So if I
> set the user with:
>
> bayes_sql_override_username someusername
>
> That is the username under which the bayes data is being stored for all
> recipients (thousands of mailboxes on a big ISP mailserver)


It can be. It can also be, for example, %u (It may be more complicated than that). Or perhaps sa_username_maps?

> How do I tell SpamAssassin to pass the recipient to the bayes
> filter while scanning an email?

Through the SQL query, IIRC.

--
Nothing like grilling a kosher dog over human hair to bring out the
subtle flavors.
Re: per-user bayes [ In reply to ]
Benoit Panizzon wrote:
> Hi
>
>> This may help
>>
>> <http://svn.apache.org/repos/asf/spamassassin/branches/duncf_masses/sql/README.bayes>
>
> I sort of have the same issue. Unfortunately that does not help, it
> merely explains how to store bayes data in a database. But there is
> still only one 'global' database on your mail platform which applies to
> all your customers.
>
> Especially in Switzerland, with four national languages, this causes
> the bayes filter not to be very efficient.
>
> What we would need, is for the bayes module a possibility to store
> bayes data per 'recipient' not just globally.
>
> So SpamAssassin would need to somehow pass the recipient(s) to the bayes
> module.

When using spamc/spamd, this is the default, so long as each user
calling spamc has a unique argument for -u (or a distinct local Unix
user on the calling system, since spamc will automatically set the user
to the local Unix username when called without -u).

There will only be one database and set of tables, but one of the fields
in each table is the user identifier. Fair warning - if you go full
per-user on a large system, this will MASSIVELY balloon the size of your
Bayes database, and most users will idle below the learning thresholds
for quite a long time.

Here, we get per-user behaviours when calling SA from MIMEDefang for
outbound mail by replacing the stock library-level integration with a
custom call to spamc. (As it happens we share a Bayes DB between
inbound and outbound mail, and use bayes_sql_override_username to force
it to be sitewide instead of per-user.)

IIRC Amavis has some support for doing this when calling SA through the
library interface, but you lose the efficiency benefit of only calling
SA once on multirecipient messages.

After a bit of searching and reading I suspect you'd either have to just
convert the library call into a spamc call, or port huge chunks of spamd
internals into Amavis or MIMEDefang to get them to do library-level
per-user SA processing.

-kgd
Re: per-user bayes [ In reply to ]
Kris Deugau <kdeugau@vianet.ca> writes:

> There will only be one database and set of tables, but one of the fields
> in each table is the user identifier. Fair warning - if you go full
> per-user on a large system, this will MASSIVELY balloon the size of your
> Bayes database, and most users will idle below the learning thresholds
> for quite a long time.

Can you give an idea of the size calculation? I'm wanting to do this,
but I need to figure out how much space I need to allocate per user!

Thanks for the clarifications, this is super helpful.

--
micah
Re: per-user bayes [ In reply to ]
micah anderson skrev den 2020-12-08 21:54:
> Kris Deugau <kdeugau@vianet.ca> writes:
>
>> There will only be one database and set of tables, but one of the
>> fields
>> in each table is the user identifier. Fair warning - if you go full
>> per-user on a large system, this will MASSIVELY balloon the size of
>> your
>> Bayes database, and most users will idle below the learning thresholds
>> for quite a long time.
>
> Can you give an idea of the size calculation? I'm wanting to do this,
> but I need to figure out how much space I need to allocate per user!
>
> Thanks for the clarifications, this is super helpful.

i use fuglu, where pr user bayes is simple, and now that fuglu have
solved the problem in that recipients envelope address is now
caseInsEnsive used in bayes userdatabase it just works with fuglu

but there is more on my wish list, i have not yet pr user retrain mails
classifed incorrect, currently only autolearn is working

with global bayes one should keep the database as big as possible, and
well trained for all users, if its manuel trained it would be best, its
just lots of time users need to do this for very little

fuglu do use spamd, and if i recall it also spamc, i have verifyed it is
running pr user now

lets see if mimedefang can do it better

in amavisd you can make sasl usermaps to use bayes user maps, i know it
exists, but have never succesfully got that to work
Re: per-user bayes [ In reply to ]
I believe that a SA plugin (like bayes) is able to know the envelope MAIL
FROM and RCPT TO values... is it correct? If it is possible we "just" have
to modify the bayes plugin

On Tue, Dec 8, 2020 at 10:13 PM Benny Pedersen <me@junc.eu> wrote:

> micah anderson skrev den 2020-12-08 21:54:
> > Kris Deugau <kdeugau@vianet.ca> writes:
> >
> >> There will only be one database and set of tables, but one of the
> >> fields
> >> in each table is the user identifier. Fair warning - if you go full
> >> per-user on a large system, this will MASSIVELY balloon the size of
> >> your
> >> Bayes database, and most users will idle below the learning thresholds
> >> for quite a long time.
> >
> > Can you give an idea of the size calculation? I'm wanting to do this,
> > but I need to figure out how much space I need to allocate per user!
> >
> > Thanks for the clarifications, this is super helpful.
>
> i use fuglu, where pr user bayes is simple, and now that fuglu have
> solved the problem in that recipients envelope address is now
> caseInsEnsive used in bayes userdatabase it just works with fuglu
>
> but there is more on my wish list, i have not yet pr user retrain mails
> classifed incorrect, currently only autolearn is working
>
> with global bayes one should keep the database as big as possible, and
> well trained for all users, if its manuel trained it would be best, its
> just lots of time users need to do this for very little
>
> fuglu do use spamd, and if i recall it also spamc, i have verifyed it is
> running pr user now
>
> lets see if mimedefang can do it better
>
> in amavisd you can make sasl usermaps to use bayes user maps, i know it
> exists, but have never succesfully got that to work
>
Re: per-user bayes [ In reply to ]
On 08 Dec 2020, at 13:54, micah anderson <micah@riseup.net> wrote:
> Kris Deugau <kdeugau@vianet.ca> writes:

>> There will only be one database and set of tables, but one of the fields
>> in each table is the user identifier. Fair warning - if you go full
>> per-user on a large system, this will MASSIVELY balloon the size of your
>> Bayes database, and most users will idle below the learning thresholds
>> for quite a long time.

> Can you give an idea of the size calculation? I'm wanting to do this,
> but I need to figure out how much space I need to allocate per user!

That would be pretty hard to predict as it would vary a lot based on the users and the mail.

I don't think Bayes is really that big (a few MB max?)

--
Varium et mutabile semper Femina.
Re: per-user bayes [ In reply to ]
On 2020-12-09 4:41 am, @lbutlr wrote:

> On 08 Dec 2020, at 13:54, micah anderson <micah@riseup.net> wrote:
>
>> Kris Deugau <kdeugau@vianet.ca> writes:
> There will only be one database and set of tables, but one of the fields in each table is the user identifier. Fair warning - if you go full per-user on a large system, this will MASSIVELY balloon the size of your Bayes database, and most users will idle below the learning thresholds for quite a long time.

> Can you give an idea of the size calculation? I'm wanting to do this, but I need to figure out how much space I need to allocate per user!

That would be pretty hard to predict as it would vary a lot based on the
users and the mail.

I don't think Bayes is really that big (a few MB max?)

It's not big. Here's my personal spamassassin database (just a few
users, but SA has been running for years and years ... About 48MB

> mysql> SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 ) AS `Size (KB)` FROM information_schema.TABLES WHERE TABLE_SCHEMA = "spamassassin" ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
> +-------------------+-----------+
> | Table | Size (KB) |
> +-------------------+-----------+
> | bayes_token | 48160 |
> | awl | 1040 |
> | bayes_vars | 32 |
> | bayes_seen | 16 |
> | bayes_global_vars | 16 |
> | bayes_expire | 16 |
> +-------------------+-----------+
> 6 rows in set (0.00 sec)
Re: per-user bayes [ In reply to ]
On 2020-12-09 9:48 am, deano-spamassassin@areyes.com wrote:

> On 2020-12-09 4:41 am, @lbutlr wrote:
>
> On 08 Dec 2020, at 13:54, micah anderson <micah@riseup.net> wrote:
> Kris Deugau <kdeugau@vianet.ca> writes: There will only be one database and set of tables, but one of the fields in each table is the user identifier. Fair warning - if you go full per-user on a large system, this will MASSIVELY balloon the size of your Bayes database, and most users will idle below the learning thresholds for quite a long time.

> Can you give an idea of the size calculation? I'm wanting to do this, but I need to figure out how much space I need to allocate per user!

That would be pretty hard to predict as it would vary a lot based on the
users and the mail.

I don't think Bayes is really that big (a few MB max?)

It's not big. Here's my personal spamassassin database (just a few
users, but SA has been running for years and years ... About 48MB

> mysql> SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 ) AS `Size (KB)` FROM information_schema.TABLES WHERE TABLE_SCHEMA = "spamassassin" ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
> +-------------------+-----------+
> | Table | Size (KB) |
> +-------------------+-----------+
> | bayes_token | 48160 |
> | awl | 1040 |
> | bayes_vars | 32 |
> | bayes_seen | 16 |
> | bayes_global_vars | 16 |
> | bayes_expire | 16 |
> +-------------------+-----------+
> 6 rows in set (0.00 sec)

I did it again on a test server - same corpus, latest SA etc. It's been
trained on ham/spam.

> MariaDB [spamassassin]> SELECT TABLE_NAME AS `Table`, ROUND((DATA_LENGTH + INDEX_LENGTH) / 1024 / 1024 ) AS `Size (MB)` FROM information_schema.TABLES WHERE TABLE_SCHEMA = "spamassassin" ORDER BY (DATA_LENGTH + INDEX_LENGTH) DESC;
> +-------------------+-----------+
> | Table | Size (MB) |
> +-------------------+-----------+
> | bayes_token | 118 |
> | txrep | 17 |
> | bayes_seen | 3 |
> | bayes_vars | 0 |
> | awl | 0 |
> | bayes_expire | 0 |
> | bayes_global_vars | 0 |
> +-------------------+-----------+
> 7 rows in set (0.001 sec)

So a bit bigger.
Re: per-user bayes [ In reply to ]
hg user skrev den 2020-12-09 08:57:
> I believe that a SA plugin (like bayes) is able to know the envelope
> MAIL FROM and RCPT TO values... is it correct? If it is possible we
> "just" have to modify the bayes plugin

provide this patch first and ask later :=)

bayes does not fokus on specifik headers
Re: per-user bayes [ In reply to ]
micah anderson wrote:
> Kris Deugau <kdeugau@vianet.ca> writes:
>
>> There will only be one database and set of tables, but one of the fields
>> in each table is the user identifier. Fair warning - if you go full
>> per-user on a large system, this will MASSIVELY balloon the size of your
>> Bayes database, and most users will idle below the learning thresholds
>> for quite a long time.
>
> Can you give an idea of the size calculation? I'm wanting to do this,
> but I need to figure out how much space I need to allocate per user!

The SA docs estimate 5-10M per user for file-based per-user Bayes with
the default token expiry settings. I'd expect about the same in SQL,
with anywhere up to 3x bloat over time due to token churn. (Checking my
personal mailbox, I have just over 5M in bayes_tokens, but bayes_seen
has grown over time to 83M. However, the message-ids stored there
aren't being expired.)

Sitewide, with ~1.7M active tokens (expiry set at 2.1M currently), the
database occupies about 342M on disk here, with a 156M SQL dump. This
comes out to about 200 bytes per token of used storage. A single user
with default settings (and plenty of learning) will probably settle down
to somewhere between ~110K and ~140K tokens, so you can probably expect
their data to occupy anywhere from the minimal 5M on up to close to 30M.
Multiply by the number of users and that's what you would have to look
at provisioning for storage. Even at a minimal steady-state you're
likely looking at 100G for 20K users.

If you have more than a handful of users, you're probably better off
looking for ways to group your users with a small number of Bayes
datasets rather than full-on per-user. I haven't tried, but you might
be able to use bayes_sql_override_username in userprefs (also storable
in SQL) to assign users to a particular dataset, with a fallback to a
global default. The documentation reads to me like this should work
(note the last sentence):

bayes_sql_override_username
Used by BayesStore::SQL storage implementation.

If this options is set the BayesStore::SQL module will
override the set username with the value given. This could
be useful for implementing global or group bayes databases.

-kgd