Mailing List Archive

Workflow for adding new ham/spam to existing site-wide database?
I have been accumulating spam/ham samples and sorting them out into
different directories on my server. As new spam/ham comes in, I throw it
into the existing pile and then run "sa-learn --spam|--ham" on the whole
pile.

It dawned on me that this will get very slow as I eventually collect
tens of thousand of emails. So I'm wondernig if it's better to:

1) Place all new, incoming spam/ham into empty directories
2) Run sa-learn only on these directories with small samples
3) Once done, move these new emails to an archive of spam/ham samples
4) Repeat

Is this typically how it's done?
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
On Tue, 16 Mar 2021 13:16:49 -0400
Steve Dondley wrote:

> I have been accumulating spam/ham samples and sorting them out into
> different directories on my server. As new spam/ham comes in, I throw
> it into the existing pile and then run "sa-learn --spam|--ham" on the
> whole pile.
>
> It dawned on me that this will get very slow as I eventually collect
> tens of thousand of emails. So I'm wondernig if it's better to:
>
> 1) Place all new, incoming spam/ham into empty directories
> 2) Run sa-learn only on these directories with small samples

Why with small samples? Just train on new spam and ham and then
move them.


> 3) Once done, move these new emails to an archive of spam/ham samples
> 4) Repeat
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
You covered a lot of ground here. Thanks.. If you have some spare
cycles, I have follow up questions to get an understanding of how you
process your email:

> 21 seconds at that includes fetch the samples via imap from two
> folders, fire them against a bayes-only spamassasin instance,

What is a "bayes-only" instance? I don't follow. What other kinds of
instances are there?


ignore
> BAEYS_00/BAYES_99 messages, move the rest to the both training
> folders, anonymize them, strip useless headers, fire sa-learn against

OK, so it looks like you are suggesting that emails get kind of
pre-screened to determine if they are obvious spam or not.

And by anonymize, what do you mean? Remove the headers that contain
email addresses? What other headers are useless? What exactly is the
goal of anonymizing and removing the headers? I think I have a vague
idea why but can't quite crystallize it in my head.

> both folders, fire bogfilkter training against both folders and verify
> that the new sampel files score with BEYS_99/BAYES_00 now

bogfilkter training?

So the goal is to get all the new emails to score either 99 (spam) or 00
(ham).

So once I verify they score 00 or 99, do I then throw them on the larger
collection of ham/spam with all headers restored? And what do I do if
they still don't score 00 or 99?
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
On Tue, 16 Mar 2021 15:33:58 -0400
Steve Dondley wrote:

> You covered a lot of ground here. Thanks.. If you have some spare
> cycles, I have follow up questions to get an understanding of how you
> process your email:
>

I presume this is a reply to Harold, in which case I would take it
with a pinch of salt. He's banned from this list and, I gather, many
others. He has some idiosyncratic ideas which he defends with
Anglo-Saxon invective.

I've never seen any sign that this is because he's a troubled genius.
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
On 16.03.21 13:16, Steve Dondley wrote:
>I have been accumulating spam/ham samples and sorting them out into
>different directories on my server. As new spam/ham comes in, I throw
>it into the existing pile and then run "sa-learn --spam|--ham" on the
>whole pile.
>
>It dawned on me that this will get very slow as I eventually collect
>tens of thousand of emails. So I'm wondernig if it's better to:
>
>1) Place all new, incoming spam/ham into empty directories
>2) Run sa-learn only on these directories with small samples
>3) Once done, move these new emails to an archive of spam/ham samples
>4) Repeat
>
>Is this typically how it's done?

I usually take care mostly about false positives, false negatives, nearly
false-negatives that don't hit BAYES_999 and phish.

that means, once you have your bayes well trained, ocasional retraining is
necessary, but on multiple places one false negative is enough to multiple
similar mail from BAYES_50 to BAYES_999

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
- Holmes, what kind of school did you study to be a detective?
- Elementary, Watkins. -- Daffy Duck & Porky Pig
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
Steve Dondley wrote:
> I have been accumulating spam/ham samples and sorting them out into
> different directories on my server. As new spam/ham comes in, I throw it
> into the existing pile and then run "sa-learn --spam|--ham" on the whole
> pile.
>
> It dawned on me that this will get very slow as I eventually collect
> tens of thousand of emails. So I'm wondernig if it's better to:
>
> 1) Place all new, incoming spam/ham into empty directories
> 2) Run sa-learn only on these directories
> 3) Once done, move these new emails to an archive of spam/ham samples
> 4) Repeat
>
> Is this typically how it's done?

Common advice both here and in most of the documentation seems to be as
above, with the minor edit I made on point 2.

My own experience has been that accumulating blobs of ham/spam and just
repeatedly running sa-learn over those works just fine. It also reduces
the incidence of tokens from somewhat rarer mail automatically expiring
out of Bayes, leading to FPs and FNs.

I maintain a long-term ham directory plus a rolling directory[*] of FNs
for Bayes learning, plus I've been filing a number of spam subtypes in
their own long-term directories for automatic rule generation with the
SOUGHT tools in the SA source tree. I keep the long-term subtype
folders indefinitely because I've found some types of spam seem to come
and go from the FN stream I see.

-kgd
[*] Newly reported FNs get added, older ones (>90 days IIRC) get
automatically moved to an archive folder on a daily basis for eventual
manual deletion.
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
On Wed, 17 Mar 2021 10:42:14 -0400
Kris Deugau wrote:


> My own experience has been that accumulating blobs of ham/spam and
> just repeatedly running sa-learn over those works just fine. It also
> reduces the incidence of tokens from somewhat rarer mail
> automatically expiring out of Bayes, leading to FPs and FNs.

It wont do that by default. You would need to have something removing
the signature hashes from the database.
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
Are you able to submit your spam to PCC / KAM.
That way the community as a whole  can be benefit.

Regards
Brent

On 2021/03/16 19:16, Steve Dondley wrote:
> I have been accumulating spam/ham samples and sorting them out into
> different directories on my server. As new spam/ham comes in, I throw
> it into the existing pile and then run "sa-learn --spam|--ham" on the
> whole pile.
>
> It dawned on me that this will get very slow as I eventually collect
> tens of thousand of emails. So I'm wondernig if it's better to:
>
> 1) Place all new, incoming spam/ham into empty directories
> 2) Run sa-learn only on these directories with small samples
> 3) Once done, move these new emails to an archive of spam/ham samples
> 4) Repeat
>
> Is this typically how it's done?
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
>On Wed, 17 Mar 2021 10:42:14 -0400 Kris Deugau wrote:

>> My own experience has been that accumulating blobs of ham/spam and
>> just repeatedly running sa-learn over those works just fine. It also
>> reduces the incidence of tokens from somewhat rarer mail
>> automatically expiring out of Bayes, leading to FPs and FNs.

On 17.03.21 22:01, RW wrote:
>It wont do that by default. You would need to have something removing
>the signature hashes from the database.

oh, yes, it does:

bayes_auto_expire (default: 1)
If enabled, the Bayes system will try to automatically expire old
tokens from the database. Auto-expiry occurs when the number of
tokens in the database surpasses the bayes_expiry_max_db_size
value. If a bayes datastore backend does not implement individual
key/value expirations, the setting is silently ignored.

note that multiple people reported long delivery time when expiration has
occured, and it's often recommended to turn this off and do expiration e.g.
from cron job.

BAYES database stored in redis does not have this issue.

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
The early bird may get the worm, but the second mouse gets the cheese.
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
Matus UHLAR - fantomas wrote:
>> On Wed, 17 Mar 2021 10:42:14 -0400 Kris Deugau wrote:
>
>>> My own experience has been that accumulating blobs of ham/spam and
>>> just repeatedly running sa-learn over those works just fine.  It also
>>> reduces the incidence of tokens from somewhat rarer mail
>>> automatically expiring out of Bayes, leading to FPs and FNs.
>
> On 17.03.21 22:01, RW wrote:
>> It wont do that by default. You would need to have something removing
>> the signature hashes from the database.
>
> oh, yes, it does:
>
>       bayes_auto_expire             (default: 1)
>           If enabled, the Bayes system will try to automatically expire
> old
>           tokens from the database.  Auto-expiry occurs when the number of
>           tokens in the database surpasses the bayes_expiry_max_db_size
>           value. If a bayes datastore backend does not implement
> individual
>           key/value expirations, the setting is silently ignored.
>
> note that multiple people reported long delivery time when expiration has
> occured, and it's often recommended to turn this off and do expiration
> e.g. from cron job.
>
> BAYES database stored in redis does not have this issue.

That option only controls when Bayes expiry is run, not what gets
expired when it does happen.

Thinking more I may have conflated several actions that any long-lived
Bayes DB will have to experience.

- Token expiry will happen automatically out of the box as above, or
manually as scheduled (for BDB or SQL backends - IIRC the Redis backend
uses a Redis feature to expire tokens automatically). Historically
autoexpire worked well for file-based per-user or (very) small sitewide
DBs, but very poorly for larger ones (even a couple tens to hundreds of
users) due to the strict locking and extra time taken while processing a
message. I'm not sure this is so much of a problem with an SQL back end.

- The list of "seen" messages (bayes_seen file or DB table - not sure
what the Redis backend uses) may grow without limit unless manually
trimmed. On a long-lived Bayes DB this can get very large indeed if not.

Between these two items, rarely matched/learned tokens will tend to
expire out (by design - I have no problem with this), but even with a
(much) larger bayes_expiry_max_db_size there will be a few more FNs or
FPs if these tokens are more or less permanently expired. Keeping them
in circulation by re-learning the same mail over and over helps nudge
the overall accuracy just a little closer to that impossible "perfect"
filter that catches all the spam and none of the ham.

This is just from my own experience, although some things may have been
refined and changed since Bayes was first introduced in 2.x (2.4? 2.6?
don't recall any more).

-kgd
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
>>>On Wed, 17 Mar 2021 10:42:14 -0400 Kris Deugau wrote:
>>>>My own experience has been that accumulating blobs of ham/spam and
>>>>just repeatedly running sa-learn over those works just fine.? It also
>>>>reduces the incidence of tokens from somewhat rarer mail
>>>>automatically expiring out of Bayes, leading to FPs and FNs.
>>
>>On 17.03.21 22:01, RW wrote:
>>>It wont do that by default. You would need to have something removing
>>>the signature hashes from the database.

>Matus UHLAR - fantomas wrote:
>>oh, yes, it does:
>>
>> ????? bayes_auto_expire???????????? (default: 1)
>> ????????? If enabled, the Bayes system will try to automatically expire old
>> ????????? tokens from the database.? Auto-expiry occurs when the number of
>> ????????? tokens in the database surpasses the bayes_expiry_max_db_size
>> ????????? value. If a bayes datastore backend does not implement individual
>> ????????? key/value expirations, the setting is silently ignored.
>>
>>note that multiple people reported long delivery time when expiration has
>>occured, and it's often recommended to turn this off and do
>>expiration e.g. from cron job.
>>
>>BAYES database stored in redis does not have this issue.

On 18.03.21 11:02, Kris Deugau wrote:
>That option only controls when Bayes expiry is run, not what gets
>expired when it does happen.

It says that the expiry is run automatically when certain conditions are met.

Of course it doesn't affect how expiry works, other options affect that.

>Thinking more I may have conflated several actions that any long-lived
>Bayes DB will have to experience.
>
>- Token expiry will happen automatically out of the box as above, or
>manually as scheduled (for BDB or SQL backends - IIRC the Redis
>backend uses a Redis feature to expire tokens automatically).
>Historically autoexpire worked well for file-based per-user or (very)
>small sitewide DBs, but very poorly for larger ones (even a couple
>tens to hundreds of users) due to the strict locking and extra time
>taken while processing a message. I'm not sure this is so much of a
>problem with an SQL back end.

that's what I wrote above about long delivery times. Maybe I should
explained more deeply (there were more problems but I don't remember
details)

manual expiration does not work with redis, so with redis bayes_auto_expire
should be set to 1 (default - simply don't turn it off)

>- The list of "seen" messages (bayes_seen file or DB table - not sure
>what the Redis backend uses) may grow without limit unless manually
>trimmed. On a long-lived Bayes DB this can get very large indeed if
>not.
>
>Between these two items, rarely matched/learned tokens will tend to
>expire out (by design - I have no problem with this), but even with a
>(much) larger bayes_expiry_max_db_size there will be a few more FNs or
>FPs if these tokens are more or less permanently expired. Keeping
>them in circulation by re-learning the same mail over and over helps
>nudge the overall accuracy just a little closer to that impossible
>"perfect" filter that catches all the spam and none of the ham.

so, in fact, you want to keep tokens fresh.

other solution would be increase their TTL and database size.

I wonder if TTL isn't updated when tokens are used (so only unused tokens
get expired).

>This is just from my own experience, although some things may have
>been refined and changed since Bayes was first introduced in 2.x (2.4?
>2.6? don't recall any more).

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
A day without sunshine is like, night.
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
On Thu, 18 Mar 2021 14:01:28 +0100
Matus UHLAR - fantomas wrote:

> >On Wed, 17 Mar 2021 10:42:14 -0400 Kris Deugau wrote:
>
> >> My own experience has been that accumulating blobs of ham/spam and
> >> just repeatedly running sa-learn over those works just fine. It
> >> also reduces the incidence of tokens from somewhat rarer mail
> >> automatically expiring out of Bayes, leading to FPs and FNs.
>
> On 17.03.21 22:01, RW wrote:
> >It wont do that by default. You would need to have something removing
> >the signature hashes from the database.
>
> oh, yes, it does:
>
> bayes_auto_expire (default: 1)

I meant that sa-learn will ignore mail that's already been trained. So,
by default, rerunning it over a corpus that already been trained wont
prevent any tokens expiring.

Redis does support ageing-out signatures, but I don't see why you would
want to retrain on old mail at the expense of losing tokens from
new mail. You'll also end up with a database where very old emails will
have been trained many times and recent, more relevant, FPs & FNs have
only have been trained once.
Re: Workflow for adding new ham/spam to existing site-wide database? [ In reply to ]
>> >On Wed, 17 Mar 2021 10:42:14 -0400 Kris Deugau wrote:
>>
>> >> My own experience has been that accumulating blobs of ham/spam and
>> >> just repeatedly running sa-learn over those works just fine. It
>> >> also reduces the incidence of tokens from somewhat rarer mail
>> >> automatically expiring out of Bayes, leading to FPs and FNs.
>>
>> On 17.03.21 22:01, RW wrote:
>> >It wont do that by default. You would need to have something removing
>> >the signature hashes from the database.

>On Thu, 18 Mar 2021 14:01:28 +0100 Matus UHLAR - fantomas wrote:
>> oh, yes, it does:
>>
>> bayes_auto_expire (default: 1)

On 18.03.21 16:09, RW wrote:
>I meant that sa-learn will ignore mail that's already been trained. So,
>by default, rerunning it over a corpus that already been trained wont
>prevent any tokens expiring.

Aha - yes, correct.

Also, re-training over huge file takes time to parse it, so I usually split
old trained mailboxes into one-per year or similar.

>Redis does support ageing-out signatures, but I don't see why you would
>want to retrain on old mail at the expense of losing tokens from
>new mail. You'll also end up with a database where very old emails will
>have been trained many times and recent, more relevant, FPs & FNs have
>only have been trained once.

I already encountered case where (apparently poorly trained) BAYES failed to
properly classify ham/spam and training multiple mail didn't change its
results (BAYES_50 nearly all the time).

Dropping bayes DB and re-training it on old corpus made it work like charm
(training one single spam pushes new mail from BAYES_50 to BAYES_999 usually)

Keeping old corpus made huge sense there, especially for targeted phish
that's quite rare and highly unique.

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
WinError #99999: Out of error messages.