Mailing List Archive

spamassassin and *compressed* Maildir
Do spamassassin or sa-learn understand compressed files or compressed
Maildir?

I've been running spamassassin on my ubuntu mail server for years very
successfully. Recently, I've been experiencing a lot of difficulty and I'm
trying to figure it out. Earlier this year we upgraded the server from
Trusty Tahr to Xenial (long time coming!) and some other stuff got
upgraded as well. We run an IMAP server with Dovecot against a Maildir
formatted message store. I noticed the message store was taking a fair
amount of space, so I decided to compress it with zlib (gz compression).

Pretty much since the upgrade (and simultaneous switch to compressed
Maildir) spamassassin has been doing a much worse job. I upgraded from the
distribution version of spamassassin (3.4.2) to the most recent version
(3.4.6) but no real joy. I keep a 'learn spam' folder to put false
negatives in (stuff that makes it into my inbox which ought not to), and
every night, run sa-learn on it and also spamassassin -r to report it. I
started noticing that DCC was complaining on report that "missing message
body; fatal error".

I ran spamassassin -d -r to see what was happening and noticed that it
interacted with dcc using dccproc. Maybe dccproc doesn't understand
compressed mail? Well, if it doesn't then perhaps sa-learn doesn't
either. That might explain why my bayes rules don't seem to be working
very well despite retraining.

-CJ
Re: spamassassin and *compressed* Maildir [ In reply to ]
On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> Do spamassassin or sa-learn understand compressed files or compressed Maildir?

I believe sa-learn will automatically decompress if the files have .gz or
.bz2 extension, but yes Maildir files without extension will not work.

Should be easy to detect compressed Maildir files, perhaps file enhancement
request in bugzilla.
Re: spamassassin and *compressed* Maildir [ In reply to ]
That's confirmed. sa-learn doesn't like compressed files. I don't know if
it will dine on compressed files with the correct extension (i.e., .gz).
Unfortunately, when using compression with Maildir format, Dovecot doesn't
seem to like to use extensions. So, I copied the directory to a temporary
location, decompressed the files and then set sa-learn on them. Even
getting gunzip to operate on the files was a pain because it only wants
files with the .gz extension (so I had to rename all 6,000 of them first -
using a utility like 'rename'). I then did the same thing with about 9,000
hams.

There was much good news. Learning proceeded about the same pace, but
syncing the journal to the database was *much *faster. Maybe the tokens
were smaller? I verified that it seemed to work with --dump magic.

Then, all by itself, Spamassassin's bayes filtering was instantly much
better. Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.

Now, I just need to update my nightly learning/reporting script.

Still, a very nice result.

On Fri, May 21, 2021 at 11:30 AM Henrik K <hege@hege.li> wrote:

> On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> > Do spamassassin or sa-learn understand compressed files or compressed
> Maildir?
>
> I believe sa-learn will automatically decompress if the files have .gz or
> .bz2 extension, but yes Maildir files without extension will not work.
>
> Should be easy to detect compressed Maildir files, perhaps file enhancement
> request in bugzilla.
>
>
Re: spamassassin and *compressed* Maildir [ In reply to ]
You can do `zcat -f` or `gunzip -c -f` and avoid having to have .gz extension, that way you can skip the rename step

Best Regards,
Lucas Rolff

From: Clive Jacques <westriverprop@gmail.com>
Date: Friday, 21 May 2021 at 21.04
To: "users@spamassassin.apache.org" <users@spamassassin.apache.org>
Subject: Re: spamassassin and *compressed* Maildir

That's confirmed. sa-learn doesn't like compressed files. I don't know if it will dine on compressed files with the correct extension (i.e., .gz). Unfortunately, when using compression with Maildir format, Dovecot doesn't seem to like to use extensions. So, I copied the directory to a temporary location, decompressed the files and then set sa-learn on them. Even getting gunzip to operate on the files was a pain because it only wants files with the .gz extension (so I had to rename all 6,000 of them first - using a utility like 'rename'). I then did the same thing with about 9,000 hams.

There was much good news. Learning proceeded about the same pace, but syncing the journal to the database was much faster. Maybe the tokens were smaller? I verified that it seemed to work with --dump magic.

Then, all by itself, Spamassassin's bayes filtering was instantly much better. Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.

Now, I just need to update my nightly learning/reporting script.

Still, a very nice result.

On Fri, May 21, 2021 at 11:30 AM Henrik K <hege@hege.li<mailto:hege@hege.li>> wrote:
On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> Do spamassassin or sa-learn understand compressed files or compressed Maildir?

I believe sa-learn will automatically decompress if the files have .gz or
.bz2 extension, but yes Maildir files without extension will not work.

Should be easy to detect compressed Maildir files, perhaps file enhancement
request in bugzilla.
Re: spamassassin and *compressed* Maildir [ In reply to ]
I have a mail folder that I put false negatives in (i.e., spam which ends
up in my inbox) and another for false negatives (ham that ends up in my
spam folder). Each night I run sa-learn on each folder (sa-learn will
munch on entire Maildirs) and also feed each message to spamassassin -r to
report it. So using zcat or gunzip -c will work for spamassassin -r, but
not for sa-learn.

Unless sa-learn can munch on stdin as well as files....

-CJ

On Fri, May 21, 2021 at 3:28 PM Lucas Rolff <lucas@lucasrolff.com> wrote:

> You can do `zcat -f` or `gunzip -c -f` and avoid having to have .gz
> extension, that way you can skip the rename step
>
>
>
> Best Regards,
>
> Lucas Rolff
>
>
>
> *From: *Clive Jacques <westriverprop@gmail.com>
> *Date: *Friday, 21 May 2021 at 21.04
> *To: *"users@spamassassin.apache.org" <users@spamassassin.apache.org>
> *Subject: *Re: spamassassin and *compressed* Maildir
>
>
>
> That's confirmed. sa-learn doesn't like compressed files. I don't know
> if it will dine on compressed files with the correct extension (i.e.,
> .gz). Unfortunately, when using compression with Maildir format, Dovecot
> doesn't seem to like to use extensions. So, I copied the directory to a
> temporary location, decompressed the files and then set sa-learn on them.
> Even getting gunzip to operate on the files was a pain because it only
> wants files with the .gz extension (so I had to rename all 6,000 of them
> first - using a utility like 'rename'). I then did the same thing with
> about 9,000 hams.
>
>
>
> There was much good news. Learning proceeded about the same pace, but
> syncing the journal to the database was *much *faster. Maybe the tokens
> were smaller? I verified that it seemed to work with --dump magic.
>
>
>
> Then, all by itself, Spamassassin's bayes filtering was instantly much
> better. Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.
>
>
>
> Now, I just need to update my nightly learning/reporting script.
>
>
>
> Still, a very nice result.
>
>
>
> On Fri, May 21, 2021 at 11:30 AM Henrik K <hege@hege.li> wrote:
>
> On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> > Do spamassassin or sa-learn understand compressed files or compressed
> Maildir?
>
> I believe sa-learn will automatically decompress if the files have .gz or
> .bz2 extension, but yes Maildir files without extension will not work.
>
> Should be easy to detect compressed Maildir files, perhaps file enhancement
> request in bugzilla.
>
>
Re: spamassassin and *compressed* Maildir [ In reply to ]
$ cat RMScaa8wVRnMfwqlQ0RxAzDjYGmIumlp1wlA8QNr8z.eml | sa-learn --spam
Learned tokens from 0 message(s) (1 message(s) examined)

Indeed does work from stdin

- Lucas

From: Clive Jacques <westriverprop@gmail.com>
Date: Friday, 21 May 2021 at 21.41
To: "users@spamassassin.apache.org" <users@spamassassin.apache.org>
Subject: Re: spamassassin and *compressed* Maildir

I have a mail folder that I put false negatives in (i.e., spam which ends up in my inbox) and another for false negatives (ham that ends up in my spam folder). Each night I run sa-learn on each folder (sa-learn will munch on entire Maildirs) and also feed each message to spamassassin -r to report it. So using zcat or gunzip -c will work for spamassassin -r, but not for sa-learn.

Unless sa-learn can munch on stdin as well as files....

-CJ

On Fri, May 21, 2021 at 3:28 PM Lucas Rolff <lucas@lucasrolff.com<mailto:lucas@lucasrolff.com>> wrote:
You can do `zcat -f` or `gunzip -c -f` and avoid having to have .gz extension, that way you can skip the rename step

Best Regards,
Lucas Rolff

From: Clive Jacques <westriverprop@gmail.com<mailto:westriverprop@gmail.com>>
Date: Friday, 21 May 2021 at 21.04
To: "users@spamassassin.apache.org<mailto:users@spamassassin.apache.org>" <users@spamassassin.apache.org<mailto:users@spamassassin.apache.org>>
Subject: Re: spamassassin and *compressed* Maildir

That's confirmed. sa-learn doesn't like compressed files. I don't know if it will dine on compressed files with the correct extension (i.e., .gz). Unfortunately, when using compression with Maildir format, Dovecot doesn't seem to like to use extensions. So, I copied the directory to a temporary location, decompressed the files and then set sa-learn on them. Even getting gunzip to operate on the files was a pain because it only wants files with the .gz extension (so I had to rename all 6,000 of them first - using a utility like 'rename'). I then did the same thing with about 9,000 hams.

There was much good news. Learning proceeded about the same pace, but syncing the journal to the database was much faster. Maybe the tokens were smaller? I verified that it seemed to work with --dump magic.

Then, all by itself, Spamassassin's bayes filtering was instantly much better. Stuff that was tripping BAYES_00 was suddenly popping BAYES_99.

Now, I just need to update my nightly learning/reporting script.

Still, a very nice result.

On Fri, May 21, 2021 at 11:30 AM Henrik K <hege@hege.li<mailto:hege@hege.li>> wrote:
On Fri, May 21, 2021 at 10:54:54AM -0400, Clive Jacques wrote:
> Do spamassassin or sa-learn understand compressed files or compressed Maildir?

I believe sa-learn will automatically decompress if the files have .gz or
.bz2 extension, but yes Maildir files without extension will not work.

Should be easy to detect compressed Maildir files, perhaps file enhancement
request in bugzilla.
Re: spamassassin and *compressed* Maildir [ In reply to ]
On Fri, 21 May 2021 15:41:22 -0400
Clive Jacques wrote:

> I have a mail folder that I put false negatives in (i.e., spam which
> ends up in my inbox) and another for false negatives (ham that ends
> up in my spam folder). Each night I run sa-learn on each folder
> (sa-learn will munch on entire Maildirs) and also feed each message
> to spamassassin -r to report it. So using zcat or gunzip -c will
> work for spamassassin -r, but not for sa-learn.

spamassassin -r also trains to Bayes