Mailing List Archive

Scan Attachment Content Using Spamassassin
Note: My email address has been changed. Please use @netcorecloud.com in future communications.

Hello Folks,

Is there any possible way using we can scan for the content of an attachment ie .doc/pdf/.xls/ppt etc...

Planning is to have a DLP kind of protection with the help of Spamassassin.  


Regards,
Siddhesh
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On Thu, 3 Jun 2021, KADAM, SIDDHESH wrote:

> Hello Folks,
>
> Is there any possible way using we can scan for the content of an attachment ie .doc/pdf/.xls/ppt etc...
>
> Planning is to have a DLP kind of protection with the help of Spamassassin.
>
> Regards,
> Siddhesh

spamassassin really isn't the best tool for this job. It's really designed for
looking at text stuff, and how do you squeeze the text out of a ppt or xls in a
meaningful way?
Even more limiting, spamassassin is designed for small to medium size messages,
scanning anything over 500KB or so is going to be a resource hog.

What would be better is a tool that is already designed for scanning .doc / pdf/
.xls/ ppt etc.; an anti-virus program with custom rules for the kinds of info
you want to detect.

ClamAV has builtin DLP rules for standard kinds of PII (EG CC#s, SSNs, etc) and
comes with tools to help you craft custom rules if you have particular kinds of
info you need DLP for.

Start with a mail scanning framework (EG amavis or mimedefang) and plug in
spamassassin for spam and two instances of ClamAV, one with standard anti-virus
rulesets and another with your DLP rules. Then you can use the framework
to take what ever kinds of actions you want based on what components 'fired'.




--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center, 103 S Capitol St.
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On Thu, Jun 03, 2021 at 01:15:03AM -0500, Dave Funk wrote:
>
> Even more limiting, spamassassin is designed for small to medium size
> messages, scanning anything over 500KB or so is going to be a resource hog.

That's just outdated information. It's fine to scan even 20MB+ messages, it
just requires some memory.
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
>On Thu, Jun 03, 2021 at 01:15:03AM -0500, Dave Funk wrote:
>> Even more limiting, spamassassin is designed for small to medium size
>> messages, scanning anything over 500KB or so is going to be a resource hog.

500KB is default max size for spamc, not for spamassassin itself.
You can rise it.

On 03.06.21 09:23, Henrik K wrote:
>That's just outdated information. It's fine to scan even 20MB+ messages, it
>just requires some memory.

and CPU and time...

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
"The box said 'Requires Windows 95 or better', so I bought a Macintosh".
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On Thu, Jun 03, 2021 at 09:32:28AM +0200, Matus UHLAR - fantomas wrote:
> On 03.06.21 09:23, Henrik K wrote:
> > That's just outdated information. It's fine to scan even 20MB+ messages, it
> > just requires some memory.
>
> and CPU and time...

Those are affected very little by message size. And all that is pretty much
negated by large messages being uncommon.
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
> On 03 Jun 2021, at 01:32, Matus UHLAR - fantomas <uhlar@fantomas.sk> wrote:
>
>> On Thu, Jun 03, 2021 at 01:15:03AM -0500, Dave Funk wrote:
>>> Even more limiting, spamassassin is designed for small to medium size
>>> messages, scanning anything over 500KB or so is going to be a resource hog.
>
> 500KB is default max size for spamc, not for spamassassin itself.
> You can rise it.
>
> On 03.06.21 09:23, Henrik K wrote:
>> That's just outdated information. It's fine to scan even 20MB+ messages, it
>> just requires some memory.
>
> and CPU and time...

If you have the RAM you will be hard pressed to notice any spike in CPU. Not sure about the amount of time to process, but it's not going to take much processing on anything but a very very lowe-end and old CPU. (Think pre Pentium, not anything from the last decade or so).

--
If puns are outlawed, only outlaws will have puns.
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On Thu, 3 Jun 2021, Henrik K wrote:

> On Thu, Jun 03, 2021 at 09:32:28AM +0200, Matus UHLAR - fantomas wrote:
>> On 03.06.21 09:23, Henrik K wrote:
>>> That's just outdated information. It's fine to scan even 20MB+ messages, it
>>> just requires some memory.
>>
>> and CPU and time...
>
> Those are affected very little by message size. And all that is pretty much
> negated by large messages being uncommon.

Be that as it may, the OP wanted to do DLP scanning of messages containing
PPTx,XLSx, etc, and it's uncommon to see a small PPTx file, large is more common
w/ such media.

Also, spamassassin does not have a native built-in component for parsing such
media attachments, it would need to be some kind of add-in (EG the "fuzzy ocr"
plugin that was the rage a while ago).
As such it adds an additional complication that needs to be integrated/
managed/updated etc.

Probably better to use a whole different tool that comes with that kind of
capability built-in (EG ClamAV).


--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center, 103 S Capitol St.
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On Thu, Jun 03, 2021 at 08:39:08AM -0500, Dave Funk wrote:
>
> Also, spamassassin does not have a native built-in component for parsing
> such media attachments, it would need to be some kind of add-in (EG the
> "fuzzy ocr" plugin that was the rage a while ago).
> As such it adds an additional complication that needs to be integrated/
> managed/updated etc.
>
> Probably better to use a whole different tool that comes with that kind of
> capability built-in (EG ClamAV).

Of course this is true. I was simply correcting the size issue, which
really isn't an issue.
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On 2021-06-03 07:22, KADAM, SIDDHESH wrote:
> Hello Folks,
>
> Is there any possible way using we can scan for the content of an
> attachment ie .doc/pdf/.xls/ppt etc...
>
> Planning is to have a DLP kind of protection with the help of
> Spamassassin.

good plans, but spamassassin is not a malware project or even af virus
scanner, that sayed, its possible to use clamav with perl modules so it
can be bridged results in spamassassin, i just hope it will not happen
since malware / virus is diffrent problem to handle

https://www.clamav.net/documents/libclamav make perl modules for this,
then make the needed plugins to use the perl modules

that would be usefull
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
On 2021-06-03 08:23, Henrik K wrote:
> On Thu, Jun 03, 2021 at 01:15:03AM -0500, Dave Funk wrote:
>>
>> Even more limiting, spamassassin is designed for small to medium size
>> messages, scanning anything over 500KB or so is going to be a resource
>> hog.
>
> That's just outdated information. It's fine to scan even 20MB+
> messages, it
> just requires some memory.

postfix have its limits aswell :=)

only 10MB is accepted by default
Re: Scan Attachment Content Using Spamassassin [ In reply to ]
>> On 03.06.21 09:23, Henrik K wrote:
>> > That's just outdated information. It's fine to scan even 20MB+ messages, it
>> > just requires some memory.

>On Thu, Jun 03, 2021 at 09:32:28AM +0200, Matus UHLAR - fantomas wrote:
>> and CPU and time...

On 03.06.21 11:14, Henrik K wrote:
>Those are affected very little by message size. And all that is pretty much
>negated by large messages being uncommon.

depends, I've had problems few times when receiving huge mail with much of
text content...

yes, most of that was parsing BAYES tokens, amount of which can be limited
now, but there still may be bottlenecks.

Also, I use FuzzyOCR which needs time.

afaik, SA4 will support different file types parsers, which is perfect idea
but will take time too.

--
Matus UHLAR - fantomas, uhlar@fantomas.sk ; http://www.fantomas.sk/
Warning: I wish NOT to receive e-mail advertising to this address.
Varovanie: na tuto adresu chcem NEDOSTAVAT akukolvek reklamnu postu.
Despite the cost of living, have you noticed how popular it remains?