Mailing List Archive

PATCH: Enhanced exclude / include support for clamscan
My employer uses ClamAV to do a daily scan of the client data stored on our appliances. Our appliances are pretty beefy, but some of our clients have so much data that the nightly simply takes too long. For example, there's at least one client with so much data that the supposedly daily scan takes over 24 hours to run.

To address this problem, I wanted to make clamscan a bit smarter, so that I could tell it to do the following:

* don't scan files more than 30 days old; however,

* scan a rotating subset of those old files every day, so that over the course of a few weeks all of them will be scanned; in addition,

* scan any files that were previously found to be infected, so they'll show up in the scan report every day; and finally,

* don't scan files which have been explicitly suppressed by a list of files we don't want to scan no matter what.

The --include / --exclude functionality built into clamscan isn't nearly powerful enough to implement this logic. Building a list of files to scan and/or ignore and then specifying that list with many --include and --exclude options wont' work, for two reasons:

1. The clamscan command line will be orders of magnitude too long.

2. The logic in clamscan for checking --include and --exclude recompiles all of the regular expressions for every single file being checked, which makes it run horrendously slow.

After evaluating what I needed to accomplish and what clamscan is currently capable of doing, I decided to go ahead and enhance clamscan by implementing much more powerful and flexible exclude / include support. This I have done in the enclosed patch, which makes the following changes:

* A new module, excludelist.c / excludelist.h, is added for managing lists of files or directories to exclude or include. There's also a pretty beefy unit-test program to go with this new module -- excludelist_test.c.

* The regular expressions being excluded or included are compiled just once and stored in memory.

* The obsolete code in clamscan.c checking HAVE_REGEX_H has been removed; I'm pretty sure that this code is obsolete, because cli_regcomp and cli_regexec were added to libclamav and are always used elsewhere in the code.

* A number of new command-line options for clamscan have been added:

* --include-list to specify a file containing a list of patterns to include, plus similarly for --exclude-list, --include-dir-list, --exclude-dir-list
* --include-newer to only scan files newer than the specified timestamp, plus similarly for --exclude-newer, --include-older, --exclude-older

* The order in which --include and --exclude are specified on the command line is now significant. THIS IS A USER-VISIBLE INTERFACE CHANGE. The last --include-* and --exclude-* option that matches a given item rules for that item. If the first exclude / include option specified on the command line is a --include-* option, then the default for unmatched items of that type is to exclude.

* To make this work, I had to add support to optparser.c for keeping track of the order in which arguments appear on the command line and making that information available to the caller.

Please note the following caveats:

* As noted above, THERE IS A USER-VISIBLE INTERFACE CHANGE, because the order of --include and --exclude on the command line is now significant. I feel that the new functionality is sufficiently powerful and useful that the UI change is appropriate. If y'all disagree, then I am open to suggestions for how to preserve the old interface while still adding the new functionality.

* The code for parsing timestamps for the --*-older and --*-newer options is known to work on recent Linux versions, but may not be portable to old Linux versions, other UNIX variants, or Windows. I hope that other developers who have access to those platforms will be willing to make the changes to my patch necessary to support them; I am not in a position to do this myself. Worst-case scenario, getdate() support in the timestamp parser can be disabled on platforms on which it simply isn't available.

I would be delighted to hear any feedback that anyone has about this patch. I would be even more delighted if the maintainers of ClamAV decided to fold my patch into an upcoming ClamAV release, so I don't have to maintain it myself forever :-).

Thanks,

Jonathan Kamens
Advent Software, Inc.
Re: PATCH: Enhanced exclude / include support for clamscan [ In reply to ]
On Tue May 12 2009 20:37:19 GMT+0200 (CEST)
Kamens, Jonathan <jkamens@Advent.COM> wrote:

> I would be delighted to hear any feedback that anyone has about this
> patch. I would be even more delighted if the maintainers of ClamAV
> decided to fold my patch into an upcoming ClamAV release, so I don't
> have to maintain it myself forever :-).

Hi Jonathan,

please open a bug report ("enhancement") at bugs.clamav.net and attach
your patches there together with all the additional information. This
will make things easier for us, also you/users will have a better way to
control what's happening with the stuff.

Thanks,

--
oo ..... Tomasz Kojm <tkojm@clamav.net>
(\/)\......... http://www.ClamAV.net/gpg/tkojm.gpg
\..........._ 0DCA5A08407D5288279DB43454822DC8985A444B
//\ /\ Wed May 13 09:51:25 CEST 2009
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: PATCH: Enhanced exclude / include support for clamscan [ In reply to ]
Hi there,

On Wed, 13 May 2009 Kamens, Jonathan

> ... wanted to make clamscan a bit smarter ...
>
> * don't scan files more than 30 days old ...
> * scan a rotating subset of those old files ...
> * scan any files that were previously found to be infected ...
> * don't scan files which have been explicitly suppressed
>
> ... I decided to go ahead and enhance clamscan ...
>
> I would be delighted to hear any feedback ...

While these features might make a lot of sense, I can't help thinking
that this isn't the 'Unix' (right:) way to go about implementing them.

Would it not be better just to tell clamscan to accept the names of
the files to scan on stdin (something like 'clamscan -f -' maybe?)
and create a tool which can provide this list if one doesn't already
exist? For most users, things like the standard 'find' or 'xargs'
will be more than sufficient.

You mentioned maintenance. IMO, what you're doing is asking for a
maintenance headache. You mentioned changing the API. Please, for
the sake of people that are using the package, already, don't do that
unless there is (a) a VERY compelling reason and (b) NO other way.
I'm already leaning towards abandoning ClamAV because of instability
in the API. Sanesecurity is the only reason I continue to use it.

--

73,
Ged.
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net