My employer uses ClamAV to do a daily scan of the client data stored on our appliances. Our appliances are pretty beefy, but some of our clients have so much data that the nightly simply takes too long. For example, there's at least one client with so much data that the supposedly daily scan takes over 24 hours to run.
To address this problem, I wanted to make clamscan a bit smarter, so that I could tell it to do the following:
* don't scan files more than 30 days old; however,
* scan a rotating subset of those old files every day, so that over the course of a few weeks all of them will be scanned; in addition,
* scan any files that were previously found to be infected, so they'll show up in the scan report every day; and finally,
* don't scan files which have been explicitly suppressed by a list of files we don't want to scan no matter what.
The --include / --exclude functionality built into clamscan isn't nearly powerful enough to implement this logic. Building a list of files to scan and/or ignore and then specifying that list with many --include and --exclude options wont' work, for two reasons:
1. The clamscan command line will be orders of magnitude too long.
2. The logic in clamscan for checking --include and --exclude recompiles all of the regular expressions for every single file being checked, which makes it run horrendously slow.
After evaluating what I needed to accomplish and what clamscan is currently capable of doing, I decided to go ahead and enhance clamscan by implementing much more powerful and flexible exclude / include support. This I have done in the enclosed patch, which makes the following changes:
* A new module, excludelist.c / excludelist.h, is added for managing lists of files or directories to exclude or include. There's also a pretty beefy unit-test program to go with this new module -- excludelist_test.c.
* The regular expressions being excluded or included are compiled just once and stored in memory.
* The obsolete code in clamscan.c checking HAVE_REGEX_H has been removed; I'm pretty sure that this code is obsolete, because cli_regcomp and cli_regexec were added to libclamav and are always used elsewhere in the code.
* A number of new command-line options for clamscan have been added:
* --include-list to specify a file containing a list of patterns to include, plus similarly for --exclude-list, --include-dir-list, --exclude-dir-list
* --include-newer to only scan files newer than the specified timestamp, plus similarly for --exclude-newer, --include-older, --exclude-older
* The order in which --include and --exclude are specified on the command line is now significant. THIS IS A USER-VISIBLE INTERFACE CHANGE. The last --include-* and --exclude-* option that matches a given item rules for that item. If the first exclude / include option specified on the command line is a --include-* option, then the default for unmatched items of that type is to exclude.
* To make this work, I had to add support to optparser.c for keeping track of the order in which arguments appear on the command line and making that information available to the caller.
Please note the following caveats:
* As noted above, THERE IS A USER-VISIBLE INTERFACE CHANGE, because the order of --include and --exclude on the command line is now significant. I feel that the new functionality is sufficiently powerful and useful that the UI change is appropriate. If y'all disagree, then I am open to suggestions for how to preserve the old interface while still adding the new functionality.
* The code for parsing timestamps for the --*-older and --*-newer options is known to work on recent Linux versions, but may not be portable to old Linux versions, other UNIX variants, or Windows. I hope that other developers who have access to those platforms will be willing to make the changes to my patch necessary to support them; I am not in a position to do this myself. Worst-case scenario, getdate() support in the timestamp parser can be disabled on platforms on which it simply isn't available.
I would be delighted to hear any feedback that anyone has about this patch. I would be even more delighted if the maintainers of ClamAV decided to fold my patch into an upcoming ClamAV release, so I don't have to maintain it myself forever :-).
Thanks,
Jonathan Kamens
Advent Software, Inc.
To address this problem, I wanted to make clamscan a bit smarter, so that I could tell it to do the following:
* don't scan files more than 30 days old; however,
* scan a rotating subset of those old files every day, so that over the course of a few weeks all of them will be scanned; in addition,
* scan any files that were previously found to be infected, so they'll show up in the scan report every day; and finally,
* don't scan files which have been explicitly suppressed by a list of files we don't want to scan no matter what.
The --include / --exclude functionality built into clamscan isn't nearly powerful enough to implement this logic. Building a list of files to scan and/or ignore and then specifying that list with many --include and --exclude options wont' work, for two reasons:
1. The clamscan command line will be orders of magnitude too long.
2. The logic in clamscan for checking --include and --exclude recompiles all of the regular expressions for every single file being checked, which makes it run horrendously slow.
After evaluating what I needed to accomplish and what clamscan is currently capable of doing, I decided to go ahead and enhance clamscan by implementing much more powerful and flexible exclude / include support. This I have done in the enclosed patch, which makes the following changes:
* A new module, excludelist.c / excludelist.h, is added for managing lists of files or directories to exclude or include. There's also a pretty beefy unit-test program to go with this new module -- excludelist_test.c.
* The regular expressions being excluded or included are compiled just once and stored in memory.
* The obsolete code in clamscan.c checking HAVE_REGEX_H has been removed; I'm pretty sure that this code is obsolete, because cli_regcomp and cli_regexec were added to libclamav and are always used elsewhere in the code.
* A number of new command-line options for clamscan have been added:
* --include-list to specify a file containing a list of patterns to include, plus similarly for --exclude-list, --include-dir-list, --exclude-dir-list
* --include-newer to only scan files newer than the specified timestamp, plus similarly for --exclude-newer, --include-older, --exclude-older
* The order in which --include and --exclude are specified on the command line is now significant. THIS IS A USER-VISIBLE INTERFACE CHANGE. The last --include-* and --exclude-* option that matches a given item rules for that item. If the first exclude / include option specified on the command line is a --include-* option, then the default for unmatched items of that type is to exclude.
* To make this work, I had to add support to optparser.c for keeping track of the order in which arguments appear on the command line and making that information available to the caller.
Please note the following caveats:
* As noted above, THERE IS A USER-VISIBLE INTERFACE CHANGE, because the order of --include and --exclude on the command line is now significant. I feel that the new functionality is sufficiently powerful and useful that the UI change is appropriate. If y'all disagree, then I am open to suggestions for how to preserve the old interface while still adding the new functionality.
* The code for parsing timestamps for the --*-older and --*-newer options is known to work on recent Linux versions, but may not be portable to old Linux versions, other UNIX variants, or Windows. I hope that other developers who have access to those platforms will be willing to make the changes to my patch necessary to support them; I am not in a position to do this myself. Worst-case scenario, getdate() support in the timestamp parser can be disabled on platforms on which it simply isn't available.
I would be delighted to hear any feedback that anyone has about this patch. I would be even more delighted if the maintainers of ClamAV decided to fold my patch into an upcoming ClamAV release, so I don't have to maintain it myself forever :-).
Thanks,
Jonathan Kamens
Advent Software, Inc.