Mailing List Archive

ClamAV, improvement in performance?
hi,

Two months back, we started working on the idea of using bloom filters for
string matching in network intrusion detection. However, a high percentage
of those signatures are two-three bytes long which made them unideal for
hashing. Then, we realized that virus signatures tend to be longer, and we
looked around for available open source virus signature DBs.

We downloaded ClamAV like two weeks ago, and used the provided API to see
if it was suitable for our bloom filters. We simply mmap'ed a sample
executable file and called cl_scanbuff() to test performance. On an AMD XP
2000 -1.8 Ghz- (with 768 MB RAM), it went at 6.69 MB/s.

ClamAV is using Aho-Corasick to detect an exact match between the input
and the virus signatures. However, we can get significant performance
improvements by using approximate string matching algorithms as a first
line of defense.

That is, we can take the first X bytes of every signature, and use hashing
functions to put them into a bloom filter. Then, when we are scanning a
buffer, we can first check the input data against our bloom filter, and
call Aho-Corasick only if there is a possible match. By using sparse
enough bloom filters, we can eliminate 96%+ of the input at a first pass.

After some math and trial-error, it seemed like 4 MB is a reasonable
filter size. CPU instructions and memory accesses aren't for free though,
so after some playing around, I realized that having two level hashing (a
simple mask and mod against 512 KB, and then better hash functions against
4 MB) is a better idea than plain hashing. I slightly modified ClamAV
code, and ran the same performance test:

[X@X clamav_test]# ./perf_test sample-file

Detected ClamAV-Test-Signature virus!!!!!

file length: 64120100
this many megs: 61.149693
processed 22.589469 MB/sec
!!!found 188276 possible matches

Without the calls to ClamAV, hashing goes around 44.70 MB/s. This
implementation requires that the prefix of the signature is at least 6
bytes long, and therefore throws away 25 sigs out of 20718.

Memory accesses are more expensive than CPU instructions these days, so we
can do a small trick to further improve performance. For a sample
signature "abcdefghi", we can hash:

abcd... ; bcde... ; cdef... ; defg...

and go four bytes at a time instead of one. Depending on how much memory
you're willing to use, you can increment that four into eight, twelve...
Of course, you need bigger prefixes in here, I used the first twelve bytes
which throws away 195 sigs out of 20718. Testing this again with 4 MB
(but this time with a 1 MB first level bloom filter) we get:

[X@X clamav_test]# ./perf_test sample-file

file length: 64120100
this many megs: 61.149693
processed 39.173410 MB/sec
!!!found 158159 possible matches

Hashing itself goes at 57.25 MB/s in here. I think this looks good, and
there is lots of space for improvement too. I got my hashing functions off
the web.

I have made the sample code available at:
http://www.stanford.edu/~kushcu/antivirus/

With the addition of this code fragment (600 lines), ClamAV can scan 1 GBs
in 25 secs on my home desktop, which I think puts it in the same league as
commercial anti-virus products, performance-wise.

The code is in its very early stages, and there may be possible bugs. Any
feedback on the idea/code would well be appreciated, and please let me
know if there are any errors with the methodology followed in testing of
this algorithm.

Thanks,

Ozgun.
Re: ClamAV, improvement in performance? [ In reply to ]
ozgun@cs.stanford.edu schrieb:

>With the addition of this code fragment (600 lines), ClamAV can scan 1 GBs
>in 25 secs on my home desktop, which I think puts it in the same league as
>commercial anti-virus products, performance-wise.
>
>The code is in its very early stages, and there may be possible bugs. Any
>feedback on the idea/code would well be appreciated, and please let me
>know if there are any errors with the methodology followed in testing of
>this algorithm.

Not that i understand the technical background, but it sounds logical and
the numbers are very impressive!

Cu
Re: ClamAV, improvement in performance? [ In reply to ]
On Tue, 6 Apr 2004 13:09:16 -0700 (PDT)
ozgun@cs.stanford.edu wrote:

> Without the calls to ClamAV, hashing goes around 44.70 MB/s. This
> implementation requires that the prefix of the signature is at least 6
> bytes long, and therefore throws away 25 sigs out of 20718.

That's unacceptable because eliminates important signatures for
polymorphic viruses.

You can get quite a big performance boost by increasing the depth
level but in the current implementation this will eliminate polymorphic
signatures, too.

--
oo ..... Tomasz Kojm <tkojm@clamav.net>
(\/)\......... http://www.ClamAV.net/gpg/tkojm.gpg
\..........._ 0DCA5A08407D5288279DB43454822DC8985A444B
//\ /\ Wed Apr 7 01:22:36 CEST 2004