Mailing List Archive

performance increase
>> That's unacceptable because eliminates important signatures for
>> polymorphic viruses.

I know that's unacceptable. Still, this hashing algorithm checks for 99.9%
of the signatures, adding a second test for that 0.01% will slow it down,
but it should still perform much faster than ClamAV.

I read the sample virus signature database, and there are complete virus
signatures that are ten bytes long in total. I assume those are just the
short strings by which you can completely identify a virus (ten bytes is
like two-three assembly instructions). Is there any way to extract longer
prefixes even for polymorphic signatures? What kind of documentation
should I read to learn more about identifying virus signatures? The reason
I am asking is, this hashing algo. can "theoretically" run at memory speed
given that the prefixes are long enough. I was able to achieve 140 MB/s on
hashing. It also performs better for scanning HTML and plaintext mail
files (even with the current implementation it went at 75 MB/s versus 5
MB/s on a complete scan). And these are preliminary results.

>> You can get quite a big performance boost by increasing the depth
>> level but in the current implementation this will eliminate polymorphic
>> signatures, too.

Thanks for pointing that out. I searched the posts and read the related
paper on "increasing depth". However, I think our approaches are
different. In Aho-Corasick, you take a byte, see where it points to in the
trie, and follow the links in the trie until you get a match/no match for
that byte. In bloom filters, you take a byte, hash it (mod it with a
prime), and see if there's a possible match in the filter. If you have
longer prefixes, you can go 4, 8, 12 bytes at a time at the cost of a
memory access (since memory accesses are more expensive than simple
hashing).

That paper improves Aho-Corasick's performance by 2x-3x (with the expense
of 40 MB of RAM). With bloom filters, you are looking into an increase of
3x-6x (4 MB of RAM) with a simple two week implementation.

It'd be great if someone could point to links on "sample infected files",
and how signatures are extracted from them. I'm also eager to learn more
about *s and ??s in the ClamAV DB library. Do we have *s because some part
of the virus tells the code to jump to a relative offset? Or is it
something else? Are the ??s there for self-encrypting viruses (like XOR
myself with the next byte and execute the remaining code)?

Thanks for the feedback,

Ozgun.
Re: performance increase [ In reply to ]
On Tue, 6 Apr 2004 ozgun@cs.stanford.edu wrote:
>
> It'd be great if someone could point to links on "sample infected files",
> and how signatures are extracted from them.

http://www.clamav.net/doc/0.70/signatures.pdf has some introductory info.

> I'm also eager to learn more about *s and ??s in the ClamAV DB library.
> Do we have *s because some part of the virus tells the code to jump to a
> relative offset? Or is it something else?

AFAIK * means skip a variable number of bytes. I don't know about ??.

--
Tony Finch <dot@dotat.at> http://dotat.at/
Re: performance increase [ In reply to ]
On Tue, 6 Apr 2004 19:44:30 -0700 (PDT)
ozgun@cs.stanford.edu wrote:

>
> >> That's unacceptable because eliminates important signatures for
> >> polymorphic viruses.
>
> I know that's unacceptable. Still, this hashing algorithm checks for
> 99.9% of the signatures, adding a second test for that 0.01% will slow
> it down, but it should still perform much faster than ClamAV.

There are big chances that two separate engines (for static and
regex-extended signatures) will perform much better than the current
implementation.

> It'd be great if someone could point to links on "sample infected
> files", and how signatures are extracted from them. I'm also eager to
> learn more about *s and ??s in the ClamAV DB library. Do we have *s
> because some part of the virus tells the code to jump to a relative
> offset? Or is it something else? Are the ??s there for self-encrypting
> viruses (like XOR myself with the next byte and execute the remaining
> code)?

In most cases we're using wildcards to catch nop insertions as well as
to detect some non-executable non-static malware that contains static
parts (see tha signature for Baggle-zippwd).

--
oo ..... Tomasz Kojm <tkojm@clamav.net>
(\/)\......... http://www.ClamAV.net/gpg/tkojm.gpg
\..........._ 0DCA5A08407D5288279DB43454822DC8985A444B
//\ /\ Thu Apr 8 22:52:38 CEST 2004
Re: performance increase [ In reply to ]
> There are big chances that two separate engines (for static and
> regex-extended signatures) will perform much better than the current
> implementation.
>

It probably would. Btw (in case anyone's interested), when taking over the
code to C, I realized that g++ stores bools as chars (I wasn't aware). I
quickly implemented a bit array which increased the performance to 50.66
MB/s (from 39.17 MB/s). I will optimize the algo. for my CPU cache. Doing
a scan for ~20K static signatures goes around 63.49 MB/s.

> In most cases we're using wildcards to catch nop insertions as well as
> to detect some non-executable non-static malware that contains static
> parts (see tha signature for Baggle-zippwd).

I think it makes more sense now. I read bunch of papers during the past
days, and am just curious: does the ClamAV team have any plans for
implementing an emulator and an heuristic engine in the future? I'm aware
that this would make the task of extracting signatures even more ardous
(to decrypt the encrypted part of a signature), but all commercial
companies are doing it, and if anyone is planning to start coding it, I
could help, too. If this idea gets adapted, in the long run, tools could
be created to automatically extract a virus signature using only two
infected files (just execute the code in the emulator, and look for
matching patterns in virtual memory).

What are the opinions of the developers in the ClamAV team? Does this seem
like a far shot?

Ozgun.
Re: performance increase [ In reply to ]
On Thu, 15 Apr 2004 02:27:08 -0700 (PDT)
ozgun@cs.stanford.edu wrote:

> I think it makes more sense now. I read bunch of papers during the
> past days, and am just curious: does the ClamAV team have any plans
> for implementing an emulator and an heuristic engine in the future?
> I'm aware that this would make the task of extracting signatures even
> more ardous(to decrypt the encrypted part of a signature), but all
> commercial companies are doing it, and if anyone is planning to start
> coding it, I could help, too. If this idea gets adapted, in the long
> run, tools could be created to automatically extract a virus signature
> using only two infected files (just execute the code in the emulator,
> and look for matching patterns in virtual memory).
>
> What are the opinions of the developers in the ClamAV team? Does this
> seem like a far shot?

Yes, it does. It's not an easy task to write a code emulator from a
scratch and unfortunately there are no ready to use open-source
solutions. Of course we can use a code from Bochs and some other
projects but it will still require a lot of work.

--
oo ..... Tomasz Kojm <tkojm@clamav.net>
(\/)\......... http://www.ClamAV.net/gpg/tkojm.gpg
\..........._ 0DCA5A08407D5288279DB43454822DC8985A444B
//\ /\ Mon Apr 19 01:43:15 CEST 2004
Re: performance increase [ In reply to ]
> Yes, it does. It's not an easy task to write a code emulator from a
> scratch and unfortunately there are no ready to use open-source
> solutions. Of course we can use a code from Bochs and some other
> projects but it will still require a lot of work.
>

It's pretty tough to write a decent emulator, but at the end, the product
may well be worth the effort. I'll take a look at the current open source
approaches out there, and maybe I'll even start coding it up during my
free-times. Meanwhile, if anyone's interested and/or have any suggestions,
I'd like to hear them too.

Ozgun.