Mailing List Archive

Question
Does ClamAV use Aho-Corasick algorithm to match files against static signatures and Boyer-Moore against signatures that have *'s and ??'s ?

Thanks much,

~Moe

_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question [ In reply to ]
On 04/24/2010 11:39 PM, Mohammed Al-Saleh wrote:
> Does ClamAV use Aho-Corasick algorithm to match files against static signatures and Boyer-Moore against signatures that have *'s and ??'s ?

No it is not as simple as that, and it is usually the other way around.

read the cli_parse_add() function, it has all the logic of choosing
between AC and BM.

Why do you ask?

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question [ In reply to ]
Hi Edwin,

Thanks for your reply.
I need to know the cases where ClamAV has performance bottlenecks or issues.
What kind of texts that could make ClamAV takes more time than usual.
Aho-Corasick and Boyer-Moore might have some situations that cause performance issue.
I might consider doing improvements or study performance impact.
Do you think that this could be a realistic problem to study?

Any help with that is appreciated.

Thanks much,

~Moe




On Apr 26, 2010, at 4:30 AM, Török Edwin wrote:

> On 04/24/2010 11:39 PM, Mohammed Al-Saleh wrote:
>> Does ClamAV use Aho-Corasick algorithm to match files against static signatures and Boyer-Moore against signatures that have *'s and ??'s ?
>
> No it is not as simple as that, and it is usually the other way around.
>
> read the cli_parse_add() function, it has all the logic of choosing
> between AC and BM.
>
> Why do you ask?
>
> Best regards,
> --Edwin
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net

~Moe

_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question [ In reply to ]
On 04/26/2010 10:20 PM, Mohammed Al-Saleh wrote:
> Hi Edwin,
>
> Thanks for your reply.
> I need to know the cases where ClamAV has performance bottlenecks or issues.

The best way to do that is by measuring it.
Read the last part of this reply:
http://lurker.clamav.net/message/20081204.212941.c9fa45c2.en.html

> What kind of texts that could make ClamAV takes more time than usual.

That question is hard to answer, since the signatures change each day,
thus the AC trie changes, the prefiltering patterns change ...

> Aho-Corasick and Boyer-Moore might have some situations that cause performance issue.

There is also a prefiltering step now.
You can search bugzilla on why it was introduced.

> I might consider doing improvements or study performance impact.

Don't expect it to be easy to make improvements.

I spent quite a lot of time on the prefiltering step, and the problem is
that some signatures falsely match a lot of times (like 'PE' from the PE
signature), but the entire signature usually doesn't.
So ClamAV has to stop the trie lookup, test the match, continue the trie
lookup lots of times.
Although the actual test is "fast enough", if it happens a million times
it does slow things down.

Also the AC and BM are not "textbook" versions, they contain extensions
(like wildcards).
It is important that you study the performance with the actual
signatures from main/daily.cvd, and on real files (both clean and infected).

> Do you think that this could be a realistic problem to study?

That depends if you have some specific ideas on how to improve AC/BM, or
you just want to try improving it, and give up if its not possible.

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question [ In reply to ]
Hi Edwin,

On Apr 27, 2010, at 7:19 AM, Török Edwin wrote:

> On 04/26/2010 10:20 PM, Mohammed Al-Saleh wrote:
>> Hi Edwin,
>>
>> Thanks for your reply.
>> I need to know the cases where ClamAV has performance bottlenecks or issues.
>
> The best way to do that is by measuring it.
> Read the last part of this reply:
> http://lurker.clamav.net/message/20081204.212941.c9fa45c2.en.html
>
>> What kind of texts that could make ClamAV takes more time than usual.
>
> That question is hard to answer, since the signatures change each day,
> thus the AC trie changes, the prefiltering patterns change ...
>
>> Aho-Corasick and Boyer-Moore might have some situations that cause performance issue.
>
> There is also a prefiltering step now.
> You can search bugzilla on why it was introduced.
>
>> I might consider doing improvements or study performance impact.
>
> Don't expect it to be easy to make improvements.
>
> I spent quite a lot of time on the prefiltering step, and the problem is
> that some signatures falsely match a lot of times (like 'PE' from the PE
> signature), but the entire signature usually doesn't.
> So ClamAV has to stop the trie lookup, test the match, continue the trie
> lookup lots of times.

My understanding (please correct me if I am wrong) is that the first step in matching (let's ignore the filetype recognition and such) is the prefiltering step.
If the filter matches then further matching (using either AC or BM) is needed to make sure that it is not a false positive because the filter could contain more patterns than it should (and the filter matches at most 8 characters of the original signature so the other parts might not match).
I am not sure if I understand your point here and I really want to understand it:
"So ClamAV has to stop the trie lookup, test the match, continue the trie lookup lots of times."
Can you please explain this to me more?
If the filter matches but AC or BM does not, would we return back to the filter to continue from the point it matches?


> Although the actual test is "fast enough", if it happens a million times
> it does slow things down.
>
> Also the AC and BM are not "textbook" versions, they contain extensions
> (like wildcards).
> It is important that you study the performance with the actual
> signatures from main/daily.cvd, and on real files (both clean and infected).
>
>> Do you think that this could be a realistic problem to study?
>
> That depends if you have some specific ideas on how to improve AC/BM, or
> you just want to try improving it, and give up if its not possible.
>
> Best regards,
> --Edwin
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net

Thanks much,

~Moe

_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question [ In reply to ]
On 05/18/2010 09:09 PM, Mohammed Al-Saleh wrote:
> Hi Edwin,
>
> On Apr 27, 2010, at 7:19 AM, Török Edwin wrote:
>
>> On 04/26/2010 10:20 PM, Mohammed Al-Saleh wrote:
>>> Hi Edwin,
>>>
>>> Thanks for your reply.
>>> I need to know the cases where ClamAV has performance bottlenecks or issues.
>>
>> The best way to do that is by measuring it.
>> Read the last part of this reply:
>> http://lurker.clamav.net/message/20081204.212941.c9fa45c2.en.html
>>
>>> What kind of texts that could make ClamAV takes more time than usual.
>>
>> That question is hard to answer, since the signatures change each day,
>> thus the AC trie changes, the prefiltering patterns change ...
>>
>>> Aho-Corasick and Boyer-Moore might have some situations that cause performance issue.
>>
>> There is also a prefiltering step now.
>> You can search bugzilla on why it was introduced.
>>
>>> I might consider doing improvements or study performance impact.
>>
>> Don't expect it to be easy to make improvements.
>>
>> I spent quite a lot of time on the prefiltering step, and the problem is
>> that some signatures falsely match a lot of times (like 'PE' from the PE
>> signature), but the entire signature usually doesn't.
>> So ClamAV has to stop the trie lookup, test the match, continue the trie
>> lookup lots of times.
>
> My understanding (please correct me if I am wrong) is that the first step in matching (let's ignore the filetype recognition and such) is the prefiltering step.
> If the filter matches then further matching (using either AC or BM) is needed to make sure that it is not a false positive because the filter could contain more patterns than it should (and the filter matches at most 8 characters of the original signature so the other parts might not match).

Yes.

> I am not sure if I understand your point here and I really want to understand it:
> "So ClamAV has to stop the trie lookup, test the match, continue the trie lookup lots of times."
> Can you please explain this to me more?
> If the filter matches but AC or BM does not, would we return back to the filter to continue from the point it matches?

No, I was refering to how AC works.

After the AC trie detects a match it needs to check it, the AC trie
contains only a tiny part of the entire signature (up to ac_max_depth),
and the trie itself doesn't contain wildcards etc.

Best regards,
--Edwin
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net