Mailing List Archive

Question about matcher-bm.c
Hello,

I'm studying multi-pattern matching and I was browsing the source code for
ClamAV's implementation of a multi-pattern matcher (Wu-Maber based)
algorithm.

I've got a question regarding the block and minimum size values.

At the moment, both the block size and the minimum pattern length are set
to 3 bytes.

If I understood the algorithm correctly, this means that the only possible
shift values are either 0 (at which point a match is possible), or 1
(minimum pattern size - block size + 1).

If this is the case, given that the algorithm can only move at most one
byte at a time, what is the advantage of using this algorithm instead of
Aho-Corasick (besides space efficiency) ?

Thank you for your time.

Best regards,

-Alexandre Dias
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
On Mon, Jul 2, 2012 at 5:07 PM, Alexandre Dias <lexx.pt@gmail.com> wrote:

> Hello,
>
> I'm studying multi-pattern matching and I was browsing the source code for
> ClamAV's implementation of a multi-pattern matcher (Wu-Maber based)
> algorithm.
>
> I've got a question regarding the block and minimum size values.
>
> At the moment, both the block size and the minimum pattern length are set
> to 3 bytes.
>
> If I understood the algorithm correctly, this means that the only possible
> shift values are either 0 (at which point a match is possible), or 1
> (minimum pattern size - block size + 1).
>
> If this is the case, given that the algorithm can only move at most one
> byte at a time, what is the advantage of using this algorithm instead of
> Aho-Corasick (besides space efficiency) ?
>
> Thank you for your time.
>
> Best regards,
>
> -Alexandre Dias
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>

Space efficiency is important. We do need to care about memory usage. But
ruling that out, consider that ClamAV has different places and different
ways it uses pattern matching. For the sake of consistency with how it is
named in the code, I'll refer to the two modified styles of matching as B-M
(for Boyer-Moore/Wu-Manber style) and A-C (for Aho-Corasick).

ClamAV has over 113,000 signatures right now and they are split between the
A-C and B-M categories. ClamAV is not using pure pattern matching of either
style and has pre-filtering steps. Some signatures are scanning direct file
content. Other signatures are matching hashes [or in some cases, hashes of
file segments]. Files can have wildly varying lengths, while the hashes
have predetermined lengths. There are logical signatures that require
certain combinations of matches. ClamAV even uses pattern matching when
checking signatures at load time to filter out those that have been added
to the ignore lists. Any optimization would be impacted daily with each new
signature that is added. To sum up, there are quite a variety of needles
and haystacks involved in the searching.

Back to your question. You are correct that the shift values will be 0 or
1. While I cannot give you an analytical defense to the choice of minimum
pattern size & block size, there is a natural tension between the two. From
what I read, Wu & Manber used a block size of 3 in their May 1994 paper.
And any efficiency gained from longer shifts (which would be based on
values which never appear in any signature) could be targeted by malware
writers to eliminate it by forcing creation of signatures that fill that
gap. I also don't know the difference in effective cost of frequent partial
matches between A-C and B-M. These are things that could be measurable but
I do not have statistics at hand.

There is more history on the topic of algorithms and their use in ClamAV to
be found in the back history of the mailing list. Discussions of everything
from extended Boyer-Moore to bloom filters.

Hope this helps,

Dave R.
---
Dave Raynor
Sourcefire Vulnerability Research Team
draynor@sourcefire.com
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
On Wed, Jul 4, 2012 at 4:25 AM, David Raynor <draynor@sourcefire.com> wrote:

> On Mon, Jul 2, 2012 at 5:07 PM, Alexandre Dias <lexx.pt@gmail.com> wrote:
>
> > Hello,
> >
> > I'm studying multi-pattern matching and I was browsing the source code
> for
> > ClamAV's implementation of a multi-pattern matcher (Wu-Maber based)
> > algorithm.
> >
> > I've got a question regarding the block and minimum size values.
> >
> > At the moment, both the block size and the minimum pattern length are set
> > to 3 bytes.
> >
> > If I understood the algorithm correctly, this means that the only
> possible
> > shift values are either 0 (at which point a match is possible), or 1
> > (minimum pattern size - block size + 1).
> >
> > If this is the case, given that the algorithm can only move at most one
> > byte at a time, what is the advantage of using this algorithm instead of
> > Aho-Corasick (besides space efficiency) ?
> >
> > Thank you for your time.
> >
> > Best regards,
> >
> > -Alexandre Dias
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
>
> Space efficiency is important. We do need to care about memory usage. But
> ruling that out, consider that ClamAV has different places and different
> ways it uses pattern matching. For the sake of consistency with how it is
> named in the code, I'll refer to the two modified styles of matching as B-M
> (for Boyer-Moore/Wu-Manber style) and A-C (for Aho-Corasick).
>
> ClamAV has over 113,000 signatures right now and they are split between the
> A-C and B-M categories. ClamAV is not using pure pattern matching of either
>
Hello Dave R,

1) How to ClamAV categories virus signature in SHA1, SHA256, MD5 and
Hexdump types?
2) What's estimate signature types of virus load to A-C and B-M on
ClamAV? I see flags --ac-only for loading signature file to A-C tires, But
I not sure how to selected virus types load to A-C and B-M algorithms when
scanning virus in common mode.

style and has pre-filtering steps. Some signatures are scanning direct file
> content. Other signatures are matching hashes [or in some cases, hashes of
> file segments]. Files can have wildly varying lengths, while the hashes
> have predetermined lengths. There are logical signatures that require
> certain combinations of matches. ClamAV even uses pattern matching when
> checking signatures at load time to filter out those that have been added
> to the ignore lists. Any optimization would be impacted daily with each new
> signature that is added. To sum up, there are quite a variety of needles
> and haystacks involved in the searching.
>
> Back to your question. You are correct that the shift values will be 0 or
> 1. While I cannot give you an analytical defense to the choice of minimum
> pattern size & block size, there is a natural tension between the two. From
> what I read, Wu & Manber used a block size of 3 in their May 1994 paper.
> And any efficiency gained from longer shifts (which would be based on
> values which never appear in any signature) could be targeted by malware
> writers to eliminate it by forcing creation of signatures that fill that
> gap. I also don't know the difference in effective cost of frequent partial
> matches between A-C and B-M. These are things that could be measurable but
> I do not have statistics at hand.
>
> There is more history on the topic of algorithms and their use in ClamAV to
> be found in the back history of the mailing list. Discussions of everything
> from extended Boyer-Moore to bloom filters.
>
> Hope this helps,
>
> Dave R.
> ---
> Dave Raynor
> Sourcefire Vulnerability Research Team
> draynor@sourcefire.com
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>



--
:--------------------------------------------------------
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
On Wed, Aug 15, 2012 at 6:58 AM, Chatsiri Ratana <insiderboy@gmail.com>wrote:

> Hello Dave R,
>
> 1) How to ClamAV categories virus signature in SHA1, SHA256, MD5 and
> Hexdump types?
> 2) What's estimate signature types of virus load to A-C and B-M on
> ClamAV? I see flags --ac-only for loading signature file to A-C tires, But
> I not sure how to selected virus types load to A-C and B-M algorithms when
> scanning virus in common mode.
>
>
>
>
> --
> :--------------------------------------------------------
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>

1) Details on signature formats are in the signatures.pdf included in the
docs folder of the source.

2) This question is a little confusing. If you are asking about numbers of
signatures, the numbers change daily. If you run clamscan in debug mode, it
will report the size and contents of the tries with signature counts
grouped by the filetypes they will scan. There are counts for both BM and
AC.

Hope this helps,

Dave R.

--
---
Dave Raynor
Sourcefire Vulnerability Research Team
draynor@sourcefire.com
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
On Wed, Aug 15, 2012 at 11:35 PM, David Raynor <draynor@sourcefire.com>wrote:

> On Wed, Aug 15, 2012 at 6:58 AM, Chatsiri Ratana <insiderboy@gmail.com
> >wrote:
>
> > Hello Dave R,
> >
> > 1) How to ClamAV categories virus signature in SHA1, SHA256, MD5 and
> > Hexdump types?
> > 2) What's estimate signature types of virus load to A-C and B-M on
> > ClamAV? I see flags --ac-only for loading signature file to A-C tires,
> But
> > I not sure how to selected virus types load to A-C and B-M algorithms
> when
> > scanning virus in common mode.
> >
> >
> >
> >
> > --
> > :--------------------------------------------------------
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
>
> 1) Details on signature formats are in the signatures.pdf included in the
> docs folder of the source.
>
Hello Dave R,

I not found section in detail of why we selected signature virus is MD5
or SHA1 when using Sigtool get signature from binary files. Signature.pdf
present only method for creating signature virus with MD5.

Best Regards,
Chatsiri Rattana.


> 2) This question is a little confusing. If you are asking about numbers of
> signatures, the numbers change daily. If you run clamscan in debug mode, it
> will report the size and contents of the tries with signature counts
> grouped by the filetypes they will scan. There are counts for both BM and
> AC.
>
> Hope this helps,
>
> Dave R.
>
> --
> ---
> Dave Raynor
> Sourcefire Vulnerability Research Team
> draynor@sourcefire.com
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>



--
:--------------------------------------------------------
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
Hi Chatsiri,
PE section MD5 signatures are more useful than MD5 signatures of the entire
file (because it allows the other section of the PE to vary, thus catching more
samples with a single signature. Moreover, updating becomes easy this way.
Hope you got your answer.

On Thu, Aug 16, 2012 at 5:51 PM, Chatsiri Ratana <insiderboy@gmail.com>wrote:

> On Wed, Aug 15, 2012 at 11:35 PM, David Raynor <draynor@sourcefire.com
> >wrote:
>
> > On Wed, Aug 15, 2012 at 6:58 AM, Chatsiri Ratana <insiderboy@gmail.com
> > >wrote:
> >
> > > Hello Dave R,
> > >
> > > 1) How to ClamAV categories virus signature in SHA1, SHA256, MD5
> and
> > > Hexdump types?
> > > 2) What's estimate signature types of virus load to A-C and B-M on
> > > ClamAV? I see flags --ac-only for loading signature file to A-C tires,
> > But
> > > I not sure how to selected virus types load to A-C and B-M algorithms
> > when
> > > scanning virus in common mode.
> > >
> > >
> > >
> > >
> > > --
> > > :--------------------------------------------------------
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> >
> > 1) Details on signature formats are in the signatures.pdf included in the
> > docs folder of the source.
> >
> Hello Dave R,
>
> I not found section in detail of why we selected signature virus is MD5
> or SHA1 when using Sigtool get signature from binary files. Signature.pdf
> present only method for creating signature virus with MD5.
>
> Best Regards,
> Chatsiri Rattana.
>
>
> > 2) This question is a little confusing. If you are asking about numbers
> of
> > signatures, the numbers change daily. If you run clamscan in debug mode,
> it
> > will report the size and contents of the tries with signature counts
> > grouped by the filetypes they will scan. There are counts for both BM and
> > AC.
> >
> > Hope this helps,
> >
> > Dave R.
> >
> > --
> > ---
> > Dave Raynor
> > Sourcefire Vulnerability Research Team
> > draynor@sourcefire.com
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
>
>
>
> --
> :--------------------------------------------------------
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>



--
Vishrut Sharma
Security Researcher
Vice Chair, Membership Growth
and Sustainability Committee, IEEE CS India Council
---------------------------------
Member of ACM, IEEE,
IEEE Computer Society, DSCI
---------------------------------
URL: *http://member.acm.org/~vishrut1* <http://member.acm.org/~vishrut1>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
On Thu, Aug 16, 2012 at 8:01 PM, Vishrut Sharma <v.vishrut@gmail.com> wrote:

> Hi Chatsiri,
> PE section MD5 signatures are more useful than MD5 signatures of the entire
> file (because it allows the other section of the PE to vary, thus catching
> more
> samples with a single signature. Moreover, updating becomes easy this way.
> Hope you got your answer.


Hello Vishrut Sharma,

If not PE type in system, Such as javascript(malicious code) and
another file types. Should we use SHA1, SHA256 and Hexdump?

Best Regards,
Chatsiri Rattana.


>
> On Thu, Aug 16, 2012 at 5:51 PM, Chatsiri Ratana <insiderboy@gmail.com
> >wrote:
>
> > On Wed, Aug 15, 2012 at 11:35 PM, David Raynor <draynor@sourcefire.com
> > >wrote:
> >
> > > On Wed, Aug 15, 2012 at 6:58 AM, Chatsiri Ratana <insiderboy@gmail.com
> > > >wrote:
> > >
> > > > Hello Dave R,
> > > >
> > > > 1) How to ClamAV categories virus signature in SHA1, SHA256, MD5
> > and
> > > > Hexdump types?
> > > > 2) What's estimate signature types of virus load to A-C and B-M
> on
> > > > ClamAV? I see flags --ac-only for loading signature file to A-C
> tires,
> > > But
> > > > I not sure how to selected virus types load to A-C and B-M algorithms
> > > when
> > > > scanning virus in common mode.
> > > >
> > > >
> > > >
> > > >
> > > > --
> > > > :--------------------------------------------------------
> > > > _______________________________________________
> > > > http://lurker.clamav.net/list/clamav-devel.html
> > > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > > >
> > >
> > > 1) Details on signature formats are in the signatures.pdf included in
> the
> > > docs folder of the source.
> > >
> > Hello Dave R,
> >
> > I not found section in detail of why we selected signature virus is
> MD5
> > or SHA1 when using Sigtool get signature from binary files. Signature.pdf
> > present only method for creating signature virus with MD5.
> >
> > Best Regards,
> > Chatsiri Rattana.
> >
> >
> > > 2) This question is a little confusing. If you are asking about numbers
> > of
> > > signatures, the numbers change daily. If you run clamscan in debug
> mode,
> > it
> > > will report the size and contents of the tries with signature counts
> > > grouped by the filetypes they will scan. There are counts for both BM
> and
> > > AC.
> > >
> > > Hope this helps,
> > >
> > > Dave R.
> > >
> > > --
> > > ---
> > > Dave Raynor
> > > Sourcefire Vulnerability Research Team
> > > draynor@sourcefire.com
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> >
> >
> >
> > --
> > :--------------------------------------------------------
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
>
>
>
> --
> Vishrut Sharma
> Security Researcher
> Vice Chair, Membership Growth
> and Sustainability Committee, IEEE CS India Council
> ---------------------------------
> Member of ACM, IEEE,
> IEEE Computer Society, DSCI
> ---------------------------------
> URL: *http://member.acm.org/~vishrut1* <http://member.acm.org/~vishrut1>
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>



--
:--------------------------------------------------------
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Question about matcher-bm.c [ In reply to ]
I'd go towards an ndb type signature first.

On Thu, Aug 16, 2012 at 10:37 PM, Chatsiri Ratana <insiderboy@gmail.com>wrote:

> On Thu, Aug 16, 2012 at 8:01 PM, Vishrut Sharma <v.vishrut@gmail.com>
> wrote:
>
> > Hi Chatsiri,
> > PE section MD5 signatures are more useful than MD5 signatures of the
> entire
> > file (because it allows the other section of the PE to vary, thus
> catching
> > more
> > samples with a single signature. Moreover, updating becomes easy this
> way.
> > Hope you got your answer.
>
>
> Hello Vishrut Sharma,
>
> If not PE type in system, Such as javascript(malicious code) and
> another file types. Should we use SHA1, SHA256 and Hexdump?
>
> Best Regards,
> Chatsiri Rattana.
>
>
> >
> > On Thu, Aug 16, 2012 at 5:51 PM, Chatsiri Ratana <insiderboy@gmail.com
> > >wrote:
> >
> > > On Wed, Aug 15, 2012 at 11:35 PM, David Raynor <draynor@sourcefire.com
> > > >wrote:
> > >
> > > > On Wed, Aug 15, 2012 at 6:58 AM, Chatsiri Ratana <
> insiderboy@gmail.com
> > > > >wrote:
> > > >
> > > > > Hello Dave R,
> > > > >
> > > > > 1) How to ClamAV categories virus signature in SHA1, SHA256, MD5
> > > and
> > > > > Hexdump types?
> > > > > 2) What's estimate signature types of virus load to A-C and B-M
> > on
> > > > > ClamAV? I see flags --ac-only for loading signature file to A-C
> > tires,
> > > > But
> > > > > I not sure how to selected virus types load to A-C and B-M
> algorithms
> > > > when
> > > > > scanning virus in common mode.
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > :--------------------------------------------------------
> > > > > _______________________________________________
> > > > > http://lurker.clamav.net/list/clamav-devel.html
> > > > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > > > >
> > > >
> > > > 1) Details on signature formats are in the signatures.pdf included in
> > the
> > > > docs folder of the source.
> > > >
> > > Hello Dave R,
> > >
> > > I not found section in detail of why we selected signature virus is
> > MD5
> > > or SHA1 when using Sigtool get signature from binary files.
> Signature.pdf
> > > present only method for creating signature virus with MD5.
> > >
> > > Best Regards,
> > > Chatsiri Rattana.
> > >
> > >
> > > > 2) This question is a little confusing. If you are asking about
> numbers
> > > of
> > > > signatures, the numbers change daily. If you run clamscan in debug
> > mode,
> > > it
> > > > will report the size and contents of the tries with signature counts
> > > > grouped by the filetypes they will scan. There are counts for both BM
> > and
> > > > AC.
> > > >
> > > > Hope this helps,
> > > >
> > > > Dave R.
> > > >
> > > > --
> > > > ---
> > > > Dave Raynor
> > > > Sourcefire Vulnerability Research Team
> > > > draynor@sourcefire.com
> > > > _______________________________________________
> > > > http://lurker.clamav.net/list/clamav-devel.html
> > > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > > >
> > >
> > >
> > >
> > > --
> > > :--------------------------------------------------------
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> >
> >
> >
> > --
> > Vishrut Sharma
> > Security Researcher
> > Vice Chair, Membership Growth
> > and Sustainability Committee, IEEE CS India Council
> > ---------------------------------
> > Member of ACM, IEEE,
> > IEEE Computer Society, DSCI
> > ---------------------------------
> > URL: *http://member.acm.org/~vishrut1* <http://member.acm.org/~vishrut1>
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
>
>
>
> --
> :--------------------------------------------------------
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>



--
Joel Esler
Senior Research Engineer, VRT
OpenSource Community Manager
Sourcefire
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net