Mailing List Archive

Does clamav work with hex or characters?
Hi

I have a basic question. Most body-based signatures are hex based(lets
focus on fixed string signatures alone for simplicity), whereas some of the
files are hex(EXE) or character-based(HTML).

In the code I see unsigned chars used predominantly to represent patterns
and file contents. At the very core, do the string matching algorithms,
mainly extended Boyer Moore, I would like to understand how the datatypes
gets manipulated.

1) Do the character based files get translated to hex to compare with body
based signatures?

2) Does the signature get treated as a string of chars?
If yes,
Does a toy signature "fe" gets treated as two chars(8 bits each) for "f"
and "e" (or)
Does the code read the signature "fe" and maps into one character based on
the ASCII table (for example)?

Thank you..
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Does clamav work with hex or characters? [ In reply to ]
Well....data is data. There is no difference (from a storage perspective)
from an executable with an "inc ecx" instruction or a text document with an
"A". Both are represented by the value 0x41. So from Clam's perspective,
a signature matching a single A would be identical to a signature that
detected a single "inc ecx" instruction. Both would look for 41.

In short your statement "some files are hex and some are character-based"
isn't really accurate. At the risk of painting with a broad brush, I would
say that all files are stored as a series of values, a series of bytes.
How you display them is different. When I used 010 Editor to view a file
as hex, I get a set of ascii-hex representations. When I look at a file
with a web-browser I get ascii text. But underlying all of that is the
same idea, a set of bytes. And that is how ClamAV treats all files.

A signature with a 41 in it would be converted in memory to look for 0x41,
a single byte of value 0x41. A signature written like that would detect an
executable or pdf or a flash or anything that has 0x41 in the data.

Hope that answers your question.

Matt


On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
kvaidya1@andrew.cmu.edu> wrote:

> Hi
>
> I have a basic question. Most body-based signatures are hex based(lets
> focus on fixed string signatures alone for simplicity), whereas some of the
> files are hex(EXE) or character-based(HTML).
>
> In the code I see unsigned chars used predominantly to represent patterns
> and file contents. At the very core, do the string matching algorithms,
> mainly extended Boyer Moore, I would like to understand how the datatypes
> gets manipulated.
>
> 1) Do the character based files get translated to hex to compare with body
> based signatures?
>
> 2) Does the signature get treated as a string of chars?
> If yes,
> Does a toy signature "fe" gets treated as two chars(8 bits each) for "f"
> and "e" (or)
> Does the code read the signature "fe" and maps into one character based on
> the ASCII table (for example)?
>
> Thank you..
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Does clamav work with hex or characters? [ In reply to ]
It was pointed out to me that in my explanation I failed to lay out how
ClamAV avoids alerting on both executable and html files with a single
signature. Signatures can be tagged with a target type. A signature of
type one would only evaluate against portable executable (PE) files. While
a signature of type three would only look at HTML files. File types are
laid out on page 8 of the "Creating Signatures for ClamAV" document, found
here: http://www.clamav.net/doc/latest/signatures.pdf

Matt


On Sat, Mar 23, 2013 at 2:02 PM, Matt Olney <molney@sourcefire.com> wrote:

> Well....data is data. There is no difference (from a storage perspective)
> from an executable with an "inc ecx" instruction or a text document with an
> "A". Both are represented by the value 0x41. So from Clam's perspective,
> a signature matching a single A would be identical to a signature that
> detected a single "inc ecx" instruction. Both would look for 41.
>
> In short your statement "some files are hex and some are character-based"
> isn't really accurate. At the risk of painting with a broad brush, I would
> say that all files are stored as a series of values, a series of bytes.
> How you display them is different. When I used 010 Editor to view a file
> as hex, I get a set of ascii-hex representations. When I look at a file
> with a web-browser I get ascii text. But underlying all of that is the
> same idea, a set of bytes. And that is how ClamAV treats all files.
>
> A signature with a 41 in it would be converted in memory to look for 0x41,
> a single byte of value 0x41. A signature written like that would detect an
> executable or pdf or a flash or anything that has 0x41 in the data.
>
> Hope that answers your question.
>
> Matt
>
>
> On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> kvaidya1@andrew.cmu.edu> wrote:
>
>> Hi
>>
>> I have a basic question. Most body-based signatures are hex based(lets
>> focus on fixed string signatures alone for simplicity), whereas some of
>> the
>> files are hex(EXE) or character-based(HTML).
>>
>> In the code I see unsigned chars used predominantly to represent patterns
>> and file contents. At the very core, do the string matching algorithms,
>> mainly extended Boyer Moore, I would like to understand how the datatypes
>> gets manipulated.
>>
>> 1) Do the character based files get translated to hex to compare with body
>> based signatures?
>>
>> 2) Does the signature get treated as a string of chars?
>> If yes,
>> Does a toy signature "fe" gets treated as two chars(8 bits each) for "f"
>> and "e" (or)
>> Does the code read the signature "fe" and maps into one character based on
>> the ASCII table (for example)?
>>
>> Thank you..
>> _______________________________________________
>> http://lurker.clamav.net/list/clamav-devel.html
>> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>>
>
>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Does clamav work with hex or characters? [ In reply to ]
Hi Matt

Thanks for your detailed explanation on how signature gets stored and
interpreted.

I was looking up the codes in libclamav to see what data formats get used
for string compare. Some backtracking from cli_bm_scanbuff took me to str.c
where I see there is a function" cli_hex2str", which if I understand
correctly maps two hexs to one character (unsigned char). Would it fair to
speculate that this function is used by the clamav engine to map two hexs
read from a signature or scanned file into one char for string matching
purposes?

Thank you..


On Sat, Mar 23, 2013 at 11:02 AM, Matt Olney <molney@sourcefire.com> wrote:

> Well....data is data. There is no difference (from a storage perspective)
> from an executable with an "inc ecx" instruction or a text document with an
> "A". Both are represented by the value 0x41. So from Clam's perspective,
> a signature matching a single A would be identical to a signature that
> detected a single "inc ecx" instruction. Both would look for 41.
>
> In short your statement "some files are hex and some are character-based"
> isn't really accurate. At the risk of painting with a broad brush, I would
> say that all files are stored as a series of values, a series of bytes.
> How you display them is different. When I used 010 Editor to view a file
> as hex, I get a set of ascii-hex representations. When I look at a file
> with a web-browser I get ascii text. But underlying all of that is the
> same idea, a set of bytes. And that is how ClamAV treats all files.
>
> A signature with a 41 in it would be converted in memory to look for 0x41,
> a single byte of value 0x41. A signature written like that would detect an
> executable or pdf or a flash or anything that has 0x41 in the data.
>
> Hope that answers your question.
>
> Matt
>
>
> On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> kvaidya1@andrew.cmu.edu> wrote:
>
> > Hi
> >
> > I have a basic question. Most body-based signatures are hex based(lets
> > focus on fixed string signatures alone for simplicity), whereas some of
> the
> > files are hex(EXE) or character-based(HTML).
> >
> > In the code I see unsigned chars used predominantly to represent patterns
> > and file contents. At the very core, do the string matching algorithms,
> > mainly extended Boyer Moore, I would like to understand how the datatypes
> > gets manipulated.
> >
> > 1) Do the character based files get translated to hex to compare with
> body
> > based signatures?
> >
> > 2) Does the signature get treated as a string of chars?
> > If yes,
> > Does a toy signature "fe" gets treated as two chars(8 bits each) for "f"
> > and "e" (or)
> > Does the code read the signature "fe" and maps into one character based
> on
> > the ASCII table (for example)?
> >
> > Thank you..
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Does clamav work with hex or characters? [ In reply to ]
Kaushik,

Those assumptions are correct. You should also have a look at the file
libclamav/readdb.c. That is where the signatures are read and loaded into
memory for the pattern matchers.

Steve

On Sat, Mar 23, 2013 at 3:34 PM, Kaushik Vaidyanathan <
kvaidya1@andrew.cmu.edu> wrote:

> Hi Matt
>
> Thanks for your detailed explanation on how signature gets stored and
> interpreted.
>
> I was looking up the codes in libclamav to see what data formats get used
> for string compare. Some backtracking from cli_bm_scanbuff took me to str.c
> where I see there is a function" cli_hex2str", which if I understand
> correctly maps two hexs to one character (unsigned char). Would it fair to
> speculate that this function is used by the clamav engine to map two hexs
> read from a signature or scanned file into one char for string matching
> purposes?
>
> Thank you..
>
>
> On Sat, Mar 23, 2013 at 11:02 AM, Matt Olney <molney@sourcefire.com>
> wrote:
>
> > Well....data is data. There is no difference (from a storage
> perspective)
> > from an executable with an "inc ecx" instruction or a text document with
> an
> > "A". Both are represented by the value 0x41. So from Clam's
> perspective,
> > a signature matching a single A would be identical to a signature that
> > detected a single "inc ecx" instruction. Both would look for 41.
> >
> > In short your statement "some files are hex and some are character-based"
> > isn't really accurate. At the risk of painting with a broad brush, I
> would
> > say that all files are stored as a series of values, a series of bytes.
> > How you display them is different. When I used 010 Editor to view a
> file
> > as hex, I get a set of ascii-hex representations. When I look at a file
> > with a web-browser I get ascii text. But underlying all of that is the
> > same idea, a set of bytes. And that is how ClamAV treats all files.
> >
> > A signature with a 41 in it would be converted in memory to look for
> 0x41,
> > a single byte of value 0x41. A signature written like that would detect
> an
> > executable or pdf or a flash or anything that has 0x41 in the data.
> >
> > Hope that answers your question.
> >
> > Matt
> >
> >
> > On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> > kvaidya1@andrew.cmu.edu> wrote:
> >
> > > Hi
> > >
> > > I have a basic question. Most body-based signatures are hex based(lets
> > > focus on fixed string signatures alone for simplicity), whereas some of
> > the
> > > files are hex(EXE) or character-based(HTML).
> > >
> > > In the code I see unsigned chars used predominantly to represent
> patterns
> > > and file contents. At the very core, do the string matching algorithms,
> > > mainly extended Boyer Moore, I would like to understand how the
> datatypes
> > > gets manipulated.
> > >
> > > 1) Do the character based files get translated to hex to compare with
> > body
> > > based signatures?
> > >
> > > 2) Does the signature get treated as a string of chars?
> > > If yes,
> > > Does a toy signature "fe" gets treated as two chars(8 bits each) for
> "f"
> > > and "e" (or)
> > > Does the code read the signature "fe" and maps into one character based
> > on
> > > the ASCII table (for example)?
> > >
> > > Thank you..
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Does clamav work with hex or characters? [ In reply to ]
On Sat, Mar 23, 2013 at 3:34 PM, Kaushik Vaidyanathan <
kvaidya1@andrew.cmu.edu> wrote:

> Hi Matt
>
> Thanks for your detailed explanation on how signature gets stored and
> interpreted.
>
> I was looking up the codes in libclamav to see what data formats get used
> for string compare. Some backtracking from cli_bm_scanbuff took me to str.c
> where I see there is a function" cli_hex2str", which if I understand
> correctly maps two hexs to one character (unsigned char). Would it fair to
> speculate that this function is used by the clamav engine to map two hexs
> read from a signature or scanned file into one char for string matching
> purposes?
>
> Thank you..
>
>
> On Sat, Mar 23, 2013 at 11:02 AM, Matt Olney <molney@sourcefire.com>
> wrote:
>
> > Well....data is data. There is no difference (from a storage
> perspective)
> > from an executable with an "inc ecx" instruction or a text document with
> an
> > "A". Both are represented by the value 0x41. So from Clam's
> perspective,
> > a signature matching a single A would be identical to a signature that
> > detected a single "inc ecx" instruction. Both would look for 41.
> >
> > In short your statement "some files are hex and some are character-based"
> > isn't really accurate. At the risk of painting with a broad brush, I
> would
> > say that all files are stored as a series of values, a series of bytes.
> > How you display them is different. When I used 010 Editor to view a
> file
> > as hex, I get a set of ascii-hex representations. When I look at a file
> > with a web-browser I get ascii text. But underlying all of that is the
> > same idea, a set of bytes. And that is how ClamAV treats all files.
> >
> > A signature with a 41 in it would be converted in memory to look for
> 0x41,
> > a single byte of value 0x41. A signature written like that would detect
> an
> > executable or pdf or a flash or anything that has 0x41 in the data.
> >
> > Hope that answers your question.
> >
> > Matt
> >
> >
> > On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> > kvaidya1@andrew.cmu.edu> wrote:
> >
> > > Hi
> > >
> > > I have a basic question. Most body-based signatures are hex based(lets
> > > focus on fixed string signatures alone for simplicity), whereas some of
> > the
> > > files are hex(EXE) or character-based(HTML).
> > >
> > > In the code I see unsigned chars used predominantly to represent
> patterns
> > > and file contents. At the very core, do the string matching algorithms,
> > > mainly extended Boyer Moore, I would like to understand how the
> datatypes
> > > gets manipulated.
> > >
> > > 1) Do the character based files get translated to hex to compare with
> > body
> > > based signatures?
> > >
> > > 2) Does the signature get treated as a string of chars?
> > > If yes,
> > > Does a toy signature "fe" gets treated as two chars(8 bits each) for
> "f"
> > > and "e" (or)
> > > Does the code read the signature "fe" and maps into one character based
> > on
> > > the ASCII table (for example)?
> > >
> > > Thank you..
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>

Read from signature, yes. Read from file, no. To quickly compare bytes it
is better to do it using the in-file binary representation. It is more
direct to say that cli_hex2str() is converting human-readable
representation of a hexadecimal number into the binary equivalent. For any
byte pattern to match, the signature-format equivalent will take twice as
many bytes as the raw binary value.

Example: "Hex" in ASCII
Actual data is 3 bytes long. 1st byte: 0x48. 2nd byte: 0x65. 3rd byte: 0x78
Signature-format equivalent is 6 bytes long, one for each hex digit.

This is where the name of the function came from. Input and output are both
char arrays (i.e. strings). The function takes in the "hex"-format version
of the content [486578], and returns the content in a usable string format
[Hex]. Hence, from "hex" to string.

Dave R.

--
---
Dave Raynor
Sourcefire Vulnerability Research Team
draynor@sourcefire.com
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: Does clamav work with hex or characters? [ In reply to ]
Ah.. that help a lot Dave.

P.S. Excuse my typos. Touched and not typed.
On Mar 25, 2013 5:39 PM, "David Raynor" <draynor@sourcefire.com> wrote:

> On Sat, Mar 23, 2013 at 3:34 PM, Kaushik Vaidyanathan <
> kvaidya1@andrew.cmu.edu> wrote:
>
> > Hi Matt
> >
> > Thanks for your detailed explanation on how signature gets stored and
> > interpreted.
> >
> > I was looking up the codes in libclamav to see what data formats get used
> > for string compare. Some backtracking from cli_bm_scanbuff took me to
> str.c
> > where I see there is a function" cli_hex2str", which if I understand
> > correctly maps two hexs to one character (unsigned char). Would it fair
> to
> > speculate that this function is used by the clamav engine to map two hexs
> > read from a signature or scanned file into one char for string matching
> > purposes?
> >
> > Thank you..
> >
> >
> > On Sat, Mar 23, 2013 at 11:02 AM, Matt Olney <molney@sourcefire.com>
> > wrote:
> >
> > > Well....data is data. There is no difference (from a storage
> > perspective)
> > > from an executable with an "inc ecx" instruction or a text document
> with
> > an
> > > "A". Both are represented by the value 0x41. So from Clam's
> > perspective,
> > > a signature matching a single A would be identical to a signature that
> > > detected a single "inc ecx" instruction. Both would look for 41.
> > >
> > > In short your statement "some files are hex and some are
> character-based"
> > > isn't really accurate. At the risk of painting with a broad brush, I
> > would
> > > say that all files are stored as a series of values, a series of bytes.
> > > How you display them is different. When I used 010 Editor to view a
> > file
> > > as hex, I get a set of ascii-hex representations. When I look at a
> file
> > > with a web-browser I get ascii text. But underlying all of that is the
> > > same idea, a set of bytes. And that is how ClamAV treats all files.
> > >
> > > A signature with a 41 in it would be converted in memory to look for
> > 0x41,
> > > a single byte of value 0x41. A signature written like that would
> detect
> > an
> > > executable or pdf or a flash or anything that has 0x41 in the data.
> > >
> > > Hope that answers your question.
> > >
> > > Matt
> > >
> > >
> > > On Fri, Mar 22, 2013 at 8:46 PM, Kaushik Vaidyanathan <
> > > kvaidya1@andrew.cmu.edu> wrote:
> > >
> > > > Hi
> > > >
> > > > I have a basic question. Most body-based signatures are hex
> based(lets
> > > > focus on fixed string signatures alone for simplicity), whereas some
> of
> > > the
> > > > files are hex(EXE) or character-based(HTML).
> > > >
> > > > In the code I see unsigned chars used predominantly to represent
> > patterns
> > > > and file contents. At the very core, do the string matching
> algorithms,
> > > > mainly extended Boyer Moore, I would like to understand how the
> > datatypes
> > > > gets manipulated.
> > > >
> > > > 1) Do the character based files get translated to hex to compare with
> > > body
> > > > based signatures?
> > > >
> > > > 2) Does the signature get treated as a string of chars?
> > > > If yes,
> > > > Does a toy signature "fe" gets treated as two chars(8 bits each) for
> > "f"
> > > > and "e" (or)
> > > > Does the code read the signature "fe" and maps into one character
> based
> > > on
> > > > the ASCII table (for example)?
> > > >
> > > > Thank you..
> > > > _______________________________________________
> > > > http://lurker.clamav.net/list/clamav-devel.html
> > > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > > >
> > > _______________________________________________
> > > http://lurker.clamav.net/list/clamav-devel.html
> > > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> > >
> > _______________________________________________
> > http://lurker.clamav.net/list/clamav-devel.html
> > Please submit your patches to our Bugzilla: http://bugs.clamav.net
> >
>
> Read from signature, yes. Read from file, no. To quickly compare bytes it
> is better to do it using the in-file binary representation. It is more
> direct to say that cli_hex2str() is converting human-readable
> representation of a hexadecimal number into the binary equivalent. For any
> byte pattern to match, the signature-format equivalent will take twice as
> many bytes as the raw binary value.
>
> Example: "Hex" in ASCII
> Actual data is 3 bytes long. 1st byte: 0x48. 2nd byte: 0x65. 3rd byte: 0x78
> Signature-format equivalent is 6 bytes long, one for each hex digit.
>
> This is where the name of the function came from. Input and output are both
> char arrays (i.e. strings). The function takes in the "hex"-format version
> of the content [486578], and returns the content in a usable string format
> [Hex]. Hence, from "hex" to string.
>
> Dave R.
>
> --
> ---
> Dave Raynor
> Sourcefire Vulnerability Research Team
> draynor@sourcefire.com
> _______________________________________________
> http://lurker.clamav.net/list/clamav-devel.html
> Please submit your patches to our Bugzilla: http://bugs.clamav.net
>
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net