Mailing List Archive

[clamav-users] Filetype determination
One problem that we're running into is that we encounter web pages and cgi
scripts that are "inconsistently" normalized. I put "inconsistently" in
quotes because without fully knowing the way ClamAV normalizes files, it is
sometimes difficult to understand why two similar files might be normalized
differently. For example, a PHP script that doesn't contain HTML tags will
be normalized using 'ascii-normalise', while the exact same PHP code will
be normalized with 'html-normalise' if it happens to be tacked on to some
HTML.

This seems to be particularly prevalent with phishing kits, where we want
to write a signature based on the PHP code, not necessarily on the HTML. As
a result, we end up having to write two signatures because HTML
normalization seems to remove the spaces around equal signs, while ASCII
normalization leaves them in. Additionally, HTML normalization uses
double-quotes (") to replace single-quotes (') while ASCII normalization
leaves them as their original.

Example:
$ip = getenv("REMOTE_ADDR");
$password = $_POST['password'];

ASCII normalized:
$ip = getenv("remote_addr");
$password = $_post['password'];

HTML normalized:
$ip=getenv("remote_addr");
$password=$_post["password"];

So, my question is this:
How can we get PHP tags ( <? and <?php ) marked as 'HTML' file type so they
are normalized the same as other 'web' files?


Also, there are more than a few HTML files that browsers render 'properly'
that don't contain the following tags:

'<html>'
'<head>'
'<a*href'
'<img'
'<script'
'<object'
'<iframe'
'<table'


A few other tags, such as <style, <!doctype, <meta, <title, <form, might
help as well as fixing the html and head tags to only require the leading <
(<html instead of <html>)


--Maarten