Mailing List Archive

scanning large zip files -> file loaded entirely into (RSS) memory during extraction?
Hi,

I'm having a 'problem' with scanning a large zip file.
What appears to be happening is that clamav loads the entire file into
memory (RSS not VSZ) and then unzips the file.

The increase in VSZ memory is expected since the file is mmap'ed; the
increase in RSS memory is/was not expected (by me).
When scanning a .bz2 or a .gz file then this does not happen. The VSZ
increases but the RSS does not (or at least not that much).

My test:
$ dd if=/dev/urandom of=test_400rand.bin bs=1M count=400
$ zip zip test_400rand.bin.zip test_400rand.bin
$ bzip2 --keep test_400rand.bin
$ gzip test_400rand.bin

The resulting files:
- test_400rand.bin: 400 MB
- test_400rand.bin.zip: 401 MB
- test_400rand.bin.bz2: 402 MB
- test_400rand.bin.gz: 401 MB



Scanning with clamav version 0.97.3 (compiled from source):
$ clamscan --max-filesize=500000000 --max-scansize=4000000000
--scan-archive=yes test_400rand.bin.bz2
Start: VSZ: 114MB, RSS: 105MB
Scanning: VSZ: 520MB, RSS: 114MB (mmap of archive file)
Scanning-2: VSZ: 917 MB, RSS: 119 MB (mmap of extracted file)

$ clamscan --max-filesize=500000000 --max-scansize=4000000000
--scan-archive=yes test_400rand.bin.gz
Start: VSZ: 115MB, RSS: 105MB
Scanning: VSZ: 515MB, RSS: 114 MB (mmap of archive file)
Scanning-2: VSZ: 916 MB, RSS: 121 MB (mmap of extracted file)

$ clamscan --max-filesize=500000000 --max-scansize=4000000000
--scan-archive=yes test_400rand.bin.zip
Start: VSZ: 113MB, RSS: 105MB
Scanning: VSZ: 515MB, RSS: 111MB (mmap of archive file)
Scanning-2: VSZ: 515MB, RSS: 506MB (???)
Scanning-3: VSZ: 916MB, RSS: 514MB (mmap of extracted file)


I looked at the source code and the increase in (RSS) memory happens
in libclamav/unzip.c in the function 'lhdr':
if (!csize) { /* FIXME: what's used for method0 files? csize or
usize? Nothing in the specs, needs testing */
cli_dbgmsg("cli_unzip: lh - skipping empty file\n");
} else {
if(zsize<csize) {
cli_dbgmsg("cli_unzip: lh - stream out of file\n");
fmap_unneed_off(map, loff, SIZEOF_LH);
return 0;
}
if(LH_flags & F_ENCR) {
cli_dbgmsg("cli_unzip: lh - skipping encrypted file\n");
} else {
if(fmap_need_ptr_once(map, zip, csize)) {
*ret = unz(zip, csize, usize, LH_method, LH_flags, fu,
ctx, tmpd);
}
}
zip+=csize;
zsize-=csize;
}

The call to 'fmap_need_ptr_once' causes an increase in the RSS memory
of about 400MB... (csize = 419.498.208, file size = 419.498.372)



Questions:
Is it expected that the RSS memory increases with approximately the
size of the zip file before extracting it?

Is this necessary? (Tools such as unzip are able to decompress the
file without loading the entire file into memory).

What is the purpose of fmap/fmap_need_ptr_once?




Best regards,

Bram


_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: scanning large zip files -> file loaded entirely into (RSS) memory during extraction? [ In reply to ]
On 12/12/11 09:31, Bram wrote:
> Questions:
> Is it expected that the RSS memory increases with approximately the size
> of the zip file before extracting it?

No. The RSS memory size will increase at most by the smaller of
- compressed internal file size
- max-filesize (the file is skipped if the value's exceeded)
- max-scansize (the scan is aborted if the value's exceeded)

> Is this necessary? (Tools such as unzip are able to decompress the file
> without loading the entire file into memory).

Not necessary but it makes the code much simpler in the default case
scenario as, with the default settings, the allocated memory is way
below CLI_MAX_ALLOCATION.
I agree that the implicit relation between the alloc size and the
max-(scan|file)size setting could be better documented.
But frankly I don't see a point in refactoring the code to handle
insanely large values of max-XXXsize in a way that is memory efficient.
After all we're targeting malware which is on avg 2-300 KB in size.

> What is the purpose of fmap/fmap_need_ptr_once?

That's basically an alloc and read.

HtH,
--aCaB
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: scanning large zip files -> file loaded entirely into (RSS) memory during extraction? [ In reply to ]
Quoting aCaB <acabng@digitalfuture.it>:
>
>> Is this necessary? (Tools such as unzip are able to decompress the file
>> without loading the entire file into memory).
>
> Not necessary but it makes the code much simpler in the default case
> scenario as, with the default settings, the allocated memory is way
> below CLI_MAX_ALLOCATION.
> I agree that the implicit relation between the alloc size and the
> max-(scan|file)size setting could be better documented.

Is this implicit relation only for zip files or also for other
compression/archive formats? (I tested bzip2 and gzip and there the
RSS does not increase - or at least not that much)

> But frankly I don't see a point in refactoring the code to handle
> insanely large values of max-XXXsize in a way that is memory efficient.
> After all we're targeting malware which is on avg 2-300 KB in size.

Just to make sure: what part of the code would need refactoring? I
*assume* it's (only) the decompression code and not the rest of the
code since the extraction code creates a temp file (which, I assume,
is later mmap'ed)


Best regards,

Bram


_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: scanning large zip files -> file loaded entirely into (RSS) memory during extraction? [ In reply to ]
On 12/12/11 15:34, Bram wrote:
> Is this implicit relation only for zip files or also for other
> compression/archive formats? (I tested bzip2 and gzip and there the RSS
> does not increase - or at least not that much)

I'm sure it affects other formats too. But I can't tell which on the top
of my mind, sorry.

> Just to make sure: what part of the code would need refactoring?

The unz() function.

Cheers,
--aCaB
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: scanning large zip files -> file loaded entirely into (RSS) memory during extraction? [ In reply to ]
>> Is it expected that the RSS memory increases with approximately the size
>> of the zip file before extracting it?
>
> No. The RSS memory size will increase at most by the smaller of
> - compressed internal file size
> - max-filesize (the file is skipped if the value's exceeded)
> - max-scansize (the scan is aborted if the value's exceeded)

Can you clarify 'compressed internal file size' a bit further?
Assume:
* a zip file which contains 4 files of 100 MB.
* max-filesize set to 1GB
* max-scansize set to 4GB
* RSS before the scan +/- 115 MB.

What is the expected value of the RSS during the scan?
* somewhere between 100MB and 200MB or
* somewhere between 200MB and 300MB or
* somewhere between 300MB and 400MB or
* somewhere between 400MB and 500MB or
* somewhere between 500MB and 600MB?


Best regards,

Bram


_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: scanning large zip files -> file loaded entirely into (RSS) memory during extraction? [ In reply to ]
On 12/12/11 21:31, Bram wrote:
> * a zip file which contains 4 files of 100 MB.
> * max-filesize set to 1GB
> * max-scansize set to 4GB

> What is the expected value of the RSS during the scan?

About MIN(zip_size(100MB), 1GB, 4GB), where zip_size() is the size of
the compressed stream.

Hope it's clear now.

Cheers,
--aCaB
_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net
Re: scanning large zip files -> file loaded entirely into (RSS) memory during extraction? [ In reply to ]
Quoting aCaB <acabng@digitalfuture.it>:

> On 12/12/11 15:34, Bram wrote:
>> Is this implicit relation only for zip files or also for other
>> compression/archive formats? (I tested bzip2 and gzip and there the RSS
>> does not increase - or at least not that much)
>
> I'm sure it affects other formats too. But I can't tell which on the top
> of my mind, sorry.

Out of intrest I decided to test some other formats:

.zip -> increase in RSS memory
.bz2 -> no real increase in RSS memory
.gz -> no real increase in RSS memory
.rar -> no real increase in RSS memory
.cab -> no real increase in RSS memory
.arj -> no real increase in RSS memory
.tar -> no real increase in RSS memory
.tar (posix) -> no real increase in RSS memory
.tar (v7) -> no real increase in RSS memory
.7z -> increase in RSS memory


Best regards,

Bram


_______________________________________________
http://lurker.clamav.net/list/clamav-devel.html
Please submit your patches to our Bugzilla: http://bugs.clamav.net