Mailing List Archive

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive
New submission from Alex Roussel <alexandredrr14@gmail.com>:

Hello,

I've come across an issue that seems similar to the false positives problem outlined in this ticket (https://bugs.python.org/issue28494), however this issue relates to a single gzipped json file which is incorrectly identified as a .zip archive because (I suspect) is_zipfile is mistaking bytes in the file's data for the ending bytes that correspond to a .zip archive.

I'm afraid I'm not well versed on the way is_zipfile 'seeks' the bytes of a file to compare its magic number, so I apologise if my description isn't completely accurate.

Here's my attempt at a summary of the problem:

My .json.gz file includes the correct magic number (1f8b) to identify it as a gzipped file, however when zipfile.is_zipfile is called on the filepath, it returns True.

I'm going to ask if I'm allowed to upload the file directly for you (it's work related), in the meantime I've included a head and tail of the file's hexdump below.

I am running Python 3.6.9 on Ubuntu 18.04.

----------

~/Téléchargements » python3
Python 3.6.9 (default, Oct 8 2020, 12:12:24)
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import zipfile
>>> zipfile.is_zipfile('2020-10-18-1602979256-http_get_7549.json.gz')
True

----------

~/Téléchargements » xxd 2020-10-18-1602979256-http_get_7549.json.gz | head
00000000: 1f8b 0800 7090 8b5f 0003 ecbd 5993 e338 ....p.._....Y..8
00000010: 922e fa7e 7fc6 bcea 588a 8b28 8ad7 ec3c ...~....X..(...<
00000020: 883b 292e e2be bc71 91b8 5394 b8f3 d8fd .;)....q..S.....
00000030: ef17 9422 3223 22ab 2ab3 7ba6 67a6 e774 ..."2#".*.{.g..t
00000040: 5bab 4209 6175 7738 3e77 3880 fff3 6f71 [..B.auw8>w8...oq
00000050: d005 fff6 fffe 9bc1 ea96 451d 2629 6712 ..........E.&)g.
00000060: 853e 4202 830d 3145 7221 6af7 fe11 3ae9 .>B...1Er!j...:.
00000070: 1c0b f9e6 2db1 c0bf 25ea 38a9 1479 f650 ....-...%.8..y.P

----------

~/Téléchargements » xxd 2020-10-18-1602979256-http_get_7549.json.gz | tail
01d492b0: 98a2 3d5c 25e1 c5b8 d9c5 3287 c5d8 3d7c ..=\%.....2...=|
01d492c0: 968f 3652 6fd4 4a0c 243b 166d 5640 97b5 ..6Ro.J.$;.mV@..
01d492d0: 9308 8376 fe17 1fac 0c90 0fdb b3e3 4e4a ...v..........NJ
01d492e0: 605c a870 5120 955b 6267 e318 406f e1e2 `\.pQ .[.bg..@o..
01d492f0: 2c50 12ec 5eb0 43cc 8d97 4daf 6017 3412 ,P..^.C...M.`.4.
01d49300: 3bdb 40ce 743f 7aa8 6ff9 f30d 796f f784 ;.@.t?z.o...yo..
01d49310: cec2 c45d b012 7e07 c70c dafd e16e fee2 ...]..~......n..
01d49320: c8a6 c01c 627f e004 f9c7 4770 e5e7 6bbf ....b.....Gp..k.
01d49330: 44d5 97bb ffdf ffe7 ff03 2263 b46f 3d5c D........."c.o=\
01d49340: cb04 ..

----------
components: Library (Lib)
messages: 379105
nosy: aroussel
priority: normal
severity: normal
status: open
title: zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive
type: behavior
versions: Python 3.6

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Irit Katriel <iritkatriel@yahoo.com> added the comment:

Are you able to attach a file with which you see this problem?
Have you tried with newer Python versions?

----------
nosy: +iritkatriel

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Alex Roussel <alexandredrr14@gmail.com> added the comment:

Hi Irit,

Thank you for the response, I'm afraid I'm not allowed to upload the file myself, however the file in question is '2020-10-18-1602979256-http_get_7549.json.gz', which is available at this link https://opendata.rapid7.com/sonar.http/?page=1. It becomes free to download one month after release so I'll make sure to come back on the 18th Nov and upload it once it is made freely available.

In the meantime, I've tested on python 3.8.5 and 3.9 and I get the same issue.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Brian Kohan <bckohan@gmail.com> added the comment:

Hi all,

I'm experiencing the same issue. I took a look at the is_zipfile code - seems like its not checking the start of the file for the magic numbers, but looking deeper in. I presume because the magic numbers at the start are considered unreliable for some reason? Seems like this opens the check up to unlucky random false positives though.

Offending file:

https://www.dropbox.com/s/t2kafn6ek1m2huy/CHPI_Rinex.crx?dl=1

----------
nosy: +bckohan
versions: +Python 3.7, Python 3.8

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Alex Roussel <alexandredrr14@gmail.com> added the comment:

The impression I got from reading https://bugs.python.org/issue28494 was that this was fixed in python 3.7 ? Or perhaps as you say it's just a matter of stumbling across the rare files that generate false positives.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Change by Irit Katriel <iritkatriel@yahoo.com>:


----------
nosy: +gregory.p.smith, vstinner

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Gregory P. Smith <greg@krypto.org> added the comment:

for what it's worth: false positives are always going to be possible in any such "magic" check as is_zipfile is.

we don't check the start of the file because zip files are defined by their end of file central directory which contains length information to determine where within the file the zip archive actually starts.

The issue28494 tests are a demonstration of this; It is somewhat common practice to append a zipfile to an executable of various forms for use as application specific data.

If you need more more reliable determination of file type not tied to a specific Python release, you might look at what the various file type sniffing magic libraries do for you, some examples include:
https://pypi.org/project/filetype/
https://pypi.org/project/puremagic/
https://pypi.org/project/python-magic/

I _can_ reproduce this issue with the testdata @bckohan provided.

But I can't promise there is anything to fix here. Even if we make the test slightly more robust by looking at another byte or two, it is always possible for files to appear to be a bunch of things at once based on small data signatures.

If nothing else we should reinforce in the documentation that is_zipfile is at best a guess. False means it is not as far as the zipfile module is concerned. True cannot guarantee that it is.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Alex Roussel <alexandredrr14@gmail.com> added the comment:

OK understood. Thanks for the explanation, I wasn't aware of those and will take a look.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
STINNER Victor <vstinner@python.org> added the comment:

ZipFile.open() checks the first 4 bytes:

# Skip the file header:
fheader = zef_file.read(sizeFileHeader)
if len(fheader) != sizeFileHeader:
raise BadZipFile("Truncated file header")
fheader = struct.unpack(structFileHeader, fheader)
if fheader[_FH_SIGNATURE] != stringFileHeader:
raise BadZipFile("Bad magic number for file header")

But is_zipfile() does not. Code could be shared for that.

.gz and .zip files don't start by the same bytes, so this check should reduce the number of false positives.

--

You may have a look at the validate() methods of my old Hachoir project, they check a few bytes to check if a file looks a valid gzip or ZIP archive.

gzip:

https://github.com/vstinner/hachoir/blob/0f56883d7cea7082e784bfbdd2882e0f2dd2f34b/hachoir/parser/archive/gzip_parser.py#L51-L62

zip:

https://github.com/vstinner/hachoir/blob/0f56883d7cea7082e784bfbdd2882e0f2dd2f34b/hachoir/parser/archive/zip.py#L411-L430

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Change by Serhiy Storchaka <storchaka+cpython@gmail.com>:


----------
nosy: +serhiy.storchaka

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Change by Serhiy Storchaka <storchaka+cpython@gmail.com>:


----------
assignee: -> serhiy.storchaka

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Gregory P. Smith <greg@krypto.org> added the comment:

ZipFile.open() is not the code for opening a zip file. :)

That's the code for opening a file embedded within an already constructed mode='r' archive as done the ZipFile.__init__() constructor. By the time you've gotten to the open() method, you've loaded the entire unbounded in size central directory into memory as ZipInfo objects [constructor] and are checking signature of an individual file header you are attempting to read out of the archive.

Follow the ZipFile() constructor, it calls ZipFile._RealGetContents() which is the true start of parsing the archive. https://github.com/python/cpython/blob/master/Lib/zipfile.py#L1317

Sure, more and more steps can be done. But if you want to do that, you may as well just get rid of is_zipfile() entirely - a functions who's point is to be fast and not consume an amount of memory determined by the input data - and have people just call `zipfile.ZipFile(path_in_question, mode='r')` and live with the consequences of attempting to load and parse the whole thing. If that doesn't raise an exception, it is more likely to be a zip file. But that could still raise an exception when trying to open each of the files inside, so you'd need to iterate over this and open those and make sure they're valid.

is_zipfile() isn't a verify_zipfile_integrity() routine. Just a quick best guess.

is_zipfile() cannot be perfect and is not intended to be. There is always going to be yet another thing it _could_ try. It isn't worth chasing the impossible goal and making it not be fast.

Just update the is_zipfile() docs.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Brian Kohan <bckohan@gmail.com> added the comment:

I concur with Gregory. It seems that the action here is to just make it apparent in the docs the very real possibility of false positives.

In my experience processing data from the wild, I see a pretty high rate of about 1/1000. I'm sure the probability is a function of the types of files I'm working with. But in any case, is_zipfile can't be made to be sufficient in and of itself for reliably identifying zip files. It still has utility in weeding out true negatives though. In my case I don't ever expect to see a self extracting file or a file compounded into an executable so I use the results of is_zipfile as well as a manual check of the magic bytes at the start. So far so good.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
STINNER Victor <vstinner@python.org> added the comment:

> functions who's point is to be fast and not consume an amount of memory determined by the input data

I proposed to read the first 4 bytes of the file. It's a fixed length.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Gregory P. Smith <greg@krypto.org> added the comment:

The first four bytes of the file do not identify a zip file. A zip file is identified by the end of file central directory. Which you then must read entries of to determine where the start of the archive may be... often not at position zero.

----------

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com
[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive [ In reply to ]
Serhiy Storchaka <storchaka+cpython@gmail.com> added the comment:

This is a duplicate of issue28494.

I concur with Gregory. What we can do -- improve the documentation.

----------
resolution: -> duplicate
stage: -> resolved
status: open -> closed
superseder: -> is_zipfile false positives

_______________________________________
Python tracker <report@bugs.python.org>
<https://bugs.python.org/issue42096>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/list-python-bugs%40lists.gossamer-threads.com