Mailing List Archive

gzip.py: allow deterministic compression (without time stamp)
gzip compression, using class GzipFile from gzip.py, by default
inserts a timestamp to the compressed stream. If the optional
argument `mtime` is absent or None, then the current time is used [1].

This makes outputs non-deterministic, which can badly confuse
unsuspecting users: If you run "diff" over two outputs to see
whether they are unaffected by changes in your application,
then you would not expect that the *.gz binaries differ just
because they were created at different times.

I'd propose to introduce a new constant `NO_TIMESTAMP` as
possible value of `mtime`.

Furthermore, if policy about API changes allows, I'd suggest
that `NO_TIMESTAMP` become the new default value for `mtime`.

How to proceed from here? Is this the kind of proposals that
has to go through a PEP?

- Joachim

[1]
https://github.com/python/cpython/blob/6f1e8ccffa5b1272a36a35405d3c4e4bbba0c082/Lib/gzip.py#L163
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
Hi,

gzip.NO_TIMESTAMP sounds like a good idea. But I'm not sure about
changing the default behavior. I would prefer to leave it unchanged.

I guess that your problem is that you don't access gzip directly, but
uses a higher level API which doesn't give access to the timestamp
parameter, like the tarfile module?

If your usecase is reproducible build, you may follow py_compile
behavior: the default behavior depends if the SOURCE_DATE_EPOCH
environment variable is set or not:

def _get_default_invalidation_mode():
if os.environ.get('SOURCE_DATE_EPOCH'):
return PycInvalidationMode.CHECKED_HASH
else:
return PycInvalidationMode.TIMESTAMP

Victor

On Wed, Apr 14, 2021 at 6:34 PM Joachim Wuttke <j.wuttke@fz-juelich.de> wrote:
>
> gzip compression, using class GzipFile from gzip.py, by default
> inserts a timestamp to the compressed stream. If the optional
> argument `mtime` is absent or None, then the current time is used [1].
>
> This makes outputs non-deterministic, which can badly confuse
> unsuspecting users: If you run "diff" over two outputs to see
> whether they are unaffected by changes in your application,
> then you would not expect that the *.gz binaries differ just
> because they were created at different times.
>
> I'd propose to introduce a new constant `NO_TIMESTAMP` as
> possible value of `mtime`.
>
> Furthermore, if policy about API changes allows, I'd suggest
> that `NO_TIMESTAMP` become the new default value for `mtime`.
>
> How to proceed from here? Is this the kind of proposals that
> has to go through a PEP?
>
> - Joachim
>
> [1]
> https://github.com/python/cpython/blob/6f1e8ccffa5b1272a36a35405d3c4e4bbba0c082/Lib/gzip.py#L163
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OTUGLATLYB736SAPPRWSSXWAKM5JHWZN/
> Code of Conduct: http://python.org/psf/codeofconduct/



--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5FLBLVY3DJFGIBMED57SASLS5ASZ65KF/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
On Wed, Apr 14, 2021 at 5:00 AM Joachim Wuttke <j.wuttke@fz-juelich.de>
wrote:

> gzip compression, using class GzipFile from gzip.py, by default
> inserts a timestamp to the compressed stream. If the optional
> argument `mtime` is absent or None, then the current time is used [1].
>
> This makes outputs non-deterministic, which can badly confuse
> unsuspecting users: If you run "diff" over two outputs to see
> whether they are unaffected by changes in your application,
> then you would not expect that the *.gz binaries differ just
> because they were created at different times.
>
> I'd propose to introduce a new constant `NO_TIMESTAMP` as
> possible value of `mtime`.
>
> Furthermore, if policy about API changes allows, I'd suggest
> that `NO_TIMESTAMP` become the new default value for `mtime`.
>
> How to proceed from here? Is this the kind of proposals that
> has to go through a PEP?
>

For something like this you would open an issue and see if a core developer
is intrigued enough to work with you to see the change occur; no PEP is
necessary.

-Brett


>
> - Joachim
>
> [1]
>
> https://github.com/python/cpython/blob/6f1e8ccffa5b1272a36a35405d3c4e4bbba0c082/Lib/gzip.py#L163
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/OTUGLATLYB736SAPPRWSSXWAKM5JHWZN/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
The gzip specification [1] makes clear that the mtime field is always present.
The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970.
MTIME = 0 means no time stamp is available. Hence no need for a
new constant NO_TIMESTAMP.

So this is primarily a documentation problem [2]. For this, I will create a
pull request to gzip.py.

Joachim

[1] https://www.ietf.org/rfc/rfc1952.txt
[2] https://discuss.python.org/t/gzip-py-allow-deterministic-compression-without-time-stamp
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LCPWERWIFG4AJS6DPHNEMOGBYJ2APDJ3/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
On Wed, 2021-04-14 at 18:06 +0000, j.wuttke@fz-juelich.de wrote:
> The gzip specification [1] makes clear that the mtime field is always present.
> The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970.
> MTIME = 0 means no time stamp is available. Hence no need for a
> new constant NO_TIMESTAMP.
>
> So this is primarily a documentation problem [2]. For this, I will create a
> pull request to gzip.py.

I think having an extra constant (equal to 0) wouldn't hurt and could
make the code a bit more explicit.

--
Best regards,
Micha? Górny


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/KK7J5GZTAR5NKNNN7BGRHZYEP3CWZTPM/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
If the so, then a better name than NO_TIMESTAMP should be chosen, as the gzip specification does not allow for no timestamp.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/O3ENOZ5OAFYX6PBXMEDGS3RJ3OSPKNYC/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
DEFAULT_TIMESTAMP?

Kind regards,
Steve


On Wed, Apr 14, 2021 at 8:03 PM <j.wuttke@fz-juelich.de> wrote:

> If the so, then a better name than NO_TIMESTAMP should be chosen, as the
> gzip specification does not allow for no timestamp.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/O3ENOZ5OAFYX6PBXMEDGS3RJ3OSPKNYC/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
On Wed, 14 Apr 2021 21:38:11 +0100
Steve Holden <steve@holdenweb.com> wrote:
> DEFAULT_TIMESTAMP?

It's not a default timestamp, it's a placeholder value meaning "no
timestamp". The aforementioned RFC 1952 explicitly says: "MTIME = 0
means no time stamp is available".

So yes, it really means "no timestamp", regardless of the fact that
it's encoded as integer value 0.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/D7MERE7EKE4RINSGNOFK52MDLM7QTRKB/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
On 4/14/2021 8:00 AM, Joachim Wuttke wrote:

> Furthermore, if policy about API changes allows, I'd suggest
> that `NO_TIMESTAMP` become the new default value for `mtime`.

Changing defaults is a huge pain, which we mostly avoid.



--
Terry Jan Reedy

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/3LA5EIND4O2JGF5SBQBWF3K3W2YPG3EG/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
If gzip is modified to use SOURCE_DATE_EPOCH timestamp, you get a
reproducible binary and you don't need to add a new constant ;-)
SOURCE_DATE_EPOCH is a timestamp: number of seconds since Unix Epoch
(January 1, 1970 at 00:00).

Victor


On Wed, Apr 14, 2021 at 8:15 PM <j.wuttke@fz-juelich.de> wrote:
>
> The gzip specification [1] makes clear that the mtime field is always present.
> The time is in Unix format, i.e., seconds since 00:00:00 GMT, Jan. 1, 1970.
> MTIME = 0 means no time stamp is available. Hence no need for a
> new constant NO_TIMESTAMP.
>
> So this is primarily a documentation problem [2]. For this, I will create a
> pull request to gzip.py.
>
> Joachim
>
> [1] https://www.ietf.org/rfc/rfc1952.txt
> [2] https://discuss.python.org/t/gzip-py-allow-deterministic-compression-without-time-stamp
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/LCPWERWIFG4AJS6DPHNEMOGBYJ2APDJ3/
> Code of Conduct: http://python.org/psf/codeofconduct/



--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FZTBVELC53IX6CGRCG53IGECJC3SANLE/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
On Thu, 15 Apr 2021 11:28:03 +0200
Victor Stinner <vstinner@python.org> wrote:
> If gzip is modified to use SOURCE_DATE_EPOCH timestamp, you get a
> reproducible binary and you don't need to add a new constant ;-)
> SOURCE_DATE_EPOCH is a timestamp: number of seconds since Unix Epoch
> (January 1, 1970 at 00:00).

Changing the behaviour of a stdlib module based on an environment
variable sounds a bit undesirable. That behaviour can be implemented
at a higher-level in application code (for example the tarfile or
zipfile command line).

Regards

Antoine.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HPX62SVAMT6ELIKCDWE2JDTY4ATX2NKU/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
SOURCE_DATE_EPOCH is not a random variable, but is a *standardised*
environment variable:
https://reproducible-builds.org/docs/source-date-epoch/

This page explains the rationale. See the “Lying about the time” /
“violates language spec” section ;-)

More and more projects adopted it. As I wrote, the Python stdlib
already uses it in compileall and py_compile modules.

Victor

On Thu, Apr 15, 2021 at 12:34 PM Antoine Pitrou <antoine@python.org> wrote:
>
> On Thu, 15 Apr 2021 11:28:03 +0200
> Victor Stinner <vstinner@python.org> wrote:
> > If gzip is modified to use SOURCE_DATE_EPOCH timestamp, you get a
> > reproducible binary and you don't need to add a new constant ;-)
> > SOURCE_DATE_EPOCH is a timestamp: number of seconds since Unix Epoch
> > (January 1, 1970 at 00:00).
>
> Changing the behaviour of a stdlib module based on an environment
> variable sounds a bit undesirable. That behaviour can be implemented
> at a higher-level in application code (for example the tarfile or
> zipfile command line).
>
> Regards
>
> Antoine.
>
>
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HPX62SVAMT6ELIKCDWE2JDTY4ATX2NKU/
> Code of Conduct: http://python.org/psf/codeofconduct/



--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MHDIQZZXQJRBSXDMQKV4JYR6J5UU2OFH/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
On Thu, 15 Apr 2021 14:32:05 +0200
Victor Stinner <vstinner@python.org> wrote:
> SOURCE_DATE_EPOCH is not a random variable, but is a *standardised*
> environment variable:
> https://reproducible-builds.org/docs/source-date-epoch/

Standardized by whom? This is not a POSIX nor Windows standard at
least. Just because a Web page claims it is standardized doesn't mean
that it is.

> More and more projects adopted it. As I wrote, the Python stdlib
> already uses it in compileall and py_compile modules.

Those are higher-level modules. Doing it in the gzip module directly
sounds like the wrong place.

Regards

Antoine.


_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BRLYIG7SMN2KJFFAOWMW6HQBIR3WQHNU/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: gzip.py: allow deterministic compression (without time stamp) [ In reply to ]
> On 15 Apr 2021, at 14:48, Antoine Pitrou <antoine@python.org> wrote:
>
> On Thu, 15 Apr 2021 14:32:05 +0200
> Victor Stinner <vstinner@python.org> wrote:
>> SOURCE_DATE_EPOCH is not a random variable, but is a *standardised*
>> environment variable:
>> https://reproducible-builds.org/docs/source-date-epoch/
>
> Standardized by whom? This is not a POSIX nor Windows standard at
> least. Just because a Web page claims it is standardized doesn't mean
> that it is.
>
>> More and more projects adopted it. As I wrote, the Python stdlib
>> already uses it in compileall and py_compile modules.
>
> Those are higher-level modules. Doing it in the gzip module directly
> sounds like the wrong place.

I agree. According to the documentation this variable is meant to be used for build tools to accomplish reproducible builds. This should IMHO not affect lower level APIs and libraries that aren’t build related.

Ronald



Twitter / micro.blog: @ronaldoussoren
Blog: https://blog.ronaldoussoren.net/