Mailing List Archive

Tool to search in the source code of PyPI top 5000 projects
Hi,

I wrote two scripts based on the work of INADA-san's work to (1)
download the source code of the PyPI top 5000 projects (2) search for
a regex in these projects (compressed source archives).

You can use these tools if you work on an incompatible Python or C API
change to estimate how many projects are impacted.

The HPy project created a Git repository for a similar need (latest
update in June 2021):
https://github.com/hpyproject/top4000-pypi-packages

There are also online services for code search:

* GitHub: https://github.com/search
* https://grep.app/ (I didn't try it yet)
* Debian: https://codesearch.debian.net/


(1) Dowload

Script:
https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py

Usage: download_pypi_top.py PATH

It uses this JSON file:
https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json

From this service:
https://hugovk.github.io/top-pypi-packages/

At December 1, on 5000 projects, it only downloads 4760 tarball and
ZIP archives: I guess that 240 projects don't provide a source
archive. It takes around 5,2 GB of disk space.


(2) Code search

First, I used the fast and nice "ripgrep" tool with the command "rg
-zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
archives). But it doesn't show the path inside the archive and it
searchs in files generated by Cython whereas I wanted to ignore these
files.

So I wrote a short Python script which decompress tarball and ZIP
archive in memory and looks for a regex:
https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py

Usage: search_pypi_top.py "REGEX" output_filename

The code to parse command line option is hardcoded and pypi_dir =
"PYPI-2021-12-01-TOP-5000" are hardcoded :-D

It ignores files generated by Cython and .so binary files (Linux
dynamic libraries).

While "rg" is very fast, my script is very slow. But I don't care,
once the regex is written, I only need to search for the regex once, I
can wait 10-15 min ;-) I prefer to wait longer and have a more
accurate result. Also, there is room for enhancement, like running
multiple jobs in different processes or threads.

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WQVEHLRIVISPFMWSSX5N4TQPIUN2XS22/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Tool to search in the source code of PyPI top 5000 projects [ In reply to ]
On Fri, 2021-12-03 at 00:44 +0100, Victor Stinner wrote:
> I wrote two scripts based on the work of INADA-san's work to (1)
> download the source code of the PyPI top 5000 projects (2) search for
> a regex in these projects (compressed source archives).
>
> You can use these tools if you work on an incompatible Python or C API
> change to estimate how many projects are impacted.
>

Am I correct that this script downloads only the newest version for each
package? It might be worth to add a disclaimer that since many Python
packages pin their dependencies to old versions, you are quite likely to
miss impact on projects that are using the deprecated API in old
versions that are still used because of their reverse dependencies.

--
Best regards,
Micha? Górny

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/O3PIUNTJUWXX2AHSCSF3FCW36NKISIUF/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Tool to search in the source code of PyPI top 5000 projects [ In reply to ]
Hi,

You're correct that the download_pypi_top.py script only downloads the
latest version. I'm looking for projects impacted by incompatible
changes. If the latest version is fine, a project just has to update
its dependencies. If the latest version has an issue, it's very likely
that old versions are also affected.

Victor

On Fri, Dec 3, 2021 at 8:35 AM Micha? Górny <mgorny@gentoo.org> wrote:
>
> On Fri, 2021-12-03 at 00:44 +0100, Victor Stinner wrote:
> > I wrote two scripts based on the work of INADA-san's work to (1)
> > download the source code of the PyPI top 5000 projects (2) search for
> > a regex in these projects (compressed source archives).
> >
> > You can use these tools if you work on an incompatible Python or C API
> > change to estimate how many projects are impacted.
> >
>
> Am I correct that this script downloads only the newest version for each
> package? It might be worth to add a disclaimer that since many Python
> packages pin their dependencies to old versions, you are quite likely to
> miss impact on projects that are using the deprecated API in old
> versions that are still used because of their reverse dependencies.
>
> --
> Best regards,
> Micha? Górny
>


--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6RZML4FAVDKL6A5TKPFGQJZ4FEGKHGV6/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Tool to search in the source code of PyPI top 5000 projects [ In reply to ]
FTR, I don't consider the top projects on PyPI to be representative of
our user base, and *especially* not representative of people compiling
native modules.

This is not a good way to evaluate the impact of breaking changes.

It would be far safer to assume that every change is going to break
someone and evaluate:
* how will they find out that upgrading Python will cause them to break
* how will they find out where that break occurs
* how will they find out how to fix it
* how will they manage that fix across multiple releases
* how will we explain that upgrading and fixing breaks is better for
*them* than staying on the older version

This last one is particularly important, as many large organisations
(anecdotally) seem to have settled on Python 3.7 for a while now.
Inevitably, this means they're all going to be faced with a painful time
when it comes to an upgrade, and every little change we add on is going
to hurt more. Every extra thing that needs fixing is motivation to just
rewrite in a new language with more hype (and the promise of better
compatibility... which I won't comment specifically on, but I suspect
they won't manage it any better than us ;) ).

This is not the case for the top PyPI projects. They incrementally
update and crowdsource fixes, often from us. The pain is distributed to
the level of permanent background noise, which sucks in its own way, but
is ultimately not representative of much of our user base.

So by all means, use this tool for checking stuff. But it's not a
substitute for justifying every incompatible change in its own right.

/rant

Cheers,
Steve

On 12/2/2021 11:44 PM, Victor Stinner wrote:
> Hi,
>
> I wrote two scripts based on the work of INADA-san's work to (1)
> download the source code of the PyPI top 5000 projects (2) search for
> a regex in these projects (compressed source archives).
>
> You can use these tools if you work on an incompatible Python or C API
> change to estimate how many projects are impacted.
>
> The HPy project created a Git repository for a similar need (latest
> update in June 2021):
> https://github.com/hpyproject/top4000-pypi-packages
>
> There are also online services for code search:
>
> * GitHub: https://github.com/search
> * https://grep.app/ (I didn't try it yet)
> * Debian: https://codesearch.debian.net/
>
>
> (1) Dowload
>
> Script:
> https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py
>
> Usage: download_pypi_top.py PATH
>
> It uses this JSON file:
> https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json
>
> From this service:
> https://hugovk.github.io/top-pypi-packages/
>
> At December 1, on 5000 projects, it only downloads 4760 tarball and
> ZIP archives: I guess that 240 projects don't provide a source
> archive. It takes around 5,2 GB of disk space.
>
>
> (2) Code search
>
> First, I used the fast and nice "ripgrep" tool with the command "rg
> -zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
> archives). But it doesn't show the path inside the archive and it
> searchs in files generated by Cython whereas I wanted to ignore these
> files.
>
> So I wrote a short Python script which decompress tarball and ZIP
> archive in memory and looks for a regex:
> https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py
>
> Usage: search_pypi_top.py "REGEX" output_filename
>
> The code to parse command line option is hardcoded and pypi_dir =
> "PYPI-2021-12-01-TOP-5000" are hardcoded :-D
>
> It ignores files generated by Cython and .so binary files (Linux
> dynamic libraries).
>
> While "rg" is very fast, my script is very slow. But I don't care,
> once the regex is written, I only need to search for the regex once, I
> can wait 10-15 min ;-) I prefer to wait longer and have a more
> accurate result. Also, there is room for enhancement, like running
> multiple jobs in different processes or threads.
>
> Victor
>
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HO7PS57UCJPJLON2BJPPEBM7I3Q6AM2U/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Tool to search in the source code of PyPI top 5000 projects [ In reply to ]
Hi Steve,

I completely agree with all you said ;-)

I will not debate here if incompatible changes are worth it or not,
this topic was discussed recently in another thread.


On Fri, Dec 3, 2021 at 2:56 PM Steve Dower <steve.dower@python.org> wrote:
> FTR, I don't consider the top projects on PyPI to be representative of
> our user base, and *especially* not representative of people compiling
> native modules.
>
> This is not a good way to evaluate the impact of breaking changes.

I do not pretend that a code search on PyPI top 5000 projects is only
way and is an exhaustive way to measure the impact of incompatible
changes. I'm only trying to advertize that *there is one practical
tool* which is better than nothing.

Last years, I saw many incompatible changes introduced in Python without:

* estimating how many projects: "release Python and pray" in the hope
that only a minority is impacted
* don't document the change at all, or just say that it's now broken,
but it was rare that practical instructions were provided to explain
how to port code and how to keep support for old Python versions.

I saw a net enhancement recently. Better documentation, core devs
proactive to fix impacted projects, better communication to announce
incompatible changes in advance, and practical instructions to port
code without losing support for old Python versions.


> It would be far safer to assume that every change is going to break
> someone and evaluate:
> * how will they find out that upgrading Python will cause them to break
> * how will they find out where that break occurs
> * how will they find out how to fix it
> * how will they manage that fix across multiple releases
> * how will we explain that upgrading and fixing breaks is better for
> *them* than staying on the older version

In the PEP 674, I wrote an explicit section "Port C extensions to Python 3.11":
https://www.python.org/dev/peps/pep-0674/#port-c-extensions-to-python-3-11

It doesn't cover all your questions, but it tries to reply to most of them.

I'm open to suggestions to enhance this section ;-) IMO it's a good
practice that a PEP introducing incompatible changes explains how to
port existing code and this practice should become more common ;-)


> * how will we explain that upgrading and fixing breaks is better for
> *them* than staying on the older version

This part is always the hardest :-( Staying at an old Python version
is usually cheaper: no further developments needed. There are still
companies using Python 2 nowadays. Don't underestimate the technical
debt and the cost to upgrade ;-)

For the PEP 674, the promise is that updated C extensions should work
better with HPy and GraalPython. Not sure if it's enough to motivate
developers to port their code.

IMO one important thing is the cost of upgrading a C extension. For
the PEP 674, all you need to do is to run a single command once!


Done!

It reminds me the Python 2 to Python 3 migration before 2to3 and six
were usable and popular. The migration was super painful and so nobody
wanted to do it because everybody wanted to still keep support for
Python 2. Only adding Python 3 support didn't bring any benefit in the
short term (Python 3 only features couldn't be used). People didn't
migrate because migrating code was dangerous, painful and complicated.

I'm now in favor of limiting the number of incompatible changes per
Python release and never again do a Python 4 "break the world"
release. I prefer to have a bunch of incompatible changes in each
Python release :-)


> This last one is particularly important, as many large organisations
> (anecdotally) seem to have settled on Python 3.7 for a while now.
> Inevitably, this means they're all going to be faced with a painful time
> when it comes to an upgrade, and every little change we add on is going
> to hurt more. Every extra thing that needs fixing is motivation to just
> rewrite in a new language with more hype (and the promise of better
> compatibility... which I won't comment specifically on, but I suspect
> they won't manage it any better than us ;) ).

IMO we need to invest more time on developing tools to ease the
migration to newer Python versions, like:

* Python: https://github.com/asottile/pyupgrade
* C code: https://github.com/pythoncapi/pythoncapi_compat

Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RWJQM5AQGQLM46TJSQZN7CBXVYS5BFPD/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Tool to search in the source code of PyPI top 5000 projects [ In reply to ]
It's really great to see data being gathered on the impact of changes.

As we've already seen in this thread, there are many suggestions for how to
gather more data and thoughts on how the methodology might be enhanced --
and these suggestions are great -- but just having a means to gather some
important data is an excellent step.

Anecdotes and people's mental models of the Python ecosystem are certainly
valuable, but by themselves they don't provide a way to improve our joint
view of the consequences of particular changes.

As with unit tests and static analysis we should not expect such data
gathering to provide complete proof that a change is okay to make, but
having *some* quantitative data and the idea that we should pay attention
to this data are definitely a big step forward.

- Simon