Hi,
I wrote two scripts based on the work of INADA-san's work to (1)
download the source code of the PyPI top 5000 projects (2) search for
a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API
change to estimate how many projects are impacted.
The HPy project created a Git repository for a similar need (latest
update in June 2021):
https://github.com/hpyproject/top4000-pypi-packages
There are also online services for code search:
* GitHub: https://github.com/search
* https://grep.app/ (I didn't try it yet)
* Debian: https://codesearch.debian.net/
(1) Dowload
Script:
https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py
Usage: download_pypi_top.py PATH
It uses this JSON file:
https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json
From this service:
https://hugovk.github.io/top-pypi-packages/
At December 1, on 5000 projects, it only downloads 4760 tarball and
ZIP archives: I guess that 240 projects don't provide a source
archive. It takes around 5,2 GB of disk space.
(2) Code search
First, I used the fast and nice "ripgrep" tool with the command "rg
-zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
archives). But it doesn't show the path inside the archive and it
searchs in files generated by Cython whereas I wanted to ignore these
files.
So I wrote a short Python script which decompress tarball and ZIP
archive in memory and looks for a regex:
https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py
Usage: search_pypi_top.py "REGEX" output_filename
The code to parse command line option is hardcoded and pypi_dir =
"PYPI-2021-12-01-TOP-5000" are hardcoded :-D
It ignores files generated by Cython and .so binary files (Linux
dynamic libraries).
While "rg" is very fast, my script is very slow. But I don't care,
once the regex is written, I only need to search for the regex once, I
can wait 10-15 min ;-) I prefer to wait longer and have a more
accurate result. Also, there is room for enhancement, like running
multiple jobs in different processes or threads.
Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WQVEHLRIVISPFMWSSX5N4TQPIUN2XS22/
Code of Conduct: http://python.org/psf/codeofconduct/
I wrote two scripts based on the work of INADA-san's work to (1)
download the source code of the PyPI top 5000 projects (2) search for
a regex in these projects (compressed source archives).
You can use these tools if you work on an incompatible Python or C API
change to estimate how many projects are impacted.
The HPy project created a Git repository for a similar need (latest
update in June 2021):
https://github.com/hpyproject/top4000-pypi-packages
There are also online services for code search:
* GitHub: https://github.com/search
* https://grep.app/ (I didn't try it yet)
* Debian: https://codesearch.debian.net/
(1) Dowload
Script:
https://github.com/vstinner/misc/blob/main/cpython/download_pypi_top.py
Usage: download_pypi_top.py PATH
It uses this JSON file:
https://hugovk.github.io/top-pypi-packages/top-pypi-packages-30-days.min.json
From this service:
https://hugovk.github.io/top-pypi-packages/
At December 1, on 5000 projects, it only downloads 4760 tarball and
ZIP archives: I guess that 240 projects don't provide a source
archive. It takes around 5,2 GB of disk space.
(2) Code search
First, I used the fast and nice "ripgrep" tool with the command "rg
-zl REGEX path/*.{zip,gz,bz2,tgz}" (-z searchs in ZIP and tarball
archives). But it doesn't show the path inside the archive and it
searchs in files generated by Cython whereas I wanted to ignore these
files.
So I wrote a short Python script which decompress tarball and ZIP
archive in memory and looks for a regex:
https://github.com/vstinner/misc/blob/main/cpython/search_pypi_top.py
Usage: search_pypi_top.py "REGEX" output_filename
The code to parse command line option is hardcoded and pypi_dir =
"PYPI-2021-12-01-TOP-5000" are hardcoded :-D
It ignores files generated by Cython and .so binary files (Linux
dynamic libraries).
While "rg" is very fast, my script is very slow. But I don't care,
once the regex is written, I only need to search for the regex once, I
can wait 10-15 min ;-) I prefer to wait longer and have a more
accurate result. Also, there is room for enhancement, like running
multiple jobs in different processes or threads.
Victor
--
Night gathers, and now my watch begins. It shall not end until my death.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WQVEHLRIVISPFMWSSX5N4TQPIUN2XS22/
Code of Conduct: http://python.org/psf/codeofconduct/