Mailing List Archive

Manfred Lotz <ml_news@posteo.de> writes:

> I have a situation where in a directory tree I want to change a certain
> string in all files where that string occurs.
>
> My idea was to do
>
> - os.scandir and for each file
> - check if a file is a text file
> - if it is not a text file skip that file
> - change the string as often as it occurs in that file
>
>
> What is the best way to check if a file is a text file? In a script I
> could use the `file` command which is not ideal as I have to grep the
> result. In Perl I could do -T file.
>
> How to do best in Python?

If you are on Linux and more interested in the result than the
programming exercise, I would suggest the following non-Python solution:

find . -type -f -exec sed -i 's/foo/bar/g' {} \;

Having said that, I would be interested to know what the most compact
way of doing the same thing in Python might be.

Cheers,

Loris

--
This signature is currently under construction.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 9, 2020, 11:37 PM

Post #3 of 25 (765 views)

On 10Nov2020 07:24, Manfred Lotz <ml_news@posteo.de> wrote:
>I have a situation where in a directory tree I want to change a certain
>string in all files where that string occurs.
>
>My idea was to do
>
>- os.scandir and for each file

Use os.walk for trees. scandir does a single directory.

> - check if a file is a text file

This requires reading the entire file. You want to check that it
consists entirely of lines of text. In your expected text encoding -
these days UTF-8 is the common default, but getting this correct is
essential if you want to recognise text. So as a first cut, totally
untested:

for dirpath, filenames, dirnames in os.walk(top_dirpath):
is_text = False
try:
# expect utf-8, fail if non-utf-8 bytes encountered
with open(filename, encoding='utf-8', errors='strict') as f:
for lineno, line in enumerate(f, 1):
... other checks on each line of the file ...
if not line.endswith('\n'):
raise ValueError("line %d: no trailing newline" lineno)
if str.isprintable(line[:-1]):
raise ValueError("line %d: not all printable" % lineno)
# if we get here all checks passed, consider the file to
# be text
is_text = True
except Exception as e:
print(filename, "not text", e)
if not is_text:
print("skip", filename)
continue

You could add all sorts of other checks. "text" is a loosely defined
idea. But you could assert: all these lines decoded cleanly, so I can't
do much damage rewriting them.

> - if it is not a text file skip that file
> - change the string as often as it occurs in that file

You could, above, gather up all the lines in the file in a list. If you
get through, replace your string in the list and if anything was
changed, rewrite the file from the list of lines.

>What is the best way to check if a file is a text file? In a script I
>could use the `file` command which is not ideal as I have to grep the
>result.

Not to mention relying on file, which (a) has a simple idea of text and
(b) only looks at the start of each file, not the whole content. Very
dodgy.

If you're really batch editing files, you could (a) put everything into
a VCS (eg hg or git) so you can roll back changes or (b) work on a copy
of your directory tree or (c) just print the "text" filenames to stdout
and pipe that into GNU parallel, invoking "sed -i.bak s/this/that/g" to
batch edit the checked files, keeping a backup.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

miked at dewhirst

Nov 9, 2020, 11:52 PM

Post #4 of 25 (765 views)

On 10/11/2020 5:24 pm, Manfred Lotz wrote:
> I have a situation where in a directory tree I want to change a certain
> string in all files where that string occurs.
>
> My idea was to do
>
> - os.scandir and for each file
> - check if a file is a text file
> - if it is not a text file skip that file
> - change the string as often as it occurs in that file
>
>
> What is the best way to check if a file is a text file? In a script I
> could use the `file` command which is not ideal as I have to grep the
> result. In Perl I could do -T file.
>
> How to do best in Python?
Not necessarily best but I rolled this earlier. Much earlier. But I
still use it. I don't need to worry about text files or binary because I
specify .py files.

# -*- coding: utf-8 -*-
"""
Look for string 'find' and replace it with string 'repl' in all files in the
current directory and all sub-dirs.

If anything untoward happens, laboriously retrieve each original file from each
individual backup made by suffixing ~ to the filename.

If everything went well, make find == repl and run again to remove backups.

"""
import os

find = """# Copyright (c) 2019 Xyz Pty Ltd"""

repl = """# Copyright (c) 2020 Xyz Pty Ltd"""

ii = 0
kk = 0
for dirpath, dirnames, filenames in os.walk(".", topdown=True):
if "migrations" in dirpath:
continue
for filename in filenames:
if filename.endswith(".py"): # or filename.endswith(".txt"):
fil = os.path.join(dirpath, filename)
bak = "{0}~".format(fil)
if find == repl:
if os.path.isfile(bak):
ii += 1
os.remove(bak)
else:
with open(fil, "r") as src:
lines = src.readlines()
# make a backup file
with open(bak, "w", encoding="utf-8") as dst:
for line in lines:
dst.write(line)
with open(bak, "r") as src:
lines = src.readlines()
# re-write the original src
with open(fil, "w", encoding="utf-8") as dst:
kk += 1
for line in lines:
dst.write(line.replace(find, repl))
print("\nbak deletions = %s, tried = %s\n" % (ii, kk))

--
Signed email is an absolute defence against phishing. This email has
been signed with my private key. If you import my public key you can
automatically decrypt my signature and be sure it came from me. Just
ask and I'll send it to you. Your email software can handle signing.

Re: Changing strings in files [ In reply to ]

cl at isbd

Nov 10, 2020, 12:25 AM

Post #5 of 25 (765 views)

Loris Bennett <loris.bennett@fu-berlin.de> wrote:
> Having said that, I would be interested to know what the most compact
> way of doing the same thing in Python might be.
>
Here's my Python replace script:-

#!/usr/bin/python3
#
#
# String replacement utility
#
import os
import re
import sys
import shutil

def replaceString(s1, s2, fn):
tmpfn = "/tmp/replace.tmp"
tofn = fn
#
#
# copy the file to /tmp
#
shutil.copy(fn, tmpfn);
#
#
# Open the files
#
fromfd = open(tmpfn, 'r');
tofd = open(tofn, 'w');
#
#
# copy the file back where it came from, replacing the string on the way
for ln in fromfd:
ln = re.sub(s1, s2, ln)
tofd.write(ln)
tofd.close()

s1 = sys.argv[1]
s2 = sys.argv[2]

for fn in sys.argv[3:]:
if (os.path.isfile(fn)):
replaceString(s1, s2, fn)
else:
for srcPath, srcDirs, srcFiles in os.walk(fn):
for f in srcFiles:
replaceString(s1, s2, os.path.join(srcPath, f))

--
Chris Green
·
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 12:57 AM

Post #6 of 25 (765 views)

On Tue, 10 Nov 2020 08:19:55 +0100
"Loris Bennett" <loris.bennett@fu-berlin.de> wrote:

> Manfred Lotz <ml_news@posteo.de> writes:
>
> > I have a situation where in a directory tree I want to change a
> > certain string in all files where that string occurs.
> >
> > My idea was to do
> >
> > - os.scandir and for each file
> > - check if a file is a text file
> > - if it is not a text file skip that file
> > - change the string as often as it occurs in that file
> >
> >
> > What is the best way to check if a file is a text file? In a script
> > I could use the `file` command which is not ideal as I have to grep
> > the result. In Perl I could do -T file.
> >
> > How to do best in Python?
>
> If you are on Linux and more interested in the result than the
> programming exercise, I would suggest the following non-Python
> solution:
>
> find . -type -f -exec sed -i 's/foo/bar/g' {} \;
>

My existing script in Perl which I wanted to migrate to Python I used
`-T $file` and called sed

I like the -T which I assume does some heuristics to tell me if a file
is a text file.

--
Manfred

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 1:07 AM

Post #7 of 25 (765 views)

On Tue, 10 Nov 2020 18:37:54 +1100
Cameron Simpson <cs@cskk.id.au> wrote:

> On 10Nov2020 07:24, Manfred Lotz <ml_news@posteo.de> wrote:
> >I have a situation where in a directory tree I want to change a
> >certain string in all files where that string occurs.
> >
> >My idea was to do
> >
> >- os.scandir and for each file
>
> Use os.walk for trees. scandir does a single directory.
>

Perhaps better. I like to use os.scandir this way

def scantree(path: str) -> Iterator[os.DirEntry[str]]:
"""Recursively yield DirEntry objects (no directories)
for a given directory.
"""
for entry in os.scandir(path):
if entry.is_dir(follow_symlinks=False):
yield from scantree(entry.path)

yield entry

Worked fine so far. I think I coded it this way because I wanted the
full path of the file the easy way.

> > - check if a file is a text file
>
> This requires reading the entire file. You want to check that it
> consists entirely of lines of text. In your expected text encoding -
> these days UTF-8 is the common default, but getting this correct is
> essential if you want to recognise text. So as a first cut, totally
> untested:
>
> ...

The reason I want to check if a file is a text file is that I don't
want to try replacing patterns in binary files (executable binaries,
archives, audio files aso).

Of course, to make this nicely work some heuristic check would be the
right thing (this is what file command does). I am aware that an
heuristic check is not 100% but I think it is good enough.

--
Manfred

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 1:08 AM

Post #8 of 25 (765 views)

loris.bennett at fu-berlin

On Tue, 10 Nov 2020 18:52:26 +1100
Mike Dewhirst <miked@dewhirst.com.au> wrote:

> On 10/11/2020 5:24 pm, Manfred Lotz wrote:
> > I have a situation where in a directory tree I want to change a
> > certain string in all files where that string occurs.
> >
> > My idea was to do
> >
> > - os.scandir and for each file
> > - check if a file is a text file
> > - if it is not a text file skip that file
> > - change the string as often as it occurs in that file
> >
> >
> > What is the best way to check if a file is a text file? In a script
> > I could use the `file` command which is not ideal as I have to grep
> > the result. In Perl I could do -T file.
> >
> > How to do best in Python?
> Not necessarily best but I rolled this earlier. Much earlier. But I
> still use it. I don't need to worry about text files or binary
> because I specify .py files.
>
>
>
> # -*- coding: utf-8 -*-
> """
> Look for string 'find' and replace it with string 'repl' in all files
> in the current directory and all sub-dirs.
>
> If anything untoward happens, laboriously retrieve each original file
> from each individual backup made by suffixing ~ to the filename.
>
> If everything went well, make find == repl and run again to remove
> backups.
>
> """
> import os
>
> find = """# Copyright (c) 2019 Xyz Pty Ltd"""
>
> repl = """# Copyright (c) 2020 Xyz Pty Ltd"""
>
> ii = 0
> kk = 0
> for dirpath, dirnames, filenames in os.walk(".", topdown=True):
> if "migrations" in dirpath:
> continue
> for filename in filenames:
> if filename.endswith(".py"): # or filename.endswith(".txt"):
> fil = os.path.join(dirpath, filename)
> bak = "{0}~".format(fil)
> if find == repl:
> if os.path.isfile(bak):
> ii += 1
> os.remove(bak)
> else:
> with open(fil, "r") as src:
> lines = src.readlines()
> # make a backup file
> with open(bak, "w", encoding="utf-8") as dst:
> for line in lines:
> dst.write(line)
> with open(bak, "r") as src:
> lines = src.readlines()
> # re-write the original src
> with open(fil, "w", encoding="utf-8") as dst:
> kk += 1
> for line in lines:
> dst.write(line.replace(find, repl))
> print("\nbak deletions = %s, tried = %s\n" % (ii, kk))
>
>

Thanks, will take a look.

--
Manfred

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 1:57 AM

Post #9 of 25 (765 views)

Manfred Lotz <ml_news@posteo.de> writes:

> On Tue, 10 Nov 2020 08:19:55 +0100
> "Loris Bennett" <loris.bennett@fu-berlin.de> wrote:
>
>> Manfred Lotz <ml_news@posteo.de> writes:
>>
>> > I have a situation where in a directory tree I want to change a
>> > certain string in all files where that string occurs.
>> >
>> > My idea was to do
>> >
>> > - os.scandir and for each file
>> > - check if a file is a text file
>> > - if it is not a text file skip that file
>> > - change the string as often as it occurs in that file
>> >
>> >
>> > What is the best way to check if a file is a text file? In a script
>> > I could use the `file` command which is not ideal as I have to grep
>> > the result. In Perl I could do -T file.
>> >
>> > How to do best in Python?
>>
>> If you are on Linux and more interested in the result than the
>> programming exercise, I would suggest the following non-Python
>> solution:
>>
>> find . -type -f -exec sed -i 's/foo/bar/g' {} \;
>>
>
> My existing script in Perl which I wanted to migrate to Python I used
> `-T $file` and called sed
>
> I like the -T which I assume does some heuristics to tell me if a file
> is a text file.

Sorry, I missed the bit about text files. By '-T' I assume you mean
Perl's taint mode option. I am no security expert, but as I understand
it, taint mode does more than just check whether something is a text
file, although the "more" probably applies mainly to files which contain
Perl code.

Sorry also to bang on about non-Python solutions but you could do

find . -type f -exec grep -Iq . {} \; -and -exec sed -i 's/foo/bar/g' {} \;

i.e. let grep ignore binary files and quietly match all non-binary files.

--
This signature is currently under construction.
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 2:20 AM

Post #10 of 25 (765 views)

On Tue, Nov 10, 2020 at 9:06 PM Manfred Lotz <ml_news@posteo.de> wrote:
> The reason I want to check if a file is a text file is that I don't
> want to try replacing patterns in binary files (executable binaries,
> archives, audio files aso).
>

I'd recommend two checks, then:

1) Can the file be decoded as UTF-8?
2) Does it contain any NULs?

The checks can be done in either order; you can check if the file
contains any b"\0" or you can check if the decoded text contains any
u"\0", since UTF-8 guarantees that those are the same.

If both those checks pass, it's still possible that the file isn't one
you want to edit, but it is highly likely to be text.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 3:08 AM

Post #11 of 25 (765 views)

On 10Nov2020 10:07, Manfred Lotz <ml_news@posteo.de> wrote:
>On Tue, 10 Nov 2020 18:37:54 +1100
>Cameron Simpson <cs@cskk.id.au> wrote:
>> Use os.walk for trees. scandir does a single directory.
>
>Perhaps better. I like to use os.scandir this way
>
>def scantree(path: str) -> Iterator[os.DirEntry[str]]:
> """Recursively yield DirEntry objects (no directories)
> for a given directory.
> """
> for entry in os.scandir(path):
> if entry.is_dir(follow_symlinks=False):
> yield from scantree(entry.path)
>
> yield entry
>
>Worked fine so far. I think I coded it this way because I wanted the
>full path of the file the easy way.

Yes, that's fine and easy to read. Note that this is effectively a
recursive call though, with the associated costs:

- a scandir (or listdir, whatever) has the directory open, and holds it
open while you scan the subdirectories; by contrast os.walk only opens
one directory at a time

- likewise, if you're maintaining data during a scan, that is held while
you process the subdirectories; with an os.walk you tend to do that
and release the memory before the next iteration of the main loop
(obviously, depending exactly what you're doing)

However, directory trees tend not to be particularly deep, and the depth
governs the excess state you're keeping around.

>> > - check if a file is a text file
>>
>> This requires reading the entire file. You want to check that it
>> consists entirely of lines of text. In your expected text encoding -
>> these days UTF-8 is the common default, but getting this correct is
>> essential if you want to recognise text. So as a first cut, totally
>> untested:
>>
>> ...
>
>The reason I want to check if a file is a text file is that I don't
>want to try replacing patterns in binary files (executable binaries,
>archives, audio files aso).

Exactly, which is why you should not trust, say, the "file" utility. It
scans only the opening part of the file. Great for rejecting files, but
not reliable for being _sure_ about the whole file being text when it
doesn't reject.

>Of course, to make this nicely work some heuristic check would be the
>right thing (this is what file command does). I am aware that an
>heuristic check is not 100% but I think it is good enough.

Shrug. That is a risk you must evaluate yourself. I'm quite paranoid
about data loss, myself. If you've got backups or are working on copies
the risks are mitigated.

You could perhaps take a more targeted approach: do your target files
have distinctive file extensions (for example, all the .py files in a
source tree).

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

grant.b.edwards at gmail

Nov 10, 2020, 6:24 AM

Post #12 of 25 (765 views)

On 2020-11-10, Manfred Lotz <ml_news@posteo.de> wrote:

> What is the best way to check if a file is a text file?

Step 1: define "text file"

--
Grant

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

_ at eli

Nov 10, 2020, 10:33 AM

Post #13 of 25 (765 views)

In comp.lang.python, Loris Bennett <loris.bennett@fu-berlin.de> wrote:
> Manfred Lotz <ml_news@posteo.de> writes:
> > My idea was to do
> >
> > - os.scandir and for each file
> > - check if a file is a text file
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > - if it is not a text file skip that file
> > - change the string as often as it occurs in that file
> >
> > What is the best way to check if a file is a text file? In a script I
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > could use the `file` command which is not ideal as I have to grep the
> > result. In Perl I could do -T file.
> If you are on Linux and more interested in the result than the
> programming exercise, I would suggest the following non-Python solution:
>
> find . -type -f -exec sed -i 's/foo/bar/g' {} \;

That 100% fails the "check if a text file" part.

> Having said that, I would be interested to know what the most compact
> way of doing the same thing in Python might be.

Read first N lines of a file. If all parse as valid UTF-8, consider it text.
That's probably the rough method file(1) and Perl's -T use. (In
particular allow no nulls. Maybe allow ISO-8859-1.)

Elijah
------
pretty no nulls is file(1) check
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 10:40 AM

Post #14 of 25 (765 views)

On Wed, Nov 11, 2020 at 5:36 AM Eli the Bearded <*@eli.users.panix.com> wrote:
> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
> That's probably the rough method file(1) and Perl's -T use. (In
> particular allow no nulls. Maybe allow ISO-8859-1.)
>

ISO-8859-1 is basically "allow any byte values", so all you'd be doing
is checking for a lack of NUL bytes. I'd definitely recommend
mandating UTF-8, as that's a very good way of recognizing valid text,
but if you can't do that then the simple NUL check is all you really
need.

And let's be honest here, there aren't THAT many binary files that
manage to contain a total of zero NULs, so you won't get many false
hits :)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 10:43 AM

Post #15 of 25 (765 views)

On Tue, 10 Nov 2020 10:57:05 +0100
"Loris Bennett" <loris.bennett@fu-berlin.de> wrote:

> Manfred Lotz <ml_news@posteo.de> writes:
>
> > On Tue, 10 Nov 2020 08:19:55 +0100
> > "Loris Bennett" <loris.bennett@fu-berlin.de> wrote:
> >
> >> Manfred Lotz <ml_news@posteo.de> writes:
> >>
> >> > I have a situation where in a directory tree I want to change a
> >> > certain string in all files where that string occurs.
> >> >
> >> > My idea was to do
> >> >
> >> > - os.scandir and for each file
> >> > - check if a file is a text file
> >> > - if it is not a text file skip that file
> >> > - change the string as often as it occurs in that file
> >> >
> >> >
> >> > What is the best way to check if a file is a text file? In a
> >> > script I could use the `file` command which is not ideal as I
> >> > have to grep the result. In Perl I could do -T file.
> >> >
> >> > How to do best in Python?
> >>
> >> If you are on Linux and more interested in the result than the
> >> programming exercise, I would suggest the following non-Python
> >> solution:
> >>
> >> find . -type -f -exec sed -i 's/foo/bar/g' {} \;
> >>
> >
> > My existing script in Perl which I wanted to migrate to Python I
> > used `-T $file` and called sed
> >
> > I like the -T which I assume does some heuristics to tell me if a
> > file is a text file.
>
> Sorry, I missed the bit about text files. By '-T' I assume you mean
> Perl's taint mode option. I am no security expert, but as I
> understand it, taint mode does more than just check whether something
> is a text file, although the "more" probably applies mainly to files
> which contain Perl code.
>
> Sorry also to bang on about non-Python solutions but you could do
>
> find . -type f -exec grep -Iq . {} \; -and -exec sed -i
> 's/foo/bar/g' {} \;
>
> i.e. let grep ignore binary files and quietly match all non-binary
> files.
>

Very nice. Thanks for this.

--
Manfred

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 10:48 AM

Post #16 of 25 (765 views)

On Tue, 10 Nov 2020 22:08:54 +1100
Cameron Simpson <cs@cskk.id.au> wrote:

> On 10Nov2020 10:07, Manfred Lotz <ml_news@posteo.de> wrote:
> >On Tue, 10 Nov 2020 18:37:54 +1100
> >Cameron Simpson <cs@cskk.id.au> wrote:
> >> Use os.walk for trees. scandir does a single directory.
> >
> >Perhaps better. I like to use os.scandir this way
> >
> >def scantree(path: str) -> Iterator[os.DirEntry[str]]:
> > """Recursively yield DirEntry objects (no directories)
> > for a given directory.
> > """
> > for entry in os.scandir(path):
> > if entry.is_dir(follow_symlinks=False):
> > yield from scantree(entry.path)
> >
> > yield entry
> >
> >Worked fine so far. I think I coded it this way because I wanted the
> >full path of the file the easy way.
>
> Yes, that's fine and easy to read. Note that this is effectively a
> recursive call though, with the associated costs:
>
> - a scandir (or listdir, whatever) has the directory open, and holds
> it open while you scan the subdirectories; by contrast os.walk only
> opens one directory at a time
>
> - likewise, if you're maintaining data during a scan, that is held
> while you process the subdirectories; with an os.walk you tend to do
> that and release the memory before the next iteration of the main
> loop (obviously, depending exactly what you're doing)
>
> However, directory trees tend not to be particularly deep, and the
> depth governs the excess state you're keeping around.
>

Very interesting information. Thanks a lot for this. I will take a
closer look at os.walk.

> >> > - check if a file is a text file
> >>
> >> This requires reading the entire file. You want to check that it
> >> consists entirely of lines of text. In your expected text encoding
> >> - these days UTF-8 is the common default, but getting this correct
> >> is essential if you want to recognise text. So as a first cut,
> >> totally untested:
> >>
> >> ...
> >
> >The reason I want to check if a file is a text file is that I don't
> >want to try replacing patterns in binary files (executable binaries,
> >archives, audio files aso).
>
> Exactly, which is why you should not trust, say, the "file" utility.
> It scans only the opening part of the file. Great for rejecting
> files, but not reliable for being _sure_ about the whole file being
> text when it doesn't reject.
>
> >Of course, to make this nicely work some heuristic check would be the
> >right thing (this is what file command does). I am aware that an
> >heuristic check is not 100% but I think it is good enough.
>
> Shrug. That is a risk you must evaluate yourself. I'm quite paranoid
> about data loss, myself. If you've got backups or are working on
> copies the risks are mitigated.
>
> You could perhaps take a more targeted approach: do your target files
> have distinctive file extensions (for example, all the .py files in a
> source tree).
>

There are some distinctive file extensions. The reason I am satisfieg
with heuristics is that the string to change is pretty long so that
there is no real danger if I try to change in a binary file because
that string it not to be found in binary files.

The idea to skip binary files was simply to save time.

--
Manfred

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

_ at eli

Nov 10, 2020, 11:30 AM

Post #17 of 25 (765 views)

In comp.lang.python, Chris Angelico <rosuav@gmail.com> wrote:
> Eli the Bearded <*@eli.users.panix.com> wrote:
>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
>> That's probably the rough method file(1) and Perl's -T use. (In
>> particular allow no nulls. Maybe allow ISO-8859-1.)
> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> is checking for a lack of NUL bytes.

ISO-8859-1, unlike similar Windows "charset"s, does not use octets
128-190. Charsets like Windows CP-1252 are nastier, because they do
use that range. Usage of 1-31 will be pretty restricted in either,
probably not more than tab, linefeed, and carriage return.

> I'd definitely recommend
> mandating UTF-8, as that's a very good way of recognizing valid text,
> but if you can't do that then the simple NUL check is all you really
> need.

Dealing with all UTF-8 is my preference, too.

> And let's be honest here, there aren't THAT many binary files that
> manage to contain a total of zero NULs, so you won't get many false
> hits :)

There's always the issue of how much to read before deciding.

Elijah
------
ASCII with embedded escapes? could be a VT100 animation
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

barry at barrys-emacs

Nov 10, 2020, 11:50 AM

Post #18 of 25 (765 views)

> On 10 Nov 2020, at 19:30, Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, Chris Angelico <rosuav@gmail.com> wrote:
>> Eli the Bearded <*@eli.users.panix.com> wrote:
>>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
>>> That's probably the rough method file(1) and Perl's -T use. (In
>>> particular allow no nulls. Maybe allow ISO-8859-1.)
>> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
>> is checking for a lack of NUL bytes.

NUL check does not work for windows UTF-16 files.

>
> ISO-8859-1, unlike similar Windows "charset"s, does not use octets
> 128-190. Charsets like Windows CP-1252 are nastier, because they do
> use that range. Usage of 1-31 will be pretty restricted in either,
> probably not more than tab, linefeed, and carriage return.

Who told you that?

The C1 control plane is used in ISO 8 bit char sets.

One optimisation for the Vt100 family of terminals is to send CSI as 0x8b
and not as 0x1b '['. Back in the days of 9600 bps or 1200 bps connections
that was worth the effort.

>
>> I'd definitely recommend
>> mandating UTF-8, as that's a very good way of recognizing valid text,
>> but if you can't do that then the simple NUL check is all you really
>> need.
>
> Dealing with all UTF-8 is my preference, too.
>
>> And let's be honest here, there aren't THAT many binary files that
>> manage to contain a total of zero NULs, so you won't get many false
>> hits :)

There is the famous EICAR virus test file that is a valid 8086 program for
DOS that is printing ASCII.

>
> There's always the issue of how much to read before deciding.

Simple read it all, after all you have to scan all the file to do the replacement.

>
> Elijah
> ------
> ASCII with embedded escapes? could be a VT100 animation

The output of software that colours its logs maybe?

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 12:25 PM

Post #19 of 25 (765 views)

On Wed, Nov 11, 2020 at 6:36 AM Eli the Bearded <*@eli.users.panix.com> wrote:
>
> In comp.lang.python, Chris Angelico <rosuav@gmail.com> wrote:
> > Eli the Bearded <*@eli.users.panix.com> wrote:
> >> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
> >> That's probably the rough method file(1) and Perl's -T use. (In
> >> particular allow no nulls. Maybe allow ISO-8859-1.)
> > ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> > is checking for a lack of NUL bytes.
>
> ISO-8859-1, unlike similar Windows "charset"s, does not use octets
> 128-190. Charsets like Windows CP-1252 are nastier, because they do
> use that range. Usage of 1-31 will be pretty restricted in either,
> probably not more than tab, linefeed, and carriage return.

Define "does not use", though. You can decode those bytes just fine:

>>> bytes(range(256)).decode("ISO-8859-1")
'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f
!"#$%&\'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0¡¢£¤¥¦§¨©ª«¬\xad®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ'

This is especially true of \x01 to \x1F, since they are most
definitely defined, even though they aren't commonly used.

> > I'd definitely recommend
> > mandating UTF-8, as that's a very good way of recognizing valid text,
> > but if you can't do that then the simple NUL check is all you really
> > need.
>
> Dealing with all UTF-8 is my preference, too.
>
> > And let's be honest here, there aren't THAT many binary files that
> > manage to contain a total of zero NULs, so you won't get many false
> > hits :)
>
> There's always the issue of how much to read before deciding.
>

Right; but a lot of binary file formats are going to include
structured data that will frequently include a NUL byte. For instance,
a PNG file (after the header) consists of chunks, where each chunk is
identified by a four-byte size; and the first chunk (IHDR) is
generally going to be a very short one, meaning that its size will
generally have three NULs. So a typical PNG file will have a NUL
probably as the ninth byte of the file. Other file formats will be
similar, or even better; an ELF binary actually has a sixteen byte
header of which the last few bytes are reserved for future expansion
and must be zeroes, so that's an even stronger guarantee.

If the main job of the program, as in this situation, is to read the
entire file, I would probably have it read in the first 1KB or 16KB or
thereabouts, see if that has any NUL bytes, and if not, proceed to
read in the rest of the file. But depending on the situation, I might
actually have a hard limit on the file size (say, "any file over 1GB
isn't what I'm looking for"), so that would reduce the risks too.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 12:30 PM

Post #20 of 25 (765 views)

On Wed, Nov 11, 2020 at 6:52 AM Barry Scott <barry@barrys-emacs.org> wrote:
>
>
>
> > On 10 Nov 2020, at 19:30, Eli the Bearded <*@eli.users.panix.com> wrote:
> >
> > In comp.lang.python, Chris Angelico <rosuav@gmail.com> wrote:
> >> Eli the Bearded <*@eli.users.panix.com> wrote:
> >>> Read first N lines of a file. If all parse as valid UTF-8, consider it text.
> >>> That's probably the rough method file(1) and Perl's -T use. (In
> >>> particular allow no nulls. Maybe allow ISO-8859-1.)
> >> ISO-8859-1 is basically "allow any byte values", so all you'd be doing
> >> is checking for a lack of NUL bytes.
>
> NUL check does not work for windows UTF-16 files.

Yeah, so if you're expecting UTF-16, you would have to do the decode
to text first, and the check for NULs second. One of the big
advantages of UTF-8 is that you can do the checks in either order.

> >> And let's be honest here, there aren't THAT many binary files that
> >> manage to contain a total of zero NULs, so you won't get many false
> >> hits :)
>
> There is the famous EICAR virus test file that is a valid 8086 program for
> DOS that is printing ASCII.

Yes. I didn't say "none", I said "aren't many" :) There's
fundamentally no way to know whether something is or isn't text based
on its contents alone; raw audio data might just happen to look like
an RFC822 email, it's just really really unlikely.

> > There's always the issue of how much to read before deciding.
>
> Simple read it all, after all you have to scan all the file to do the replacement.

If the script's assuming it'll mostly work on small text files, it
might be very annoying to suddenly read in a 4GB blob of video file
just to find out that it's not text. But since we're talking
heuristics here, reading in a small chunk of the file is going to give
an extremely high chance of recognizing a binary file, with a
relatively small cost.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 2:55 PM

Post #21 of 25 (765 views)

On 11Nov2020 07:25, Chris Angelico <rosuav@gmail.com> wrote:
>If the main job of the program, as in this situation, is to read the
>entire file, I would probably have it read in the first 1KB or 16KB or
>thereabouts, see if that has any NUL bytes, and if not, proceed to
>read in the rest of the file. But depending on the situation, I might
>actually have a hard limit on the file size (say, "any file over 1GB
>isn't what I'm looking for"), so that would reduce the risks too.

You could shoehorn my suggested code for this efficiently.

It had a loop body like this:

is_text = False
try:
# expect utf-8, fail if non-utf-8 bytes encountered
with open(filename, encoding='utf-8', errors='strict') as f:
for lineno, line in enumerate(f, 1):
... other checks on each line of the file ...
if not line.endswith('\n'):
raise ValueError("line %d: no trailing newline" lineno)
if str.isprintable(line[:-1]):
raise ValueError("line %d: not all printable" % lineno)
# if we get here all checks passed, consider the file
# to
# be text
is_text = True
except Exception as e:
print(filename, "not text", e)
if not is_text:
print("skip", filename)
continue

which scans the entire file to see if it is all text (criteria to be
changed to suit the user, but I was going for clean strict utf-8 decode,
all chars "printable"). Since we're doing that, we could accumulate the
lines as we went and make the replacement in memory. If we get all the
way out the bottom, rewrite the file.

If memory is a concern, we could copy modified lines to a temporary
file, and copy back if everything was good (or not if we make no
replacements).

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 2:56 PM

Post #22 of 25 (765 views)

On 11Nov2020 07:30, Chris Angelico <rosuav@gmail.com> wrote:
>If the script's assuming it'll mostly work on small text files, it
>might be very annoying to suddenly read in a 4GB blob of video file
>just to find out that it's not text.

You can abort as soon as the decode fails. Which will usually be pretty
early for video.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

Nov 10, 2020, 4:07 PM

Post #23 of 25 (765 views)

On Wed, Nov 11, 2020 at 10:00 AM Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 11Nov2020 07:30, Chris Angelico <rosuav@gmail.com> wrote:
> >If the script's assuming it'll mostly work on small text files, it
> >might be very annoying to suddenly read in a 4GB blob of video file
> >just to find out that it's not text.
>
> You can abort as soon as the decode fails. Which will usually be pretty
> early for video.
>
Only if you haven't already read it from the file system, which is the
point of the early abort :)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

storchaka at gmail

Nov 11, 2020, 12:03 AM

Post #24 of 25 (765 views)

10.11.20 22:40, Dennis Lee Bieber ????:
> Testing for extension in a list of exclusions would be much faster than
> scanning the contents of a file, and the few that do get through would have
> to be scanned anyway.

Then the simplest method should work: read the first 512 bytes and check
if they contain b'\0'. Chance that a random sequences of bytes does not
contain NUL is (1-1/256)**512 = 0.13. So this will filter out 87% of
binary files. Likely6 more, because binary files usually have some
structure, and reserve fixed size for integers. Most integers are much
less than the maximal value, so higher bits and bytes are zeroes. You
can also decrease the probability of false results by increasing the
size of tested data or by testing few other byte values (b'\1', b'\2',
etc). Anything more sophisticate is just a waste of your time.

--
https://mail.python.org/mailman/listinfo/python-list

Re: Changing strings in files [ In reply to ]

storchaka at gmail

Nov 11, 2020, 12:19 AM

Post #25 of 25 (765 views)