Mailing List Archive

1 2 3 4  View All
Re: tail [ In reply to ]
On Sun, 8 May 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
> >
> > On 2022-05-07 17:28, Marco Sulla wrote:
> > > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> > >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> > >
> > >>>> "\n".encode("utf-16")
> > > b'\xff\xfe\n\x00'
> > >>>> "".encode("utf-16")
> > > b'\xff\xfe'
> > >>>> "a\nb".encode("utf-16")
> > > b'\xff\xfea\x00\n\x00b\x00'
> > >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > > b'\n\x00'
> > >
> > > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
> >
> > In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> > could be little-endian or big-endian.
> >
> > As you didn't specify which you wanted, it defaulted to little-endian
> > and added a BOM (U+FEFF).
> >
> > If you specify which endianness you want with "utf-16le" or "utf-16be",
> > it won't add the BOM:
> >
> > >>> # Little-endian.
> > >>> "\n".encode("utf-16le")
> > b'\n\x00'
> > >>> # Big-endian.
> > >>> "\n".encode("utf-16be")
> > b'\x00\n'
>
> Well, ok, but I need a generic method to get LF and CR for any
> encoding an user can input.
> Do you think that
>
> "\n".encode(encoding).lstrip("".encode(encoding))
>
> is good for any encoding?

No, because it is only useful for stateless encodings. Any encoding
which uses "shift bytes" that cause subsequent bytes to be interpreted
differently will simply not work with this naive technique. Also,
you're assuming that the byte(s) you get from encoding LF will *only*
represent LF, which is also not true for a number of other encodings -
they might always encode LF to the same byte sequence, but could use
that same byte sequence as part of a multi-byte encoding. So, no, for
arbitrarily chosen encodings, this is not dependable.

> Furthermore, is there a way to get the
> encoding of an opened file object?

Nope. That's fundamentally not possible. Unless you mean in the
trivial sense of "what was the parameter passed to the open() call?",
in which case f.encoding will give it to you; but to find out the
actual encoding, no, you can't.

The ONLY way to 100% reliably decode arbitrary text is to know, from
external information, what encoding it is in. Every other scheme
imposes restrictions. Trying to do something that works for absolutely
any encoding is a doomed project.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> MRAB <python@mrabarnett.plus.com> writes:
> >On 2022-05-07 19:47, Stefan Ram wrote:
> ...
> >>def encoding( name ):
> >> path = pathlib.Path( name )
> >> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >> try:
> >> with path.open( encoding=encoding, errors="strict" )as file:
> >> text = file.read()
> >> return encoding
> >> except UnicodeDecodeError:
> >> pass
> >> return "ascii"
> >>Yes, it's potentially slow and might be wrong.
> >>The result "ascii" might mean it's a binary file.
> >"latin-1" will decode any sequence of bytes, so it'll never try
> >"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >anyway because the file could contain 0x80..0xFF, which aren't supported
> >by that encoding.
>
> Thank you! It's working for my specific application where
> I'm reading from a collection of text files that should be
> encoded in either utf_8, latin_1, or ascii.
>

In that case, I'd exclude ASCII from the check, and just check UTF-8,
and if that fails, decode as Latin-1. Any ASCII files will decode
correctly as UTF-8, and any file will decode as Latin-1.

I've used this exact fallback system when decoding raw data from
Unicode-naive servers - they accept and share bytes, so it's entirely
possible to have a mix of encodings in a single stream. As long as you
can define the span of a single "unit" (say, a line, or a chunk in
some form), you can read as bytes and do the exact same "decode as
UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
perfectly ideal, but it's about as good as you'll get with a lot of
US-based servers. (Depending on context, you might use CP-1252 instead
of Latin-1, but you might need errors="replace" there, since
Windows-1252 has some undefined byte values.)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 7 May 2022, at 17:29, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
>> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
>
>>>> "\n".encode("utf-16")
> b'\xff\xfe\n\x00'
>>>> "".encode("utf-16")
> b'\xff\xfe'
>>>> "a\nb".encode("utf-16")
> b'\xff\xfea\x00\n\x00b\x00'
>>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> b'\n\x00'
>
> Can I use the last trick to get the encoding of a LF or a CR in any encoding?

In a word no.

There are cases that you just have to know the encoding you are working with.
utf-16 because you have deal with the data in 2 byte units and know if
it is big endian or little endian.

There will be other encoding that will also be difficult.

But if you are working with encoding that are using ASCII as a base,
like unicode encoded as utf-8 or iso-8859 series then you can just look
for NL and CR using the ASCII values of the byte.

In short once you set your requirements then you can know what problems
you can avoid and which you must solve.

Is utf-16 important to you? If not no need to solve its issues.

Barry



--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
I think I've _almost_ found a simpler, general way:

import os

_lf = "\n"
_cr = "\r"

def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
lines_not_found = n

with open(filepath, newline=newline, encoding=encoding) as f:
text = ""

hard_mode = False

if newline == None:
newline = _lf
elif newline == "":
hard_mode = True

if hard_mode:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()
lf_after = False

for i, char in enumerate(reversed(text)):
if char == _lf:
lf_after == True
elif char == _cr:
lines_not_found -= 1

newline_size = 2 if lf_after else 1

lf_after = False
elif lf_after:
lines_not_found -= 1
newline_size = 1
lf_after = False


if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i + newline_size
break

if lines_not_found == 0:
break
else:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()

for i, char in enumerate(reversed(text)):
if char == newline:
lines_not_found -= 1

if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i +
len(newline)
break

if lines_not_found == 0:
break


if chunk_line_pos == -1:
chunk_line_pos = 0

return text[chunk_line_pos:]


Shortly, the file is always opened in text mode. File is read at the end in
bigger and bigger chunks, until the file is finished or all the lines are
found.

Why? Because in encodings that have more than 1 byte per character, reading
a chunk of n bytes, then reading the previous chunk, can eventually split
the character between the chunks in two distinct bytes.

I think one can read chunk by chunk and test the chunk junction problem. I
suppose the code will be faster this way. Anyway, it seems that this trick
is quite fast anyway and it's a lot simpler.

The final result is read from the chunk, and not from the file, so there's
no problems of misalignment of bytes and text. Furthermore, the builtin
encoding parameter is used, so this should work with all the encodings
(untested).

Furthermore, a newline parameter can be specified, as in open(). If it's
equal to the empty string, the things are a little more complicated, anyway
I suppose the code is clear. It's untested too. I only tested with an utf8
linux file.

Do you think there are chances to get this function as a method of the file
object in CPython? The method for a file object opened in bytes mode is
simpler, since there's no encoding and newline is only \n in that case.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 7 May 2022, at 14:40, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
>> So there's no way to reliably read lines in reverse in text mode using
>> seek and read, but the only option is readlines?
>
> I think, CPython is based on C. I don't know whether
> Python's seek function directly calls C's fseek function,
> but maybe the following parts of the C standard also are
> relevant for Python?

There is the posix API that and the C FILE API.

I expect that the odities you about NUL chars is all about the FILE
API. As far as I know its the posix API that C Python uses and it
does not suffer from issues with binary files.

Barry

>
> |Setting the file position indicator to end-of-file, as with
> |fseek(file, 0, SEEK_END), has undefined behavior for a binary
> |stream (because of possible trailing null characters) or for
> |any stream with state-dependent encoding that does not
> |assuredly end in the initial shift state.
> from a footnote in a draft of a C standard
>
> |For a text stream, either offset shall be zero, or offset
> |shall be a value returned by an earlier successful call to
> |the ftell function on a stream associated with the same file
> |and whence shall be SEEK_SET.
> from a draft of a C standard
>
> |A text stream is an ordered sequence of characters composed
> |into lines, each line consisting of zero or more characters
> |plus a terminating new-line character. Whether the last line
> |requires a terminating new-line character is implementation-defined.
> from a draft of a C standard
>
> This might mean that reading from a text stream that is not
> ending in a new-line character might have undefined behavior
> (depending on the C implementation). In practice, it might
> mean that some things could go wrong near the end of such
> a stream.
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>
>> MRAB <python@mrabarnett.plus.com> writes:
>>> On 2022-05-07 19:47, Stefan Ram wrote:
>> ...
>>>> def encoding( name ):
>>>> path = pathlib.Path( name )
>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>> try:
>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>> text = file.read()
>>>> return encoding
>>>> except UnicodeDecodeError:
>>>> pass
>>>> return "ascii"
>>>> Yes, it's potentially slow and might be wrong.
>>>> The result "ascii" might mean it's a binary file.
>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>> by that encoding.
>>
>> Thank you! It's working for my specific application where
>> I'm reading from a collection of text files that should be
>> encoded in either utf_8, latin_1, or ascii.
>>
>
> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> and if that fails, decode as Latin-1. Any ASCII files will decode
> correctly as UTF-8, and any file will decode as Latin-1.
>
> I've used this exact fallback system when decoding raw data from
> Unicode-naive servers - they accept and share bytes, so it's entirely
> possible to have a mix of encodings in a single stream. As long as you
> can define the span of a single "unit" (say, a line, or a chunk in
> some form), you can read as bytes and do the exact same "decode as
> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> perfectly ideal, but it's about as good as you'll get with a lot of
> US-based servers. (Depending on context, you might use CP-1252 instead
> of Latin-1, but you might need errors="replace" there, since
> Windows-1252 has some undefined byte values.)

There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.

Barry


>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 9 May 2022 at 04:15, Barry Scott <barry@barrys-emacs.org> wrote:
>
>
>
> > On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >>
> >> MRAB <python@mrabarnett.plus.com> writes:
> >>> On 2022-05-07 19:47, Stefan Ram wrote:
> >> ...
> >>>> def encoding( name ):
> >>>> path = pathlib.Path( name )
> >>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >>>> try:
> >>>> with path.open( encoding=encoding, errors="strict" )as file:
> >>>> text = file.read()
> >>>> return encoding
> >>>> except UnicodeDecodeError:
> >>>> pass
> >>>> return "ascii"
> >>>> Yes, it's potentially slow and might be wrong.
> >>>> The result "ascii" might mean it's a binary file.
> >>> "latin-1" will decode any sequence of bytes, so it'll never try
> >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >>> anyway because the file could contain 0x80..0xFF, which aren't supported
> >>> by that encoding.
> >>
> >> Thank you! It's working for my specific application where
> >> I'm reading from a collection of text files that should be
> >> encoded in either utf_8, latin_1, or ascii.
> >>
> >
> > In that case, I'd exclude ASCII from the check, and just check UTF-8,
> > and if that fails, decode as Latin-1. Any ASCII files will decode
> > correctly as UTF-8, and any file will decode as Latin-1.
> >
> > I've used this exact fallback system when decoding raw data from
> > Unicode-naive servers - they accept and share bytes, so it's entirely
> > possible to have a mix of encodings in a single stream. As long as you
> > can define the span of a single "unit" (say, a line, or a chunk in
> > some form), you can read as bytes and do the exact same "decode as
> > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> > perfectly ideal, but it's about as good as you'll get with a lot of
> > US-based servers. (Depending on context, you might use CP-1252 instead
> > of Latin-1, but you might need errors="replace" there, since
> > Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>

Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some
sort of straight-up lie in the form of a meta tag. It's annoying. But
the same logic still applies: attempt one decode (UTF-8) and if it
fails, there's one fallback. Fairly simple.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> I think I've _almost_ found a simpler, general way:
>
> import os
>
> _lf = "\n"
> _cr = "\r"
>
> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> n_chunk_size = n * chunk_size

Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
I tend to read on multiple of MiB as its near instant.

> pos = os.stat(filepath).st_size

You cannot mix POSIX API with text mode.
pos is in bytes from the start of the file.
Textmode will be in code points. bytes != code points.

> chunk_line_pos = -1
> lines_not_found = n
>
> with open(filepath, newline=newline, encoding=encoding) as f:
> text = ""
>
> hard_mode = False
>
> if newline == None:
> newline = _lf
> elif newline == "":
> hard_mode = True
>
> if hard_mode:
> while pos != 0:
> pos -= n_chunk_size
>
> if pos < 0:
> pos = 0
>
> f.seek(pos)

In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.

> text = f.read()

You have on limit on the amount of data read.

> lf_after = False
>
> for i, char in enumerate(reversed(text)):

Simple use text.rindex('\n') or text.rfind('\n') for speed.

> if char == _lf:
> lf_after == True
> elif char == _cr:
> lines_not_found -= 1
>
> newline_size = 2 if lf_after else 1
>
> lf_after = False
> elif lf_after:
> lines_not_found -= 1
> newline_size = 1
> lf_after = False
>
>
> if lines_not_found == 0:
> chunk_line_pos = len(text) - 1 - i + newline_size
> break
>
> if lines_not_found == 0:
> break
> else:
> while pos != 0:
> pos -= n_chunk_size
>
> if pos < 0:
> pos = 0
>
> f.seek(pos)
> text = f.read()
>
> for i, char in enumerate(reversed(text)):
> if char == newline:
> lines_not_found -= 1
>
> if lines_not_found == 0:
> chunk_line_pos = len(text) - 1 - i +
> len(newline)
> break
>
> if lines_not_found == 0:
> break
>
>
> if chunk_line_pos == -1:
> chunk_line_pos = 0
>
> return text[chunk_line_pos:]
>
>
> Shortly, the file is always opened in text mode. File is read at the end in
> bigger and bigger chunks, until the file is finished or all the lines are
> found.

It will fail if the contents is not ASCII.

>
> Why? Because in encodings that have more than 1 byte per character, reading
> a chunk of n bytes, then reading the previous chunk, can eventually split
> the character between the chunks in two distinct bytes.

No it cannot. text mode only knows how to return code points. Now if you are in
binary it could be split, but you are not in binary mode so it cannot.

> I think one can read chunk by chunk and test the chunk junction problem. I
> suppose the code will be faster this way. Anyway, it seems that this trick
> is quite fast anyway and it's a lot simpler.

> The final result is read from the chunk, and not from the file, so there's
> no problems of misalignment of bytes and text. Furthermore, the builtin
> encoding parameter is used, so this should work with all the encodings
> (untested).
>
> Furthermore, a newline parameter can be specified, as in open(). If it's
> equal to the empty string, the things are a little more complicated, anyway
> I suppose the code is clear. It's untested too. I only tested with an utf8
> linux file.
>
> Do you think there are chances to get this function as a method of the file
> object in CPython? The method for a file object opened in bytes mode is
> simpler, since there's no encoding and newline is only \n in that case.

State your requirements. Then see if your implementation meets them.

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-05-08 19:15, Barry Scott wrote:
>
>
>> On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
>>
>> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>>
>>> MRAB <python@mrabarnett.plus.com> writes:
>>>> On 2022-05-07 19:47, Stefan Ram wrote:
>>> ...
>>>>> def encoding( name ):
>>>>> path = pathlib.Path( name )
>>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>>> try:
>>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>>> text = file.read()
>>>>> return encoding
>>>>> except UnicodeDecodeError:
>>>>> pass
>>>>> return "ascii"
>>>>> Yes, it's potentially slow and might be wrong.
>>>>> The result "ascii" might mean it's a binary file.
>>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>>> by that encoding.
>>>
>>> Thank you! It's working for my specific application where
>>> I'm reading from a collection of text files that should be
>>> encoded in either utf_8, latin_1, or ascii.
>>>
>>
>> In that case, I'd exclude ASCII from the check, and just check UTF-8,
>> and if that fails, decode as Latin-1. Any ASCII files will decode
>> correctly as UTF-8, and any file will decode as Latin-1.
>>
>> I've used this exact fallback system when decoding raw data from
>> Unicode-naive servers - they accept and share bytes, so it's entirely
>> possible to have a mix of encodings in a single stream. As long as you
>> can define the span of a single "unit" (say, a line, or a chunk in
>> some form), you can read as bytes and do the exact same "decode as
>> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
>> perfectly ideal, but it's about as good as you'll get with a lot of
>> US-based servers. (Depending on context, you might use CP-1252 instead
>> of Latin-1, but you might need errors="replace" there, since
>> Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>
Is it CP-1252 or ISO-8859-1 (Latin-1)?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 8 May 2022 at 20:31, Barry Scott <barry@barrys-emacs.org> wrote:
>
> > On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> > n_chunk_size = n * chunk_size
>
> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
> I tend to read on multiple of MiB as its near instant.

Well, I tested on a little file, a list of my preferred pizzas, so....

> > pos = os.stat(filepath).st_size
>
> You cannot mix POSIX API with text mode.
> pos is in bytes from the start of the file.
> Textmode will be in code points. bytes != code points.
>
> > chunk_line_pos = -1
> > lines_not_found = n
> >
> > with open(filepath, newline=newline, encoding=encoding) as f:
> > text = ""
> >
> > hard_mode = False
> >
> > if newline == None:
> > newline = _lf
> > elif newline == "":
> > hard_mode = True
> >
> > if hard_mode:
> > while pos != 0:
> > pos -= n_chunk_size
> >
> > if pos < 0:
> > pos = 0
> >
> > f.seek(pos)
>
> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.

Why? I don't see any recommendation about it in the docs:
https://docs.python.org/3/library/io.html#io.IOBase.seek

> > text = f.read()
>
> You have on limit on the amount of data read.

I explained that previously. Anyway, chunk_size is small, so it's not
a great problem.

> > lf_after = False
> >
> > for i, char in enumerate(reversed(text)):
>
> Simple use text.rindex('\n') or text.rfind('\n') for speed.

I can't use them when I have to find both \n or \r. So I preferred to
simplify the code and use the for cycle every time. Take into mind
anyway that this is a prototype for a Python C Api implementation
(builtin I hope, or a C extension if not)

> > Shortly, the file is always opened in text mode. File is read at the end in
> > bigger and bigger chunks, until the file is finished or all the lines are
> > found.
>
> It will fail if the contents is not ASCII.

Why?

> > Why? Because in encodings that have more than 1 byte per character, reading
> > a chunk of n bytes, then reading the previous chunk, can eventually split
> > the character between the chunks in two distinct bytes.
>
> No it cannot. text mode only knows how to return code points. Now if you are in
> binary it could be split, but you are not in binary mode so it cannot.

From the docs:

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

> > Do you think there are chances to get this function as a method of the file
> > object in CPython? The method for a file object opened in bytes mode is
> > simpler, since there's no encoding and newline is only \n in that case.
>
> State your requirements. Then see if your implementation meets them.

The method should return the last n lines from a file object.
If the file object is in text mode, the newline parameter must be honored.
If the file object is in binary mode, a newline is always b"\n", to be
consistent with readline.

I suppose the current implementation of tail satisfies the
requirements for text mode. The previous one satisfied binary mode.

Anyway, apart from my implementation, I'm curious if you think a tail
method is worth it to be a method of the builtin file objects in
CPython.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 9 May 2022 at 05:49, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.

Absolutely not. As has been stated multiple times in this thread, a
fully general approach is extremely complicated, horrifically
unreliable, and hopelessly inefficient. The ONLY way to make this sort
of thing any good whatsoever is to know your own use-case and code to
exactly that. Given the size of files you're working with, for
instance, a simple approach of just reading the whole file would make
far more sense than the complex seeking you're doing. For reading a
multi-gigabyte file, the choices will be different.

No, this does NOT belong in the core language.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 8 May 2022, at 20:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Sun, 8 May 2022 at 20:31, Barry Scott <barry@barrys-emacs.org> wrote:
>>
>>>> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>>>
>>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>>> n_chunk_size = n * chunk_size
>>
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
>> I tend to read on multiple of MiB as its near instant.
>
> Well, I tested on a little file, a list of my preferred pizzas, so....

Try it on a very big file.

>
>>> pos = os.stat(filepath).st_size
>>
>> You cannot mix POSIX API with text mode.
>> pos is in bytes from the start of the file.
>> Textmode will be in code points. bytes != code points.
>>
>>> chunk_line_pos = -1
>>> lines_not_found = n
>>>
>>> with open(filepath, newline=newline, encoding=encoding) as f:
>>> text = ""
>>>
>>> hard_mode = False
>>>
>>> if newline == None:
>>> newline = _lf
>>> elif newline == "":
>>> hard_mode = True
>>>
>>> if hard_mode:
>>> while pos != 0:
>>> pos -= n_chunk_size
>>>
>>> if pos < 0:
>>> pos = 0
>>>
>>> f.seek(pos)
>>
>> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.
>
> Why? I don't see any recommendation about it in the docs:
> https://docs.python.org/3/library/io.html#io.IOBase.seek

What does adding 1 to a pos mean?
If it’s binary it mean 1 byte further down the file but in text mode it may need to
move the point 1, 2 or 3 bytes down the file.

>
>>> text = f.read()
>>
>> You have on limit on the amount of data read.
>
> I explained that previously. Anyway, chunk_size is small, so it's not
> a great problem.

Typo I meant you have no limit.

You read all the data till the end of the file that might be mega bytes of data.
>
>>> lf_after = False
>>>
>>> for i, char in enumerate(reversed(text)):
>>
>> Simple use text.rindex('\n') or text.rfind('\n') for speed.
>
> I can't use them when I have to find both \n or \r. So I preferred to
> simplify the code and use the for cycle every time. Take into mind
> anyway that this is a prototype for a Python C Api implementation
> (builtin I hope, or a C extension if not)
>
>>> Shortly, the file is always opened in text mode. File is read at the end in
>>> bigger and bigger chunks, until the file is finished or all the lines are
>>> found.
>>
>> It will fail if the contents is not ASCII.
>
> Why?
>
>>> Why? Because in encodings that have more than 1 byte per character, reading
>>> a chunk of n bytes, then reading the previous chunk, can eventually split
>>> the character between the chunks in two distinct bytes.
>>
>> No it cannot. text mode only knows how to return code points. Now if you are in
>> binary it could be split, but you are not in binary mode so it cannot.
>
>> From the docs:
>
> seek(offset, whence=SEEK_SET)
> Change the stream position to the given byte offset.
>
>>> Do you think there are chances to get this function as a method of the file
>>> object in CPython? The method for a file object opened in bytes mode is
>>> simpler, since there's no encoding and newline is only \n in that case.
>>
>> State your requirements. Then see if your implementation meets them.
>
> The method should return the last n lines from a file object.
> If the file object is in text mode, the newline parameter must be honored.
> If the file object is in binary mode, a newline is always b"\n", to be
> consistent with readline.
>
> I suppose the current implementation of tail satisfies the
> requirements for text mode. The previous one satisfied binary mode.
>
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 8 May 2022 at 22:02, Chris Angelico <rosuav@gmail.com> wrote:
>
> Absolutely not. As has been stated multiple times in this thread, a
> fully general approach is extremely complicated, horrifically
> unreliable, and hopelessly inefficient.

Well, my implementation is quite general now. It's not complicated and
inefficient. About reliability, I can't say anything without a test
case.

> The ONLY way to make this sort
> of thing any good whatsoever is to know your own use-case and code to
> exactly that. Given the size of files you're working with, for
> instance, a simple approach of just reading the whole file would make
> far more sense than the complex seeking you're doing. For reading a
> multi-gigabyte file, the choices will be different.

Apart from the fact that it's very, very simple to optimize for small
files: this is, IMHO, a premature optimization. The code is quite fast
even if the file is small. Can it be faster? Of course, but it depends
on the use case. Every optimization in CPython must pass the benchmark
suite test. If there's little or no gain, the optimization is usually
rejected.

> No, this does NOT belong in the core language.

I respect your opinion, but IMHO you think that the task is more
complicated than the reality. It seems to me that the method can be
quite simple and fast.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 8 May 2022 at 22:34, Barry <barry@barrys-emacs.org> wrote:
>
> > On 8 May 2022, at 20:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > ?On Sun, 8 May 2022 at 20:31, Barry Scott <barry@barrys-emacs.org> wrote:
> >>
> >>>> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >>>
> >>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >>> n_chunk_size = n * chunk_size
> >>
> >> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
> >> I tend to read on multiple of MiB as its near instant.
> >
> > Well, I tested on a little file, a list of my preferred pizzas, so....
>
> Try it on a very big file.

I'm not saying it's a good idea, it's only the value that I needed for my tests.
Anyway, it's not a problem with big files. The problem is with files
with long lines.

> >> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.
> >
> > Why? I don't see any recommendation about it in the docs:
> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>
> What does adding 1 to a pos mean?
> If it’s binary it mean 1 byte further down the file but in text mode it may need to
> move the point 1, 2 or 3 bytes down the file.

Emh. I re-quote

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

And so on. No mention of differences between text and binary mode.

> >> You have on limit on the amount of data read.
> >
> > I explained that previously. Anyway, chunk_size is small, so it's not
> > a great problem.
>
> Typo I meant you have no limit.
>
> You read all the data till the end of the file that might be mega bytes of data.

Yes, I already explained why and how it could be optimized. I quote myself:

Shortly, the file is always opened in text mode. File is read at the
end in bigger and bigger chunks, until the file is finished or all the
lines are found.

Why? Because in encodings that have more than 1 byte per character,
reading a chunk of n bytes, then reading the previous chunk, can
eventually split the character between the chunks in two distinct
bytes.

I think one can read chunk by chunk and test the chunk junction
problem. I suppose the code will be faster this way. Anyway, it seems
that this trick is quite fast anyway and it's a lot simpler.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 9/05/22 7:47 am, Marco Sulla wrote:
>> It will fail if the contents is not ASCII.
>
> Why?

For some encodings, if you seek to an arbitrary byte position and
then read, it may *appear* to succeed but give you complete gibberish.

Your method might work for a certain subset of encodings (those that
are self-synchronising) but it won't work for arbitrary encodings.

Given that limitation, I don't think it's reliable enough to include
in the standard library.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 8 May 2022 22:48:32 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>
>Emh. I re-quote
>
>seek(offset, whence=SEEK_SET)
>Change the stream position to the given byte offset.
>
>And so on. No mention of differences between text and binary mode.

You ignore that, underneath, Python is just wrapping the C API... And
the documentation for C explicitly specifies that other then SEEK_END with
offset 0, and SEEK_SET with offset of 0, for a text file one can only rely
upon SEEK_SET using an offset previously obtained with (C) ftell() /
(Python) .tell() .

https://docs.python.org/3/library/io.html
"""
class io.IOBase

The abstract base class for all I/O classes.
"""
seek(offset, whence=SEEK_SET)

Change the stream position to the given byte offset. offset is
interpreted relative to the position indicated by whence. The default value
for whence is SEEK_SET. Values for whence are:
"""

Applicable to BINARY MODE I/O: For UTF-8 and any other multibyte
encoding, this means you could end up positioning into the middle of a
"character" and subsequently read garbage. It is on you to handle
synchronizing on a valid character position, and also to handle different
line ending conventions.

"""
class io.TextIOBase

Base class for text streams. This class provides a character and line
based interface to stream I/O. It inherits IOBase.
"""
seek(offset, whence=SEEK_SET)

Change the stream position to the given offset. Behaviour depends on
the whence parameter. The default value for whence is SEEK_SET.

SEEK_SET or 0: seek from the start of the stream (the default);
offset must either be a number returned by TextIOBase.tell(), or zero. Any
other offset value produces undefined behaviour.

SEEK_CUR or 1: “seek” to the current position; offset must be zero,
which is a no-operation (all other values are unsupported).

SEEK_END or 2: seek to the end of the stream; offset must be zero
(all other values are unsupported).
"""

EMPHASIS: "offset must either be a number returned by TextIOBase.tell(), or
zero."

TEXT I/O, with a specified encoding, will return Unicode data points,
and will handle converting line ending to the internal (<lf> represents
new-line) format.

Since your code does not specify BINARY mode in the open statement,
Python should be using TEXT mode.



--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 08May2022 22:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>On Sun, 8 May 2022 at 22:34, Barry <barry@barrys-emacs.org> wrote:
>> >> In text mode you can only seek to a value return from f.tell()
>> >> otherwise the behaviour is undefined.
>> >
>> > Why? I don't see any recommendation about it in the docs:
>> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>>
>> What does adding 1 to a pos mean?
>> If it’s binary it mean 1 byte further down the file but in text mode it may need to
>> move the point 1, 2 or 3 bytes down the file.
>
>Emh. I re-quote
>
>seek(offset, whence=SEEK_SET)
>Change the stream position to the given byte offset.
>
>And so on. No mention of differences between text and binary mode.

You're looking at IOBase, the _binary_ basis of low level common file
I/O. Compare with: https://docs.python.org/3/library/io.html#io.TextIOBase.seek
The positions are "opaque numbers", which means you should not ascribe
any deeper meaning to them except that they represent a point in the
file. It clearly says "offset must either be a number returned by
TextIOBase.tell(), or zero. Any other offset value produces undefined
behaviour."

The point here is that text is a very different thing. Because you
cannot seek to an absolute number of characters in an encoding with
variable sized characters. _If_ you did a seek to an arbitrary number
you can end up in the middle of some character. And there are encodings
where you cannot inspect the data to find a character boundary in the
byte stream.

Reading text files backwards is not a well defined thing without
additional criteria:
- knowing the text file actually ended on a character boundary
- knowing how to find a character boundary

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
>
> The point here is that text is a very different thing. Because you
> cannot seek to an absolute number of characters in an encoding with
> variable sized characters. _If_ you did a seek to an arbitrary number
> you can end up in the middle of some character. And there are encodings
> where you cannot inspect the data to find a character boundary in the
> byte stream.

Ooook, now I understand what you and Barry mean. I suppose there's no
reliable way to tail a big file opened in text mode with a decent performance.

Anyway, the previous-previous function I posted worked only for files
opened in binary mode, and I suppose it's reliable, since it searches
only for b"\n", as readline() in binary mode do.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
> >
> > The point here is that text is a very different thing. Because you
> > cannot seek to an absolute number of characters in an encoding with
> > variable sized characters. _If_ you did a seek to an arbitrary number
> > you can end up in the middle of some character. And there are encodings
> > where you cannot inspect the data to find a character boundary in the
> > byte stream.
>
> Ooook, now I understand what you and Barry mean. I suppose there's no
> reliable way to tail a big file opened in text mode with a decent performance.
>
> Anyway, the previous-previous function I posted worked only for files
> opened in binary mode, and I suppose it's reliable, since it searches
> only for b"\n", as readline() in binary mode do.

It's still fundamentally impossible to solve this in a general way, so
the best way to do things will always be to code for *your* specific
use-case. That means that this doesn't belong in the stdlib or core
language, but in your own toolkit.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-05-08 at 18:52:42 +0000,
Stefan Ram <ram@zedat.fu-berlin.de> wrote:

> Remember how recently people here talked about how you cannot copy
> text from a video? Then, how did I do it? Turns out, for my
> operating system, there's a screen OCR program! So I did this OCR
> and then manually corrected a few wrong characters, and was done!

When you're learning, and the example you tried doesn't work like it
worked on the video, you probably don't know what's wrong, let alone how
to correct it.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
> > >
> > > The point here is that text is a very different thing. Because you
> > > cannot seek to an absolute number of characters in an encoding with
> > > variable sized characters. _If_ you did a seek to an arbitrary number
> > > you can end up in the middle of some character. And there are encodings
> > > where you cannot inspect the data to find a character boundary in the
> > > byte stream.
> >
> > Ooook, now I understand what you and Barry mean. I suppose there's no
> > reliable way to tail a big file opened in text mode with a decent performance.
> >
> > Anyway, the previous-previous function I posted worked only for files
> > opened in binary mode, and I suppose it's reliable, since it searches
> > only for b"\n", as readline() in binary mode do.
>
> It's still fundamentally impossible to solve this in a general way, so
> the best way to do things will always be to code for *your* specific
> use-case. That means that this doesn't belong in the stdlib or core
> language, but in your own toolkit.

Nevertheless, tail is a fundamental tool in *nix. It's fast and
reliable. Also the tail command can't handle different encodings?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Tue, 10 May 2022 at 05:12, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >
> > > On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
> > > >
> > > > The point here is that text is a very different thing. Because you
> > > > cannot seek to an absolute number of characters in an encoding with
> > > > variable sized characters. _If_ you did a seek to an arbitrary number
> > > > you can end up in the middle of some character. And there are encodings
> > > > where you cannot inspect the data to find a character boundary in the
> > > > byte stream.
> > >
> > > Ooook, now I understand what you and Barry mean. I suppose there's no
> > > reliable way to tail a big file opened in text mode with a decent performance.
> > >
> > > Anyway, the previous-previous function I posted worked only for files
> > > opened in binary mode, and I suppose it's reliable, since it searches
> > > only for b"\n", as readline() in binary mode do.
> >
> > It's still fundamentally impossible to solve this in a general way, so
> > the best way to do things will always be to code for *your* specific
> > use-case. That means that this doesn't belong in the stdlib or core
> > language, but in your own toolkit.
>
> Nevertheless, tail is a fundamental tool in *nix. It's fast and
> reliable. Also the tail command can't handle different encodings?

Like most Unix programs, it handles bytes.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 9 May 2022, at 17:41, ram@zedat.fu-berlin.de wrote:
>
> ?Barry Scott <barry@barrys-emacs.org> writes:
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes
>
> When optimizing code, it helps to be aware of the orders of
> magnitude

That is true and we’ll know to me, now show how what I said is wrong.

The os is going to DMA at least 4k, with read ahead more like 64k.
So I can get that into the python memory at the same scale of time as
1 byte because it’s the setup of the I/O that is expensive not the bytes
transferred.

Barry

> . Code that is more cache-friendly is faster, that is,
> code that holds data in single region of memory and that uses
> regular patterns of access. Chandler Carruth talked about this,
> and I made some notes when watching the video of his talk:
>
> CPUS HAVE A HIERARCHICAL CACHE SYSTEM
> (from a 2014 talk by Chandler Carruth)
>
> One cycle on a 3 GHz processor 1 ns
> L1 cache reference 0.5 ns
> Branch mispredict 5 ns
> L2 cache reference 7 ns 14x L1 cache
> Mutex lock/unlock 25 ns
> Main memory reference 100 ns 20xL2, 200xL1
> Compress 1K bytes with Snappy 3,000 ns
> Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
> Read 4K randomly from SSD 150,000 ns 0.15 ms
> Read 1 MB sequentially from memory 250,000 ns 0.25 ms
> Round trip within same datacenter 500,000 ns 0.5 ms
> Read 1 MB sequentially From SSD 1,000,000 ns 1 ms 4x memory
> Disk seek 10,000,000 ns 10 ms 20xdatacen. RT
> Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80xmem.,20xSSD
> Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
>
> . Remember how recently people here talked about how you cannot
> copy text from a video? Then, how did I do it? Turns out, for my
> operating system, there's a screen OCR program! So I did this OCR
> and then manually corrected a few wrong characters, and was done!
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 9 May 2022 21:11:23 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>Nevertheless, tail is a fundamental tool in *nix. It's fast and
>reliable. Also the tail command can't handle different encodings?

Based upon
https://github.com/coreutils/coreutils/blob/master/src/tail.c the ONLY
thing tail looks at is single byte "\n". It does not handle other line
endings, and appears to performs BINARY I/O, not text I/O. It does nothing
for bytes that are not "\n". Split multi-byte encodings are irrelevant
since, if it does not find enough "\n" bytes in the buffer (chunk) it reads
another binary chunk and seeks for additional "\n" bytes. Once it finds the
desired amount, it is synchronized on the byte following the "\n" (which,
for multi-byte encodings might be a NUL, but in any event, should be a safe
location for subsequent I/O).

Interpretation of encoding appears to fall to the console driver
configuration when displaying the bytes output by tail.


--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 9 May 2022, at 20:14, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
>>
>>> On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>>>
>>> On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
>>>>
>>>> The point here is that text is a very different thing. Because you
>>>> cannot seek to an absolute number of characters in an encoding with
>>>> variable sized characters. _If_ you did a seek to an arbitrary number
>>>> you can end up in the middle of some character. And there are encodings
>>>> where you cannot inspect the data to find a character boundary in the
>>>> byte stream.
>>>
>>> Ooook, now I understand what you and Barry mean. I suppose there's no
>>> reliable way to tail a big file opened in text mode with a decent performance.
>>>
>>> Anyway, the previous-previous function I posted worked only for files
>>> opened in binary mode, and I suppose it's reliable, since it searches
>>> only for b"\n", as readline() in binary mode do.
>>
>> It's still fundamentally impossible to solve this in a general way, so
>> the best way to do things will always be to code for *your* specific
>> use-case. That means that this doesn't belong in the stdlib or core
>> language, but in your own toolkit.
>
> Nevertheless, tail is a fundamental tool in *nix. It's fast and
> reliable. Also the tail command can't handle different encodings?

POSIX tail just prints the bytes to the output that it finds between \n bytes.
At no time does it need to care about encodings as that is a problem solved
by the terminal software. I would not expect utf-16 to work with tail on
linux systems.

You could always get the source of tail and read It’s implementation.

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

1 2 3 4  View All