Mailing List Archive

Apr 23, 2022, 11:57 AM

Post #2 of 95 (1891 views)

On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> What about introducing a method for text streams that reads the lines
> from the bottom? Java has also a ReversedLinesFileReader with Apache
> Commons IO.

It's fundamentally difficult to get precise. In general, there are
three steps to reading the last N lines of a file:

1) Find out the size of the file (currently, if it's being grown)
2) Seek to the end of the file, minus some threshold that you hope
will contain a number of lines
3) Read from there to the end of the file, split it into lines, and
keep the last N

Reading the preceding N lines is basically a matter of repeating the
same exercise, but instead of "end of the file", use the byte position
of the line you last read.

The problem is, seeking around in a file is done by bytes, not
characters. So if you know for sure that you can resynchronize
(possible with UTF-8, not possible with some other encodings), then
you can do this, but it's probably best to build it yourself (opening
the file in binary mode).

This is quite inefficient in general. It would be far FAR easier to do
this instead:

1) Read the entire file and decode bytes to text
2) Split into lines
3) Iterate backwards over the lines

Tada! Done. And in Python, quite easy. The downside, of course, is
that you have to store the entire file in memory.

So it's up to you: pay the memory price, or pay the complexity price.

Personally, unless the file is tremendously large and I know for sure
that I'm not going to end up iterating over it all, I would pay the
memory price.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 1:41 PM

Post #3 of 95 (1889 views)

On Sat, 23 Apr 2022 at 20:59, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > What about introducing a method for text streams that reads the lines
> > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > Commons IO.
>
> It's fundamentally difficult to get precise. In general, there are
> three steps to reading the last N lines of a file:
>
> 1) Find out the size of the file (currently, if it's being grown)
> 2) Seek to the end of the file, minus some threshold that you hope
> will contain a number of lines
> 3) Read from there to the end of the file, split it into lines, and
> keep the last N
>
> Reading the preceding N lines is basically a matter of repeating the
> same exercise, but instead of "end of the file", use the byte position
> of the line you last read.
>
> The problem is, seeking around in a file is done by bytes, not
> characters. So if you know for sure that you can resynchronize
> (possible with UTF-8, not possible with some other encodings), then
> you can do this, but it's probably best to build it yourself (opening
> the file in binary mode).

Well, indeed I have an implementation that does more or less what you
described for utf8 only. The only difference is that I just started
from the end of file -1. I'm just wondering if this will be useful in
the stdlib. I think it's not too difficult to generalise for every
encoding.

> This is quite inefficient in general.

Why inefficient? I think that readlines() will be much slower, not
only more time consuming.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 1:58 PM

Post #4 of 95 (1889 views)

On Sun, 24 Apr 2022 at 06:41, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Sat, 23 Apr 2022 at 20:59, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >
> > > What about introducing a method for text streams that reads the lines
> > > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > > Commons IO.
> >
> > It's fundamentally difficult to get precise. In general, there are
> > three steps to reading the last N lines of a file:
> >
> > 1) Find out the size of the file (currently, if it's being grown)
> > 2) Seek to the end of the file, minus some threshold that you hope
> > will contain a number of lines
> > 3) Read from there to the end of the file, split it into lines, and
> > keep the last N
> >
> > Reading the preceding N lines is basically a matter of repeating the
> > same exercise, but instead of "end of the file", use the byte position
> > of the line you last read.
> >
> > The problem is, seeking around in a file is done by bytes, not
> > characters. So if you know for sure that you can resynchronize
> > (possible with UTF-8, not possible with some other encodings), then
> > you can do this, but it's probably best to build it yourself (opening
> > the file in binary mode).
>
> Well, indeed I have an implementation that does more or less what you
> described for utf8 only. The only difference is that I just started
> from the end of file -1. I'm just wondering if this will be useful in
> the stdlib. I think it's not too difficult to generalise for every
> encoding.
>
> > This is quite inefficient in general.
>
> Why inefficient? I think that readlines() will be much slower, not
> only more time consuming.

It depends on which is more costly: reading the whole file (cost
depends on size of file) or reading chunks and splitting into lines
(cost depends on how well you guess at chunk size). If the lines are
all *precisely* the same number of bytes each, you can pick a chunk
size and step backwards with near-perfect efficiency (it's still
likely to be less efficient than reading a file forwards, on most file
systems, but it'll be close); but if you have to guess, adjust, and
keep going, then you lose efficiency there.

I don't think this is necessary in the stdlib. If anything, it might
be good on PyPI, but I for one have literally never wanted this.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 2:12 PM

Post #5 of 95 (1887 views)

On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> > > This is quite inefficient in general.
> >
> > Why inefficient? I think that readlines() will be much slower, not
> > only more time consuming.
>
> It depends on which is more costly: reading the whole file (cost
> depends on size of file) or reading chunks and splitting into lines
> (cost depends on how well you guess at chunk size). If the lines are
> all *precisely* the same number of bytes each, you can pick a chunk
> size and step backwards with near-perfect efficiency (it's still
> likely to be less efficient than reading a file forwards, on most file
> systems, but it'll be close); but if you have to guess, adjust, and
> keep going, then you lose efficiency there.

Emh, why chunks? My function simply reads byte per byte and compares it to
b"\n". When it find it, it stops and do a readline():

def tail(filepath):
"""
@author Marco Sulla
@date May 31, 2016
"""

try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath

with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1

if start_pos != 0:
f.seek(start_pos)
char = f.read(1)

if char == b"\n":
start_pos -= 1
f.seek(start_pos)

if start_pos == 0:
f.seek(start_pos)
else:
for pos in range(start_pos, -1, -1):
f.seek(pos)

char = f.read(1)

if char == b"\n":
break

return f.readline()

This is only for one line and in utf8, but it can be generalised.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 2:15 PM

Post #6 of 95 (1887 views)

On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> > > > This is quite inefficient in general.
> > >
> > > Why inefficient? I think that readlines() will be much slower, not
> > > only more time consuming.
> >
> > It depends on which is more costly: reading the whole file (cost
> > depends on size of file) or reading chunks and splitting into lines
> > (cost depends on how well you guess at chunk size). If the lines are
> > all *precisely* the same number of bytes each, you can pick a chunk
> > size and step backwards with near-perfect efficiency (it's still
> > likely to be less efficient than reading a file forwards, on most file
> > systems, but it'll be close); but if you have to guess, adjust, and
> > keep going, then you lose efficiency there.
>
> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
>
> def tail(filepath):
> """
> @author Marco Sulla
> @date May 31, 2016
> """
>
> try:
> filepath.is_file
> fp = str(filepath)
> except AttributeError:
> fp = filepath
>
> with open(fp, "rb") as f:
> size = os.stat(fp).st_size
> start_pos = 0 if size - 1 < 0 else size - 1
>
> if start_pos != 0:
> f.seek(start_pos)
> char = f.read(1)
>
> if char == b"\n":
> start_pos -= 1
> f.seek(start_pos)
>
> if start_pos == 0:
> f.seek(start_pos)
> else:
> for pos in range(start_pos, -1, -1):
> f.seek(pos)
>
> char = f.read(1)
>
> if char == b"\n":
> break
>
> return f.readline()
>
> This is only for one line and in utf8, but it can be generalised.
>

Ah. Well, then, THAT is why it's inefficient: you're seeking back one
single byte at a time, then reading forwards. That is NOT going to
play nicely with file systems or buffers.

Compare reading line by line over the file with readlines() and you'll
see how abysmal this is.

If you really only need one line (which isn't what your original post
suggested), I would recommend starting with a chunk that is likely to
include a full line, and expanding the chunk until you have that
newline. Much more efficient than one byte at a time.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

hjp-python at hjp

Apr 23, 2022, 3:02 PM

Post #7 of 95 (1887 views)

PythonList at DancesWithMice

On 2022-04-24 04:57:20 +1000, Chris Angelico wrote:
> On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > What about introducing a method for text streams that reads the lines
> > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > Commons IO.
>
> It's fundamentally difficult to get precise. In general, there are
> three steps to reading the last N lines of a file:
>
> 1) Find out the size of the file (currently, if it's being grown)
> 2) Seek to the end of the file, minus some threshold that you hope
> will contain a number of lines
> 3) Read from there to the end of the file, split it into lines, and
> keep the last N
[...]
> This is quite inefficient in general. It would be far FAR easier to do
> this instead:
>
> 1) Read the entire file and decode bytes to text
> 2) Split into lines
> 3) Iterate backwards over the lines

Which one is more efficient depends very much on the size of the file.
For a file of a few kilobytes, the second solution is probably more
efficient. But for a few gigabytes, that's almost certainly not the
case.

> Tada! Done. And in Python, quite easy. The downside, of course, is
> that you have to store the entire file in memory.

Not just memory. You have to read the whole file in the first place. Which is
hardly efficient if you only need a tiny fraction.

> Personally, unless the file is tremendously large and I know for sure
> that I'm not going to end up iterating over it all, I would pay the
> memory price.

Me, too. Problem with a library function (as Marco proposes) is that you
don't know how it will be used.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"

Re: tail [ In reply to ]

Apr 23, 2022, 3:04 PM

Post #8 of 95 (1887 views)

On 24/04/2022 09.15, Chris Angelico wrote:
> On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>>
>> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
>>>>> This is quite inefficient in general.
>>>>
>>>> Why inefficient? I think that readlines() will be much slower, not
>>>> only more time consuming.
>>>
>>> It depends on which is more costly: reading the whole file (cost
>>> depends on size of file) or reading chunks and splitting into lines
>>> (cost depends on how well you guess at chunk size). If the lines are
>>> all *precisely* the same number of bytes each, you can pick a chunk
>>> size and step backwards with near-perfect efficiency (it's still
>>> likely to be less efficient than reading a file forwards, on most file
>>> systems, but it'll be close); but if you have to guess, adjust, and
>>> keep going, then you lose efficiency there.
>>
>> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
...

> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> single byte at a time, then reading forwards. That is NOT going to
> play nicely with file systems or buffers.
>
> Compare reading line by line over the file with readlines() and you'll
> see how abysmal this is.
>
> If you really only need one line (which isn't what your original post
> suggested), I would recommend starting with a chunk that is likely to
> include a full line, and expanding the chunk until you have that
> newline. Much more efficient than one byte at a time.

Disagreeing with @Chris in the sense that I use tail very frequently,
and usually in the context of server logs - but I'm talking about the
Linux implementation, not Python code!

Agree with @Chris' assessment of the (in)efficiency. It is more likely
than not, that you will have a good idea of the length of each line.
Even if the line-length is highly-variable (thinking of some of my
applications of the Python logging module!), one can still 'take a stab
at it' (a "thumb suck" as an engineer-colleague used to say - apparently
not an electrical engineer!) by remembering that lines exceeding
80-characters become less readable and thus have likely?hopefully been
split into two.

Thus,

N*(80+p)

where N is the number of lines desired and p is a reasonable
'safety'/over-estimation percentage, would give a good chunk size.
Binar-ily grab that much of the end of the file, split on line-ending,
and take the last N elements from that list. (with 'recovery code' just
in case the 'chunk' wasn't large-enough).

Adding to the efficiency (of the algorithm, but not the dev-time),
consider that shorter files are likely to be more easily--handled by
reading serially from the beginning. To side-step @Chris' criticism, use
a generator to produce the individual lines (lazy evaluation and low
storage requirement) and feed them into a circular-queue which is
limited to N-entries. QED, as fast as the machine's I/O, and undemanding
of storage-space!

Running a few timing trials should reveal the 'sweet spot', at which one
algorithm takes-over from the other!

NB quite a few of IBM's (extensively researched) algorithms which formed
utility program[me]s on mainframes, made similar such algorithmic
choices, in the pursuit of efficiencies.
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 3:09 PM

Post #9 of 95 (1887 views)

On 24Apr2022 07:15, Chris Angelico <rosuav@gmail.com> wrote:
>On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>> Emh, why chunks? My function simply reads byte per byte and compares
>> it to b"\n". When it find it, it stops and do a readline():
[...]
>> This is only for one line and in utf8, but it can be generalised.

For some encodings that generalisation might be hard. But mostly, yes.

>Ah. Well, then, THAT is why it's inefficient: you're seeking back one
>single byte at a time, then reading forwards. That is NOT going to
>play nicely with file systems or buffers.

An approach I think you both may have missed: mmap the file and use
mmap.rfind(b'\n') to locate line delimiters.
https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind

Avoids sucking the whole file into memory in the usualy sense, instead
the file is paged in as needed. Far more efficient that a seek/read
single byte approach.

If the file's growing you can do this to start with, then do a normal
file open from your end point to follow accruing text. (Or reuse the
descriptor you sues for the mmap, but using s.read().)

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 3:11 PM

Post #10 of 95 (1887 views)

On Sun, 24 Apr 2022 at 08:03, Peter J. Holzer <hjp-python@hjp.at> wrote:
>
> On 2022-04-24 04:57:20 +1000, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > > What about introducing a method for text streams that reads the lines
> > > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > > Commons IO.
> >
> > It's fundamentally difficult to get precise. In general, there are
> > three steps to reading the last N lines of a file:
> >
> > 1) Find out the size of the file (currently, if it's being grown)
> > 2) Seek to the end of the file, minus some threshold that you hope
> > will contain a number of lines
> > 3) Read from there to the end of the file, split it into lines, and
> > keep the last N
> [...]
> > This is quite inefficient in general. It would be far FAR easier to do
> > this instead:
> >
> > 1) Read the entire file and decode bytes to text
> > 2) Split into lines
> > 3) Iterate backwards over the lines
>
> Which one is more efficient depends very much on the size of the file.
> For a file of a few kilobytes, the second solution is probably more
> efficient. But for a few gigabytes, that's almost certainly not the
> case.

Yeah. I said "easier", not necessarily more efficient. Which is more
efficient is a virtually unanswerable question (will you need to
iterate over the whole file or stop part way? Is the file stored
contiguously? Can you memory map it in some way?), so it's going to
depend a lot on your use-case.

> > Tada! Done. And in Python, quite easy. The downside, of course, is
> > that you have to store the entire file in memory.
>
> Not just memory. You have to read the whole file in the first place. Which is
> hardly efficient if you only need a tiny fraction.

Right - if that's the case, then the chunked form, even though it's
harder, would be worth doing.

> > Personally, unless the file is tremendously large and I know for sure
> > that I'm not going to end up iterating over it all, I would pay the
> > memory price.
>
> Me, too. Problem with a library function (as Marco proposes) is that you
> don't know how it will be used.
>

Yup. And there may be other options worth considering, like
maintaining an index (a bunch of "line 142857 is at byte position
3141592" entries) which would allow random access... but at some
point, if your file is that big, you probably shouldn't be storing it
as a file of lines of text. Use a database instead.

Reading a text file backwards by lines is, by definition, hard. Every
file format I know of that involves starting at the end of the file is
defined in binary, so you can actually seek, and is usually defined
with fixed-size structures (so you just go "read the last 768 bytes of
the file" or something).

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 3:19 PM

Post #11 of 95 (1887 views)

On Sun, 24 Apr 2022 at 08:06, dn <PythonList@danceswithmice.info> wrote:
>
> On 24/04/2022 09.15, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >>
> >> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> >>>>> This is quite inefficient in general.
> >>>>
> >>>> Why inefficient? I think that readlines() will be much slower, not
> >>>> only more time consuming.
> >>>
> >>> It depends on which is more costly: reading the whole file (cost
> >>> depends on size of file) or reading chunks and splitting into lines
> >>> (cost depends on how well you guess at chunk size). If the lines are
> >>> all *precisely* the same number of bytes each, you can pick a chunk
> >>> size and step backwards with near-perfect efficiency (it's still
> >>> likely to be less efficient than reading a file forwards, on most file
> >>> systems, but it'll be close); but if you have to guess, adjust, and
> >>> keep going, then you lose efficiency there.
> >>
> >> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
> ...
>
> > Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> > single byte at a time, then reading forwards. That is NOT going to
> > play nicely with file systems or buffers.
> >
> > Compare reading line by line over the file with readlines() and you'll
> > see how abysmal this is.
> >
> > If you really only need one line (which isn't what your original post
> > suggested), I would recommend starting with a chunk that is likely to
> > include a full line, and expanding the chunk until you have that
> > newline. Much more efficient than one byte at a time.
>
>
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!

tail(1) doesn't read a single byte at a time. It works in chunks,
more-or-less the way I described. (That's where I borrowed the
technique from.) It finds one block of lines, and displays those; it
doesn't iterate backwards.

(By the way, the implementation of "tail -f" is actually less
complicated than you might think, as long as inotify is available.
That's been much more useful to me than reading a file backwards.)

> Agree with @Chris' assessment of the (in)efficiency. It is more likely
> than not, that you will have a good idea of the length of each line.
> Even if the line-length is highly-variable (thinking of some of my
> applications of the Python logging module!), one can still 'take a stab
> at it' (a "thumb suck" as an engineer-colleague used to say - apparently
> not an electrical engineer!) by remembering that lines exceeding
> 80-characters become less readable and thus have likely?hopefully been
> split into two.
>
> Thus,
>
> N*(80+p)
>
> where N is the number of lines desired and p is a reasonable
> 'safety'/over-estimation percentage, would give a good chunk size.
> Binar-ily grab that much of the end of the file, split on line-ending,
> and take the last N elements from that list. (with 'recovery code' just
> in case the 'chunk' wasn't large-enough).

Yup, that's the broad idea of the chunked read. If you know how many
lines you're going to need, that's not too bad. If you need to iterate
backwards over the file (as the original question suggested), that
gets complicated fast.

> Adding to the efficiency (of the algorithm, but not the dev-time),
> consider that shorter files are likely to be more easily--handled by
> reading serially from the beginning. To side-step @Chris' criticism, use
> a generator to produce the individual lines (lazy evaluation and low
> storage requirement) and feed them into a circular-queue which is
> limited to N-entries. QED, as fast as the machine's I/O, and undemanding
> of storage-space!

Still needs to read the whole file though.

> Running a few timing trials should reveal the 'sweet spot', at which one
> algorithm takes-over from the other!

Unfortunately, the sweet spot will depend on a lot of things other
than just the file size. Is it stored contiguously? Is it on hard
drive or SSD? Can it be read from the disk cache? How frequently do
the chunk size guesses fail, and how often do they grab more than is
necessary?

It's basically unknowable, so the best you can do is judge your own
app, pick one, and go with it.

> NB quite a few of IBM's (extensively researched) algorithms which formed
> utility program[me]s on mainframes, made similar such algorithmic
> choices, in the pursuit of efficiencies.

Indeed. Just make a choice and run with it, and accept that there will
be situations where it'll be inefficient.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 3:21 PM

Post #12 of 95 (1887 views)

On Sun, 24 Apr 2022 at 08:18, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 24Apr2022 07:15, Chris Angelico <rosuav@gmail.com> wrote:
> >On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >> Emh, why chunks? My function simply reads byte per byte and compares
> >> it to b"\n". When it find it, it stops and do a readline():
> [...]
> >> This is only for one line and in utf8, but it can be generalised.
>
> For some encodings that generalisation might be hard. But mostly, yes.
>
> >Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> >single byte at a time, then reading forwards. That is NOT going to
> >play nicely with file systems or buffers.
>
> An approach I think you both may have missed: mmap the file and use
> mmap.rfind(b'\n') to locate line delimiters.
> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind

Yeah, I made a vague allusion to use of mmap, but didn't elaborate
because I actually have zero idea of how efficient this would be.
Would it be functionally equivalent to the chunking, but with the
chunk size defined by the system as whatever's most optimal? It would
need to be tested.

I've never used mmap for this kind of job, so it's not something I'm
comfortable predicting the performance of.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 5:03 PM

Post #13 of 95 (1887 views)

On 24Apr2022 08:21, Chris Angelico <rosuav@gmail.com> wrote:
>On Sun, 24 Apr 2022 at 08:18, Cameron Simpson <cs@cskk.id.au> wrote:
>> An approach I think you both may have missed: mmap the file and use
>> mmap.rfind(b'\n') to locate line delimiters.
>> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind
>
>Yeah, I made a vague allusion to use of mmap, but didn't elaborate
>because I actually have zero idea of how efficient this would be.
>Would it be functionally equivalent to the chunking, but with the
>chunk size defined by the system as whatever's most optimal? It would
>need to be tested.

True. I'd expect better than single byte seek/read though.

>I've never used mmap for this kind of job, so it's not something I'm
>comfortable predicting the performance of.

Fair.

But it would be much easier to read code.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 23, 2022, 5:39 PM

Post #14 of 95 (1887 views)

On Sun, 24 Apr 2022 at 10:04, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 24Apr2022 08:21, Chris Angelico <rosuav@gmail.com> wrote:
> >On Sun, 24 Apr 2022 at 08:18, Cameron Simpson <cs@cskk.id.au> wrote:
> >> An approach I think you both may have missed: mmap the file and use
> >> mmap.rfind(b'\n') to locate line delimiters.
> >> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind
> >
> >Yeah, I made a vague allusion to use of mmap, but didn't elaborate
> >because I actually have zero idea of how efficient this would be.
> >Would it be functionally equivalent to the chunking, but with the
> >chunk size defined by the system as whatever's most optimal? It would
> >need to be tested.
>
> True. I'd expect better than single byte seek/read though.
>

Yeah, I think pretty much *anything* would be better than single byte seeks.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

roel at roelschroeven

Apr 24, 2022, 2:19 AM

Post #15 of 95 (1887 views)

dn schreef op 24/04/2022 om 0:04:
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!
If I understand Marco correctly, what he want is to read the lines from
bottom to top, i.e. tac instead of tail, despite his subject.
I use tail very frequently too, but tac is something I almost never use.

--
"Peace cannot be kept by force. It can only be achieved through understanding."
-- Albert Einstein

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

antoon.pardon at vub

Apr 24, 2022, 4:09 AM

Post #16 of 95 (1887 views)

Op 23/04/2022 om 20:57 schreef Chris Angelico:
> On Sun, 24 Apr 2022 at 04:37, Marco Sulla<Marco.Sulla.Python@gmail.com> wrote:
>> What about introducing a method for text streams that reads the lines
>> from the bottom? Java has also a ReversedLinesFileReader with Apache
>> Commons IO.
>
> 1) Read the entire file and decode bytes to text
> 2) Split into lines
> 3) Iterate backwards over the lines
>
> Tada! Done. And in Python, quite easy. The downside, of course, is
> that you have to store the entire file in memory.

Why not just do:

tail = collections.deque(text_stream, maxlen = nr_of_lines)
tail.reverse()
...

--
Antoon Pardon

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 4:32 AM

Post #17 of 95 (1887 views)

On Sun, 24 Apr 2022 at 21:11, Antoon Pardon <antoon.pardon@vub.be> wrote:
>
>
>
> Op 23/04/2022 om 20:57 schreef Chris Angelico:
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla<Marco.Sulla.Python@gmail.com> wrote:
> >> What about introducing a method for text streams that reads the lines
> >> from the bottom? Java has also a ReversedLinesFileReader with Apache
> >> Commons IO.
> >
> > 1) Read the entire file and decode bytes to text
> > 2) Split into lines
> > 3) Iterate backwards over the lines
> >
> > Tada! Done. And in Python, quite easy. The downside, of course, is
> > that you have to store the entire file in memory.
>
> Why not just do:
>
> tail = collections.deque(text_stream, maxlen = nr_of_lines)
> tail.reverse()
> ...
>

You still need to read the entire file, and you also restrict the max
line count, so you can't iterate this to take the next block of lines.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 5:06 AM

Post #18 of 95 (1887 views)

I have been getting confused by how many interpretations and conditions for chasing tail people seem to be talking about.
A fairly normal task is to want to see just the last N lines of a text-based file.
A variant is the "tail -f" command from UNIX that continues to follow a growing file, often into a pipeline for further processing.
The variant now being mentioned is a sort of "reverse" that has nothing to do with that kind of "tail" except if the implementation is to read the file backwards. A very straightforward way to reverse a file takes perhaps two lines of Python code by reading forward to fill a list with lines of text then using an index that reverses it.
The issues being considered are memory and whether to read the entire file.
I would think reading a file forwards in big chunks to be far faster and simpler than various schemes mentioned here for reading it backwards. It only makes sense if the goal is not reversal of all the contents.
Also noted is that memory use can be minimized various ways so that only thefinal results are kept around. And if you really want more random access to files that you view as being organized as lines of text with a fixed or maximum width,then storing in some database format, perhaps indexed, may be a way to go.

A time stamped log file is a good example.
So which problem is really supposed to be solved for the original question?

-----Original Message-----
From: Roel Schroeven <roel@roelschroeven.net>
To: python-list@python.org
Sent: Sun, Apr 24, 2022 5:19 am
Subject: Re: tail

dn schreef op 24/04/2022 om 0:04:
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!
If I understand Marco correctly, what he want is to read the lines from
bottom to top, i.e. tac instead of tail, despite his subject.
I use tail very frequently too, but tac is something I almost never use.

--
"Peace cannot be kept by force. It can only be achieved through understanding."
-- Albert Einstein

--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 8:47 AM

Post #19 of 95 (1885 views)

On Sat, 23 Apr 2022 at 23:18, Chris Angelico <rosuav@gmail.com> wrote:

> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> single byte at a time, then reading forwards. That is NOT going to
> play nicely with file systems or buffers.
>
> Compare reading line by line over the file with readlines() and you'll
> see how abysmal this is.
>
> If you really only need one line (which isn't what your original post
> suggested), I would recommend starting with a chunk that is likely to
> include a full line, and expanding the chunk until you have that
> newline. Much more efficient than one byte at a time.
>

Well, I would like to have a sort of tail, so to generalise to more than 1
line. But I think that once you have a good algorithm for one line, you can
repeat it N times.

I understand that you can read a chunk instead of a single byte, so when
the newline is found you can return all the cached chunks concatenated. But
will this make the search of the start of the line faster? I suppose you
have always to read byte by byte (or more, if you're using urf16 etc) and
see if there's a newline.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 8:56 AM

Post #20 of 95 (1885 views)

On Sun, 24 Apr 2022 at 00:19, Cameron Simpson <cs@cskk.id.au> wrote:

> An approach I think you both may have missed: mmap the file and use
> mmap.rfind(b'\n') to locate line delimiters.
> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind
>

Ah, I played very little with mmap, I didn't know about this. So I suppose
you can locate the newline and at that point read the line without using
chunks?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 8:58 AM

Post #21 of 95 (1885 views)

On Mon, 25 Apr 2022 at 01:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
>
>
> On Sat, 23 Apr 2022 at 23:18, Chris Angelico <rosuav@gmail.com> wrote:
>>
>> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
>> single byte at a time, then reading forwards. That is NOT going to
>> play nicely with file systems or buffers.
>>
>> Compare reading line by line over the file with readlines() and you'll
>> see how abysmal this is.
>>
>> If you really only need one line (which isn't what your original post
>> suggested), I would recommend starting with a chunk that is likely to
>> include a full line, and expanding the chunk until you have that
>> newline. Much more efficient than one byte at a time.
>
>
> Well, I would like to have a sort of tail, so to generalise to more than 1 line. But I think that once you have a good algorithm for one line, you can repeat it N times.
>

Not always. If you know you want to read 5 lines, it's much more
efficient than reading 1 line, then going back to the file, five
times. Disk reads are the costliest part, with the possible exception
of memory usage (but usually only because it can cause additional disk
*writes*).

> I understand that you can read a chunk instead of a single byte, so when the newline is found you can return all the cached chunks concatenated. But will this make the search of the start of the line faster? I suppose you have always to read byte by byte (or more, if you're using urf16 etc) and see if there's a newline.
>

Massively massively faster. Try it. Especially, try it on an
artificially slow file system, so you can see what it costs.

But you can't rely on any backwards reads unless you know for sure
that the encoding supports this. UTF-8 does (you have to scan
backwards for a start byte), UTF-16 does (work with pairs of bytes and
check for surrogates), and fixed-width encodings do, but otherwise,
you won't necessarily know when you've found a valid start point. So
any reverse-read algorithm is going to be restricted to specific
encodings.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 9:00 AM

Post #22 of 95 (1885 views)

On Sun, 24 Apr 2022 at 11:21, Roel Schroeven <roel@roelschroeven.net> wrote:

> dn schreef op 24/04/2022 om 0:04:
> > Disagreeing with @Chris in the sense that I use tail very frequently,
> > and usually in the context of server logs - but I'm talking about the
> > Linux implementation, not Python code!
> If I understand Marco correctly, what he want is to read the lines from
> bottom to top, i.e. tac instead of tail, despite his subject.
> I use tail very frequently too, but tac is something I almost never use.
>

Well, the inverse reader is only a secondary suggestion. I suppose a tail
is much more useful.
--
https://mail.python.org/mailman/listinfo/python-list

RE: tail [ In reply to ]

pjfarley3 at earthlink

Apr 24, 2022, 9:21 AM

Post #23 of 95 (1885 views)

> -----Original Message-----
> From: dn <PythonList@DancesWithMice.info>
> Sent: Saturday, April 23, 2022 6:05 PM
> To: python-list@python.org
> Subject: Re: tail
>
<Snipped>
> NB quite a few of IBM's (extensively researched) algorithms which formed utility
> program[me]s on mainframes, made similar such algorithmic choices, in the
> pursuit of efficiencies.

WRT the mentioned IBM utility program[me]s, the non-Posix part of the IBM mainframe file system has always provided record-managed storage since the late 1960's (as opposed to the byte-managed storage of *ix systems) so searching for line endings was (and is) irrelevant and unnecessary in that environment. That operating system also provides basic "kernel-level" read-backwards API's for the record-managed file system, so there was never any need to build reverse-read into your code for that environment.

The byte-managed file storage used by the Posix kernel running under the actually-in-charge IBM mainframe operating system is, of course, subject to the same constraints and (in)efficiencies discussed in this thread.

Peter
--

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 10:12 AM

Post #24 of 95 (1843 views)

PythonList at DancesWithMice

On Sun, 24 Apr 2022 12:21:36 -0400, <pjfarley3@earthlink.net> declaimed the
following:

>
>WRT the mentioned IBM utility program[me]s, the non-Posix part of the IBM mainframe file system has always provided record-managed storage since the late 1960's (as opposed to the byte-managed storage of *ix systems) so searching for line endings was (and is) irrelevant and unnecessary in that environment. That operating system also provides basic "kernel-level" read-backwards API's for the record-managed file system, so there was never any need to build reverse-read into your code for that environment.
>

IBM wasn't the only one... Xerox Sigma running CP/V default for text
files (those created using a text editor) used numeric ISAM keys (as record
numbers -- which is how their FORTRAN IV compiler did random access I/O
without requiring fixed length records). The system supported three access
methods: consecutive (similar to UNIX "stream" files, for files that didn't
require editing, these saved disk space as the ISAM headers could be
disposed of), the aforesaid keyed, and "random" (on this system, "random"
meant the ONLY thing the OS did was know where the file was on disk --
files had to be contiguous and pre-allocated, and what data was in the file
was strictly up to the application to manage).

VAX/VMS had lots of different file structures managed by the RMS system
services. The default for FORTRAN text files was a segmented model, making
use of chunks of around 250 bytes [.it has been years and I no longer have
the documentation] in which the start of each chunk had a code for "first
chunk", "last chunk", "intermediate chunk" (and maybe length of data in the
chunk). A record that fit completely within one chunk would have both
"first" and "last" codes set (intermediate chunks have neither code). One
had to go out of their way to create a "stream" file in DEC FORTRAN 77
(open the file with CARRIAGECONTROL=CARRIAGERETURN). Other languages on the
OS had different default file structures, but RMS would handle all of them
transparently.

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 24, 2022, 1:08 PM

Post #25 of 95 (1843 views)

On 25/04/2022 04.21, pjfarley3@earthlink.net wrote:
>> -----Original Message-----
>> From: dn <PythonList@DancesWithMice.info>
>> Sent: Saturday, April 23, 2022 6:05 PM
>> To: python-list@python.org
>> Subject: Re: tail
>>
> <Snipped>
>> NB quite a few of IBM's (extensively researched) algorithms which formed utility
>> program[me]s on mainframes, made similar such algorithmic choices, in the
>> pursuit of efficiencies.
>
> WRT the mentioned IBM utility program[me]s, the non-Posix part of the IBM mainframe file system has always provided record-managed storage since the late 1960's (as opposed to the byte-managed storage of *ix systems) so searching for line endings was (and is) irrelevant and unnecessary in that environment. That operating system also provides basic "kernel-level" read-backwards API's for the record-managed file system, so there was never any need to build reverse-read into your code for that environment.
>
> The byte-managed file storage used by the Posix kernel running under the actually-in-charge IBM mainframe operating system is, of course, subject to the same constraints and (in)efficiencies discussed in this thread.

Thanks for the clarification (and @wlfraed's addition).

Apologies if misunderstood. The above comment was about utilities which
would choose between algorithms, based on some rapid, initial,
assessment of the task. It was not about 'tail' utility/ies specifically
- and I don't recall using a 'tail' on mainframes, but...

Thus, the observation that the OP may find that a serial,
read-the-entire-file approach is faster is some situations (relatively
short files). Conversely, with longer files, some sort of 'last chunk'
approach would be superior.
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 25, 2022, 3:54 PM

Post #26 of 95 (1266 views)

PythonList at DancesWithMice

On 25Apr2022 08:08, DL Neil <PythonList@DancesWithMice.info> wrote:
>Thus, the observation that the OP may find that a serial,
>read-the-entire-file approach is faster is some situations (relatively
>short files). Conversely, with longer files, some sort of 'last chunk'
>approach would be superior.

If you make the chunk big enough, they're the same algorithm!

It sound silly, but if you make your chunk size as big as your threshold
for "this file is too big to read serially in its entirety, you may as
well just write the "last chunk" flavour.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Apr 25, 2022, 4:39 PM

Post #27 of 95 (1265 views)

On 26/04/2022 10.54, Cameron Simpson wrote:
> On 25Apr2022 08:08, DL Neil <PythonList@DancesWithMice.info> wrote:
>> Thus, the observation that the OP may find that a serial,
>> read-the-entire-file approach is faster is some situations (relatively
>> short files). Conversely, with longer files, some sort of 'last chunk'
>> approach would be superior.
>
> If you make the chunk big enough, they're the same algorithm!
>
> It sound silly, but if you make your chunk size as big as your threshold
> for "this file is too big to read serially in its entirety, you may as
> well just write the "last chunk" flavour.

I like it!

Yes, in the context of memory-limited mainframes being in-the-past, and
our thinking has, or needs to, moved-on; memory is so much 'cheaper' and
thus available for use!

That said, it depends on file-size and what else is going-on in the
machine/total-application. (and that's 'probably not much' as far as
resource-mix is concerned!) However, I can't speak for the OP, the
reason behind the post, and/or his circumstances...
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 1, 2022, 9:55 AM

Post #28 of 95 (1241 views)

Something like this is OK?

import os

def tail(f):
chunk_size = 100
size = os.stat(f.fileno()).st_size

positions = iter(range(size, -1, -chunk_size))
next(positions)

chunk_line_pos = -1
pos = 0

for pos in positions:
f.seek(pos)
chars = f.read(chunk_size)
chunk_line_pos = chars.rfind(b"\n")

if chunk_line_pos != -1:
break

if chunk_line_pos == -1:
nbytes = pos
pos = 0
f.seek(pos)
chars = f.read(nbytes)
chunk_line_pos = chars.rfind(b"\n")

if chunk_line_pos == -1:
line_pos = pos
else:
line_pos = pos + chunk_line_pos + 1

f.seek(line_pos)

return f.readline()

This is simply for one line and for utf8.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 1, 2022, 3:18 PM

Post #29 of 95 (1238 views)

On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>Something like this is OK?
[...]
>def tail(f):
> chunk_size = 100
> size = os.stat(f.fileno()).st_size

I think you want os.fstat().

> positions = iter(range(size, -1, -chunk_size))
> next(positions)

I was wondering about the iter, but this makes sense. Alternatively you
could put a range check in the for-loop.

> chunk_line_pos = -1
> pos = 0
>
> for pos in positions:
> f.seek(pos)
> chars = f.read(chunk_size)
> chunk_line_pos = chars.rfind(b"\n")
>
> if chunk_line_pos != -1:
> break

Normal text file _end_ in a newline. I'd expect this to stop immediately
at the end of the file.

> if chunk_line_pos == -1:
> nbytes = pos
> pos = 0
> f.seek(pos)
> chars = f.read(nbytes)
> chunk_line_pos = chars.rfind(b"\n")

I presume this is because unless you're very lucky, 0 will not be a
position in the range(). I'd be inclined to avoid duplicating this code
and special case and instead maybe make the range unbounded and do
something like this:

if pos < 0:
pos = 0
... seek/read/etc ...
if pos == 0:
break

around the for-loop body.

> if chunk_line_pos == -1:
> line_pos = pos
> else:
> line_pos = pos + chunk_line_pos + 1
> f.seek(line_pos)
> return f.readline()
>
>This is simply for one line and for utf8.

And anything else where a newline is just an ASCII newline byte (10) and
can't be mistaken otherwise. So also ASCII and all the ISO8859-x single
byte encodings. But as Chris has mentioned, not for other encodings.

Seems sane. I haven't tried to run it.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

drsalists at gmail

May 1, 2022, 4:17 PM

Post #30 of 95 (1238 views)

On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <cs@cskk.id.au> wrote:

> On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >Something like this is OK?
>

Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine.

But what about Unicode? Are all 10 bytes newlines in Unicode encodings?

If not, and you have a huge file to reverse, it might be better to use a
temporary file.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 1, 2022, 4:41 PM

Post #31 of 95 (1238 views)

On Mon, 2 May 2022 at 09:19, Dan Stromberg <drsalists@gmail.com> wrote:
>
> On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <cs@cskk.id.au> wrote:
>
> > On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >Something like this is OK?
> >
>
> Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine.
>
> But what about Unicode? Are all 10 bytes newlines in Unicode encodings?

Most absolutely not. "Unicode" isn't an encoding, but of the Unicode
Transformation Formats and Universal Character Set encodings, most
don't make that guarantee:

* UTF-8 does, as mentioned. It sacrifices some efficiency and
consistency for a guarantee that ASCII characters are represented by
ASCII bytes, and ASCII bytes only ever represent ASCII characters.
* UCS-2 and UTF-16 will both represent BMP characters with two bytes.
Any character U+xx0A or U+0Axx will include an 0x0A in its
representation.
* UTF-16 will also encode anything U+000xxx0A with an 0x0A. (And I
don't think any codepoints have been allocated that would trigger
this, but UTF-16 can also use 0x0A in the high surrogate.)
* UTF-32 and UCS-4 will use 0x0A for any character U+xx0A, U+0Axx, and
U+Axxxx (though that plane has no characters on it either)

So, of all the available Unicode standard encodings, only UTF-8 makes
this guarantee.

Of course, if you look at documents available on the internet, UTF-8
the encoding used by the vast majority of them (especially if you
include seven-bit files, which can equally be considered ASCII,
ISO-8859-x, and UTF-8), so while it might only be one encoding out of
many, it's probably the most important :)

In general, you can *only* make this parsing assumption IF you know
for sure that your file's encoding is UTF-8, ISO-8859-x, some OEM
eight-bit encoding (eg Windows-125x), or one of a handful of other
compatible encodings. But it probably will be.

> If not, and you have a huge file to reverse, it might be better to use a
> temporary file.

Yeah, or an in-memory deque if you know how many lines you want.
Either way, you can read the file forwards, guaranteeing correct
decoding even of a shifted character set (where a byte value can
change in meaning based on arbitrarily distant context).

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 1, 2022, 6:52 PM

Post #32 of 95 (1238 views)

On 01May2022 23:30, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>Dan Stromberg <drsalists@gmail.com> writes:
>>But what about Unicode? Are all 10 bytes newlines in Unicode encodings?
> It seems in UTF-8, when a value is above U+007F, it will be
> encoded with bytes that always have their high bit set.

Aye. Design festure enabling easy resync-to-char-boundary at an
arbitrary point in the file.

> But Unicode has NEL "Next Line" U+0085 and other values that
> conforming applications should recognize as line terminators.

I disagree. Maybe for printing things. But textual data records? I would
hope to end them with NL, and only NL (code 10).

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 1, 2022, 7:44 PM

Post #33 of 95 (1238 views)

On Mon, 2 May 2022 at 11:54, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 01May2022 23:30, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >Dan Stromberg <drsalists@gmail.com> writes:
> >>But what about Unicode? Are all 10 bytes newlines in Unicode encodings?
> > It seems in UTF-8, when a value is above U+007F, it will be
> > encoded with bytes that always have their high bit set.
>
> Aye. Design festure enabling easy resync-to-char-boundary at an
> arbitrary point in the file.

Yep - and there's also a distinction between "first byte of multi-byte
character" and "continuation byte, keep scanning backwards". So you're
guaranteed to be able to resynchronize.

(If you know whether it's little-endian or big-endian, UTF-16 can also
resync like that, since "high surrogate" and "low surrogate" look
different.)

> > But Unicode has NEL "Next Line" U+0085 and other values that
> > conforming applications should recognize as line terminators.
>
> I disagree. Maybe for printing things. But textual data records? I would
> hope to end them with NL, and only NL (code 10).
>

I'm with you on that - textual data records should end with 0x0A only.
But if there are text entities in there, they should be allowed to
include any Unicode characters, potentially including other types of
whitespace.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 2, 2022, 11:36 AM

Post #34 of 95 (1230 views)

On Mon, 2 May 2022 at 18:31, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> |The Unicode standard defines a number of characters that
> |conforming applications should recognize as line terminators:[7]
> |
> |LF: Line Feed, U+000A
> |VT: Vertical Tab, U+000B
> |FF: Form Feed, U+000C
> |CR: Carriage Return, U+000D
> |CR+LF: CR (U+000D) followed by LF (U+000A)
> |NEL: Next Line, U+0085
> |LS: Line Separator, U+2028
> |PS: Paragraph Separator, U+2029
> |
> Wikipedia "Newline".

Should I suppose that other encodings may have more line ending chars?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 2, 2022, 11:42 AM

Post #35 of 95 (1230 views)

On Tue, 3 May 2022 at 04:38, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Mon, 2 May 2022 at 18:31, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >
> > |The Unicode standard defines a number of characters that
> > |conforming applications should recognize as line terminators:[7]
> > |
> > |LF: Line Feed, U+000A
> > |VT: Vertical Tab, U+000B
> > |FF: Form Feed, U+000C
> > |CR: Carriage Return, U+000D
> > |CR+LF: CR (U+000D) followed by LF (U+000A)
> > |NEL: Next Line, U+0085
> > |LS: Line Separator, U+2028
> > |PS: Paragraph Separator, U+2029
> > |
> > Wikipedia "Newline".
>
> Should I suppose that other encodings may have more line ending chars?

No, because those are Unicode characters. How they're encoded may
affect the bytes you see, but those are code point values after
decoding.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 2, 2022, 12:25 PM

Post #36 of 95 (1224 views)

Ok, I suppose \n and \r are enough:

########
readline(size=- 1, /)

Read and return one line from the stream. If size is specified, at
most size bytes will be read.

The line terminator is always b'\n' for binary files; for text files,
the newline argument to open() can be used to select the line
terminator(s) recognized.
########
open(file, mode='r', buffering=- 1, encoding=None, errors=None,
newline=None, closefd=True, opener=None)
[...]
newline controls how universal newlines mode works (it only applies to
text mode). It can be None, '', '\n', '\r', and '\r\n'
########
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 2, 2022, 12:40 PM

Post #37 of 95 (1224 views)

On Mon, 2 May 2022 at 00:20, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >Something like this is OK?
> [...]
> >def tail(f):
> > chunk_size = 100
> > size = os.stat(f.fileno()).st_size
>
> I think you want os.fstat().

It's the same from py 3.3

> > chunk_line_pos = -1
> > pos = 0
> >
> > for pos in positions:
> > f.seek(pos)
> > chars = f.read(chunk_size)
> > chunk_line_pos = chars.rfind(b"\n")
> >
> > if chunk_line_pos != -1:
> > break
>
> Normal text file _end_ in a newline. I'd expect this to stop immediately
> at the end of the file.

I think it's correct. The last line in this case is an empty bytes.

> > if chunk_line_pos == -1:
> > nbytes = pos
> > pos = 0
> > f.seek(pos)
> > chars = f.read(nbytes)
> > chunk_line_pos = chars.rfind(b"\n")
>
> I presume this is because unless you're very lucky, 0 will not be a
> position in the range(). I'd be inclined to avoid duplicating this code
> and special case and instead maybe make the range unbounded and do
> something like this:
>
> if pos < 0:
> pos = 0
> ... seek/read/etc ...
> if pos == 0:
> break
>
> around the for-loop body.

Yes, I was not very happy to duplicate the code... I have to think about it.

> Seems sane. I haven't tried to run it.

Thank you ^^
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 6, 2022, 12:21 PM

Post #38 of 95 (1151 views)

I have a little problem.

I tried to extend the tail function, so it can read lines from the bottom
of a file object opened in text mode.

The problem is it does not work. It gets a starting position that is lower
than the expected by 3 characters. So the first line is read only for 2
chars, and the last line is missing.

import os

_lf = "\n"
_cr = "\r"
_lf_ord = ord(_lf)

def tail(f, n=10, chunk_size=100):
n_chunk_size = n * chunk_size
pos = os.stat(f.fileno()).st_size
chunk_line_pos = -1
lines_not_found = n
binary_mode = "b" in f.mode
lf = _lf_ord if binary_mode else _lf

while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
chars = f.read(n_chunk_size)

for i, char in enumerate(reversed(chars)):
if char == lf:
lines_not_found -= 1

if lines_not_found == 0:
chunk_line_pos = len(chars) - i - 1
print(chunk_line_pos, i)
break

if lines_not_found == 0:
break

line_pos = pos + chunk_line_pos + 1

f.seek(line_pos)

res = b"" if binary_mode else ""

for i in range(n):
res += f.readline()

return res

Maybe the problem is 1 char != 1 byte?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 6, 2022, 1:19 PM

Post #39 of 95 (1151 views)

On 2022-05-06 20:21, Marco Sulla wrote:
> I have a little problem.
>
> I tried to extend the tail function, so it can read lines from the bottom
> of a file object opened in text mode.
>
> The problem is it does not work. It gets a starting position that is lower
> than the expected by 3 characters. So the first line is read only for 2
> chars, and the last line is missing.
>
> import os
>
> _lf = "\n"
> _cr = "\r"
> _lf_ord = ord(_lf)
>
> def tail(f, n=10, chunk_size=100):
> n_chunk_size = n * chunk_size
> pos = os.stat(f.fileno()).st_size
> chunk_line_pos = -1
> lines_not_found = n
> binary_mode = "b" in f.mode
> lf = _lf_ord if binary_mode else _lf
>
> while pos != 0:
> pos -= n_chunk_size
>
> if pos < 0:
> pos = 0
>
> f.seek(pos)
> chars = f.read(n_chunk_size)
>
> for i, char in enumerate(reversed(chars)):
> if char == lf:
> lines_not_found -= 1
>
> if lines_not_found == 0:
> chunk_line_pos = len(chars) - i - 1
> print(chunk_line_pos, i)
> break
>
> if lines_not_found == 0:
> break
>
> line_pos = pos + chunk_line_pos + 1
>
> f.seek(line_pos)
>
> res = b"" if binary_mode else ""
>
> for i in range(n):
> res += f.readline()
>
> return res
>
> Maybe the problem is 1 char != 1 byte?

Is the file UTF-8? That's a variable-width encoding, so are any of the
characters > U+007F?

Which OS? On Windows, it's common/normal for UTF-8 files to start with a
BOM/signature, which is 3 bytes/1 codepoint.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 6, 2022, 2:10 PM

Post #40 of 95 (1151 views)

On Fri, 6 May 2022 21:19:48 +0100, MRAB <python@mrabarnett.plus.com>
declaimed the following:

>Is the file UTF-8? That's a variable-width encoding, so are any of the
>characters > U+007F?
>
>Which OS? On Windows, it's common/normal for UTF-8 files to start with a
>BOM/signature, which is 3 bytes/1 codepoint.

Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
condenses that to just <lf> internally (for TEXT mode) -- so using the
length of a string so read to compute a file position may be off-by-one for
each EOL in the string.

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
"""
In text mode, the default when reading is to convert platform-specific line
endings (\n on Unix, \r\n on Windows) to just \n. When writing in text
mode, the default is to convert occurrences of \n back to platform-specific
line endings. This behind-the-scenes modification to file data is fine for
text files, but will corrupt binary data like that in JPEG or EXE files. Be
very careful to use binary mode when reading and writing such files.
"""

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 6:21 AM

Post #41 of 95 (1143 views)

On Sat, 7 May 2022 at 01:03, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
> Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
> condenses that to just <lf> internally (for TEXT mode) -- so using the
> length of a string so read to compute a file position may be off-by-one for
> each EOL in the string.

So there's no way to reliably read lines in reverse in text mode using
seek and read, but the only option is readlines?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 7:01 AM

Post #42 of 95 (1143 views)

Marco,
I think it was made clear from the start that "text" files in the classic sense have no random access method at any higher level than reading a byte at some offset from the beginning of the file, or back from the end when it has not grown.
The obvious fact is that most of the time the lines are not of fixed widths and you have heard about multiple byte encodings and how the ends of lines can vary.

When files get long enough that just reading them from the start as a whole, or even in chunks, gets too expensive, some people might consider some other method. Log files can go on for years so it is not uncommon to start a new one periodically and have a folder with many of them in some order. To get the last few lines simply means finding the last file and reading it, or if it is too short, getting the penultimate one too.
And obviously a database or other structure might work better which might make each "line" a record and index them.
But there are ways to create your own data that get around this such as using an encoding with a large but fixed width for every character albeit you need more storage space. But if the goal is a general purpose tool, internationalization from ASCII has created a challenge for lots of such tools.

-----Original Message-----
From: Marco Sulla <Marco.Sulla.Python@gmail.com>
To: Dennis Lee Bieber <wlfraed@ix.netcom.com>
Cc: python-list@python.org
Sent: Sat, May 7, 2022 9:21 am
Subject: Re: tail

On Sat, 7 May 2022 at 01:03, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
> Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
> condenses that to just <lf> internally (for TEXT mode) -- so using the
> length of a string so read to compute a file position may be off-by-one for
> each EOL in the string.

So there's no way to reliably read lines in reverse in text mode using
seek and read, but the only option is readlines?
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 7:08 AM

Post #43 of 95 (1143 views)

> On 7 May 2022, at 14:24, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Sat, 7 May 2022 at 01:03, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>>
>> Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
>> condenses that to just <lf> internally (for TEXT mode) -- so using the
>> length of a string so read to compute a file position may be off-by-one for
>> each EOL in the string.
>
> So there's no way to reliably read lines in reverse in text mode using
> seek and read, but the only option is readlines?

You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
Figure out which line ending is in use from the CR LF, LF, CR.
Once you have a line decode it before returning it.

The only OS I know that used CR was Classic Mac OS.
If you do not care about that then you can split on NL and strip any trailing CR.

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 9:28 AM

Post #44 of 95 (1138 views)

On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.

>>> "\n".encode("utf-16")
b'\xff\xfe\n\x00'
>>> "".encode("utf-16")
b'\xff\xfe'
>>> "a\nb".encode("utf-16")
b'\xff\xfea\x00\n\x00b\x00'
>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
b'\n\x00'

Can I use the last trick to get the encoding of a LF or a CR in any encoding?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

drsalists at gmail

May 7, 2022, 9:54 AM

Post #45 of 95 (1138 views)

I believe I'd do something like:

#!/usr/local/cpython-3.10/bin/python3

"""
Output the last 10 lines of a potentially-huge file.

O(n). But technically so is scanning backward from the EOF.

It'd be faster to use a dict, but this has the advantage of working for
huge num_lines.
"""

import dbm

import os

import sys

tempfile = f'/tmp/{os.path.basename(sys.argv[0])}.{os.getpid()}'

db = dbm.open(tempfile, 'n')

num_lines = 10

for cur_lineno, line in enumerate(sys.stdin):

db[str(cur_lineno)] = line.encode('utf-8')

max_lineno = cur_lineno

str_age_out_lineno = str(cur_lineno - num_lines - 1)

if str_age_out_lineno in db:

del db[str_age_out_lineno]

for lineno in range(max_lineno, max_lineno - num_lines, -1):

str_lineno = str(lineno)

if str_lineno not in db:

break

print(db[str(lineno)].decode('utf-8'), end='')

db.close()

os.unlink(tempfile)

On Sat, Apr 23, 2022 at 11:36 AM Marco Sulla <Marco.Sulla.Python@gmail.com>
wrote:

> What about introducing a method for text streams that reads the lines
> from the bottom? Java has also a ReversedLinesFileReader with Apache
> Commons IO.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 9:58 AM

Post #46 of 95 (1138 views)

On 2022-05-07 17:28, Marco Sulla wrote:
> On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
>> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
>
>>>> "\n".encode("utf-16")
> b'\xff\xfe\n\x00'
>>>> "".encode("utf-16")
> b'\xff\xfe'
>>>> "a\nb".encode("utf-16")
> b'\xff\xfea\x00\n\x00b\x00'
>>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> b'\n\x00'
>
> Can I use the last trick to get the encoding of a LF or a CR in any encoding?

In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
could be little-endian or big-endian.

As you didn't specify which you wanted, it defaulted to little-endian
and added a BOM (U+FEFF).

If you specify which endianness you want with "utf-16le" or "utf-16be",
it won't add the BOM:

>>> # Little-endian.
>>> "\n".encode("utf-16le")
b'\n\x00'
>>> # Big-endian.
>>> "\n".encode("utf-16be")
b'\x00\n'
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 11:35 AM

Post #47 of 95 (1134 views)

On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
>
> On 2022-05-07 17:28, Marco Sulla wrote:
> > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> >
> >>>> "\n".encode("utf-16")
> > b'\xff\xfe\n\x00'
> >>>> "".encode("utf-16")
> > b'\xff\xfe'
> >>>> "a\nb".encode("utf-16")
> > b'\xff\xfea\x00\n\x00b\x00'
> >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > b'\n\x00'
> >
> > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
>
> In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> could be little-endian or big-endian.
>
> As you didn't specify which you wanted, it defaulted to little-endian
> and added a BOM (U+FEFF).
>
> If you specify which endianness you want with "utf-16le" or "utf-16be",
> it won't add the BOM:
>
> >>> # Little-endian.
> >>> "\n".encode("utf-16le")
> b'\n\x00'
> >>> # Big-endian.
> >>> "\n".encode("utf-16be")
> b'\x00\n'

Well, ok, but I need a generic method to get LF and CR for any
encoding an user can input.
Do you think that

"\n".encode(encoding).lstrip("".encode(encoding))

is good for any encoding? Furthermore, is there a way to get the
encoding of an opened file object?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 12:08 PM

Post #48 of 95 (1131 views)

On Sat, 7 May 2022 20:35:34 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>Well, ok, but I need a generic method to get LF and CR for any
>encoding an user can input.

Other than EBCDIC, <lf> and <cr> AS BYTES should appear as x0A and x0D
in any of the 8-bit encodings (ASCII, ISO-8859-x, CPxxxx, UTF-8). I believe
those bytes also appear in UTF-16 -- BUT, they will have a null (x00) byte
associated with them as padding; as a result, you can not search for just
x0Dx0A (Windows line end convention -- they may be x00x0Dx00x0A or
x0Dx00x0Ax00 depending on endianness cf:
https://docs.microsoft.com/en-us/cpp/text/support-for-unicode?view=msvc-170
)

For EBCDIC <cr> is still x0D, but <lf> is x25 (and there is a separate
<nl> [new line] at x15)

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 12:19 PM

Post #49 of 95 (1131 views)

On 2022-05-07 19:35, Marco Sulla wrote:
> On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
> >
> > On 2022-05-07 17:28, Marco Sulla wrote:
> > > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> > >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> > >
> > >>>> "\n".encode("utf-16")
> > > b'\xff\xfe\n\x00'
> > >>>> "".encode("utf-16")
> > > b'\xff\xfe'
> > >>>> "a\nb".encode("utf-16")
> > > b'\xff\xfea\x00\n\x00b\x00'
> > >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > > b'\n\x00'
> > >
> > > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
> >
> > In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> > could be little-endian or big-endian.
> >
> > As you didn't specify which you wanted, it defaulted to little-endian
> > and added a BOM (U+FEFF).
> >
> > If you specify which endianness you want with "utf-16le" or "utf-16be",
> > it won't add the BOM:
> >
> > >>> # Little-endian.
> > >>> "\n".encode("utf-16le")
> > b'\n\x00'
> > >>> # Big-endian.
> > >>> "\n".encode("utf-16be")
> > b'\x00\n'
>
> Well, ok, but I need a generic method to get LF and CR for any
> encoding an user can input.
> Do you think that
>
> "\n".encode(encoding).lstrip("".encode(encoding))
>
> is good for any encoding?
'.lstrip' is the wrong method to use because it treats its argument as a
set of characters, so it might strip off too many characters. A better
choice is '.removeprefix'.
> Furthermore, is there a way to get the encoding of an opened file object?
>
How was the file opened?

If it was opened as a text file, use the '.encoding' attribute (which
just tells you what encoding was specified when it was opened, and you'd
be assuming that it's the correct one).

If it was opened as a binary file, all you know is that it contains
bytes, and determining the encoding (assuming that it is a text file) is
down to heuristics (i.e. guesswork).

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 12:26 PM

Post #50 of 95 (1131 views)

On 2022-05-07 19:47, Stefan Ram wrote:
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
>>Well, ok, but I need a generic method to get LF and CR for any
>>encoding an user can input.
>
> "LF" and "CR" come from US-ASCII. It is theoretically
> possible that there might be some encodings out there
> (not for Unicode) that are not based on US-ASCII and
> have no LF or no CR.
>
>>is good for any encoding? Furthermore, is there a way to get the
>>encoding of an opened file object?
>
> I have written a function that might be able to detect one
> of few encodings based on a heuristic algorithm.
>
> def encoding( name ):
> path = pathlib.Path( name )
> for encoding in( "utf_8", "latin_1", "cp1252" ):
> try:
> with path.open( encoding=encoding, errors="strict" )as file:
> text = file.read()
> return encoding
> except UnicodeDecodeError:
> pass
> return "ascii"
>
> Yes, it's potentially slow and might be wrong.
> The result "ascii" might mean it's a binary file.
>
"latin-1" will decode any sequence of bytes, so it'll never try
"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
anyway because the file could contain 0x80..0xFF, which aren't supported
by that encoding.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 1:12 PM

Post #51 of 95 (1504 views)

On Sun, 8 May 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
> >
> > On 2022-05-07 17:28, Marco Sulla wrote:
> > > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> > >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> > >
> > >>>> "\n".encode("utf-16")
> > > b'\xff\xfe\n\x00'
> > >>>> "".encode("utf-16")
> > > b'\xff\xfe'
> > >>>> "a\nb".encode("utf-16")
> > > b'\xff\xfea\x00\n\x00b\x00'
> > >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > > b'\n\x00'
> > >
> > > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
> >
> > In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> > could be little-endian or big-endian.
> >
> > As you didn't specify which you wanted, it defaulted to little-endian
> > and added a BOM (U+FEFF).
> >
> > If you specify which endianness you want with "utf-16le" or "utf-16be",
> > it won't add the BOM:
> >
> > >>> # Little-endian.
> > >>> "\n".encode("utf-16le")
> > b'\n\x00'
> > >>> # Big-endian.
> > >>> "\n".encode("utf-16be")
> > b'\x00\n'
>
> Well, ok, but I need a generic method to get LF and CR for any
> encoding an user can input.
> Do you think that
>
> "\n".encode(encoding).lstrip("".encode(encoding))
>
> is good for any encoding?

No, because it is only useful for stateless encodings. Any encoding
which uses "shift bytes" that cause subsequent bytes to be interpreted
differently will simply not work with this naive technique. Also,
you're assuming that the byte(s) you get from encoding LF will *only*
represent LF, which is also not true for a number of other encodings -
they might always encode LF to the same byte sequence, but could use
that same byte sequence as part of a multi-byte encoding. So, no, for
arbitrarily chosen encodings, this is not dependable.

> Furthermore, is there a way to get the
> encoding of an opened file object?

Nope. That's fundamentally not possible. Unless you mean in the
trivial sense of "what was the parameter passed to the open() call?",
in which case f.encoding will give it to you; but to find out the
actual encoding, no, you can't.

The ONLY way to 100% reliably decode arbitrary text is to know, from
external information, what encoding it is in. Every other scheme
imposes restrictions. Trying to do something that works for absolutely
any encoding is a doomed project.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 7, 2022, 2:31 PM

Post #52 of 95 (1502 views)

On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> MRAB <python@mrabarnett.plus.com> writes:
> >On 2022-05-07 19:47, Stefan Ram wrote:
> ...
> >>def encoding( name ):
> >> path = pathlib.Path( name )
> >> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >> try:
> >> with path.open( encoding=encoding, errors="strict" )as file:
> >> text = file.read()
> >> return encoding
> >> except UnicodeDecodeError:
> >> pass
> >> return "ascii"
> >>Yes, it's potentially slow and might be wrong.
> >>The result "ascii" might mean it's a binary file.
> >"latin-1" will decode any sequence of bytes, so it'll never try
> >"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >anyway because the file could contain 0x80..0xFF, which aren't supported
> >by that encoding.
>
> Thank you! It's working for my specific application where
> I'm reading from a collection of text files that should be
> encoded in either utf_8, latin_1, or ascii.
>

In that case, I'd exclude ASCII from the check, and just check UTF-8,
and if that fails, decode as Latin-1. Any ASCII files will decode
correctly as UTF-8, and any file will decode as Latin-1.

I've used this exact fallback system when decoding raw data from
Unicode-naive servers - they accept and share bytes, so it's entirely
possible to have a mix of encodings in a single stream. As long as you
can define the span of a single "unit" (say, a line, or a chunk in
some form), you can read as bytes and do the exact same "decode as
UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
perfectly ideal, but it's about as good as you'll get with a lot of
US-based servers. (Depending on context, you might use CP-1252 instead
of Latin-1, but you might need errors="replace" there, since
Windows-1252 has some undefined byte values.)

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 1:04 AM

Post #53 of 95 (1498 views)

> On 7 May 2022, at 17:29, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
>> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
>
>>>> "\n".encode("utf-16")
> b'\xff\xfe\n\x00'
>>>> "".encode("utf-16")
> b'\xff\xfe'
>>>> "a\nb".encode("utf-16")
> b'\xff\xfea\x00\n\x00b\x00'
>>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> b'\n\x00'
>
> Can I use the last trick to get the encoding of a LF or a CR in any encoding?

In a word no.

There are cases that you just have to know the encoding you are working with.
utf-16 because you have deal with the data in 2 byte units and know if
it is big endian or little endian.

There will be other encoding that will also be difficult.

But if you are working with encoding that are using ASCII as a base,
like unicode encoded as utf-8 or iso-8859 series then you can just look
for NL and CR using the ASCII values of the byte.

In short once you set your requirements then you can know what problems
you can avoid and which you must solve.

Is utf-16 important to you? If not no need to solve its issues.

Barry

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 9:05 AM

Post #54 of 95 (1494 views)

I think I've _almost_ found a simpler, general way:

import os

_lf = "\n"
_cr = "\r"

def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
lines_not_found = n

with open(filepath, newline=newline, encoding=encoding) as f:
text = ""

hard_mode = False

if newline == None:
newline = _lf
elif newline == "":
hard_mode = True

if hard_mode:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()
lf_after = False

for i, char in enumerate(reversed(text)):
if char == _lf:
lf_after == True
elif char == _cr:
lines_not_found -= 1

newline_size = 2 if lf_after else 1

lf_after = False
elif lf_after:
lines_not_found -= 1
newline_size = 1
lf_after = False

if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i + newline_size
break

if lines_not_found == 0:
break
else:
while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
text = f.read()

for i, char in enumerate(reversed(text)):
if char == newline:
lines_not_found -= 1

if lines_not_found == 0:
chunk_line_pos = len(text) - 1 - i +
len(newline)
break

if lines_not_found == 0:
break

if chunk_line_pos == -1:
chunk_line_pos = 0

return text[chunk_line_pos:]

Shortly, the file is always opened in text mode. File is read at the end in
bigger and bigger chunks, until the file is finished or all the lines are
found.

Why? Because in encodings that have more than 1 byte per character, reading
a chunk of n bytes, then reading the previous chunk, can eventually split
the character between the chunks in two distinct bytes.

I think one can read chunk by chunk and test the chunk junction problem. I
suppose the code will be faster this way. Anyway, it seems that this trick
is quite fast anyway and it's a lot simpler.

The final result is read from the chunk, and not from the file, so there's
no problems of misalignment of bytes and text. Furthermore, the builtin
encoding parameter is used, so this should work with all the encodings
(untested).

Furthermore, a newline parameter can be specified, as in open(). If it's
equal to the empty string, the things are a little more complicated, anyway
I suppose the code is clear. It's untested too. I only tested with an utf8
linux file.

Do you think there are chances to get this function as a method of the file
object in CPython? The method for a file object opened in bytes mode is
simpler, since there's no encoding and newline is only \n in that case.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 11:10 AM

Post #55 of 95 (1491 views)

> On 7 May 2022, at 14:40, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
>> So there's no way to reliably read lines in reverse in text mode using
>> seek and read, but the only option is readlines?
>
> I think, CPython is based on C. I don't know whether
> Python's seek function directly calls C's fseek function,
> but maybe the following parts of the C standard also are
> relevant for Python?

There is the posix API that and the C FILE API.

I expect that the odities you about NUL chars is all about the FILE
API. As far as I know its the posix API that C Python uses and it
does not suffer from issues with binary files.

Barry

>
> |Setting the file position indicator to end-of-file, as with
> |fseek(file, 0, SEEK_END), has undefined behavior for a binary
> |stream (because of possible trailing null characters) or for
> |any stream with state-dependent encoding that does not
> |assuredly end in the initial shift state.
> from a footnote in a draft of a C standard
>
> |For a text stream, either offset shall be zero, or offset
> |shall be a value returned by an earlier successful call to
> |the ftell function on a stream associated with the same file
> |and whence shall be SEEK_SET.
> from a draft of a C standard
>
> |A text stream is an ordered sequence of characters composed
> |into lines, each line consisting of zero or more characters
> |plus a terminating new-line character. Whether the last line
> |requires a terminating new-line character is implementation-defined.
> from a draft of a C standard
>
> This might mean that reading from a text stream that is not
> ending in a new-line character might have undefined behavior
> (depending on the C implementation). In practice, it might
> mean that some things could go wrong near the end of such
> a stream.
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 11:15 AM

Post #56 of 95 (1491 views)

> On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>
>> MRAB <python@mrabarnett.plus.com> writes:
>>> On 2022-05-07 19:47, Stefan Ram wrote:
>> ...
>>>> def encoding( name ):
>>>> path = pathlib.Path( name )
>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>> try:
>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>> text = file.read()
>>>> return encoding
>>>> except UnicodeDecodeError:
>>>> pass
>>>> return "ascii"
>>>> Yes, it's potentially slow and might be wrong.
>>>> The result "ascii" might mean it's a binary file.
>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>> by that encoding.
>>
>> Thank you! It's working for my specific application where
>> I'm reading from a collection of text files that should be
>> encoded in either utf_8, latin_1, or ascii.
>>
>
> In that case, I'd exclude ASCII from the check, and just check UTF-8,
> and if that fails, decode as Latin-1. Any ASCII files will decode
> correctly as UTF-8, and any file will decode as Latin-1.
>
> I've used this exact fallback system when decoding raw data from
> Unicode-naive servers - they accept and share bytes, so it's entirely
> possible to have a mix of encodings in a single stream. As long as you
> can define the span of a single "unit" (say, a line, or a chunk in
> some form), you can read as bytes and do the exact same "decode as
> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> perfectly ideal, but it's about as good as you'll get with a lot of
> US-based servers. (Depending on context, you might use CP-1252 instead
> of Latin-1, but you might need errors="replace" there, since
> Windows-1252 has some undefined byte values.)

There is a very common error on Windows that files and especially web pages that
claim to be utf-8 are in fact CP-1252.

There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.

Its usually the left and "smart" quote chars that cause the issue as they code
as an invalid utf-8.

Barry

>
> ChrisA
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 11:27 AM

Post #57 of 95 (1491 views)

On Mon, 9 May 2022 at 04:15, Barry Scott <barry@barrys-emacs.org> wrote:
>
>
>
> > On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >>
> >> MRAB <python@mrabarnett.plus.com> writes:
> >>> On 2022-05-07 19:47, Stefan Ram wrote:
> >> ...
> >>>> def encoding( name ):
> >>>> path = pathlib.Path( name )
> >>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
> >>>> try:
> >>>> with path.open( encoding=encoding, errors="strict" )as file:
> >>>> text = file.read()
> >>>> return encoding
> >>>> except UnicodeDecodeError:
> >>>> pass
> >>>> return "ascii"
> >>>> Yes, it's potentially slow and might be wrong.
> >>>> The result "ascii" might mean it's a binary file.
> >>> "latin-1" will decode any sequence of bytes, so it'll never try
> >>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
> >>> anyway because the file could contain 0x80..0xFF, which aren't supported
> >>> by that encoding.
> >>
> >> Thank you! It's working for my specific application where
> >> I'm reading from a collection of text files that should be
> >> encoded in either utf_8, latin_1, or ascii.
> >>
> >
> > In that case, I'd exclude ASCII from the check, and just check UTF-8,
> > and if that fails, decode as Latin-1. Any ASCII files will decode
> > correctly as UTF-8, and any file will decode as Latin-1.
> >
> > I've used this exact fallback system when decoding raw data from
> > Unicode-naive servers - they accept and share bytes, so it's entirely
> > possible to have a mix of encodings in a single stream. As long as you
> > can define the span of a single "unit" (say, a line, or a chunk in
> > some form), you can read as bytes and do the exact same "decode as
> > UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
> > perfectly ideal, but it's about as good as you'll get with a lot of
> > US-based servers. (Depending on context, you might use CP-1252 instead
> > of Latin-1, but you might need errors="replace" there, since
> > Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>

Yeah, or sometimes, there isn't *anything* in UTF-8, and it has some
sort of straight-up lie in the form of a meta tag. It's annoying. But
the same logic still applies: attempt one decode (UTF-8) and if it
fails, there's one fallback. Fairly simple.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 11:31 AM

Post #58 of 95 (1491 views)

> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> I think I've _almost_ found a simpler, general way:
>
> import os
>
> _lf = "\n"
> _cr = "\r"
>
> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> n_chunk_size = n * chunk_size

Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
I tend to read on multiple of MiB as its near instant.

> pos = os.stat(filepath).st_size

You cannot mix POSIX API with text mode.
pos is in bytes from the start of the file.
Textmode will be in code points. bytes != code points.

> chunk_line_pos = -1
> lines_not_found = n
>
> with open(filepath, newline=newline, encoding=encoding) as f:
> text = ""
>
> hard_mode = False
>
> if newline == None:
> newline = _lf
> elif newline == "":
> hard_mode = True
>
> if hard_mode:
> while pos != 0:
> pos -= n_chunk_size
>
> if pos < 0:
> pos = 0
>
> f.seek(pos)

In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.

> text = f.read()

You have on limit on the amount of data read.

> lf_after = False
>
> for i, char in enumerate(reversed(text)):

Simple use text.rindex('\n') or text.rfind('\n') for speed.

> if char == _lf:
> lf_after == True
> elif char == _cr:
> lines_not_found -= 1
>
> newline_size = 2 if lf_after else 1
>
> lf_after = False
> elif lf_after:
> lines_not_found -= 1
> newline_size = 1
> lf_after = False
>
>
> if lines_not_found == 0:
> chunk_line_pos = len(text) - 1 - i + newline_size
> break
>
> if lines_not_found == 0:
> break
> else:
> while pos != 0:
> pos -= n_chunk_size
>
> if pos < 0:
> pos = 0
>
> f.seek(pos)
> text = f.read()
>
> for i, char in enumerate(reversed(text)):
> if char == newline:
> lines_not_found -= 1
>
> if lines_not_found == 0:
> chunk_line_pos = len(text) - 1 - i +
> len(newline)
> break
>
> if lines_not_found == 0:
> break
>
>
> if chunk_line_pos == -1:
> chunk_line_pos = 0
>
> return text[chunk_line_pos:]
>
>
> Shortly, the file is always opened in text mode. File is read at the end in
> bigger and bigger chunks, until the file is finished or all the lines are
> found.

It will fail if the contents is not ASCII.

>
> Why? Because in encodings that have more than 1 byte per character, reading
> a chunk of n bytes, then reading the previous chunk, can eventually split
> the character between the chunks in two distinct bytes.

No it cannot. text mode only knows how to return code points. Now if you are in
binary it could be split, but you are not in binary mode so it cannot.

> I think one can read chunk by chunk and test the chunk junction problem. I
> suppose the code will be faster this way. Anyway, it seems that this trick
> is quite fast anyway and it's a lot simpler.

> The final result is read from the chunk, and not from the file, so there's
> no problems of misalignment of bytes and text. Furthermore, the builtin
> encoding parameter is used, so this should work with all the encodings
> (untested).
>
> Furthermore, a newline parameter can be specified, as in open(). If it's
> equal to the empty string, the things are a little more complicated, anyway
> I suppose the code is clear. It's untested too. I only tested with an utf8
> linux file.
>
> Do you think there are chances to get this function as a method of the file
> object in CPython? The method for a file object opened in bytes mode is
> simpler, since there's no encoding and newline is only \n in that case.

State your requirements. Then see if your implementation meets them.

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 11:50 AM

Post #59 of 95 (1491 views)

On 2022-05-08 19:15, Barry Scott wrote:
>
>
>> On 7 May 2022, at 22:31, Chris Angelico <rosuav@gmail.com> wrote:
>>
>> On Sun, 8 May 2022 at 07:19, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>>
>>> MRAB <python@mrabarnett.plus.com> writes:
>>>> On 2022-05-07 19:47, Stefan Ram wrote:
>>> ...
>>>>> def encoding( name ):
>>>>> path = pathlib.Path( name )
>>>>> for encoding in( "utf_8", "latin_1", "cp1252" ):
>>>>> try:
>>>>> with path.open( encoding=encoding, errors="strict" )as file:
>>>>> text = file.read()
>>>>> return encoding
>>>>> except UnicodeDecodeError:
>>>>> pass
>>>>> return "ascii"
>>>>> Yes, it's potentially slow and might be wrong.
>>>>> The result "ascii" might mean it's a binary file.
>>>> "latin-1" will decode any sequence of bytes, so it'll never try
>>>> "cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
>>>> anyway because the file could contain 0x80..0xFF, which aren't supported
>>>> by that encoding.
>>>
>>> Thank you! It's working for my specific application where
>>> I'm reading from a collection of text files that should be
>>> encoded in either utf_8, latin_1, or ascii.
>>>
>>
>> In that case, I'd exclude ASCII from the check, and just check UTF-8,
>> and if that fails, decode as Latin-1. Any ASCII files will decode
>> correctly as UTF-8, and any file will decode as Latin-1.
>>
>> I've used this exact fallback system when decoding raw data from
>> Unicode-naive servers - they accept and share bytes, so it's entirely
>> possible to have a mix of encodings in a single stream. As long as you
>> can define the span of a single "unit" (say, a line, or a chunk in
>> some form), you can read as bytes and do the exact same "decode as
>> UTF-8 if possible, otherwise decode as Latin-1" dance. Sure, it's not
>> perfectly ideal, but it's about as good as you'll get with a lot of
>> US-based servers. (Depending on context, you might use CP-1252 instead
>> of Latin-1, but you might need errors="replace" there, since
>> Windows-1252 has some undefined byte values.)
>
> There is a very common error on Windows that files and especially web pages that
> claim to be utf-8 are in fact CP-1252.
>
> There is logic in the HTML standards to try utf-8 and if it fails fall back to CP-1252.
>
> Its usually the left and "smart" quote chars that cause the issue as they code
> as an invalid utf-8.
>
Is it CP-1252 or ISO-8859-1 (Latin-1)?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 12:47 PM

Post #60 of 95 (1491 views)

On Sun, 8 May 2022 at 20:31, Barry Scott <barry@barrys-emacs.org> wrote:
>
> > On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> > n_chunk_size = n * chunk_size
>
> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
> I tend to read on multiple of MiB as its near instant.

Well, I tested on a little file, a list of my preferred pizzas, so....

> > pos = os.stat(filepath).st_size
>
> You cannot mix POSIX API with text mode.
> pos is in bytes from the start of the file.
> Textmode will be in code points. bytes != code points.
>
> > chunk_line_pos = -1
> > lines_not_found = n
> >
> > with open(filepath, newline=newline, encoding=encoding) as f:
> > text = ""
> >
> > hard_mode = False
> >
> > if newline == None:
> > newline = _lf
> > elif newline == "":
> > hard_mode = True
> >
> > if hard_mode:
> > while pos != 0:
> > pos -= n_chunk_size
> >
> > if pos < 0:
> > pos = 0
> >
> > f.seek(pos)
>
> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.

Why? I don't see any recommendation about it in the docs:
https://docs.python.org/3/library/io.html#io.IOBase.seek

> > text = f.read()
>
> You have on limit on the amount of data read.

I explained that previously. Anyway, chunk_size is small, so it's not
a great problem.

> > lf_after = False
> >
> > for i, char in enumerate(reversed(text)):
>
> Simple use text.rindex('\n') or text.rfind('\n') for speed.

I can't use them when I have to find both \n or \r. So I preferred to
simplify the code and use the for cycle every time. Take into mind
anyway that this is a prototype for a Python C Api implementation
(builtin I hope, or a C extension if not)

> > Shortly, the file is always opened in text mode. File is read at the end in
> > bigger and bigger chunks, until the file is finished or all the lines are
> > found.
>
> It will fail if the contents is not ASCII.

Why?

> > Why? Because in encodings that have more than 1 byte per character, reading
> > a chunk of n bytes, then reading the previous chunk, can eventually split
> > the character between the chunks in two distinct bytes.
>
> No it cannot. text mode only knows how to return code points. Now if you are in
> binary it could be split, but you are not in binary mode so it cannot.

From the docs:

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

> > Do you think there are chances to get this function as a method of the file
> > object in CPython? The method for a file object opened in bytes mode is
> > simpler, since there's no encoding and newline is only \n in that case.
>
> State your requirements. Then see if your implementation meets them.

The method should return the last n lines from a file object.
If the file object is in text mode, the newline parameter must be honored.
If the file object is in binary mode, a newline is always b"\n", to be
consistent with readline.

I suppose the current implementation of tail satisfies the
requirements for text mode. The previous one satisfied binary mode.

Anyway, apart from my implementation, I'm curious if you think a tail
method is worth it to be a method of the builtin file objects in
CPython.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 1:00 PM

Post #61 of 95 (1491 views)

On Mon, 9 May 2022 at 05:49, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.

Absolutely not. As has been stated multiple times in this thread, a
fully general approach is extremely complicated, horrifically
unreliable, and hopelessly inefficient. The ONLY way to make this sort
of thing any good whatsoever is to know your own use-case and code to
exactly that. Given the size of files you're working with, for
instance, a simple approach of just reading the whole file would make
far more sense than the complex seeking you're doing. For reading a
multi-gigabyte file, the choices will be different.

No, this does NOT belong in the core language.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 1:34 PM

Post #62 of 95 (1491 views)

> On 8 May 2022, at 20:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Sun, 8 May 2022 at 20:31, Barry Scott <barry@barrys-emacs.org> wrote:
>>
>>>> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>>>
>>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
>>> n_chunk_size = n * chunk_size
>>
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
>> I tend to read on multiple of MiB as its near instant.
>
> Well, I tested on a little file, a list of my preferred pizzas, so....

Try it on a very big file.

>
>>> pos = os.stat(filepath).st_size
>>
>> You cannot mix POSIX API with text mode.
>> pos is in bytes from the start of the file.
>> Textmode will be in code points. bytes != code points.
>>
>>> chunk_line_pos = -1
>>> lines_not_found = n
>>>
>>> with open(filepath, newline=newline, encoding=encoding) as f:
>>> text = ""
>>>
>>> hard_mode = False
>>>
>>> if newline == None:
>>> newline = _lf
>>> elif newline == "":
>>> hard_mode = True
>>>
>>> if hard_mode:
>>> while pos != 0:
>>> pos -= n_chunk_size
>>>
>>> if pos < 0:
>>> pos = 0
>>>
>>> f.seek(pos)
>>
>> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.
>
> Why? I don't see any recommendation about it in the docs:
> https://docs.python.org/3/library/io.html#io.IOBase.seek

What does adding 1 to a pos mean?
If it’s binary it mean 1 byte further down the file but in text mode it may need to
move the point 1, 2 or 3 bytes down the file.

>
>>> text = f.read()
>>
>> You have on limit on the amount of data read.
>
> I explained that previously. Anyway, chunk_size is small, so it's not
> a great problem.

Typo I meant you have no limit.

You read all the data till the end of the file that might be mega bytes of data.
>
>>> lf_after = False
>>>
>>> for i, char in enumerate(reversed(text)):
>>
>> Simple use text.rindex('\n') or text.rfind('\n') for speed.
>
> I can't use them when I have to find both \n or \r. So I preferred to
> simplify the code and use the for cycle every time. Take into mind
> anyway that this is a prototype for a Python C Api implementation
> (builtin I hope, or a C extension if not)
>
>>> Shortly, the file is always opened in text mode. File is read at the end in
>>> bigger and bigger chunks, until the file is finished or all the lines are
>>> found.
>>
>> It will fail if the contents is not ASCII.
>
> Why?
>
>>> Why? Because in encodings that have more than 1 byte per character, reading
>>> a chunk of n bytes, then reading the previous chunk, can eventually split
>>> the character between the chunks in two distinct bytes.
>>
>> No it cannot. text mode only knows how to return code points. Now if you are in
>> binary it could be split, but you are not in binary mode so it cannot.
>
>> From the docs:
>
> seek(offset, whence=SEEK_SET)
> Change the stream position to the given byte offset.
>
>>> Do you think there are chances to get this function as a method of the file
>>> object in CPython? The method for a file object opened in bytes mode is
>>> simpler, since there's no encoding and newline is only \n in that case.
>>
>> State your requirements. Then see if your implementation meets them.
>
> The method should return the last n lines from a file object.
> If the file object is in text mode, the newline parameter must be honored.
> If the file object is in binary mode, a newline is always b"\n", to be
> consistent with readline.
>
> I suppose the current implementation of tail satisfies the
> requirements for text mode. The previous one satisfied binary mode.
>
> Anyway, apart from my implementation, I'm curious if you think a tail
> method is worth it to be a method of the builtin file objects in
> CPython.
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 1:36 PM

Post #63 of 95 (1491 views)

On Sun, 8 May 2022 at 22:02, Chris Angelico <rosuav@gmail.com> wrote:
>
> Absolutely not. As has been stated multiple times in this thread, a
> fully general approach is extremely complicated, horrifically
> unreliable, and hopelessly inefficient.

Well, my implementation is quite general now. It's not complicated and
inefficient. About reliability, I can't say anything without a test
case.

> The ONLY way to make this sort
> of thing any good whatsoever is to know your own use-case and code to
> exactly that. Given the size of files you're working with, for
> instance, a simple approach of just reading the whole file would make
> far more sense than the complex seeking you're doing. For reading a
> multi-gigabyte file, the choices will be different.

Apart from the fact that it's very, very simple to optimize for small
files: this is, IMHO, a premature optimization. The code is quite fast
even if the file is small. Can it be faster? Of course, but it depends
on the use case. Every optimization in CPython must pass the benchmark
suite test. If there's little or no gain, the optimization is usually
rejected.

> No, this does NOT belong in the core language.

I respect your opinion, but IMHO you think that the task is more
complicated than the reality. It seems to me that the method can be
quite simple and fast.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 1:48 PM

Post #64 of 95 (1491 views)

On Sun, 8 May 2022 at 22:34, Barry <barry@barrys-emacs.org> wrote:
>
> > On 8 May 2022, at 20:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > ?On Sun, 8 May 2022 at 20:31, Barry Scott <barry@barrys-emacs.org> wrote:
> >>
> >>>> On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >>>
> >>> def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >>> n_chunk_size = n * chunk_size
> >>
> >> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
> >> I tend to read on multiple of MiB as its near instant.
> >
> > Well, I tested on a little file, a list of my preferred pizzas, so....
>
> Try it on a very big file.

I'm not saying it's a good idea, it's only the value that I needed for my tests.
Anyway, it's not a problem with big files. The problem is with files
with long lines.

> >> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.
> >
> > Why? I don't see any recommendation about it in the docs:
> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>
> What does adding 1 to a pos mean?
> If it’s binary it mean 1 byte further down the file but in text mode it may need to
> move the point 1, 2 or 3 bytes down the file.

Emh. I re-quote

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

And so on. No mention of differences between text and binary mode.

> >> You have on limit on the amount of data read.
> >
> > I explained that previously. Anyway, chunk_size is small, so it's not
> > a great problem.
>
> Typo I meant you have no limit.
>
> You read all the data till the end of the file that might be mega bytes of data.

Yes, I already explained why and how it could be optimized. I quote myself:

Shortly, the file is always opened in text mode. File is read at the
end in bigger and bigger chunks, until the file is finished or all the
lines are found.

Why? Because in encodings that have more than 1 byte per character,
reading a chunk of n bytes, then reading the previous chunk, can
eventually split the character between the chunks in two distinct
bytes.

I think one can read chunk by chunk and test the chunk junction
problem. I suppose the code will be faster this way. Anyway, it seems
that this trick is quite fast anyway and it's a lot simpler.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

greg.ewing at canterbury

May 8, 2022, 4:58 PM

Post #65 of 95 (1473 views)

On 9/05/22 7:47 am, Marco Sulla wrote:
>> It will fail if the contents is not ASCII.
>
> Why?

For some encodings, if you seek to an arbitrary byte position and
then read, it may *appear* to succeed but give you complete gibberish.

Your method might work for a certain subset of encodings (those that
are self-synchronising) but it won't work for arbitrary encodings.

Given that limitation, I don't think it's reliable enough to include
in the standard library.

--
Greg

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 6:56 PM

Post #66 of 95 (1473 views)

On Sun, 8 May 2022 22:48:32 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>
>Emh. I re-quote
>
>seek(offset, whence=SEEK_SET)
>Change the stream position to the given byte offset.
>
>And so on. No mention of differences between text and binary mode.

You ignore that, underneath, Python is just wrapping the C API... And
the documentation for C explicitly specifies that other then SEEK_END with
offset 0, and SEEK_SET with offset of 0, for a text file one can only rely
upon SEEK_SET using an offset previously obtained with (C) ftell() /
(Python) .tell() .

https://docs.python.org/3/library/io.html
"""
class io.IOBase

The abstract base class for all I/O classes.
"""
seek(offset, whence=SEEK_SET)

Change the stream position to the given byte offset. offset is
interpreted relative to the position indicated by whence. The default value
for whence is SEEK_SET. Values for whence are:
"""

Applicable to BINARY MODE I/O: For UTF-8 and any other multibyte
encoding, this means you could end up positioning into the middle of a
"character" and subsequently read garbage. It is on you to handle
synchronizing on a valid character position, and also to handle different
line ending conventions.

"""
class io.TextIOBase

Base class for text streams. This class provides a character and line
based interface to stream I/O. It inherits IOBase.
"""
seek(offset, whence=SEEK_SET)

Change the stream position to the given offset. Behaviour depends on
the whence parameter. The default value for whence is SEEK_SET.

SEEK_SET or 0: seek from the start of the stream (the default);
offset must either be a number returned by TextIOBase.tell(), or zero. Any
other offset value produces undefined behaviour.

SEEK_CUR or 1: “seek” to the current position; offset must be zero,
which is a no-operation (all other values are unsupported).

SEEK_END or 2: seek to the end of the stream; offset must be zero
(all other values are unsupported).
"""

EMPHASIS: "offset must either be a number returned by TextIOBase.tell(), or
zero."

TEXT I/O, with a specified encoding, will return Unicode data points,
and will handle converting line ending to the internal (<lf> represents
new-line) format.

Since your code does not specify BINARY mode in the open statement,
Python should be using TEXT mode.

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 8, 2022, 10:14 PM

Post #67 of 95 (1477 views)

On 08May2022 22:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>On Sun, 8 May 2022 at 22:34, Barry <barry@barrys-emacs.org> wrote:
>> >> In text mode you can only seek to a value return from f.tell()
>> >> otherwise the behaviour is undefined.
>> >
>> > Why? I don't see any recommendation about it in the docs:
>> > https://docs.python.org/3/library/io.html#io.IOBase.seek
>>
>> What does adding 1 to a pos mean?
>> If it’s binary it mean 1 byte further down the file but in text mode it may need to
>> move the point 1, 2 or 3 bytes down the file.
>
>Emh. I re-quote
>
>seek(offset, whence=SEEK_SET)
>Change the stream position to the given byte offset.
>
>And so on. No mention of differences between text and binary mode.

You're looking at IOBase, the _binary_ basis of low level common file
I/O. Compare with: https://docs.python.org/3/library/io.html#io.TextIOBase.seek
The positions are "opaque numbers", which means you should not ascribe
any deeper meaning to them except that they represent a point in the
file. It clearly says "offset must either be a number returned by
TextIOBase.tell(), or zero. Any other offset value produces undefined
behaviour."

The point here is that text is a very different thing. Because you
cannot seek to an absolute number of characters in an encoding with
variable sized characters. _If_ you did a seek to an arbitrary number
you can end up in the middle of some character. And there are encodings
where you cannot inspect the data to find a character boundary in the
byte stream.

Reading text files backwards is not a well defined thing without
additional criteria:
- knowing the text file actually ended on a character boundary
- knowing how to find a character boundary

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 10:45 AM

Post #68 of 95 (1473 views)

On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
>
> The point here is that text is a very different thing. Because you
> cannot seek to an absolute number of characters in an encoding with
> variable sized characters. _If_ you did a seek to an arbitrary number
> you can end up in the middle of some character. And there are encodings
> where you cannot inspect the data to find a character boundary in the
> byte stream.

Ooook, now I understand what you and Barry mean. I suppose there's no
reliable way to tail a big file opened in text mode with a decent performance.

Anyway, the previous-previous function I posted worked only for files
opened in binary mode, and I suppose it's reliable, since it searches
only for b"\n", as readline() in binary mode do.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 10:51 AM

Post #69 of 95 (1473 views)

2QdxY4RzWzUUiLuE at potatochowder

On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
> >
> > The point here is that text is a very different thing. Because you
> > cannot seek to an absolute number of characters in an encoding with
> > variable sized characters. _If_ you did a seek to an arbitrary number
> > you can end up in the middle of some character. And there are encodings
> > where you cannot inspect the data to find a character boundary in the
> > byte stream.
>
> Ooook, now I understand what you and Barry mean. I suppose there's no
> reliable way to tail a big file opened in text mode with a decent performance.
>
> Anyway, the previous-previous function I posted worked only for files
> opened in binary mode, and I suppose it's reliable, since it searches
> only for b"\n", as readline() in binary mode do.

It's still fundamentally impossible to solve this in a general way, so
the best way to do things will always be to code for *your* specific
use-case. That means that this doesn't belong in the stdlib or core
language, but in your own toolkit.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 10:52 AM

Post #70 of 95 (1473 views)

On 2022-05-08 at 18:52:42 +0000,
Stefan Ram <ram@zedat.fu-berlin.de> wrote:

> Remember how recently people here talked about how you cannot copy
> text from a video? Then, how did I do it? Turns out, for my
> operating system, there's a screen OCR program! So I did this OCR
> and then manually corrected a few wrong characters, and was done!

When you're learning, and the example you tried doesn't work like it
worked on the video, you probably don't know what's wrong, let alone how
to correct it.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 12:11 PM

Post #71 of 95 (1473 views)

On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
> > >
> > > The point here is that text is a very different thing. Because you
> > > cannot seek to an absolute number of characters in an encoding with
> > > variable sized characters. _If_ you did a seek to an arbitrary number
> > > you can end up in the middle of some character. And there are encodings
> > > where you cannot inspect the data to find a character boundary in the
> > > byte stream.
> >
> > Ooook, now I understand what you and Barry mean. I suppose there's no
> > reliable way to tail a big file opened in text mode with a decent performance.
> >
> > Anyway, the previous-previous function I posted worked only for files
> > opened in binary mode, and I suppose it's reliable, since it searches
> > only for b"\n", as readline() in binary mode do.
>
> It's still fundamentally impossible to solve this in a general way, so
> the best way to do things will always be to code for *your* specific
> use-case. That means that this doesn't belong in the stdlib or core
> language, but in your own toolkit.

Nevertheless, tail is a fundamental tool in *nix. It's fast and
reliable. Also the tail command can't handle different encodings?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 1:47 PM

Post #72 of 95 (1472 views)

On Tue, 10 May 2022 at 05:12, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >
> > > On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
> > > >
> > > > The point here is that text is a very different thing. Because you
> > > > cannot seek to an absolute number of characters in an encoding with
> > > > variable sized characters. _If_ you did a seek to an arbitrary number
> > > > you can end up in the middle of some character. And there are encodings
> > > > where you cannot inspect the data to find a character boundary in the
> > > > byte stream.
> > >
> > > Ooook, now I understand what you and Barry mean. I suppose there's no
> > > reliable way to tail a big file opened in text mode with a decent performance.
> > >
> > > Anyway, the previous-previous function I posted worked only for files
> > > opened in binary mode, and I suppose it's reliable, since it searches
> > > only for b"\n", as readline() in binary mode do.
> >
> > It's still fundamentally impossible to solve this in a general way, so
> > the best way to do things will always be to code for *your* specific
> > use-case. That means that this doesn't belong in the stdlib or core
> > language, but in your own toolkit.
>
> Nevertheless, tail is a fundamental tool in *nix. It's fast and
> reliable. Also the tail command can't handle different encodings?

Like most Unix programs, it handles bytes.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 1:58 PM

Post #73 of 95 (1472 views)

> On 9 May 2022, at 17:41, ram@zedat.fu-berlin.de wrote:
>
> ?Barry Scott <barry@barrys-emacs.org> writes:
>> Why use tiny chunks? You can read 4KiB as fast as 100 bytes
>
> When optimizing code, it helps to be aware of the orders of
> magnitude

That is true and we’ll know to me, now show how what I said is wrong.

The os is going to DMA at least 4k, with read ahead more like 64k.
So I can get that into the python memory at the same scale of time as
1 byte because it’s the setup of the I/O that is expensive not the bytes
transferred.

Barry

> . Code that is more cache-friendly is faster, that is,
> code that holds data in single region of memory and that uses
> regular patterns of access. Chandler Carruth talked about this,
> and I made some notes when watching the video of his talk:
>
> CPUS HAVE A HIERARCHICAL CACHE SYSTEM
> (from a 2014 talk by Chandler Carruth)
>
> One cycle on a 3 GHz processor 1 ns
> L1 cache reference 0.5 ns
> Branch mispredict 5 ns
> L2 cache reference 7 ns 14x L1 cache
> Mutex lock/unlock 25 ns
> Main memory reference 100 ns 20xL2, 200xL1
> Compress 1K bytes with Snappy 3,000 ns
> Send 1K bytes over 1 Gbps network 10,000 ns 0.01 ms
> Read 4K randomly from SSD 150,000 ns 0.15 ms
> Read 1 MB sequentially from memory 250,000 ns 0.25 ms
> Round trip within same datacenter 500,000 ns 0.5 ms
> Read 1 MB sequentially From SSD 1,000,000 ns 1 ms 4x memory
> Disk seek 10,000,000 ns 10 ms 20xdatacen. RT
> Read 1 MB sequentially from disk 20,000,000 ns 20 ms 80xmem.,20xSSD
> Send packet CA->Netherlands->CA 150,000,000 ns 150 ms
>
> . Remember how recently people here talked about how you cannot
> copy text from a video? Then, how did I do it? Turns out, for my
> operating system, there's a screen OCR program! So I did this OCR
> and then manually corrected a few wrong characters, and was done!
>
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 2:05 PM

Post #74 of 95 (1463 views)

On Mon, 9 May 2022 21:11:23 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>Nevertheless, tail is a fundamental tool in *nix. It's fast and
>reliable. Also the tail command can't handle different encodings?

Based upon
https://github.com/coreutils/coreutils/blob/master/src/tail.c the ONLY
thing tail looks at is single byte "\n". It does not handle other line
endings, and appears to performs BINARY I/O, not text I/O. It does nothing
for bytes that are not "\n". Split multi-byte encodings are irrelevant
since, if it does not find enough "\n" bytes in the buffer (chunk) it reads
another binary chunk and seeks for additional "\n" bytes. Once it finds the
desired amount, it is synchronized on the byte following the "\n" (which,
for multi-byte encodings might be a NUL, but in any event, should be a safe
location for subsequent I/O).

Interpretation of encoding appears to fall to the console driver
configuration when displaying the bytes output by tail.

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 2:07 PM

Post #75 of 95 (1472 views)

> On 9 May 2022, at 20:14, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
>>
>>> On Tue, 10 May 2022 at 03:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>>>
>>> On Mon, 9 May 2022 at 07:56, Cameron Simpson <cs@cskk.id.au> wrote:
>>>>
>>>> The point here is that text is a very different thing. Because you
>>>> cannot seek to an absolute number of characters in an encoding with
>>>> variable sized characters. _If_ you did a seek to an arbitrary number
>>>> you can end up in the middle of some character. And there are encodings
>>>> where you cannot inspect the data to find a character boundary in the
>>>> byte stream.
>>>
>>> Ooook, now I understand what you and Barry mean. I suppose there's no
>>> reliable way to tail a big file opened in text mode with a decent performance.
>>>
>>> Anyway, the previous-previous function I posted worked only for files
>>> opened in binary mode, and I suppose it's reliable, since it searches
>>> only for b"\n", as readline() in binary mode do.
>>
>> It's still fundamentally impossible to solve this in a general way, so
>> the best way to do things will always be to code for *your* specific
>> use-case. That means that this doesn't belong in the stdlib or core
>> language, but in your own toolkit.
>
> Nevertheless, tail is a fundamental tool in *nix. It's fast and
> reliable. Also the tail command can't handle different encodings?

POSIX tail just prints the bytes to the output that it finds between \n bytes.
At no time does it need to care about encodings as that is a problem solved
by the terminal software. I would not expect utf-16 to work with tail on
linux systems.

You could always get the source of tail and read It’s implementation.

Barry

> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 9, 2022, 2:12 PM

Post #76 of 95 (1803 views)

On Tue, 10 May 2022 at 07:07, Barry <barry@barrys-emacs.org> wrote:
> POSIX tail just prints the bytes to the output that it finds between \n bytes.
> At no time does it need to care about encodings as that is a problem solved
> by the terminal software. I would not expect utf-16 to work with tail on
> linux systems.

UTF-16 ASCII seems to work fine on my system, which probably means the
terminal is just ignoring all the NUL bytes. But if there's a random
0x0A anywhere, it would probably be counted as a line break.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

alan at csail

May 9, 2022, 2:31 PM

Post #77 of 95 (1803 views)

Marco Sulla <Marco.Sulla.Python@gmail.com> writes:

On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
...
Nevertheless, tail is a fundamental tool in *nix. It's fast and
reliable. Also the tail command can't handle different encodings?

It definitely can't. It works for UTF-8, and all the ASCII compatible
single byte encodings, but feed it a file encoded in UTF-16, and it will
sometimes screw up. (And if you don't redirect the output away from
your terminal, and your terminal encoding isn't also set to UTF-16, you
will likely find yourself looking at gibberish -- but that's another
problem...)
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 12:58 PM

Post #78 of 95 (1790 views)

On Mon, 9 May 2022 at 23:15, Dennis Lee Bieber <wlfraed@ix.netcom.com>
wrote:
>
> On Mon, 9 May 2022 21:11:23 +0200, Marco Sulla
> <Marco.Sulla.Python@gmail.com> declaimed the following:
>
> >Nevertheless, tail is a fundamental tool in *nix. It's fast and
> >reliable. Also the tail command can't handle different encodings?
>
> Based upon
> https://github.com/coreutils/coreutils/blob/master/src/tail.c the ONLY
> thing tail looks at is single byte "\n". It does not handle other line
> endings, and appears to performs BINARY I/O, not text I/O. It does nothing
> for bytes that are not "\n". Split multi-byte encodings are irrelevant
> since, if it does not find enough "\n" bytes in the buffer (chunk) it
reads
> another binary chunk and seeks for additional "\n" bytes. Once it finds
the
> desired amount, it is synchronized on the byte following the "\n" (which,
> for multi-byte encodings might be a NUL, but in any event, should be a
safe
> location for subsequent I/O).
>
> Interpretation of encoding appears to fall to the console driver
> configuration when displaying the bytes output by tail.

Ok, I understand. This should be a Python implementation of *nix tail:

import os

_lf = b"\n"
_err_n = "Parameter n must be a positive integer number"
_err_chunk_size = "Parameter chunk_size must be a positive integer number"

def tail(filepath, n=10, chunk_size=100):
if (n <= 0):
raise ValueError(_err_n)

if (n % 1 != 0):
raise ValueError(_err_n)

if (chunk_size <= 0):
raise ValueError(_err_chunk_size)

if (chunk_size % 1 != 0):
raise ValueError(_err_chunk_size)

n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
lines_not_found = n

with open(filepath, "rb") as f:
text = bytearray()

while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
chars = f.read(n_chunk_size)
text[0:0] = chars
search_pos = n_chunk_size

while search_pos != -1:
chunk_line_pos = chars.rfind(_lf, 0, search_pos)

if chunk_line_pos != -1:
lines_not_found -= 1

if lines_not_found == 0:
break

search_pos = chunk_line_pos

if lines_not_found == 0:
break

return bytes(text[chunk_line_pos+1:])

The function opens the file in binary mode and searches only for b"\n". It
returns the last n lines of the file as bytes.

I suppose this function is fast. It reads the bytes from the file in chunks
and stores them in a bytearray, prepending them to it. The final result is
read from the bytearray and converted to bytes (to be consistent with the
read method).

I suppose the function is reliable. File is opened in binary mode and only
b"\n" is searched as line end, as *nix tail (and python readline in binary
mode) do. And bytes are returned. The caller can use them as is or convert
them to a string using the encoding it wants, or do whatever its
imagination can think :)

Finally, it seems to me the function is quite simple.

If all my affirmations are true, the three obstacles written by Chris
should be passed.

I'd very much like to see a CPython implementation of that function. It
could be a method of a file object opened in binary mode, and *only* in
binary mode.

What do you think about it?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 1:07 PM

Post #79 of 95 (1790 views)

On Thu, 12 May 2022 at 06:03, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> I suppose this function is fast. It reads the bytes from the file in chunks
> and stores them in a bytearray, prepending them to it. The final result is
> read from the bytearray and converted to bytes (to be consistent with the
> read method).
>
> I suppose the function is reliable. File is opened in binary mode and only
> b"\n" is searched as line end, as *nix tail (and python readline in binary
> mode) do. And bytes are returned. The caller can use them as is or convert
> them to a string using the encoding it wants, or do whatever its
> imagination can think :)
>
> Finally, it seems to me the function is quite simple.
>
> If all my affirmations are true, the three obstacles written by Chris
> should be passed.

Have you actually checked those three, or do you merely suppose them to be true?

> I'd very much like to see a CPython implementation of that function. It
> could be a method of a file object opened in binary mode, and *only* in
> binary mode.
>
> What do you think about it?

Still not necessary. You can simply have it in your own toolkit. Why
should it be part of the core language? How much benefit would it be
to anyone else? All the same assumptions are still there, so it still
isn't general, and you may as well just *code to your own needs* like
I've been saying all along. This does not need to be in the standard
library. Do what you need, assume what you can safely assume, and
other people can write different code.

I don't understand why this wants to be in the standard library.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 2:27 PM

Post #80 of 95 (1787 views)

On Wed, 11 May 2022 at 22:09, Chris Angelico <rosuav@gmail.com> wrote:
>
> Have you actually checked those three, or do you merely suppose them to be true?

I only suppose, as I said. I should do some benchmark and some other
tests, and, frankly, I don't want to. I don't want to because I'm
quite sure the implementation is fast, since it reads by chunks and
cache them. I'm not sure it's 100% free of bugs, but the concept is
very simple, since it simply mimics the *nix tail, so it should be
reliable.

>
> > I'd very much like to see a CPython implementation of that function. It
> > could be a method of a file object opened in binary mode, and *only* in
> > binary mode.
> >
> > What do you think about it?
>
> Still not necessary. You can simply have it in your own toolkit. Why
> should it be part of the core language?

Why not?

> How much benefit would it be
> to anyone else?

I suppose that every programmer, at least one time in its life, did a tail.

> All the same assumptions are still there, so it still
> isn't general

It's general. It mimics the *nix tail. I can't think of a more general
way to implement a tail.

> I don't understand why this wants to be in the standard library.

Well, the answer is really simple: I needed it and if I found it in
the stdlib, I used it instead of writing the first horrible function.
Furthermore, tail is such a useful tool that I suppose many others are
interested, based on this quick Google search:

https://www.google.com/search?q=python+tail

A question on Stackoverflow really much voted, many other
Stackoverflow questions, a package that seems to exactly do the same
thing, that is mimic *nix tail, and a blog post about how to tail in
Python. Furthermore, if you search python tail pypi, you can find a
bunch of other packages:

https://www.google.com/search?q=python+tail+pypi

It seems the subject is quite popular, and I can't imagine otherwise.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 2:31 PM

Post #81 of 95 (1787 views)

On Thu, 12 May 2022 at 07:27, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Wed, 11 May 2022 at 22:09, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > Have you actually checked those three, or do you merely suppose them to be true?
>
> I only suppose, as I said. I should do some benchmark and some other
> tests, and, frankly, I don't want to. I don't want to because I'm
> quite sure the implementation is fast, since it reads by chunks and
> cache them. I'm not sure it's 100% free of bugs, but the concept is
> very simple, since it simply mimics the *nix tail, so it should be
> reliable.

If you don't care enough to benchmark it or even debug it, why should
anyone else care?

I'm done discussing. You think that someone else should have done this
for you, but you aren't even willing to put in the effort to make this
useful to anyone else. Just use it yourself and have done with it.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 3:15 PM

Post #82 of 95 (1787 views)

On Thu, 12 May 2022 06:07:18 +1000, Chris Angelico <rosuav@gmail.com>
declaimed the following:

>I don't understand why this wants to be in the standard library.
>
Especially as any Linux distribution probably includes the compiled
"tail" command, so this would only be of use on Windows.

Under recent Windows, one has an equivalent to "tail" IFF using
PowerShell rather than the "DOS" shell.

https://www.middlewareinventory.com/blog/powershell-tail-file-windows-tail-command/

or install a Windows binary equivalent http://tailforwin32.sourceforge.net/

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 4:47 PM

Post #83 of 95 (1787 views)

Just FYI, UNIX had a bunch of utilities that could emulate a vanilla version of tail on a command line.
You can use sed, awk and quite a few others to simply show line N to the end of a file or other variations.
Of course the way many things were done back then had less focus on efficiency than how to stepwise make changes in a pipeline so reading from the beginning to end was not an issue.

-----Original Message-----
From: Marco Sulla <Marco.Sulla.Python@gmail.com>
To: Chris Angelico <rosuav@gmail.com>
Cc: python-list@python.org
Sent: Wed, May 11, 2022 5:27 pm
Subject: Re: tail

On Wed, 11 May 2022 at 22:09, Chris Angelico <rosuav@gmail.com> wrote:
>
> Have you actually checked those three, or do you merely suppose them to be true?

I only suppose, as I said. I should do some benchmark and some other
tests, and, frankly, I don't want to. I don't want to because I'm
quite sure the implementation is fast, since it reads by chunks and
cache them. I'm not sure it's 100% free of bugs, but the concept is
very simple, since it simply mimics the *nix tail, so it should be
reliable.

>
> > I'd very much like to see a CPython implementation of that function. It
> > could be a method of a file object opened in binary mode, and *only* in
> > binary mode.
> >
> > What do you think about it?
>
> Still not necessary. You can simply have it in your own toolkit. Why
> should it be part of the core language?

Why not?

> How much benefit would it be
> to anyone else?

I suppose that every programmer, at least one time in its life, did a tail.

> All the same assumptions are still there, so it still
> isn't general

It's general. It mimics the *nix tail. I can't think of a more general
way to implement a tail.

> I don't understand why this wants to be in the standard library.

Well, the answer is really simple: I needed it and if I found it in
the stdlib, I used it instead of writing the first horrible function.
Furthermore, tail is such a useful tool that I suppose many others are
interested, based on this quick Google search:

https://www.google.com/search?q=python+tail

A question on Stackoverflow really much voted, many other
Stackoverflow questions, a package that seems to exactly do the same
thing, that is mimic *nix tail, and a blog post about how to tail in
Python. Furthermore, if you search python tail pypi, you can find a
bunch of other packages:

https://www.google.com/search?q=python+tail+pypi

It seems the subject is quite popular, and I can't imagine otherwise.
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 11, 2022, 6:27 PM

Post #84 of 95 (1781 views)

This seems to be a regular refrain where someone wants something as STANDARD in a programming language or environment and others want to keep it lean and mean or do not see THIS suggestion as particularly important or useful.
Looking at the end of something is extremely common. Packages like numpy/pandas in Python often provide functions with names like head or tail as do other languages where data structures with names like data.frame are commonly used. These structures are in some way indexed to make it easy to jump towards the end. Text files are not.

Efficiency aside, a 3-year-old (well, certainly a 30 year old) can cobble together a function that takes a filename assumed to be textual and reads the file into some data structure that stores the lines of the file and so it can be indexed by line number and also report the index of the final line. The data structure can be a list of lists or a dictionary with line numbers as keys or a numpy ...

So the need for this functionality seems obvious but then what about someone who wants a bunch of random lines from a file? Need we satisfy their wish to pick random offsets from the file and get the line in which the offset is in middle of or the one about to start? Would that even be random if line lengths vary? Text files were never designed to be used efficiently except for reading and writing and certainly not for something like sorting.

Again, generally you can read in the darn file and perform the operation and free up whatever memory you do  not need. If you have huge files, fine, but then why make a special function be part of the default setup if it is rarely used? Why not put it in a module/package called BigFileBatches alongside other functions useful to do things in batches? Call that when needed but for smaller files, KISS.

-----Original Message-----
From: Dennis Lee Bieber <wlfraed@ix.netcom.com>
To: python-list@python.org
Sent: Wed, May 11, 2022 6:15 pm
Subject: Re: tail

On Thu, 12 May 2022 06:07:18 +1000, Chris Angelico <rosuav@gmail.com>
declaimed the following:

>I don't understand why this wants to be in the standard library.
>
    Especially as any Linux distribution probably includes the compiled
"tail" command, so this would only be of use on Windows.

    Under recent Windows, one has an equivalent to "tail" IFF using
PowerShell rather than the "DOS" shell.

https://www.middlewareinventory.com/blog/powershell-tail-file-windows-tail-command/

or install a Windows binary equivalent http://tailforwin32.sourceforge.net/

--
    Wulfraed Dennis Lee Bieber AF6VN
    wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 12, 2022, 10:48 AM

Post #85 of 95 (1773 views)

On Thu, 12 May 2022 at 00:50, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
> >def tail(filepath, n=10, chunk_size=100):
> > if (n <= 0):
> > raise ValueError(_err_n)
> ...
>
> There's no spec/doc, so one can't even test it.

Excuse me, you're very right.

"""
A function that "tails" the file. If you don't know what that means,
google "man tail"

filepath: the file path of the file to be "tailed"
n: the numbers of lines "tailed"
chunk_size: oh don't care, use it as is
"""
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 12, 2022, 1:45 PM

Post #86 of 95 (1770 views)

Thank you very much. This helped me to improve the function:

import os

_lf = b"\n"
_err_n = "Parameter n must be a positive integer number"
_err_chunk_size = "Parameter chunk_size must be a positive integer number"

def tail(filepath, n=10, chunk_size=100):
if (n <= 0):
raise ValueError(_err_n)

if (n % 1 != 0):
raise ValueError(_err_n)

if (chunk_size <= 0):
raise ValueError(_err_chunk_size)

if (chunk_size % 1 != 0):
raise ValueError(_err_chunk_size)

n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
newlines_to_find = n
first_step = True

with open(filepath, "rb") as f:
text = bytearray()

while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
chars = f.read(n_chunk_size)
text[0:0] = chars
search_pos = n_chunk_size

while search_pos != -1:
chunk_line_pos = chars.rfind(_lf, 0, search_pos)

if first_step and chunk_line_pos == search_pos - 1:
newlines_to_find += 1

first_step = False

if chunk_line_pos != -1:
newlines_to_find -= 1

if newlines_to_find == 0:
break

search_pos = chunk_line_pos

if newlines_to_find == 0:
break

return bytes(text[chunk_line_pos+1:])

On Thu, 12 May 2022 at 20:29, Stefan Ram <ram@zedat.fu-berlin.de> wrote:

> I am not aware of a definition of "line" above,
> but the PLR says:
>
> |A physical line is a sequence of characters terminated
> |by an end-of-line sequence.
>
> . So 10 lines should have 10 end-of-line sequences.
>

Maybe. Maybe not. What if the file ends with no newline?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 12, 2022, 2:48 PM

Post #87 of 95 (1769 views)

On Thu, 12 May 2022 22:45:42 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>
>Maybe. Maybe not. What if the file ends with no newline?

https://github.com/coreutils/coreutils/blob/master/src/tail.c
Lines 567-569 (also lines 550-557 for "bytes_read" determination)

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 12, 2022, 3:29 PM

Post #88 of 95 (1769 views)

On 12May2022 19:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>On Thu, 12 May 2022 at 00:50, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>> There's no spec/doc, so one can't even test it.
>
>Excuse me, you're very right.
>
>"""
>A function that "tails" the file. If you don't know what that means,
>google "man tail"
>
>filepath: the file path of the file to be "tailed"
>n: the numbers of lines "tailed"
>chunk_size: oh don't care, use it as is

This is nearly the worst "specification" I have ever seen.

Describe what your function _does_.

Do not just send people to an arbitrary search engine to find possibly
ephemeral web pages where someone has typed "man tail" and/or (if lucky)
web pages with the output of "man tail" for any of several platforms.

But this is sounding more and more like a special purpose task to be
done for your particular use cases. That says it should be in your
personal toolkit. If it has general applicability, _publish_ your
toolkit for others to use. You can do that trivially by pushing your
code repo to any of several free services like bitbucket, gitlab,
sourcehut, github etc. Or you can go the extra few yards and publish a
package to PyPI and see if anyone uses it.

Part of your problem is that you think the term "tail" has a specific
simple obvious meaning. But even to me it means at least 2 things:
- to report the last "n" "lines" of a text file
- to continuously report "new" data appended to a file

These are different, though related, tasks. The latter one is
particularly easy if done purely for bytes (on systems which allow it).
As you've had explained to you, the former task is actually very fiddly.

It is fiddly both in boundary conditions and complicated by being
dependent on the text encoding, which you do not inherently know - that
implies that you ought to (a) provide a way to specify that encoding and
(b) maybe have a reasonable fallback default. But that default needs to
be carefully and precisely explained. And the "find a line ending"
criteria need to be explained. And the "sync to a character boundary"
needs to be explained, including where it cannot be done.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 13, 2022, 3:16 AM

Post #89 of 95 (1758 views)

2QdxY4RzWzUUiLuE at potatochowder

On Fri, 13 May 2022 at 00:31, Cameron Simpson <cs@cskk.id.au> wrote:

> On 12May2022 19:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >On Thu, 12 May 2022 at 00:50, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >> There's no spec/doc, so one can't even test it.
> >
> >Excuse me, you're very right.
> >
> >"""
> >A function that "tails" the file. If you don't know what that means,
> >google "man tail"
> >
> >filepath: the file path of the file to be "tailed"
> >n: the numbers of lines "tailed"
> >chunk_size: oh don't care, use it as is
>
> This is nearly the worst "specification" I have ever seen.
>

You're lucky. I've seen much worse (or no one).
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 13, 2022, 3:47 AM

Post #90 of 95 (1758 views)

On 2022-05-13 at 12:16:57 +0200,
Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:

> On Fri, 13 May 2022 at 00:31, Cameron Simpson <cs@cskk.id.au> wrote:

[...]

> > This is nearly the worst "specification" I have ever seen.

> You're lucky. I've seen much worse (or no one).

At least with *no* documentation, the source code stands for itself. If
I can execute it (whatever that entails), then I can (in theory) figure
out *what* it does. I still don't what it's *supposed* to do, and
therefore *cannot* know how well it does or doesn't "work,", but at
least source code is deterministic and unambiguous (except when it
isn't, but let's not go there).
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 16, 2022, 11:13 AM

Post #91 of 95 (1645 views)

On Fri, 13 May 2022 at 12:49, <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
>
> On 2022-05-13 at 12:16:57 +0200,
> Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> > On Fri, 13 May 2022 at 00:31, Cameron Simpson <cs@cskk.id.au> wrote:
>
> [...]
>
> > > This is nearly the worst "specification" I have ever seen.
>
> > You're lucky. I've seen much worse (or no one).
>
> At least with *no* documentation, the source code stands for itself.

So I did it well to not put one in the first time. I think that after
100 posts about tail, chunks etc it was clear what that stuff was
about and how to use it.

Speaking about more serious things, so far I've done a test with:

* a file that does not end with \n
* a file that ends with \n (after Stefan test)
* a file with more than 10 lines
* a file with less than 10 lines

It seemed to work. I've only to benchmark it. I suppose I have to test
with at least 1 GB file, a big lorem ipsum, and do an unequal
comparison with Linux tail. I'll do it when I have time, so Chris will
be no more angry with me.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 17, 2022, 1:45 PM

Post #92 of 95 (1599 views)

Well, I've done a benchmark.

>>> timeit.timeit("tail('/home/marco/small.txt')", globals={"tail":tail}, number=100000)
1.5963431186974049
>>> timeit.timeit("tail('/home/marco/lorem.txt')", globals={"tail":tail}, number=100000)
2.5240604374557734
>>> timeit.timeit("tail('/home/marco/lorem.txt', chunk_size=1000)", globals={"tail":tail}, number=100000)
1.8944984432309866

small.txt is a text file of 1.3 KB. lorem.txt is a lorem ipsum of 1.2
GB. It seems the performance is good, thanks to the chunk suggestion.

But the time of Linux tail surprise me:

marco@buzz:~$ time tail lorem.txt
[text]

real 0m0.004s
user 0m0.003s
sys 0m0.001s

It's strange that it's so slow. I thought it was because it decodes
and print the result, but I timed

timeit.timeit("print(tail('/home/marco/lorem.txt').decode('utf-8'))",
globals={"tail":tail}, number=100000)

and I got ~36 seconds. It seems quite strange to me. Maybe I got the
benchmarks wrong at some point?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 18, 2022, 2:30 PM

Post #93 of 95 (1593 views)

On 17May2022 22:45, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>Well, I've done a benchmark.
>>>> timeit.timeit("tail('/home/marco/small.txt')", globals={"tail":tail}, number=100000)
>1.5963431186974049
>>>> timeit.timeit("tail('/home/marco/lorem.txt')", globals={"tail":tail}, number=100000)
>2.5240604374557734
>>>> timeit.timeit("tail('/home/marco/lorem.txt', chunk_size=1000)", globals={"tail":tail}, number=100000)
>1.8944984432309866

This suggests that the file size does not dominate uour runtime. Ah.
_Or_ that there are similar numbers of newlines vs text in the files so
reading similar amounts of data from the end. If the "line desnity" of
the files were similar you would hope that the runtimes would be
similar.

>small.txt is a text file of 1.3 KB. lorem.txt is a lorem ipsum of 1.2
>GB. It seems the performance is good, thanks to the chunk suggestion.
>
>But the time of Linux tail surprise me:
>
>marco@buzz:~$ time tail lorem.txt
>[text]
>
>real 0m0.004s
>user 0m0.003s
>sys 0m0.001s
>
>It's strange that it's so slow. I thought it was because it decodes
>and print the result, but I timed

You're measuring different things. timeit() tries hard to measure just
the code snippet you provide. It doesn't measure the startup cost of the
whole python interpreter. Try:

time python3 your-tail-prog.py /home/marco/lorem.txt

BTW, does your `tail()` print output? If not, again not measuring the
same thing.

If you have the source of tail(1) to hand, consider getting to the core
and measuring `time()` immediately before and immediately after the
central tail operation and printing the result.

Also: does tail(1) do character set / encoding stuff? Does your Python
code do that? Might be apples and oranges.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 19, 2022, 10:50 AM

Post #94 of 95 (1577 views)

On Wed, 18 May 2022 at 23:32, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 17May2022 22:45, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >Well, I've done a benchmark.
> >>>> timeit.timeit("tail('/home/marco/small.txt')", globals={"tail":tail}, number=100000)
> >1.5963431186974049
> >>>> timeit.timeit("tail('/home/marco/lorem.txt')", globals={"tail":tail}, number=100000)
> >2.5240604374557734
> >>>> timeit.timeit("tail('/home/marco/lorem.txt', chunk_size=1000)", globals={"tail":tail}, number=100000)
> >1.8944984432309866
>
> This suggests that the file size does not dominate uour runtime.

Yes, this is what I wanted to test and it seems good.

> Ah.
> _Or_ that there are similar numbers of newlines vs text in the files so
> reading similar amounts of data from the end. If the "line desnity" of
> the files were similar you would hope that the runtimes would be
> similar.

No, well, small.txt has very short lines. Lorem.txt is a lorem ipsum,
so really long lines. Indeed I get better results tuning chunk_size.
Anyway, also with the default value the performance is not bad at all.

> >But the time of Linux tail surprise me:
> >
> >marco@buzz:~$ time tail lorem.txt
> >[text]
> >
> >real 0m0.004s
> >user 0m0.003s
> >sys 0m0.001s
> >
> >It's strange that it's so slow. I thought it was because it decodes
> >and print the result, but I timed
>
> You're measuring different things. timeit() tries hard to measure just
> the code snippet you provide. It doesn't measure the startup cost of the
> whole python interpreter. Try:
>
> time python3 your-tail-prog.py /home/marco/lorem.txt

Well, I'll try it, but it's not a bit unfair to compare Python startup with C?
> BTW, does your `tail()` print output? If not, again not measuring the
> same thing.
> [...]
> Also: does tail(1) do character set / encoding stuff? Does your Python
> code do that? Might be apples and oranges.

Well, as I wrote I also timed

timeit.timeit("print(tail('/home/marco/lorem.txt').decode('utf-8'))",
globals={"tail":tail}, number=100000)

and I got ~36 seconds.

> If you have the source of tail(1) to hand, consider getting to the core
> and measuring `time()` immediately before and immediately after the
> central tail operation and printing the result.

IMHO this is a very good idea, but I have to find the time(). Ahah. Emh.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

May 19, 2022, 6:33 PM

Post #95 of 95 (1569 views)