Mailing List Archive

tail
What about introducing a method for text streams that reads the lines
from the bottom? Java has also a ReversedLinesFileReader with Apache
Commons IO.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> What about introducing a method for text streams that reads the lines
> from the bottom? Java has also a ReversedLinesFileReader with Apache
> Commons IO.

It's fundamentally difficult to get precise. In general, there are
three steps to reading the last N lines of a file:

1) Find out the size of the file (currently, if it's being grown)
2) Seek to the end of the file, minus some threshold that you hope
will contain a number of lines
3) Read from there to the end of the file, split it into lines, and
keep the last N

Reading the preceding N lines is basically a matter of repeating the
same exercise, but instead of "end of the file", use the byte position
of the line you last read.

The problem is, seeking around in a file is done by bytes, not
characters. So if you know for sure that you can resynchronize
(possible with UTF-8, not possible with some other encodings), then
you can do this, but it's probably best to build it yourself (opening
the file in binary mode).

This is quite inefficient in general. It would be far FAR easier to do
this instead:

1) Read the entire file and decode bytes to text
2) Split into lines
3) Iterate backwards over the lines

Tada! Done. And in Python, quite easy. The downside, of course, is
that you have to store the entire file in memory.

So it's up to you: pay the memory price, or pay the complexity price.

Personally, unless the file is tremendously large and I know for sure
that I'm not going to end up iterating over it all, I would pay the
memory price.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 23 Apr 2022 at 20:59, Chris Angelico <rosuav@gmail.com> wrote:
>
> On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >
> > What about introducing a method for text streams that reads the lines
> > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > Commons IO.
>
> It's fundamentally difficult to get precise. In general, there are
> three steps to reading the last N lines of a file:
>
> 1) Find out the size of the file (currently, if it's being grown)
> 2) Seek to the end of the file, minus some threshold that you hope
> will contain a number of lines
> 3) Read from there to the end of the file, split it into lines, and
> keep the last N
>
> Reading the preceding N lines is basically a matter of repeating the
> same exercise, but instead of "end of the file", use the byte position
> of the line you last read.
>
> The problem is, seeking around in a file is done by bytes, not
> characters. So if you know for sure that you can resynchronize
> (possible with UTF-8, not possible with some other encodings), then
> you can do this, but it's probably best to build it yourself (opening
> the file in binary mode).

Well, indeed I have an implementation that does more or less what you
described for utf8 only. The only difference is that I just started
from the end of file -1. I'm just wondering if this will be useful in
the stdlib. I think it's not too difficult to generalise for every
encoding.

> This is quite inefficient in general.

Why inefficient? I think that readlines() will be much slower, not
only more time consuming.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 06:41, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Sat, 23 Apr 2022 at 20:59, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >
> > > What about introducing a method for text streams that reads the lines
> > > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > > Commons IO.
> >
> > It's fundamentally difficult to get precise. In general, there are
> > three steps to reading the last N lines of a file:
> >
> > 1) Find out the size of the file (currently, if it's being grown)
> > 2) Seek to the end of the file, minus some threshold that you hope
> > will contain a number of lines
> > 3) Read from there to the end of the file, split it into lines, and
> > keep the last N
> >
> > Reading the preceding N lines is basically a matter of repeating the
> > same exercise, but instead of "end of the file", use the byte position
> > of the line you last read.
> >
> > The problem is, seeking around in a file is done by bytes, not
> > characters. So if you know for sure that you can resynchronize
> > (possible with UTF-8, not possible with some other encodings), then
> > you can do this, but it's probably best to build it yourself (opening
> > the file in binary mode).
>
> Well, indeed I have an implementation that does more or less what you
> described for utf8 only. The only difference is that I just started
> from the end of file -1. I'm just wondering if this will be useful in
> the stdlib. I think it's not too difficult to generalise for every
> encoding.
>
> > This is quite inefficient in general.
>
> Why inefficient? I think that readlines() will be much slower, not
> only more time consuming.

It depends on which is more costly: reading the whole file (cost
depends on size of file) or reading chunks and splitting into lines
(cost depends on how well you guess at chunk size). If the lines are
all *precisely* the same number of bytes each, you can pick a chunk
size and step backwards with near-perfect efficiency (it's still
likely to be less efficient than reading a file forwards, on most file
systems, but it'll be close); but if you have to guess, adjust, and
keep going, then you lose efficiency there.

I don't think this is necessary in the stdlib. If anything, it might
be good on PyPI, but I for one have literally never wanted this.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> > > This is quite inefficient in general.
> >
> > Why inefficient? I think that readlines() will be much slower, not
> > only more time consuming.
>
> It depends on which is more costly: reading the whole file (cost
> depends on size of file) or reading chunks and splitting into lines
> (cost depends on how well you guess at chunk size). If the lines are
> all *precisely* the same number of bytes each, you can pick a chunk
> size and step backwards with near-perfect efficiency (it's still
> likely to be less efficient than reading a file forwards, on most file
> systems, but it'll be close); but if you have to guess, adjust, and
> keep going, then you lose efficiency there.

Emh, why chunks? My function simply reads byte per byte and compares it to
b"\n". When it find it, it stops and do a readline():

def tail(filepath):
"""
@author Marco Sulla
@date May 31, 2016
"""

try:
filepath.is_file
fp = str(filepath)
except AttributeError:
fp = filepath

with open(fp, "rb") as f:
size = os.stat(fp).st_size
start_pos = 0 if size - 1 < 0 else size - 1

if start_pos != 0:
f.seek(start_pos)
char = f.read(1)

if char == b"\n":
start_pos -= 1
f.seek(start_pos)

if start_pos == 0:
f.seek(start_pos)
else:
for pos in range(start_pos, -1, -1):
f.seek(pos)

char = f.read(1)

if char == b"\n":
break

return f.readline()

This is only for one line and in utf8, but it can be generalised.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> > > > This is quite inefficient in general.
> > >
> > > Why inefficient? I think that readlines() will be much slower, not
> > > only more time consuming.
> >
> > It depends on which is more costly: reading the whole file (cost
> > depends on size of file) or reading chunks and splitting into lines
> > (cost depends on how well you guess at chunk size). If the lines are
> > all *precisely* the same number of bytes each, you can pick a chunk
> > size and step backwards with near-perfect efficiency (it's still
> > likely to be less efficient than reading a file forwards, on most file
> > systems, but it'll be close); but if you have to guess, adjust, and
> > keep going, then you lose efficiency there.
>
> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
>
> def tail(filepath):
> """
> @author Marco Sulla
> @date May 31, 2016
> """
>
> try:
> filepath.is_file
> fp = str(filepath)
> except AttributeError:
> fp = filepath
>
> with open(fp, "rb") as f:
> size = os.stat(fp).st_size
> start_pos = 0 if size - 1 < 0 else size - 1
>
> if start_pos != 0:
> f.seek(start_pos)
> char = f.read(1)
>
> if char == b"\n":
> start_pos -= 1
> f.seek(start_pos)
>
> if start_pos == 0:
> f.seek(start_pos)
> else:
> for pos in range(start_pos, -1, -1):
> f.seek(pos)
>
> char = f.read(1)
>
> if char == b"\n":
> break
>
> return f.readline()
>
> This is only for one line and in utf8, but it can be generalised.
>

Ah. Well, then, THAT is why it's inefficient: you're seeking back one
single byte at a time, then reading forwards. That is NOT going to
play nicely with file systems or buffers.

Compare reading line by line over the file with readlines() and you'll
see how abysmal this is.

If you really only need one line (which isn't what your original post
suggested), I would recommend starting with a chunk that is likely to
include a full line, and expanding the chunk until you have that
newline. Much more efficient than one byte at a time.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-04-24 04:57:20 +1000, Chris Angelico wrote:
> On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > What about introducing a method for text streams that reads the lines
> > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > Commons IO.
>
> It's fundamentally difficult to get precise. In general, there are
> three steps to reading the last N lines of a file:
>
> 1) Find out the size of the file (currently, if it's being grown)
> 2) Seek to the end of the file, minus some threshold that you hope
> will contain a number of lines
> 3) Read from there to the end of the file, split it into lines, and
> keep the last N
[...]
> This is quite inefficient in general. It would be far FAR easier to do
> this instead:
>
> 1) Read the entire file and decode bytes to text
> 2) Split into lines
> 3) Iterate backwards over the lines

Which one is more efficient depends very much on the size of the file.
For a file of a few kilobytes, the second solution is probably more
efficient. But for a few gigabytes, that's almost certainly not the
case.

> Tada! Done. And in Python, quite easy. The downside, of course, is
> that you have to store the entire file in memory.

Not just memory. You have to read the whole file in the first place. Which is
hardly efficient if you only need a tiny fraction.

> Personally, unless the file is tremendously large and I know for sure
> that I'm not going to end up iterating over it all, I would pay the
> memory price.

Me, too. Problem with a library function (as Marco proposes) is that you
don't know how it will be used.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | hjp@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
Re: tail [ In reply to ]
On 24/04/2022 09.15, Chris Angelico wrote:
> On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>>
>> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
>>>>> This is quite inefficient in general.
>>>>
>>>> Why inefficient? I think that readlines() will be much slower, not
>>>> only more time consuming.
>>>
>>> It depends on which is more costly: reading the whole file (cost
>>> depends on size of file) or reading chunks and splitting into lines
>>> (cost depends on how well you guess at chunk size). If the lines are
>>> all *precisely* the same number of bytes each, you can pick a chunk
>>> size and step backwards with near-perfect efficiency (it's still
>>> likely to be less efficient than reading a file forwards, on most file
>>> systems, but it'll be close); but if you have to guess, adjust, and
>>> keep going, then you lose efficiency there.
>>
>> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
...

> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> single byte at a time, then reading forwards. That is NOT going to
> play nicely with file systems or buffers.
>
> Compare reading line by line over the file with readlines() and you'll
> see how abysmal this is.
>
> If you really only need one line (which isn't what your original post
> suggested), I would recommend starting with a chunk that is likely to
> include a full line, and expanding the chunk until you have that
> newline. Much more efficient than one byte at a time.


Disagreeing with @Chris in the sense that I use tail very frequently,
and usually in the context of server logs - but I'm talking about the
Linux implementation, not Python code!

Agree with @Chris' assessment of the (in)efficiency. It is more likely
than not, that you will have a good idea of the length of each line.
Even if the line-length is highly-variable (thinking of some of my
applications of the Python logging module!), one can still 'take a stab
at it' (a "thumb suck" as an engineer-colleague used to say - apparently
not an electrical engineer!) by remembering that lines exceeding
80-characters become less readable and thus have likely?hopefully been
split into two.

Thus,

N*(80+p)

where N is the number of lines desired and p is a reasonable
'safety'/over-estimation percentage, would give a good chunk size.
Binar-ily grab that much of the end of the file, split on line-ending,
and take the last N elements from that list. (with 'recovery code' just
in case the 'chunk' wasn't large-enough).

Adding to the efficiency (of the algorithm, but not the dev-time),
consider that shorter files are likely to be more easily--handled by
reading serially from the beginning. To side-step @Chris' criticism, use
a generator to produce the individual lines (lazy evaluation and low
storage requirement) and feed them into a circular-queue which is
limited to N-entries. QED, as fast as the machine's I/O, and undemanding
of storage-space!

Running a few timing trials should reveal the 'sweet spot', at which one
algorithm takes-over from the other!

NB quite a few of IBM's (extensively researched) algorithms which formed
utility program[me]s on mainframes, made similar such algorithmic
choices, in the pursuit of efficiencies.
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 24Apr2022 07:15, Chris Angelico <rosuav@gmail.com> wrote:
>On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>> Emh, why chunks? My function simply reads byte per byte and compares
>> it to b"\n". When it find it, it stops and do a readline():
[...]
>> This is only for one line and in utf8, but it can be generalised.

For some encodings that generalisation might be hard. But mostly, yes.

>Ah. Well, then, THAT is why it's inefficient: you're seeking back one
>single byte at a time, then reading forwards. That is NOT going to
>play nicely with file systems or buffers.

An approach I think you both may have missed: mmap the file and use
mmap.rfind(b'\n') to locate line delimiters.
https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind

Avoids sucking the whole file into memory in the usualy sense, instead
the file is paged in as needed. Far more efficient that a seek/read
single byte approach.

If the file's growing you can do this to start with, then do a normal
file open from your end point to follow accruing text. (Or reuse the
descriptor you sues for the mmap, but using s.read().)

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 08:03, Peter J. Holzer <hjp-python@hjp.at> wrote:
>
> On 2022-04-24 04:57:20 +1000, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > > What about introducing a method for text streams that reads the lines
> > > from the bottom? Java has also a ReversedLinesFileReader with Apache
> > > Commons IO.
> >
> > It's fundamentally difficult to get precise. In general, there are
> > three steps to reading the last N lines of a file:
> >
> > 1) Find out the size of the file (currently, if it's being grown)
> > 2) Seek to the end of the file, minus some threshold that you hope
> > will contain a number of lines
> > 3) Read from there to the end of the file, split it into lines, and
> > keep the last N
> [...]
> > This is quite inefficient in general. It would be far FAR easier to do
> > this instead:
> >
> > 1) Read the entire file and decode bytes to text
> > 2) Split into lines
> > 3) Iterate backwards over the lines
>
> Which one is more efficient depends very much on the size of the file.
> For a file of a few kilobytes, the second solution is probably more
> efficient. But for a few gigabytes, that's almost certainly not the
> case.

Yeah. I said "easier", not necessarily more efficient. Which is more
efficient is a virtually unanswerable question (will you need to
iterate over the whole file or stop part way? Is the file stored
contiguously? Can you memory map it in some way?), so it's going to
depend a lot on your use-case.

> > Tada! Done. And in Python, quite easy. The downside, of course, is
> > that you have to store the entire file in memory.
>
> Not just memory. You have to read the whole file in the first place. Which is
> hardly efficient if you only need a tiny fraction.

Right - if that's the case, then the chunked form, even though it's
harder, would be worth doing.

> > Personally, unless the file is tremendously large and I know for sure
> > that I'm not going to end up iterating over it all, I would pay the
> > memory price.
>
> Me, too. Problem with a library function (as Marco proposes) is that you
> don't know how it will be used.
>

Yup. And there may be other options worth considering, like
maintaining an index (a bunch of "line 142857 is at byte position
3141592" entries) which would allow random access... but at some
point, if your file is that big, you probably shouldn't be storing it
as a file of lines of text. Use a database instead.

Reading a text file backwards by lines is, by definition, hard. Every
file format I know of that involves starting at the end of the file is
defined in binary, so you can actually seek, and is usually defined
with fixed-size structures (so you just go "read the last 768 bytes of
the file" or something).

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 08:06, dn <PythonList@danceswithmice.info> wrote:
>
> On 24/04/2022 09.15, Chris Angelico wrote:
> > On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >>
> >> On Sat, 23 Apr 2022 at 23:00, Chris Angelico <rosuav@gmail.com> wrote:
> >>>>> This is quite inefficient in general.
> >>>>
> >>>> Why inefficient? I think that readlines() will be much slower, not
> >>>> only more time consuming.
> >>>
> >>> It depends on which is more costly: reading the whole file (cost
> >>> depends on size of file) or reading chunks and splitting into lines
> >>> (cost depends on how well you guess at chunk size). If the lines are
> >>> all *precisely* the same number of bytes each, you can pick a chunk
> >>> size and step backwards with near-perfect efficiency (it's still
> >>> likely to be less efficient than reading a file forwards, on most file
> >>> systems, but it'll be close); but if you have to guess, adjust, and
> >>> keep going, then you lose efficiency there.
> >>
> >> Emh, why chunks? My function simply reads byte per byte and compares it to b"\n". When it find it, it stops and do a readline():
> ...
>
> > Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> > single byte at a time, then reading forwards. That is NOT going to
> > play nicely with file systems or buffers.
> >
> > Compare reading line by line over the file with readlines() and you'll
> > see how abysmal this is.
> >
> > If you really only need one line (which isn't what your original post
> > suggested), I would recommend starting with a chunk that is likely to
> > include a full line, and expanding the chunk until you have that
> > newline. Much more efficient than one byte at a time.
>
>
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!

tail(1) doesn't read a single byte at a time. It works in chunks,
more-or-less the way I described. (That's where I borrowed the
technique from.) It finds one block of lines, and displays those; it
doesn't iterate backwards.

(By the way, the implementation of "tail -f" is actually less
complicated than you might think, as long as inotify is available.
That's been much more useful to me than reading a file backwards.)

> Agree with @Chris' assessment of the (in)efficiency. It is more likely
> than not, that you will have a good idea of the length of each line.
> Even if the line-length is highly-variable (thinking of some of my
> applications of the Python logging module!), one can still 'take a stab
> at it' (a "thumb suck" as an engineer-colleague used to say - apparently
> not an electrical engineer!) by remembering that lines exceeding
> 80-characters become less readable and thus have likely?hopefully been
> split into two.
>
> Thus,
>
> N*(80+p)
>
> where N is the number of lines desired and p is a reasonable
> 'safety'/over-estimation percentage, would give a good chunk size.
> Binar-ily grab that much of the end of the file, split on line-ending,
> and take the last N elements from that list. (with 'recovery code' just
> in case the 'chunk' wasn't large-enough).

Yup, that's the broad idea of the chunked read. If you know how many
lines you're going to need, that's not too bad. If you need to iterate
backwards over the file (as the original question suggested), that
gets complicated fast.

> Adding to the efficiency (of the algorithm, but not the dev-time),
> consider that shorter files are likely to be more easily--handled by
> reading serially from the beginning. To side-step @Chris' criticism, use
> a generator to produce the individual lines (lazy evaluation and low
> storage requirement) and feed them into a circular-queue which is
> limited to N-entries. QED, as fast as the machine's I/O, and undemanding
> of storage-space!

Still needs to read the whole file though.

> Running a few timing trials should reveal the 'sweet spot', at which one
> algorithm takes-over from the other!

Unfortunately, the sweet spot will depend on a lot of things other
than just the file size. Is it stored contiguously? Is it on hard
drive or SSD? Can it be read from the disk cache? How frequently do
the chunk size guesses fail, and how often do they grab more than is
necessary?

It's basically unknowable, so the best you can do is judge your own
app, pick one, and go with it.

> NB quite a few of IBM's (extensively researched) algorithms which formed
> utility program[me]s on mainframes, made similar such algorithmic
> choices, in the pursuit of efficiencies.

Indeed. Just make a choice and run with it, and accept that there will
be situations where it'll be inefficient.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 08:18, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 24Apr2022 07:15, Chris Angelico <rosuav@gmail.com> wrote:
> >On Sun, 24 Apr 2022 at 07:13, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >> Emh, why chunks? My function simply reads byte per byte and compares
> >> it to b"\n". When it find it, it stops and do a readline():
> [...]
> >> This is only for one line and in utf8, but it can be generalised.
>
> For some encodings that generalisation might be hard. But mostly, yes.
>
> >Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> >single byte at a time, then reading forwards. That is NOT going to
> >play nicely with file systems or buffers.
>
> An approach I think you both may have missed: mmap the file and use
> mmap.rfind(b'\n') to locate line delimiters.
> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind

Yeah, I made a vague allusion to use of mmap, but didn't elaborate
because I actually have zero idea of how efficient this would be.
Would it be functionally equivalent to the chunking, but with the
chunk size defined by the system as whatever's most optimal? It would
need to be tested.

I've never used mmap for this kind of job, so it's not something I'm
comfortable predicting the performance of.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 24Apr2022 08:21, Chris Angelico <rosuav@gmail.com> wrote:
>On Sun, 24 Apr 2022 at 08:18, Cameron Simpson <cs@cskk.id.au> wrote:
>> An approach I think you both may have missed: mmap the file and use
>> mmap.rfind(b'\n') to locate line delimiters.
>> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind
>
>Yeah, I made a vague allusion to use of mmap, but didn't elaborate
>because I actually have zero idea of how efficient this would be.
>Would it be functionally equivalent to the chunking, but with the
>chunk size defined by the system as whatever's most optimal? It would
>need to be tested.

True. I'd expect better than single byte seek/read though.

>I've never used mmap for this kind of job, so it's not something I'm
>comfortable predicting the performance of.

Fair.

But it would be much easier to read code.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 10:04, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 24Apr2022 08:21, Chris Angelico <rosuav@gmail.com> wrote:
> >On Sun, 24 Apr 2022 at 08:18, Cameron Simpson <cs@cskk.id.au> wrote:
> >> An approach I think you both may have missed: mmap the file and use
> >> mmap.rfind(b'\n') to locate line delimiters.
> >> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind
> >
> >Yeah, I made a vague allusion to use of mmap, but didn't elaborate
> >because I actually have zero idea of how efficient this would be.
> >Would it be functionally equivalent to the chunking, but with the
> >chunk size defined by the system as whatever's most optimal? It would
> >need to be tested.
>
> True. I'd expect better than single byte seek/read though.
>

Yeah, I think pretty much *anything* would be better than single byte seeks.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
dn schreef op 24/04/2022 om 0:04:
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!
If I understand Marco correctly, what he want is to read the lines from
bottom to top, i.e. tac instead of tail, despite his subject.
I use tail very frequently too, but tac is something I almost never use.

--
"Peace cannot be kept by force. It can only be achieved through understanding."
-- Albert Einstein

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
Op 23/04/2022 om 20:57 schreef Chris Angelico:
> On Sun, 24 Apr 2022 at 04:37, Marco Sulla<Marco.Sulla.Python@gmail.com> wrote:
>> What about introducing a method for text streams that reads the lines
>> from the bottom? Java has also a ReversedLinesFileReader with Apache
>> Commons IO.
>
> 1) Read the entire file and decode bytes to text
> 2) Split into lines
> 3) Iterate backwards over the lines
>
> Tada! Done. And in Python, quite easy. The downside, of course, is
> that you have to store the entire file in memory.

Why not just do:

tail = collections.deque(text_stream, maxlen = nr_of_lines)
tail.reverse()
...

--
Antoon Pardon

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 21:11, Antoon Pardon <antoon.pardon@vub.be> wrote:
>
>
>
> Op 23/04/2022 om 20:57 schreef Chris Angelico:
> > On Sun, 24 Apr 2022 at 04:37, Marco Sulla<Marco.Sulla.Python@gmail.com> wrote:
> >> What about introducing a method for text streams that reads the lines
> >> from the bottom? Java has also a ReversedLinesFileReader with Apache
> >> Commons IO.
> >
> > 1) Read the entire file and decode bytes to text
> > 2) Split into lines
> > 3) Iterate backwards over the lines
> >
> > Tada! Done. And in Python, quite easy. The downside, of course, is
> > that you have to store the entire file in memory.
>
> Why not just do:
>
> tail = collections.deque(text_stream, maxlen = nr_of_lines)
> tail.reverse()
> ...
>

You still need to read the entire file, and you also restrict the max
line count, so you can't iterate this to take the next block of lines.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
I have been getting confused by how many interpretations and conditions for chasing tail people seem to be talking about.
A fairly normal task is to want to see just the last N lines of a text-based file. 
A variant is the "tail -f" command from UNIX that continues to follow a growing file, often into a pipeline for further processing.
The variant now being mentioned is a sort of "reverse" that has nothing to do with that kind of "tail" except if the implementation is to read the file backwards. A very straightforward way to reverse a file takes perhaps two lines of Python code by reading forward to fill a list with lines of text then using an index that reverses it.
The issues being considered are memory and whether to read the entire file.
I would think reading a file forwards in big chunks to be far faster and simpler than various schemes mentioned here for reading it backwards. It only makes sense if the goal is not reversal of all the contents.
Also noted is that memory use can be minimized various ways so that only thefinal results are kept around. And if you really want more random access to files that you view as being organized as lines of text with a fixed or maximum width,then storing in some database format, perhaps indexed, may be a way to go.

A time stamped log file is a good example.
So which problem is really supposed to be solved for the original question?



-----Original Message-----
From: Roel Schroeven <roel@roelschroeven.net>
To: python-list@python.org
Sent: Sun, Apr 24, 2022 5:19 am
Subject: Re: tail

dn schreef op 24/04/2022 om 0:04:
> Disagreeing with @Chris in the sense that I use tail very frequently,
> and usually in the context of server logs - but I'm talking about the
> Linux implementation, not Python code!
If I understand Marco correctly, what he want is to read the lines from
bottom to top, i.e. tac instead of tail, despite his subject.
I use tail very frequently too, but tac is something I almost never use.

--
"Peace cannot be kept by force. It can only be achieved through understanding."
        -- Albert Einstein

--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 23 Apr 2022 at 23:18, Chris Angelico <rosuav@gmail.com> wrote:

> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
> single byte at a time, then reading forwards. That is NOT going to
> play nicely with file systems or buffers.
>
> Compare reading line by line over the file with readlines() and you'll
> see how abysmal this is.
>
> If you really only need one line (which isn't what your original post
> suggested), I would recommend starting with a chunk that is likely to
> include a full line, and expanding the chunk until you have that
> newline. Much more efficient than one byte at a time.
>

Well, I would like to have a sort of tail, so to generalise to more than 1
line. But I think that once you have a good algorithm for one line, you can
repeat it N times.

I understand that you can read a chunk instead of a single byte, so when
the newline is found you can return all the cached chunks concatenated. But
will this make the search of the start of the line faster? I suppose you
have always to read byte by byte (or more, if you're using urf16 etc) and
see if there's a newline.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 00:19, Cameron Simpson <cs@cskk.id.au> wrote:

> An approach I think you both may have missed: mmap the file and use
> mmap.rfind(b'\n') to locate line delimiters.
> https://docs.python.org/3/library/mmap.html#mmap.mmap.rfind
>

Ah, I played very little with mmap, I didn't know about this. So I suppose
you can locate the newline and at that point read the line without using
chunks?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 25 Apr 2022 at 01:47, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
>
>
> On Sat, 23 Apr 2022 at 23:18, Chris Angelico <rosuav@gmail.com> wrote:
>>
>> Ah. Well, then, THAT is why it's inefficient: you're seeking back one
>> single byte at a time, then reading forwards. That is NOT going to
>> play nicely with file systems or buffers.
>>
>> Compare reading line by line over the file with readlines() and you'll
>> see how abysmal this is.
>>
>> If you really only need one line (which isn't what your original post
>> suggested), I would recommend starting with a chunk that is likely to
>> include a full line, and expanding the chunk until you have that
>> newline. Much more efficient than one byte at a time.
>
>
> Well, I would like to have a sort of tail, so to generalise to more than 1 line. But I think that once you have a good algorithm for one line, you can repeat it N times.
>

Not always. If you know you want to read 5 lines, it's much more
efficient than reading 1 line, then going back to the file, five
times. Disk reads are the costliest part, with the possible exception
of memory usage (but usually only because it can cause additional disk
*writes*).

> I understand that you can read a chunk instead of a single byte, so when the newline is found you can return all the cached chunks concatenated. But will this make the search of the start of the line faster? I suppose you have always to read byte by byte (or more, if you're using urf16 etc) and see if there's a newline.
>

Massively massively faster. Try it. Especially, try it on an
artificially slow file system, so you can see what it costs.

But you can't rely on any backwards reads unless you know for sure
that the encoding supports this. UTF-8 does (you have to scan
backwards for a start byte), UTF-16 does (work with pairs of bytes and
check for surrogates), and fixed-width encodings do, but otherwise,
you won't necessarily know when you've found a valid start point. So
any reverse-read algorithm is going to be restricted to specific
encodings.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 at 11:21, Roel Schroeven <roel@roelschroeven.net> wrote:

> dn schreef op 24/04/2022 om 0:04:
> > Disagreeing with @Chris in the sense that I use tail very frequently,
> > and usually in the context of server logs - but I'm talking about the
> > Linux implementation, not Python code!
> If I understand Marco correctly, what he want is to read the lines from
> bottom to top, i.e. tac instead of tail, despite his subject.
> I use tail very frequently too, but tac is something I almost never use.
>

Well, the inverse reader is only a secondary suggestion. I suppose a tail
is much more useful.
--
https://mail.python.org/mailman/listinfo/python-list
RE: tail [ In reply to ]
> -----Original Message-----
> From: dn <PythonList@DancesWithMice.info>
> Sent: Saturday, April 23, 2022 6:05 PM
> To: python-list@python.org
> Subject: Re: tail
>
<Snipped>
> NB quite a few of IBM's (extensively researched) algorithms which formed utility
> program[me]s on mainframes, made similar such algorithmic choices, in the
> pursuit of efficiencies.

WRT the mentioned IBM utility program[me]s, the non-Posix part of the IBM mainframe file system has always provided record-managed storage since the late 1960's (as opposed to the byte-managed storage of *ix systems) so searching for line endings was (and is) irrelevant and unnecessary in that environment. That operating system also provides basic "kernel-level" read-backwards API's for the record-managed file system, so there was never any need to build reverse-read into your code for that environment.

The byte-managed file storage used by the Posix kernel running under the actually-in-charge IBM mainframe operating system is, of course, subject to the same constraints and (in)efficiencies discussed in this thread.

Peter
--


--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, 24 Apr 2022 12:21:36 -0400, <pjfarley3@earthlink.net> declaimed the
following:

>
>WRT the mentioned IBM utility program[me]s, the non-Posix part of the IBM mainframe file system has always provided record-managed storage since the late 1960's (as opposed to the byte-managed storage of *ix systems) so searching for line endings was (and is) irrelevant and unnecessary in that environment. That operating system also provides basic "kernel-level" read-backwards API's for the record-managed file system, so there was never any need to build reverse-read into your code for that environment.
>

IBM wasn't the only one... Xerox Sigma running CP/V default for text
files (those created using a text editor) used numeric ISAM keys (as record
numbers -- which is how their FORTRAN IV compiler did random access I/O
without requiring fixed length records). The system supported three access
methods: consecutive (similar to UNIX "stream" files, for files that didn't
require editing, these saved disk space as the ISAM headers could be
disposed of), the aforesaid keyed, and "random" (on this system, "random"
meant the ONLY thing the OS did was know where the file was on disk --
files had to be contiguous and pre-allocated, and what data was in the file
was strictly up to the application to manage).

VAX/VMS had lots of different file structures managed by the RMS system
services. The default for FORTRAN text files was a segmented model, making
use of chunks of around 250 bytes [.it has been years and I no longer have
the documentation] in which the start of each chunk had a code for "first
chunk", "last chunk", "intermediate chunk" (and maybe length of data in the
chunk). A record that fit completely within one chunk would have both
"first" and "last" codes set (intermediate chunks have neither code). One
had to go out of their way to create a "stream" file in DEC FORTRAN 77
(open the file with CARRIAGECONTROL=CARRIAGERETURN). Other languages on the
OS had different default file structures, but RMS would handle all of them
transparently.


--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 25/04/2022 04.21, pjfarley3@earthlink.net wrote:
>> -----Original Message-----
>> From: dn <PythonList@DancesWithMice.info>
>> Sent: Saturday, April 23, 2022 6:05 PM
>> To: python-list@python.org
>> Subject: Re: tail
>>
> <Snipped>
>> NB quite a few of IBM's (extensively researched) algorithms which formed utility
>> program[me]s on mainframes, made similar such algorithmic choices, in the
>> pursuit of efficiencies.
>
> WRT the mentioned IBM utility program[me]s, the non-Posix part of the IBM mainframe file system has always provided record-managed storage since the late 1960's (as opposed to the byte-managed storage of *ix systems) so searching for line endings was (and is) irrelevant and unnecessary in that environment. That operating system also provides basic "kernel-level" read-backwards API's for the record-managed file system, so there was never any need to build reverse-read into your code for that environment.
>
> The byte-managed file storage used by the Posix kernel running under the actually-in-charge IBM mainframe operating system is, of course, subject to the same constraints and (in)efficiencies discussed in this thread.


Thanks for the clarification (and @wlfraed's addition).

Apologies if misunderstood. The above comment was about utilities which
would choose between algorithms, based on some rapid, initial,
assessment of the task. It was not about 'tail' utility/ies specifically
- and I don't recall using a 'tail' on mainframes, but...

Thus, the observation that the OP may find that a serial,
read-the-entire-file approach is faster is some situations (relatively
short files). Conversely, with longer files, some sort of 'last chunk'
approach would be superior.
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list

1 2 3 4  View All