Mailing List Archive: tail

Re: tail [ In reply to ]

May 9, 2022, 2:12 PM

Post #76 of 95 (1799 views)

On Tue, 10 May 2022 at 07:07, Barry <barry@barrys-emacs.org> wrote:
> POSIX tail just prints the bytes to the output that it finds between \n bytes.
> At no time does it need to care about encodings as that is a problem solved
> by the terminal software. I would not expect utf-16 to work with tail on
> linux systems.

UTF-16 ASCII seems to work fine on my system, which probably means the
terminal is just ignoring all the NUL bytes. But if there's a random
0x0A anywhere, it would probably be counted as a line break.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

alan at csail

May 9, 2022, 2:31 PM

Post #77 of 95 (1799 views)

Permalink

Marco Sulla <Marco.Sulla.Python@gmail.com> writes:

On Mon, 9 May 2022 at 19:53, Chris Angelico <rosuav@gmail.com> wrote:
...
Nevertheless, tail is a fundamental tool in *nix. It's fast and
reliable. Also the tail command can't handle different encodings?

It definitely can't. It works for UTF-8, and all the ASCII compatible
single byte encodings, but feed it a file encoded in UTF-16, and it will
sometimes screw up. (And if you don't redirect the output away from
your terminal, and your terminal encoding isn't also set to UTF-16, you
will likely find yourself looking at gibberish -- but that's another
problem...)
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 11, 2022, 12:58 PM

Post #78 of 95 (1786 views)

Permalink

On Mon, 9 May 2022 at 23:15, Dennis Lee Bieber <wlfraed@ix.netcom.com>
wrote:
>
> On Mon, 9 May 2022 21:11:23 +0200, Marco Sulla
> <Marco.Sulla.Python@gmail.com> declaimed the following:
>
> >Nevertheless, tail is a fundamental tool in *nix. It's fast and
> >reliable. Also the tail command can't handle different encodings?
>
> Based upon
> https://github.com/coreutils/coreutils/blob/master/src/tail.c the ONLY
> thing tail looks at is single byte "\n". It does not handle other line
> endings, and appears to performs BINARY I/O, not text I/O. It does nothing
> for bytes that are not "\n". Split multi-byte encodings are irrelevant
> since, if it does not find enough "\n" bytes in the buffer (chunk) it
reads
> another binary chunk and seeks for additional "\n" bytes. Once it finds
the
> desired amount, it is synchronized on the byte following the "\n" (which,
> for multi-byte encodings might be a NUL, but in any event, should be a
safe
> location for subsequent I/O).
>
> Interpretation of encoding appears to fall to the console driver
> configuration when displaying the bytes output by tail.

Ok, I understand. This should be a Python implementation of *nix tail:

import os

_lf = b"\n"
_err_n = "Parameter n must be a positive integer number"
_err_chunk_size = "Parameter chunk_size must be a positive integer number"

def tail(filepath, n=10, chunk_size=100):
if (n <= 0):
raise ValueError(_err_n)

if (n % 1 != 0):
raise ValueError(_err_n)

if (chunk_size <= 0):
raise ValueError(_err_chunk_size)

if (chunk_size % 1 != 0):
raise ValueError(_err_chunk_size)

n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
lines_not_found = n

with open(filepath, "rb") as f:
text = bytearray()

while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
chars = f.read(n_chunk_size)
text[0:0] = chars
search_pos = n_chunk_size

while search_pos != -1:
chunk_line_pos = chars.rfind(_lf, 0, search_pos)

if chunk_line_pos != -1:
lines_not_found -= 1

if lines_not_found == 0:
break

search_pos = chunk_line_pos

if lines_not_found == 0:
break

return bytes(text[chunk_line_pos+1:])

The function opens the file in binary mode and searches only for b"\n". It
returns the last n lines of the file as bytes.

I suppose this function is fast. It reads the bytes from the file in chunks
and stores them in a bytearray, prepending them to it. The final result is
read from the bytearray and converted to bytes (to be consistent with the
read method).

I suppose the function is reliable. File is opened in binary mode and only
b"\n" is searched as line end, as *nix tail (and python readline in binary
mode) do. And bytes are returned. The caller can use them as is or convert
them to a string using the encoding it wants, or do whatever its
imagination can think :)

Finally, it seems to me the function is quite simple.

If all my affirmations are true, the three obstacles written by Chris
should be passed.

I'd very much like to see a CPython implementation of that function. It
could be a method of a file object opened in binary mode, and *only* in
binary mode.

What do you think about it?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

rosuav at gmail

May 11, 2022, 1:07 PM

Post #79 of 95 (1786 views)

Permalink

On Thu, 12 May 2022 at 06:03, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> I suppose this function is fast. It reads the bytes from the file in chunks
> and stores them in a bytearray, prepending them to it. The final result is
> read from the bytearray and converted to bytes (to be consistent with the
> read method).
>
> I suppose the function is reliable. File is opened in binary mode and only
> b"\n" is searched as line end, as *nix tail (and python readline in binary
> mode) do. And bytes are returned. The caller can use them as is or convert
> them to a string using the encoding it wants, or do whatever its
> imagination can think :)
>
> Finally, it seems to me the function is quite simple.
>
> If all my affirmations are true, the three obstacles written by Chris
> should be passed.

Have you actually checked those three, or do you merely suppose them to be true?

> I'd very much like to see a CPython implementation of that function. It
> could be a method of a file object opened in binary mode, and *only* in
> binary mode.
>
> What do you think about it?

Still not necessary. You can simply have it in your own toolkit. Why
should it be part of the core language? How much benefit would it be
to anyone else? All the same assumptions are still there, so it still
isn't general, and you may as well just *code to your own needs* like
I've been saying all along. This does not need to be in the standard
library. Do what you need, assume what you can safely assume, and
other people can write different code.

I don't understand why this wants to be in the standard library.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 11, 2022, 2:27 PM

Post #80 of 95 (1783 views)

Permalink

On Wed, 11 May 2022 at 22:09, Chris Angelico <rosuav@gmail.com> wrote:
>
> Have you actually checked those three, or do you merely suppose them to be true?

I only suppose, as I said. I should do some benchmark and some other
tests, and, frankly, I don't want to. I don't want to because I'm
quite sure the implementation is fast, since it reads by chunks and
cache them. I'm not sure it's 100% free of bugs, but the concept is
very simple, since it simply mimics the *nix tail, so it should be
reliable.

>
> > I'd very much like to see a CPython implementation of that function. It
> > could be a method of a file object opened in binary mode, and *only* in
> > binary mode.
> >
> > What do you think about it?
>
> Still not necessary. You can simply have it in your own toolkit. Why
> should it be part of the core language?

Why not?

> How much benefit would it be
> to anyone else?

I suppose that every programmer, at least one time in its life, did a tail.

> All the same assumptions are still there, so it still
> isn't general

It's general. It mimics the *nix tail. I can't think of a more general
way to implement a tail.

> I don't understand why this wants to be in the standard library.

Well, the answer is really simple: I needed it and if I found it in
the stdlib, I used it instead of writing the first horrible function.
Furthermore, tail is such a useful tool that I suppose many others are
interested, based on this quick Google search:

https://www.google.com/search?q=python+tail

A question on Stackoverflow really much voted, many other
Stackoverflow questions, a package that seems to exactly do the same
thing, that is mimic *nix tail, and a blog post about how to tail in
Python. Furthermore, if you search python tail pypi, you can find a
bunch of other packages:

https://www.google.com/search?q=python+tail+pypi

It seems the subject is quite popular, and I can't imagine otherwise.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

rosuav at gmail

May 11, 2022, 2:31 PM

Post #81 of 95 (1783 views)

Permalink

On Thu, 12 May 2022 at 07:27, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Wed, 11 May 2022 at 22:09, Chris Angelico <rosuav@gmail.com> wrote:
> >
> > Have you actually checked those three, or do you merely suppose them to be true?
>
> I only suppose, as I said. I should do some benchmark and some other
> tests, and, frankly, I don't want to. I don't want to because I'm
> quite sure the implementation is fast, since it reads by chunks and
> cache them. I'm not sure it's 100% free of bugs, but the concept is
> very simple, since it simply mimics the *nix tail, so it should be
> reliable.

If you don't care enough to benchmark it or even debug it, why should
anyone else care?

I'm done discussing. You think that someone else should have done this
for you, but you aren't even willing to put in the effort to make this
useful to anyone else. Just use it yourself and have done with it.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

wlfraed at ix

May 11, 2022, 3:15 PM

Post #82 of 95 (1783 views)

Permalink

On Thu, 12 May 2022 06:07:18 +1000, Chris Angelico <rosuav@gmail.com>
declaimed the following:

>I don't understand why this wants to be in the standard library.
>
Especially as any Linux distribution probably includes the compiled
"tail" command, so this would only be of use on Windows.

Under recent Windows, one has an equivalent to "tail" IFF using
PowerShell rather than the "DOS" shell.

https://www.middlewareinventory.com/blog/powershell-tail-file-windows-tail-command/

or install a Windows binary equivalent http://tailforwin32.sourceforge.net/

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

python-list at python

May 11, 2022, 4:47 PM

Post #83 of 95 (1783 views)

Permalink

Just FYI, UNIX had a bunch of utilities that could emulate a vanilla version of tail on a command line.
You can use sed, awk and quite a few others to simply show line N to the end of a file or other variations.
Of course the way many things were done back then had less focus on efficiency than how to stepwise make changes in a pipeline so reading from the beginning to end was not an issue.

-----Original Message-----
From: Marco Sulla <Marco.Sulla.Python@gmail.com>
To: Chris Angelico <rosuav@gmail.com>
Cc: python-list@python.org
Sent: Wed, May 11, 2022 5:27 pm
Subject: Re: tail

On Wed, 11 May 2022 at 22:09, Chris Angelico <rosuav@gmail.com> wrote:
>
> Have you actually checked those three, or do you merely suppose them to be true?

I only suppose, as I said. I should do some benchmark and some other
tests, and, frankly, I don't want to. I don't want to because I'm
quite sure the implementation is fast, since it reads by chunks and
cache them. I'm not sure it's 100% free of bugs, but the concept is
very simple, since it simply mimics the *nix tail, so it should be
reliable.

>
> > I'd very much like to see a CPython implementation of that function. It
> > could be a method of a file object opened in binary mode, and *only* in
> > binary mode.
> >
> > What do you think about it?
>
> Still not necessary. You can simply have it in your own toolkit. Why
> should it be part of the core language?

Why not?

> How much benefit would it be
> to anyone else?

I suppose that every programmer, at least one time in its life, did a tail.

> All the same assumptions are still there, so it still
> isn't general

It's general. It mimics the *nix tail. I can't think of a more general
way to implement a tail.

> I don't understand why this wants to be in the standard library.

Well, the answer is really simple: I needed it and if I found it in
the stdlib, I used it instead of writing the first horrible function.
Furthermore, tail is such a useful tool that I suppose many others are
interested, based on this quick Google search:

https://www.google.com/search?q=python+tail

A question on Stackoverflow really much voted, many other
Stackoverflow questions, a package that seems to exactly do the same
thing, that is mimic *nix tail, and a blog post about how to tail in
Python. Furthermore, if you search python tail pypi, you can find a
bunch of other packages:

https://www.google.com/search?q=python+tail+pypi

It seems the subject is quite popular, and I can't imagine otherwise.
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

python-list at python

May 11, 2022, 6:27 PM

Post #84 of 95 (1777 views)

Permalink

This seems to be a regular refrain where someone wants something as STANDARD in a programming language or environment and others want to keep it lean and mean or do not see THIS suggestion as particularly important or useful.
Looking at the end of something is extremely common. Packages like numpy/pandas in Python often provide functions with names like head or tail as do other languages where data structures with names like data.frame are commonly used. These structures are in some way indexed to make it easy to jump towards the end. Text files are not.

Efficiency aside, a 3-year-old (well, certainly a 30 year old) can cobble together a function that takes a filename assumed to be textual and reads the file into some data structure that stores the lines of the file and so it can be indexed by line number and also report the index of the final line. The data structure can be a list of lists or a dictionary with line numbers as keys or a numpy ...

So the need for this functionality seems obvious but then what about someone who wants a bunch of random lines from a file? Need we satisfy their wish to pick random offsets from the file and get the line in which the offset is in middle of or the one about to start? Would that even be random if line lengths vary? Text files were never designed to be used efficiently except for reading and writing and certainly not for something like sorting.

Again, generally you can read in the darn file and perform the operation and free up whatever memory you do  not need. If you have huge files, fine, but then why make a special function be part of the default setup if it is rarely used? Why not put it in a module/package called BigFileBatches alongside other functions useful to do things in batches? Call that when needed but for smaller files, KISS.

-----Original Message-----
From: Dennis Lee Bieber <wlfraed@ix.netcom.com>
To: python-list@python.org
Sent: Wed, May 11, 2022 6:15 pm
Subject: Re: tail

On Thu, 12 May 2022 06:07:18 +1000, Chris Angelico <rosuav@gmail.com>
declaimed the following:

>I don't understand why this wants to be in the standard library.
>
    Especially as any Linux distribution probably includes the compiled
"tail" command, so this would only be of use on Windows.

    Under recent Windows, one has an equivalent to "tail" IFF using
PowerShell rather than the "DOS" shell.

https://www.middlewareinventory.com/blog/powershell-tail-file-windows-tail-command/

or install a Windows binary equivalent http://tailforwin32.sourceforge.net/

--
    Wulfraed Dennis Lee Bieber AF6VN
    wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 12, 2022, 10:48 AM

Post #85 of 95 (1769 views)

Permalink

On Thu, 12 May 2022 at 00:50, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
> >def tail(filepath, n=10, chunk_size=100):
> > if (n <= 0):
> > raise ValueError(_err_n)
> ...
>
> There's no spec/doc, so one can't even test it.

Excuse me, you're very right.

"""
A function that "tails" the file. If you don't know what that means,
google "man tail"

filepath: the file path of the file to be "tailed"
n: the numbers of lines "tailed"
chunk_size: oh don't care, use it as is
"""
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 12, 2022, 1:45 PM

Post #86 of 95 (1766 views)

Permalink

Thank you very much. This helped me to improve the function:

import os

_lf = b"\n"
_err_n = "Parameter n must be a positive integer number"
_err_chunk_size = "Parameter chunk_size must be a positive integer number"

def tail(filepath, n=10, chunk_size=100):
if (n <= 0):
raise ValueError(_err_n)

if (n % 1 != 0):
raise ValueError(_err_n)

if (chunk_size <= 0):
raise ValueError(_err_chunk_size)

if (chunk_size % 1 != 0):
raise ValueError(_err_chunk_size)

n_chunk_size = n * chunk_size
pos = os.stat(filepath).st_size
chunk_line_pos = -1
newlines_to_find = n
first_step = True

with open(filepath, "rb") as f:
text = bytearray()

while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
chars = f.read(n_chunk_size)
text[0:0] = chars
search_pos = n_chunk_size

while search_pos != -1:
chunk_line_pos = chars.rfind(_lf, 0, search_pos)

if first_step and chunk_line_pos == search_pos - 1:
newlines_to_find += 1

first_step = False

if chunk_line_pos != -1:
newlines_to_find -= 1

if newlines_to_find == 0:
break

search_pos = chunk_line_pos

if newlines_to_find == 0:
break

return bytes(text[chunk_line_pos+1:])

On Thu, 12 May 2022 at 20:29, Stefan Ram <ram@zedat.fu-berlin.de> wrote:

> I am not aware of a definition of "line" above,
> but the PLR says:
>
> |A physical line is a sequence of characters terminated
> |by an end-of-line sequence.
>
> . So 10 lines should have 10 end-of-line sequences.
>

Maybe. Maybe not. What if the file ends with no newline?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

wlfraed at ix

May 12, 2022, 2:48 PM

Post #87 of 95 (1765 views)

Permalink

On Thu, 12 May 2022 22:45:42 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>
>Maybe. Maybe not. What if the file ends with no newline?

https://github.com/coreutils/coreutils/blob/master/src/tail.c
Lines 567-569 (also lines 550-557 for "bytes_read" determination)

--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

cs at cskk

May 12, 2022, 3:29 PM

Post #88 of 95 (1765 views)

Permalink

On 12May2022 19:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>On Thu, 12 May 2022 at 00:50, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>> There's no spec/doc, so one can't even test it.
>
>Excuse me, you're very right.
>
>"""
>A function that "tails" the file. If you don't know what that means,
>google "man tail"
>
>filepath: the file path of the file to be "tailed"
>n: the numbers of lines "tailed"
>chunk_size: oh don't care, use it as is

This is nearly the worst "specification" I have ever seen.

Describe what your function _does_.

Do not just send people to an arbitrary search engine to find possibly
ephemeral web pages where someone has typed "man tail" and/or (if lucky)
web pages with the output of "man tail" for any of several platforms.

But this is sounding more and more like a special purpose task to be
done for your particular use cases. That says it should be in your
personal toolkit. If it has general applicability, _publish_ your
toolkit for others to use. You can do that trivially by pushing your
code repo to any of several free services like bitbucket, gitlab,
sourcehut, github etc. Or you can go the extra few yards and publish a
package to PyPI and see if anyone uses it.

Part of your problem is that you think the term "tail" has a specific
simple obvious meaning. But even to me it means at least 2 things:
- to report the last "n" "lines" of a text file
- to continuously report "new" data appended to a file

These are different, though related, tasks. The latter one is
particularly easy if done purely for bytes (on systems which allow it).
As you've had explained to you, the former task is actually very fiddly.

It is fiddly both in boundary conditions and complicated by being
dependent on the text encoding, which you do not inherently know - that
implies that you ought to (a) provide a way to specify that encoding and
(b) maybe have a reasonable fallback default. But that default needs to
be carefully and precisely explained. And the "find a line ending"
criteria need to be explained. And the "sync to a character boundary"
needs to be explained, including where it cannot be done.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 13, 2022, 3:16 AM

Post #89 of 95 (1754 views)

Permalink

On Fri, 13 May 2022 at 00:31, Cameron Simpson <cs@cskk.id.au> wrote:

> On 12May2022 19:48, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >On Thu, 12 May 2022 at 00:50, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >> There's no spec/doc, so one can't even test it.
> >
> >Excuse me, you're very right.
> >
> >"""
> >A function that "tails" the file. If you don't know what that means,
> >google "man tail"
> >
> >filepath: the file path of the file to be "tailed"
> >n: the numbers of lines "tailed"
> >chunk_size: oh don't care, use it as is
>
> This is nearly the worst "specification" I have ever seen.
>

You're lucky. I've seen much worse (or no one).
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

2QdxY4RzWzUUiLuE at potatochowder

May 13, 2022, 3:47 AM

Post #90 of 95 (1754 views)

Permalink

On 2022-05-13 at 12:16:57 +0200,
Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:

> On Fri, 13 May 2022 at 00:31, Cameron Simpson <cs@cskk.id.au> wrote:

[...]

> > This is nearly the worst "specification" I have ever seen.

> You're lucky. I've seen much worse (or no one).

At least with *no* documentation, the source code stands for itself. If
I can execute it (whatever that entails), then I can (in theory) figure
out *what* it does. I still don't what it's *supposed* to do, and
therefore *cannot* know how well it does or doesn't "work,", but at
least source code is deterministic and unambiguous (except when it
isn't, but let's not go there).
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 16, 2022, 11:13 AM

Post #91 of 95 (1641 views)

Permalink

On Fri, 13 May 2022 at 12:49, <2QdxY4RzWzUUiLuE@potatochowder.com> wrote:
>
> On 2022-05-13 at 12:16:57 +0200,
> Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> > On Fri, 13 May 2022 at 00:31, Cameron Simpson <cs@cskk.id.au> wrote:
>
> [...]
>
> > > This is nearly the worst "specification" I have ever seen.
>
> > You're lucky. I've seen much worse (or no one).
>
> At least with *no* documentation, the source code stands for itself.

So I did it well to not put one in the first time. I think that after
100 posts about tail, chunks etc it was clear what that stuff was
about and how to use it.

Speaking about more serious things, so far I've done a test with:

* a file that does not end with \n
* a file that ends with \n (after Stefan test)
* a file with more than 10 lines
* a file with less than 10 lines

It seemed to work. I've only to benchmark it. I suppose I have to test
with at least 1 GB file, a big lorem ipsum, and do an unequal
comparison with Linux tail. I'll do it when I have time, so Chris will
be no more angry with me.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 17, 2022, 1:45 PM

Post #92 of 95 (1595 views)

Permalink

Well, I've done a benchmark.

>>> timeit.timeit("tail('/home/marco/small.txt')", globals={"tail":tail}, number=100000)
1.5963431186974049
>>> timeit.timeit("tail('/home/marco/lorem.txt')", globals={"tail":tail}, number=100000)
2.5240604374557734
>>> timeit.timeit("tail('/home/marco/lorem.txt', chunk_size=1000)", globals={"tail":tail}, number=100000)
1.8944984432309866

small.txt is a text file of 1.3 KB. lorem.txt is a lorem ipsum of 1.2
GB. It seems the performance is good, thanks to the chunk suggestion.

But the time of Linux tail surprise me:

marco@buzz:~$ time tail lorem.txt
[text]

real 0m0.004s
user 0m0.003s
sys 0m0.001s

It's strange that it's so slow. I thought it was because it decodes
and print the result, but I timed

timeit.timeit("print(tail('/home/marco/lorem.txt').decode('utf-8'))",
globals={"tail":tail}, number=100000)

and I got ~36 seconds. It seems quite strange to me. Maybe I got the
benchmarks wrong at some point?
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

cs at cskk

May 18, 2022, 2:30 PM

Post #93 of 95 (1589 views)

Permalink

On 17May2022 22:45, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>Well, I've done a benchmark.
>>>> timeit.timeit("tail('/home/marco/small.txt')", globals={"tail":tail}, number=100000)
>1.5963431186974049
>>>> timeit.timeit("tail('/home/marco/lorem.txt')", globals={"tail":tail}, number=100000)
>2.5240604374557734
>>>> timeit.timeit("tail('/home/marco/lorem.txt', chunk_size=1000)", globals={"tail":tail}, number=100000)
>1.8944984432309866

This suggests that the file size does not dominate uour runtime. Ah.
_Or_ that there are similar numbers of newlines vs text in the files so
reading similar amounts of data from the end. If the "line desnity" of
the files were similar you would hope that the runtimes would be
similar.

>small.txt is a text file of 1.3 KB. lorem.txt is a lorem ipsum of 1.2
>GB. It seems the performance is good, thanks to the chunk suggestion.
>
>But the time of Linux tail surprise me:
>
>marco@buzz:~$ time tail lorem.txt
>[text]
>
>real 0m0.004s
>user 0m0.003s
>sys 0m0.001s
>
>It's strange that it's so slow. I thought it was because it decodes
>and print the result, but I timed

You're measuring different things. timeit() tries hard to measure just
the code snippet you provide. It doesn't measure the startup cost of the
whole python interpreter. Try:

time python3 your-tail-prog.py /home/marco/lorem.txt

BTW, does your `tail()` print output? If not, again not measuring the
same thing.

If you have the source of tail(1) to hand, consider getting to the core
and measuring `time()` immediately before and immediately after the
central tail operation and printing the result.

Also: does tail(1) do character set / encoding stuff? Does your Python
code do that? Might be apples and oranges.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

Marco.Sulla.Python at gmail

May 19, 2022, 10:50 AM

Post #94 of 95 (1573 views)

Permalink

On Wed, 18 May 2022 at 23:32, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 17May2022 22:45, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >Well, I've done a benchmark.
> >>>> timeit.timeit("tail('/home/marco/small.txt')", globals={"tail":tail}, number=100000)
> >1.5963431186974049
> >>>> timeit.timeit("tail('/home/marco/lorem.txt')", globals={"tail":tail}, number=100000)
> >2.5240604374557734
> >>>> timeit.timeit("tail('/home/marco/lorem.txt', chunk_size=1000)", globals={"tail":tail}, number=100000)
> >1.8944984432309866
>
> This suggests that the file size does not dominate uour runtime.

Yes, this is what I wanted to test and it seems good.

> Ah.
> _Or_ that there are similar numbers of newlines vs text in the files so
> reading similar amounts of data from the end. If the "line desnity" of
> the files were similar you would hope that the runtimes would be
> similar.

No, well, small.txt has very short lines. Lorem.txt is a lorem ipsum,
so really long lines. Indeed I get better results tuning chunk_size.
Anyway, also with the default value the performance is not bad at all.

> >But the time of Linux tail surprise me:
> >
> >marco@buzz:~$ time tail lorem.txt
> >[text]
> >
> >real 0m0.004s
> >user 0m0.003s
> >sys 0m0.001s
> >
> >It's strange that it's so slow. I thought it was because it decodes
> >and print the result, but I timed
>
> You're measuring different things. timeit() tries hard to measure just
> the code snippet you provide. It doesn't measure the startup cost of the
> whole python interpreter. Try:
>
> time python3 your-tail-prog.py /home/marco/lorem.txt

Well, I'll try it, but it's not a bit unfair to compare Python startup with C?
> BTW, does your `tail()` print output? If not, again not measuring the
> same thing.
> [...]
> Also: does tail(1) do character set / encoding stuff? Does your Python
> code do that? Might be apples and oranges.

Well, as I wrote I also timed

timeit.timeit("print(tail('/home/marco/lorem.txt').decode('utf-8'))",
globals={"tail":tail}, number=100000)

and I got ~36 seconds.

> If you have the source of tail(1) to hand, consider getting to the core
> and measuring `time()` immediately before and immediately after the
> central tail operation and printing the result.

IMHO this is a very good idea, but I have to find the time(). Ahah. Emh.
--
https://mail.python.org/mailman/listinfo/python-list

Re: tail [ In reply to ]

cs at cskk

May 19, 2022, 6:33 PM

Post #95 of 95 (1565 views)

Permalink

On 19May2022 19:50, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>On Wed, 18 May 2022 at 23:32, Cameron Simpson <cs@cskk.id.au> wrote:
>> You're measuring different things. timeit() tries hard to measure
>> just
>> the code snippet you provide. It doesn't measure the startup cost of the
>> whole python interpreter. Try:
>>
>> time python3 your-tail-prog.py /home/marco/lorem.txt
>
>Well, I'll try it, but it's not a bit unfair to compare Python startup with C?

Yes it is. But timeit goes the other way and only measures the code.
Admittedly I'd expect a C tail to be pretty quick anyway. But... even a
small C programme often has a surprising degree of startup these days,
what with dynamicly linked libraries, locale lookups etc etc. Try:

strace tail some-empty-file.txt

and see what goes on. If you're on slow hard drives what is cached in
memory and what isn't can have a surprising effect.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list