Mailing List Archive

1 2 3 4  View All
Re: tail [ In reply to ]
On 25Apr2022 08:08, DL Neil <PythonList@DancesWithMice.info> wrote:
>Thus, the observation that the OP may find that a serial,
>read-the-entire-file approach is faster is some situations (relatively
>short files). Conversely, with longer files, some sort of 'last chunk'
>approach would be superior.

If you make the chunk big enough, they're the same algorithm!

It sound silly, but if you make your chunk size as big as your threshold
for "this file is too big to read serially in its entirety, you may as
well just write the "last chunk" flavour.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 26/04/2022 10.54, Cameron Simpson wrote:
> On 25Apr2022 08:08, DL Neil <PythonList@DancesWithMice.info> wrote:
>> Thus, the observation that the OP may find that a serial,
>> read-the-entire-file approach is faster is some situations (relatively
>> short files). Conversely, with longer files, some sort of 'last chunk'
>> approach would be superior.
>
> If you make the chunk big enough, they're the same algorithm!
>
> It sound silly, but if you make your chunk size as big as your threshold
> for "this file is too big to read serially in its entirety, you may as
> well just write the "last chunk" flavour.


I like it!

Yes, in the context of memory-limited mainframes being in-the-past, and
our thinking has, or needs to, moved-on; memory is so much 'cheaper' and
thus available for use!

That said, it depends on file-size and what else is going-on in the
machine/total-application. (and that's 'probably not much' as far as
resource-mix is concerned!) However, I can't speak for the OP, the
reason behind the post, and/or his circumstances...
--
Regards,
=dn
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
Something like this is OK?

import os

def tail(f):
chunk_size = 100
size = os.stat(f.fileno()).st_size

positions = iter(range(size, -1, -chunk_size))
next(positions)

chunk_line_pos = -1
pos = 0

for pos in positions:
f.seek(pos)
chars = f.read(chunk_size)
chunk_line_pos = chars.rfind(b"\n")

if chunk_line_pos != -1:
break

if chunk_line_pos == -1:
nbytes = pos
pos = 0
f.seek(pos)
chars = f.read(nbytes)
chunk_line_pos = chars.rfind(b"\n")

if chunk_line_pos == -1:
line_pos = pos
else:
line_pos = pos + chunk_line_pos + 1

f.seek(line_pos)

return f.readline()

This is simply for one line and for utf8.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>Something like this is OK?
[...]
>def tail(f):
> chunk_size = 100
> size = os.stat(f.fileno()).st_size

I think you want os.fstat().

> positions = iter(range(size, -1, -chunk_size))
> next(positions)

I was wondering about the iter, but this makes sense. Alternatively you
could put a range check in the for-loop.

> chunk_line_pos = -1
> pos = 0
>
> for pos in positions:
> f.seek(pos)
> chars = f.read(chunk_size)
> chunk_line_pos = chars.rfind(b"\n")
>
> if chunk_line_pos != -1:
> break

Normal text file _end_ in a newline. I'd expect this to stop immediately
at the end of the file.

> if chunk_line_pos == -1:
> nbytes = pos
> pos = 0
> f.seek(pos)
> chars = f.read(nbytes)
> chunk_line_pos = chars.rfind(b"\n")

I presume this is because unless you're very lucky, 0 will not be a
position in the range(). I'd be inclined to avoid duplicating this code
and special case and instead maybe make the range unbounded and do
something like this:

if pos < 0:
pos = 0
... seek/read/etc ...
if pos == 0:
break

around the for-loop body.

> if chunk_line_pos == -1:
> line_pos = pos
> else:
> line_pos = pos + chunk_line_pos + 1
> f.seek(line_pos)
> return f.readline()
>
>This is simply for one line and for utf8.

And anything else where a newline is just an ASCII newline byte (10) and
can't be mistaken otherwise. So also ASCII and all the ISO8859-x single
byte encodings. But as Chris has mentioned, not for other encodings.

Seems sane. I haven't tried to run it.

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <cs@cskk.id.au> wrote:

> On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >Something like this is OK?
>

Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine.

But what about Unicode? Are all 10 bytes newlines in Unicode encodings?

If not, and you have a huge file to reverse, it might be better to use a
temporary file.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 2 May 2022 at 09:19, Dan Stromberg <drsalists@gmail.com> wrote:
>
> On Sun, May 1, 2022 at 3:19 PM Cameron Simpson <cs@cskk.id.au> wrote:
>
> > On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> > >Something like this is OK?
> >
>
> Scanning backward for a byte == 10 in ASCII or ISO-8859 seems fine.
>
> But what about Unicode? Are all 10 bytes newlines in Unicode encodings?

Most absolutely not. "Unicode" isn't an encoding, but of the Unicode
Transformation Formats and Universal Character Set encodings, most
don't make that guarantee:

* UTF-8 does, as mentioned. It sacrifices some efficiency and
consistency for a guarantee that ASCII characters are represented by
ASCII bytes, and ASCII bytes only ever represent ASCII characters.
* UCS-2 and UTF-16 will both represent BMP characters with two bytes.
Any character U+xx0A or U+0Axx will include an 0x0A in its
representation.
* UTF-16 will also encode anything U+000xxx0A with an 0x0A. (And I
don't think any codepoints have been allocated that would trigger
this, but UTF-16 can also use 0x0A in the high surrogate.)
* UTF-32 and UCS-4 will use 0x0A for any character U+xx0A, U+0Axx, and
U+Axxxx (though that plane has no characters on it either)

So, of all the available Unicode standard encodings, only UTF-8 makes
this guarantee.

Of course, if you look at documents available on the internet, UTF-8
the encoding used by the vast majority of them (especially if you
include seven-bit files, which can equally be considered ASCII,
ISO-8859-x, and UTF-8), so while it might only be one encoding out of
many, it's probably the most important :)

In general, you can *only* make this parsing assumption IF you know
for sure that your file's encoding is UTF-8, ISO-8859-x, some OEM
eight-bit encoding (eg Windows-125x), or one of a handful of other
compatible encodings. But it probably will be.

> If not, and you have a huge file to reverse, it might be better to use a
> temporary file.

Yeah, or an in-memory deque if you know how many lines you want.
Either way, you can read the file forwards, guaranteeing correct
decoding even of a shifted character set (where a byte value can
change in meaning based on arbitrarily distant context).

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 01May2022 23:30, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>Dan Stromberg <drsalists@gmail.com> writes:
>>But what about Unicode? Are all 10 bytes newlines in Unicode encodings?
> It seems in UTF-8, when a value is above U+007F, it will be
> encoded with bytes that always have their high bit set.

Aye. Design festure enabling easy resync-to-char-boundary at an
arbitrary point in the file.

> But Unicode has NEL "Next Line" U+0085 and other values that
> conforming applications should recognize as line terminators.

I disagree. Maybe for printing things. But textual data records? I would
hope to end them with NL, and only NL (code 10).

Cheers,
Cameron Simpson <cs@cskk.id.au>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 2 May 2022 at 11:54, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 01May2022 23:30, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >Dan Stromberg <drsalists@gmail.com> writes:
> >>But what about Unicode? Are all 10 bytes newlines in Unicode encodings?
> > It seems in UTF-8, when a value is above U+007F, it will be
> > encoded with bytes that always have their high bit set.
>
> Aye. Design festure enabling easy resync-to-char-boundary at an
> arbitrary point in the file.

Yep - and there's also a distinction between "first byte of multi-byte
character" and "continuation byte, keep scanning backwards". So you're
guaranteed to be able to resynchronize.

(If you know whether it's little-endian or big-endian, UTF-16 can also
resync like that, since "high surrogate" and "low surrogate" look
different.)

> > But Unicode has NEL "Next Line" U+0085 and other values that
> > conforming applications should recognize as line terminators.
>
> I disagree. Maybe for printing things. But textual data records? I would
> hope to end them with NL, and only NL (code 10).
>

I'm with you on that - textual data records should end with 0x0A only.
But if there are text entities in there, they should be allowed to
include any Unicode characters, potentially including other types of
whitespace.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 2 May 2022 at 18:31, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>
> |The Unicode standard defines a number of characters that
> |conforming applications should recognize as line terminators:[7]
> |
> |LF: Line Feed, U+000A
> |VT: Vertical Tab, U+000B
> |FF: Form Feed, U+000C
> |CR: Carriage Return, U+000D
> |CR+LF: CR (U+000D) followed by LF (U+000A)
> |NEL: Next Line, U+0085
> |LS: Line Separator, U+2028
> |PS: Paragraph Separator, U+2029
> |
> Wikipedia "Newline".

Should I suppose that other encodings may have more line ending chars?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Tue, 3 May 2022 at 04:38, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> On Mon, 2 May 2022 at 18:31, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
> >
> > |The Unicode standard defines a number of characters that
> > |conforming applications should recognize as line terminators:[7]
> > |
> > |LF: Line Feed, U+000A
> > |VT: Vertical Tab, U+000B
> > |FF: Form Feed, U+000C
> > |CR: Carriage Return, U+000D
> > |CR+LF: CR (U+000D) followed by LF (U+000A)
> > |NEL: Next Line, U+0085
> > |LS: Line Separator, U+2028
> > |PS: Paragraph Separator, U+2029
> > |
> > Wikipedia "Newline".
>
> Should I suppose that other encodings may have more line ending chars?

No, because those are Unicode characters. How they're encoded may
affect the bytes you see, but those are code point values after
decoding.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
Ok, I suppose \n and \r are enough:

########
readline(size=- 1, /)

Read and return one line from the stream. If size is specified, at
most size bytes will be read.

The line terminator is always b'\n' for binary files; for text files,
the newline argument to open() can be used to select the line
terminator(s) recognized.
########
open(file, mode='r', buffering=- 1, encoding=None, errors=None,
newline=None, closefd=True, opener=None)
[...]
newline controls how universal newlines mode works (it only applies to
text mode). It can be None, '', '\n', '\r', and '\r\n'
########
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Mon, 2 May 2022 at 00:20, Cameron Simpson <cs@cskk.id.au> wrote:
>
> On 01May2022 18:55, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
> >Something like this is OK?
> [...]
> >def tail(f):
> > chunk_size = 100
> > size = os.stat(f.fileno()).st_size
>
> I think you want os.fstat().

It's the same from py 3.3

> > chunk_line_pos = -1
> > pos = 0
> >
> > for pos in positions:
> > f.seek(pos)
> > chars = f.read(chunk_size)
> > chunk_line_pos = chars.rfind(b"\n")
> >
> > if chunk_line_pos != -1:
> > break
>
> Normal text file _end_ in a newline. I'd expect this to stop immediately
> at the end of the file.

I think it's correct. The last line in this case is an empty bytes.

> > if chunk_line_pos == -1:
> > nbytes = pos
> > pos = 0
> > f.seek(pos)
> > chars = f.read(nbytes)
> > chunk_line_pos = chars.rfind(b"\n")
>
> I presume this is because unless you're very lucky, 0 will not be a
> position in the range(). I'd be inclined to avoid duplicating this code
> and special case and instead maybe make the range unbounded and do
> something like this:
>
> if pos < 0:
> pos = 0
> ... seek/read/etc ...
> if pos == 0:
> break
>
> around the for-loop body.

Yes, I was not very happy to duplicate the code... I have to think about it.

> Seems sane. I haven't tried to run it.

Thank you ^^
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
I have a little problem.

I tried to extend the tail function, so it can read lines from the bottom
of a file object opened in text mode.

The problem is it does not work. It gets a starting position that is lower
than the expected by 3 characters. So the first line is read only for 2
chars, and the last line is missing.

import os

_lf = "\n"
_cr = "\r"
_lf_ord = ord(_lf)

def tail(f, n=10, chunk_size=100):
n_chunk_size = n * chunk_size
pos = os.stat(f.fileno()).st_size
chunk_line_pos = -1
lines_not_found = n
binary_mode = "b" in f.mode
lf = _lf_ord if binary_mode else _lf

while pos != 0:
pos -= n_chunk_size

if pos < 0:
pos = 0

f.seek(pos)
chars = f.read(n_chunk_size)

for i, char in enumerate(reversed(chars)):
if char == lf:
lines_not_found -= 1

if lines_not_found == 0:
chunk_line_pos = len(chars) - i - 1
print(chunk_line_pos, i)
break

if lines_not_found == 0:
break

line_pos = pos + chunk_line_pos + 1

f.seek(line_pos)

res = b"" if binary_mode else ""

for i in range(n):
res += f.readline()

return res

Maybe the problem is 1 char != 1 byte?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-05-06 20:21, Marco Sulla wrote:
> I have a little problem.
>
> I tried to extend the tail function, so it can read lines from the bottom
> of a file object opened in text mode.
>
> The problem is it does not work. It gets a starting position that is lower
> than the expected by 3 characters. So the first line is read only for 2
> chars, and the last line is missing.
>
> import os
>
> _lf = "\n"
> _cr = "\r"
> _lf_ord = ord(_lf)
>
> def tail(f, n=10, chunk_size=100):
> n_chunk_size = n * chunk_size
> pos = os.stat(f.fileno()).st_size
> chunk_line_pos = -1
> lines_not_found = n
> binary_mode = "b" in f.mode
> lf = _lf_ord if binary_mode else _lf
>
> while pos != 0:
> pos -= n_chunk_size
>
> if pos < 0:
> pos = 0
>
> f.seek(pos)
> chars = f.read(n_chunk_size)
>
> for i, char in enumerate(reversed(chars)):
> if char == lf:
> lines_not_found -= 1
>
> if lines_not_found == 0:
> chunk_line_pos = len(chars) - i - 1
> print(chunk_line_pos, i)
> break
>
> if lines_not_found == 0:
> break
>
> line_pos = pos + chunk_line_pos + 1
>
> f.seek(line_pos)
>
> res = b"" if binary_mode else ""
>
> for i in range(n):
> res += f.readline()
>
> return res
>
> Maybe the problem is 1 char != 1 byte?

Is the file UTF-8? That's a variable-width encoding, so are any of the
characters > U+007F?

Which OS? On Windows, it's common/normal for UTF-8 files to start with a
BOM/signature, which is 3 bytes/1 codepoint.
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Fri, 6 May 2022 21:19:48 +0100, MRAB <python@mrabarnett.plus.com>
declaimed the following:

>Is the file UTF-8? That's a variable-width encoding, so are any of the
>characters > U+007F?
>
>Which OS? On Windows, it's common/normal for UTF-8 files to start with a
>BOM/signature, which is 3 bytes/1 codepoint.

Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
condenses that to just <lf> internally (for TEXT mode) -- so using the
length of a string so read to compute a file position may be off-by-one for
each EOL in the string.

https://docs.python.org/3/tutorial/inputoutput.html#reading-and-writing-files
"""
In text mode, the default when reading is to convert platform-specific line
endings (\n on Unix, \r\n on Windows) to just \n. When writing in text
mode, the default is to convert occurrences of \n back to platform-specific
line endings. This behind-the-scenes modification to file data is fine for
text files, but will corrupt binary data like that in JPEG or EXE files. Be
very careful to use binary mode when reading and writing such files.
"""



--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 7 May 2022 at 01:03, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
> Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
> condenses that to just <lf> internally (for TEXT mode) -- so using the
> length of a string so read to compute a file position may be off-by-one for
> each EOL in the string.

So there's no way to reliably read lines in reverse in text mode using
seek and read, but the only option is readlines?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
Marco,
I think it was made clear from the start that "text" files in the classic sense have no random access method at any higher level than reading a byte at some offset from the beginning of the file, or back from the end when it has not grown.
The obvious fact is that most of the time the lines are not of fixed widths and you have heard about multiple byte encodings and how the ends of lines can vary.

When files get long enough that just reading them from the start as a whole, or even in chunks, gets too expensive, some people might consider some other method. Log files can go on for years so it is not uncommon to start a new one periodically and have a folder with many of them in some order. To get the last few lines simply means finding the last file and reading it, or if it is too short, getting the penultimate one too.
And obviously a database or other structure might work better which might make each "line" a record and index them.
But there are ways to create your own data that get around this such as using an encoding with a large but fixed width for every character albeit you need more storage space. But if the goal is a general purpose tool, internationalization from ASCII has created a challenge for lots of such tools.


-----Original Message-----
From: Marco Sulla <Marco.Sulla.Python@gmail.com>
To: Dennis Lee Bieber <wlfraed@ix.netcom.com>
Cc: python-list@python.org
Sent: Sat, May 7, 2022 9:21 am
Subject: Re: tail

On Sat, 7 May 2022 at 01:03, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>
>        Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
> condenses that to just <lf> internally (for TEXT mode) -- so using the
> length of a string so read to compute a file position may be off-by-one for
> each EOL in the string.

So there's no way to reliably read lines in reverse in text mode using
seek and read, but the only option is readlines?
--
https://mail.python.org/mailman/listinfo/python-list
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
> On 7 May 2022, at 14:24, Marco Sulla <Marco.Sulla.Python@gmail.com> wrote:
>
> ?On Sat, 7 May 2022 at 01:03, Dennis Lee Bieber <wlfraed@ix.netcom.com> wrote:
>>
>> Windows also uses <cr><lf> for the EOL marker, but Python's I/O system
>> condenses that to just <lf> internally (for TEXT mode) -- so using the
>> length of a string so read to compute a file position may be off-by-one for
>> each EOL in the string.
>
> So there's no way to reliably read lines in reverse in text mode using
> seek and read, but the only option is readlines?

You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
Figure out which line ending is in use from the CR LF, LF, CR.
Once you have a line decode it before returning it.

The only OS I know that used CR was Classic Mac OS.
If you do not care about that then you can split on NL and strip any trailing CR.

Barry


> --
> https://mail.python.org/mailman/listinfo/python-list
>

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.

>>> "\n".encode("utf-16")
b'\xff\xfe\n\x00'
>>> "".encode("utf-16")
b'\xff\xfe'
>>> "a\nb".encode("utf-16")
b'\xff\xfea\x00\n\x00b\x00'
>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
b'\n\x00'

Can I use the last trick to get the encoding of a LF or a CR in any encoding?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
I believe I'd do something like:

#!/usr/local/cpython-3.10/bin/python3

"""
Output the last 10 lines of a potentially-huge file.


O(n). But technically so is scanning backward from the EOF.



It'd be faster to use a dict, but this has the advantage of working for
huge num_lines.
"""



import dbm

import os

import sys





tempfile = f'/tmp/{os.path.basename(sys.argv[0])}.{os.getpid()}'



db = dbm.open(tempfile, 'n')



num_lines = 10



for cur_lineno, line in enumerate(sys.stdin):

db[str(cur_lineno)] = line.encode('utf-8')

max_lineno = cur_lineno

str_age_out_lineno = str(cur_lineno - num_lines - 1)

if str_age_out_lineno in db:

del db[str_age_out_lineno]



for lineno in range(max_lineno, max_lineno - num_lines, -1):

str_lineno = str(lineno)

if str_lineno not in db:

break

print(db[str(lineno)].decode('utf-8'), end='')



db.close()

os.unlink(tempfile)


On Sat, Apr 23, 2022 at 11:36 AM Marco Sulla <Marco.Sulla.Python@gmail.com>
wrote:

> What about introducing a method for text streams that reads the lines
> from the bottom? Java has also a ReversedLinesFileReader with Apache
> Commons IO.
> --
> https://mail.python.org/mailman/listinfo/python-list
>
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-05-07 17:28, Marco Sulla wrote:
> On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
>> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
>
>>>> "\n".encode("utf-16")
> b'\xff\xfe\n\x00'
>>>> "".encode("utf-16")
> b'\xff\xfe'
>>>> "a\nb".encode("utf-16")
> b'\xff\xfea\x00\n\x00b\x00'
>>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> b'\n\x00'
>
> Can I use the last trick to get the encoding of a LF or a CR in any encoding?

In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
could be little-endian or big-endian.

As you didn't specify which you wanted, it defaulted to little-endian
and added a BOM (U+FEFF).

If you specify which endianness you want with "utf-16le" or "utf-16be",
it won't add the BOM:

>>> # Little-endian.
>>> "\n".encode("utf-16le")
b'\n\x00'
>>> # Big-endian.
>>> "\n".encode("utf-16be")
b'\x00\n'
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
>
> On 2022-05-07 17:28, Marco Sulla wrote:
> > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> >
> >>>> "\n".encode("utf-16")
> > b'\xff\xfe\n\x00'
> >>>> "".encode("utf-16")
> > b'\xff\xfe'
> >>>> "a\nb".encode("utf-16")
> > b'\xff\xfea\x00\n\x00b\x00'
> >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > b'\n\x00'
> >
> > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
>
> In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> could be little-endian or big-endian.
>
> As you didn't specify which you wanted, it defaulted to little-endian
> and added a BOM (U+FEFF).
>
> If you specify which endianness you want with "utf-16le" or "utf-16be",
> it won't add the BOM:
>
> >>> # Little-endian.
> >>> "\n".encode("utf-16le")
> b'\n\x00'
> >>> # Big-endian.
> >>> "\n".encode("utf-16be")
> b'\x00\n'

Well, ok, but I need a generic method to get LF and CR for any
encoding an user can input.
Do you think that

"\n".encode(encoding).lstrip("".encode(encoding))

is good for any encoding? Furthermore, is there a way to get the
encoding of an opened file object?
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On Sat, 7 May 2022 20:35:34 +0200, Marco Sulla
<Marco.Sulla.Python@gmail.com> declaimed the following:

>Well, ok, but I need a generic method to get LF and CR for any
>encoding an user can input.

Other than EBCDIC, <lf> and <cr> AS BYTES should appear as x0A and x0D
in any of the 8-bit encodings (ASCII, ISO-8859-x, CPxxxx, UTF-8). I believe
those bytes also appear in UTF-16 -- BUT, they will have a null (x00) byte
associated with them as padding; as a result, you can not search for just
x0Dx0A (Windows line end convention -- they may be x00x0Dx00x0A or
x0Dx00x0Ax00 depending on endianness cf:
https://docs.microsoft.com/en-us/cpp/text/support-for-unicode?view=msvc-170
)

For EBCDIC <cr> is still x0D, but <lf> is x25 (and there is a separate
<nl> [new line] at x15)


--
Wulfraed Dennis Lee Bieber AF6VN
wlfraed@ix.netcom.com http://wlfraed.microdiversity.freeddns.org/
--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-05-07 19:35, Marco Sulla wrote:
> On Sat, 7 May 2022 at 19:02, MRAB <python@mrabarnett.plus.com> wrote:
> >
> > On 2022-05-07 17:28, Marco Sulla wrote:
> > > On Sat, 7 May 2022 at 16:08, Barry <barry@barrys-emacs.org> wrote:
> > >> You need to handle the file in bin mode and do the handling of line endings and encodings yourself. It’s not that hard for the cases you wanted.
> > >
> > >>>> "\n".encode("utf-16")
> > > b'\xff\xfe\n\x00'
> > >>>> "".encode("utf-16")
> > > b'\xff\xfe'
> > >>>> "a\nb".encode("utf-16")
> > > b'\xff\xfea\x00\n\x00b\x00'
> > >>>> "\n".encode("utf-16").lstrip("".encode("utf-16"))
> > > b'\n\x00'
> > >
> > > Can I use the last trick to get the encoding of a LF or a CR in any encoding?
> >
> > In the case of UTF-16, it's 2 bytes per code unit, but those 2 bytes
> > could be little-endian or big-endian.
> >
> > As you didn't specify which you wanted, it defaulted to little-endian
> > and added a BOM (U+FEFF).
> >
> > If you specify which endianness you want with "utf-16le" or "utf-16be",
> > it won't add the BOM:
> >
> > >>> # Little-endian.
> > >>> "\n".encode("utf-16le")
> > b'\n\x00'
> > >>> # Big-endian.
> > >>> "\n".encode("utf-16be")
> > b'\x00\n'
>
> Well, ok, but I need a generic method to get LF and CR for any
> encoding an user can input.
> Do you think that
>
> "\n".encode(encoding).lstrip("".encode(encoding))
>
> is good for any encoding?
'.lstrip' is the wrong method to use because it treats its argument as a
set of characters, so it might strip off too many characters. A better
choice is '.removeprefix'.
> Furthermore, is there a way to get the encoding of an opened file object?
>
How was the file opened?


If it was opened as a text file, use the '.encoding' attribute (which
just tells you what encoding was specified when it was opened, and you'd
be assuming that it's the correct one).


If it was opened as a binary file, all you know is that it contains
bytes, and determining the encoding (assuming that it is a text file) is
down to heuristics (i.e. guesswork).

--
https://mail.python.org/mailman/listinfo/python-list
Re: tail [ In reply to ]
On 2022-05-07 19:47, Stefan Ram wrote:
> Marco Sulla <Marco.Sulla.Python@gmail.com> writes:
>>Well, ok, but I need a generic method to get LF and CR for any
>>encoding an user can input.
>
> "LF" and "CR" come from US-ASCII. It is theoretically
> possible that there might be some encodings out there
> (not for Unicode) that are not based on US-ASCII and
> have no LF or no CR.
>
>>is good for any encoding? Furthermore, is there a way to get the
>>encoding of an opened file object?
>
> I have written a function that might be able to detect one
> of few encodings based on a heuristic algorithm.
>
> def encoding( name ):
> path = pathlib.Path( name )
> for encoding in( "utf_8", "latin_1", "cp1252" ):
> try:
> with path.open( encoding=encoding, errors="strict" )as file:
> text = file.read()
> return encoding
> except UnicodeDecodeError:
> pass
> return "ascii"
>
> Yes, it's potentially slow and might be wrong.
> The result "ascii" might mean it's a binary file.
>
"latin-1" will decode any sequence of bytes, so it'll never try
"cp1252", nor fall back to "ascii", and falling back to "ascii" is wrong
anyway because the file could contain 0x80..0xFF, which aren't supported
by that encoding.
--
https://mail.python.org/mailman/listinfo/python-list

1 2 3 4  View All