Mailing List Archive

End of the line
The latest versions of the Icon language (9.3.1 & beyond) sprouted an
interesting change in semantics: if you open a file for reading in
"translated" (text) mode now, it normalizes Unix, Mac and Windows line
endings to plain \n. Writing in text mode still produces what's natural for
the platform.

Anyone think that's *not* a good idea?

c-will-never-get-fixed-ly y'rs - tim
Re: End of the line [ In reply to ]
> The latest versions of the Icon language (9.3.1 & beyond) sprouted an
> interesting change in semantics: if you open a file for reading in
> "translated" (text) mode now, it normalizes Unix, Mac and Windows line
> endings to plain \n. Writing in text mode still produces what's natural for
> the platform.
>
> Anyone think that's *not* a good idea?

I've been thinking about this myself -- exactly what I would do.

Not clear how easy it is to implement (given that I'm not so enthused
about the idea of rewriting the entire I/O system without using stdio
-- see archives).

The implementation must be as fast as the current one -- people used
to complain bitterly when readlines() or read() where just a tad
slower than they *could* be.

There's a lookahead of 1 character needed -- ungetc() might be
sufficient except that I think it's not guaranteed to work on
unbuffered files.

Should also do this for the Python parser -- there it would be a lot
easier.

--Guido van Rossum (home page: http://www.python.org/~guido/)
RE: End of the line [ In reply to ]
[Tim]
> ... Icon ... sprouted an interesting change in semantics: if you open
> a file for reading in ...text mode ... it normalizes Unix, Mac and
> Windows line endings to plain \n. Writing in text mode still produces
> what's natural for the platform.

[Guido]
> I've been thinking about this myself -- exactly what I would do.

Me too <wink>.

> Not clear how easy it is to implement (given that I'm not so enthused
> about the idea of rewriting the entire I/O system without using stdio
> -- see archives).

The Icon implementation is very simple: they *still* open the file in stdio
text mode. "What's natural for the platform" on writing then comes for
free. On reading, libc usually takes care of what's needed, and what
remains is to check for stray '\r' characters that stdio glossed over. That
is, in fileobject.c, replacing

if ((*buf++ = c) == '\n') {
if (n < 0)
buf--;
break;
}

with a block like (untested!)

*buf++ = c;
if (c == '\n' || c == '\r') {
if (c == '\r') {
*(buf-1) = '\n';
/* consume following newline, if any */
c = getc(fp);
if (c != '\n')
ungetc(c, fp);
}
if (n < 0)
buf--;
break;
}

Related trickery needed in readlines. Of course the '\r' business should be
done only if the file was opened in text mode.

> The implementation must be as fast as the current one -- people used
> to complain bitterly when readlines() or read() where just a tad
> slower than they *could* be.

The above does add one compare per character. Haven't timed it. readlines
may be worse.

BTW, people complain bitterly anyway, but it's in comparison to Perl text
mode line-at-a-time reads!

D:\Python>wc a.c
1146880 3023873 25281537 a.c

D:\Python>

Reading that via

def g():
f = open("a.c")
while 1:
line = f.readline()
if not line:
break

and using python -O took 51 seconds. Running the similar Perl (although
it's not idiomatic Perl to assign each line to an explict var, or to test
that var in the loop, or to use "if !" instead of "unless" -- did all those
to make it more like the Python):

open(DATA, "<a.c");
while ($line = <DATA>) {last if ! $line;}

took 17 seconds. So when people are complaining about a factor of 3, I'm
not inclined to get excited about a few percent <wink>.

> There's a lookahead of 1 character needed -- ungetc() might be
> sufficient except that I think it's not guaranteed to work on
> unbuffered files.

Don't believe I've bumped into that. *Have* bumped into problems with
ungetc not playing nice with fseek/ftell, and that's probably enough to kill
it right there (alas).

> Should also do this for the Python parser -- there it would be a lot
> easier.

And probably the biggest bang for the buck.

the-problem-with-exposing-libc-is-that-libc-isn't-worth-exposing<wink-ly
y'rs - tim
Re: End of the line [ In reply to ]
> The Icon implementation is very simple: they *still* open the file in stdio
> text mode. "What's natural for the platform" on writing then comes for
> free. On reading, libc usually takes care of what's needed, and what
> remains is to check for stray '\r' characters that stdio glossed over.

This'll work for Unix and PC conventions, but not for the Mac. Mac end of line
is \r, so reading a line from a mac file on unix will give you the whole file.
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
RE: End of the line [ In reply to ]
[Tim]
> On reading, libc usually takes care of what's needed, and what
> remains is to check for stray '\r' characters that stdio glossed over.

[Jack Jansen]
> This'll work for Unix and PC conventions, but not for the Mac.
> Mac end of line is \r, so reading a line from a mac file on unix will
> give you the whole file.

I don't see how. Did you look at the code I posted? It treats '\r' the
same as '\n', except that when it sees an '\r' it eats a following '\n' (if
any) too, and replaces the '\r' with '\n' regardless.

Maybe you're missing that Python reads lines one character at a time? So
e.g. the behavior of the platform libc fgets is irrelevant.
Re: End of the line [ In reply to ]
> [Jack Jansen]
> > This'll work for Unix and PC conventions, but not for the Mac.
> > Mac end of line is \r, so reading a line from a mac file on unix will
> > give you the whole file.
> [...]
>
> Maybe you're missing that Python reads lines one character at a time? So
> e.g. the behavior of the platform libc fgets is irrelevant.

You're absolutely right...
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
RE: End of the line [ In reply to ]
> The latest versions of the Icon language [.convert \r\n, \r and \n to
> plain \n in text mode upon read, and convert \n to the platform convention
> on write]

It's a trend <wink>: the latest version of the REBOL language also does
this.

The Java compiler does it for Java source files, but I don't know how
runtime file read/write work in Java.

Anyone know offhand if there's a reliable way to determine whether an open
file descriptor (a C FILE*) is seekable?

if-i'm-doomed-to-get-obsessed-by-this-may-as-well-make-it-faster-
too-ly y'rs - tim
Re: End of the line [ In reply to ]
Tim Peters wrote:
>
> Anyone know offhand if there's a reliable way to determine whether an open
> file descriptor (a C FILE*) is seekable?

I'd simply use trial&error:

if (fseek(stream,0,SEEK_CUR) < 0) {
if (errno != EBADF)) {
/* Not seekable */
errno = 0;
}
else
/* Error */
;
}
else
/* Seekable */
;

How to get this thread safe is left as exercise to the interested
reader ;)

Cheers,
--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 166 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: End of the line [ In reply to ]
Tim Peters <tim_one@email.msn.com> wrote:
> The latest versions of the Icon language (9.3.1 & beyond) sprouted an
> interesting change in semantics: if you open a file for reading in
> "translated" (text) mode now, it normalizes Unix, Mac and Windows line
> endings to plain \n. Writing in text mode still produces what's natural for
> the platform.
>
> Anyone think that's *not* a good idea?

if we were to change this, how would you
tell Python to open a file in text mode?

</F>
RE: End of the line [ In reply to ]
[Tim]
> The latest versions of the Icon language ... normalizes Unix, Mac
> and Windows line endings to plain \n. Writing in text mode still
> produces what's natural for the platform.

[/F]
> if we were to change this, how would you
> tell Python to open a file in text mode?

Meaning whatever it is the platform libc does?

In Icon or REBOL, you don't. Icon is more interesting because they changed
the semantics of their "t" (for "translated") mode without providing any way
to go back to the old behavior (REBOL did this too, but didn't have Icon's
15 years of history to wrestle with). Curiously (I doubt Griswold *cared*
about this!), the resulting behavior still conforms to ANSI C, because that
std promises little about text mode semantics in the presence of
non-printable characters.

Nothing of mine would miss C's raw text mode (lack of) semantics, so I don't
care.

I *would* like Python to define portable semantics for the mode strings it
accepts in the builtin open regardless, and push platform-specific silliness
(including raw C text mode, if someone really wants that; or MS's "c" mode,
etc) into a new os.fopen function. Push random C crap into expert modules,
where it won't baffle my sister <0.7 wink>.

I expect Python should still open non-binary files in the platform's text
mode, though, to minimize surprises for C extensions mucking with the
underlying stream object (Icon/REBOL don't have this problem, although Icon
opens the file in native libc text mode anyway).

next-step:-define-tabs-to-mean-8-characters-and-drop-unicode-in-
favor-of-7-bit-ascii<wink>-ly y'rs - tim
RE: End of the line [ In reply to ]
[.Tim, notes that Perl line-at-a-time text mode input runs 3x faster than
Python's on his platform]

And much to my surprise, it turns out Perl reads lines a character at a time
too! And they do not reimplement stdio. But they do cheat.

Perl's internals are written on top of an abstract IO API, with "PerlIO *"
instead of "FILE *", "PerlIO_tell(PerlIO *)" instead of "ftell(FILE*)", and
so on. Nothing surprising in the details, except maybe that stdin is
modeled as a function "PerlIO *PerlIO_stdin(void)" instead of as global data
(& ditto for stdout/stderr).

The usual *implementation* of these guys is as straight macro substitution
to the corresponding C stdio call. It's possible to implement them some
other way, but I don't see anything in the source that suggests anyone has
done so, except possibly to build it all on AT&T's SFIO lib.

So where's the cheating? In these API functions:

int PerlIO_has_base(PerlIO *);
int PerlIO_has_cntptr(PerlIO *);
int PerlIO_canset_cnt(PerlIO *);

char *PerlIO_get_ptr(PerlIO *);
int PerlIO_get_cnt(PerlIO *);
void PerlIO_set_cnt(PerlIO *,int);
void PerlIO_set_ptrcnt(PerlIO *,char *,int);
char *PerlIO_get_base(PerlIO *);
int PerlIO_get_bufsiz(PerlIO *);

In almost all platform stdio implementations, the C FILE struct has members
that may vary in name but serve the same purpose: an internal buffer, and
some way (pointer or offset) to get at "the next" buffer character. The
guys above are usually just (after layers & layers of config stuff sets it
up) macros that expand into the platform's internal way of spelling these
things. For example, the count member is spelled under Windows as fp->_cnt
under VC, or as fp->level under Borland.

The payoff is in Perl's sv_gets function, in file sv.c. This is long and
very complicated, but at its core has a fast inner loop that copies
characters (provided the PerlIO_has/canXXX functions say it's possible)
directly from the stdio buffer into a Perl string variable -- in the way a
platform fgets function *would* do it if it bothered to optimize fgets. In
my experience, platforms usually settle for the same kind of
fgetc/EOF?/newline? loop Python uses, as if fgets were a stdio client rather
than a stdio primitive. Perl's keeps everything in registers inside the
loop, updates the FILE struct members only at the boundaries, and doesn't
check for EOF except at the boundaries (so long as the buffer has unread
stuff in it, you can't be at EOF).

If the stdio buffer is exhausted before the input terminator is seen (Perl
has "input record separator" and "paragraph mode" gimmicks, so it's hairier
than just looking for \n), it calls PerlIO_getc once to force the platform
to refill the buffer, and goes back to the screaming loop.

Major hackery, but major payoff (on most platforms) too. The abstract I/O
layer is a fine idea regardless. The sad thing is that the real reason Perl
is so fast here is that platform fgets is so needlessly slow.

perl-input-is-faster-than-c-input-ly y'rs - tim
RE: End of the line [ In reply to ]
> [.Tim, notes that Perl line-at-a-time text mode input runs 3x
> faster than
> Python's on his platform]
>
> And much to my surprise, it turns out Perl reads lines a
> character at a time
> too! And they do not reimplement stdio. But they do cheat.
>
> [some notes on the cheating and PerlIO api snipped]
>
> The usual *implementation* of these guys is as straight macro
> substitution
> to the corresponding C stdio call. It's possible to
> implement them some
> other way, but I don't see anything in the source that
> suggests anyone has
> done so, except possibly to build it all on AT&T's SFIO lib.

Hmm - speed bonuses not withstanding, an implementation of
such a beast in the Python sources would've helped a lot to
reduce the ugly hairy gymnastics required to get Python going
on Win CE, where (until very recently) there was no concept
of most of the things you expect to find in stdio...


Brian Lloyd brian@digicool.com
Software Engineer 540.371.6909
Digital Creations http://www.digicool.com
RE: End of the line [ In reply to ]
[Tim, on the cheating PerlIO API]

[Brian Lloyd]
> Hmm - speed bonuses not withstanding, an implementation of
> such a beast in the Python sources would've helped a lot to
> reduce the ugly hairy gymnastics required to get Python going
> on Win CE, where (until very recently) there was no concept
> of most of the things you expect to find in stdio...

I don't think it would have helped you there. If e.g. ftell is missing,
it's no easier to implement it yourself under the name "PerlIO_ftell" than
under the name "ftell" ...

Back before Larry Wall got it into in his head that Perl is a grand metaphor
for freedom and creativity (or whatever), he justifiably claimed that Perl's
great achievement was in taming Unix. Which it did! Perl essentially
defined yet a 537th variation of libc/shell/tool semantics, but in a way
that worked the same across its 536 Unix hosts. The PerlIO API is a great
help with *that*: if a platform is a little off kilter in its
implementation of one of these functions, Perl can use a corresponding
PerlIO wrapper to hide the shortcoming in a platform-specific file, and the
rest of Perl blissfully assumes everything works the same everywhere.

That's a good, cool idea. Ironically, Perl does more to hide gratuitous
platform differences here than Python does! But it's just a pile of names
if you've got no stdio to build on.

let's-model-PythonIO-on-the-win32-api<wink>-ly y'rs - tim
RE: End of the line [ In reply to ]
> let's-model-PythonIO-on-the-win32-api<wink>-ly y'rs - tim

Interestingly, this raises a point worth mentioning sans-wink :-)

Win32 has quite a nice concept that file handles (nearly all handles
really) are "waitable".

Indeed, in the Win32 world, this feature usually prevents me from using the
"threading" module - I need to wait on objects other than threads or locks
(usually files, but sometimes child processes). I also usually need a
"wait for the first one of these objects", which threading doesnt provide,
but that is a digression...

What Im getting at is that a Python IO model should maybe go a little
further than "tradtional" IO - asynchronous IO and synchronisation
capabilities should also be specified. Of course, these would be optional,
but it would be excellent if a platform could easily slot into pre-defined
Python semantics if possible.

Is this reasonable, or really simply too hard to abstract in the manner I
an talking!?

Mark.
Re: End of the line [ In reply to ]
> What Im getting at is that a Python IO model should maybe go a little
> further than "tradtional" IO - asynchronous IO and synchronisation
> capabilities should also be specified. Of course, these would be optional,
> but it would be excellent if a platform could easily slot into pre-defined
> Python semantics if possible.

What Python could do with reasonable ease is a sort of "promise" model, where
an I/O operation returns an object that waits for the I/O to complete upon
access or destruction.

Something like

def foo():
obj = stdin.delayed_read()
obj2 = stdout.delayed_write("data")
do_lengthy_computation()
data = obj.get() # Here we wait for the read to complete
del obj2 # Here we wait for the write to complete.

This gives a fairly nice programming model.
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
RE: End of the line [ In reply to ]
[Jack seems to like an asynch IO model]

> def foo():
> obj = stdin.delayed_read()
> obj2 = stdout.delayed_write("data")
> do_lengthy_computation()
> data = obj.get() # Here we wait for the read to complete
> del obj2 # Here we wait for the write to
> complete.
>
> This gives a fairly nice programming model.

Indeed. Taking this a little further, I come up with something like:

inlock = threading.Lock()
buffer = stdin.delayed_read(inlock)

outlock = threading.Lock()
stdout.delayed_write(outlock, "The data")

fired = threading.Wait(inlock, outlock) # new fn :-)

if fired is inlock: # etc.

The idea is we can make everything wait on a single lock abstraction.
threading.Wait() could accept lock objects, thread objects, Sockets, etc.

Obviously a bit to work out, but it does make an appealing model. OTOH, I
wonder how it fits with continutations etc. Not too badly from my weak
understanding. May be an interesting convergence!

Mark.
Re: End of the line [ In reply to ]
> [Jack seems to like an asynch IO model]
>
> > def foo():
> > obj = stdin.delayed_read()
> > obj2 = stdout.delayed_write("data")
> > do_lengthy_computation()
> > data = obj.get() # Here we wait for the read to complete
> > del obj2 # Here we wait for the write to
> > complete.
> >
> > This gives a fairly nice programming model.
>
> Indeed. Taking this a little further, I come up with something like:
>
> inlock = threading.Lock()
> buffer = stdin.delayed_read(inlock)
>
> outlock = threading.Lock()
> stdout.delayed_write(outlock, "The data")
>
> fired = threading.Wait(inlock, outlock) # new fn :-)
>
> if fired is inlock: # etc.

I think this is exactly what I _didn't_ want:-)

I'd like the delayed read to return an object that will automatically wait
when I try to get the data from it, and the delayed write object to
automatically wait when I garbage-collect it.

Of course, there's no reason why you couldn't also wait on these objects (or,
on unix, pass them to select(), or whatever).

On second thought the method of the delayed read should be called read() in
stead of get(), of course.
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
RE: End of the line [ In reply to ]
[I missed Jack's point]

> I think this is exactly what I _didn't_ want:-)
>
> I'd like the delayed read to return an object that will
> automatically wait
> when I try to get the data from it, and the delayed write object to
> automatically wait when I garbage-collect it.

OK - that is fine. My driving requirement was that I be able to wait on
_multiple_ files at the same time - ie, I dont know which one will complete
first.

There is no reason then why your initial suggestion can not satisfy my
requirement, as long as the "buffer type object" returned from read is
itself waitable. I agree there is no driving need for a seperate buffer
type object and seperate waitable object necessarily.

[.OTOH, your scheme could be simply built on top of my scheme as a
framework]

Unfortunately, this doesnt seem to have grabbed anyone elses interest..

Mark.
RE: delayed I/O; multiple waits [ In reply to ]
[Mark Hammond]
> ...
> Unfortunately, this doesnt seem to have grabbed anyone elses interest..

You lost me when you said it should be optional -- that's fine for an
extension module, but it sounded like you wanted this to somehow be part of
the language core. If WaitForMultipleObjects (which is what you *really*
want <wink>) is thought to be a cool enough idea to be in the core, we
should think about how to implement it on non-Win32 platforms too.

needs-more-words-ly y'rs - tim
RE: RE: delayed I/O; multiple waits [ In reply to ]
> You lost me when you said it should be optional -- that's fine for an
> extension module, but it sounded like you wanted this to

Cool - I admit I knew it was too vague, but left it in anyway.

> the language core. If WaitForMultipleObjects (which is what
> you *really*

Sort-of. IMO, the threading module does need a WaitForMultipleObjects
(whatever the spelling) but <sigh> I also recall the discussion that this
is not trivial.

But what I _really_ want is an enhanced concept of "waitable" - threading
can only wait on locks and threads. If we have this, the WaitForMultiple
would become even more pressing, but they are not directly related.

So, I see 2 issues, both of which usually prevent me personally from using
the threading module in the real world.

By "optional", I meant a way for a platform to slot into existing
"waitable" semantics. Win32 file operations are waitable. I dont really
want native win32 file operations to be in the core, but I would like some
standard way that, if possible, I could map the waitable semantics to
Python waitable semantics.

Thus, although the threading module knows nothing about win32 file objects
or handles, it would be nice if it could still wait on them.

> needs-more-words-ly y'rs - tim

Unfortunately, if I knew exactly what I wanted I would be asking for
implementation advice rather than grasping at straws :-)

Attempting to move from totally raw to half-baked, I suppose this is what I
had in mind:

* Platform optionally defines what a "waitable" object is, in the same way
it now defines what a lock is. Locks are currently _required_ only with
threading - waitables would never be required.
* Python defines a "waitable" protocol - eg, a new "tp_wait"/"__wait__"
slot. If this slot is filled/function exists, it is expected to provide a
"waitable" object or NULL/None.
* Threading support for platforms that support it define a tp_wait slot
that maps the Thread ID to the "waitable object"
* Ditto lock support for the plaform.
* Extensions such as win32 handles also provide this.
* Dream up extensions to file objects a-la Jack's idea. When a file is
opened asynch, tp_wait returns non-NULL (via platform specific hooks), or
NULL when opened sync (making it not waitable). Non-asynch platforms need
zero work here - the asynch open fails, tp_wait slot never filled in.

Thus, for platforms that provide no extra asynch support, threading can
still only wait on threads and locks. The threading module could take
advantage of the new protocol thereby supporting any waitable object.

Like I said, only half-baked, but I think expresses a potentially workable
idea. Does this get closer to either a) explaining what I meant, or b)
confirming I am dribbling?

Biggest problem I see is that the only platform that may take advantage is
Windows, thereby making a platform specific solution (such as win32event I
use now) perfectly reasonable. Maybe my focus should simply be on allowing
win32event.WaitFor* to accept threading instances and standard Python lock
objects!!

Mark.