Mailing List Archive

buffer design (was: marshal (was:Buffer interface in abstract.c?))
M.-A. Lemburg wrote:
>...
> I meant PyUnicode_* style APIs for dealing with all the aspects
> of Unicode objects -- much like the PyString_* APIs available.

Sure, these could be added as necessary. For raw access to the bytes, I
would refer people to the abstract buffer functions, tho.

> > Your abstract.c functions make it quite simple.
>
> BTW, do we need an extra set of those with buffer index or not ?
> Those would really be one-liners for the sake of hiding the
> type slots from applications.

It sounds like NumPy and PIL would need it, which makes the landscape
quite a bit different from the last time we discussed this (when we
didn't imagine anybody needing those).

>...
> > > Since fp.write() uses "s#" this would use the getreadbuffer
> > > slot in 1.5.2... I think what it *should* do is use the
> > > getcharbuffer slot instead (see my other post), since dumping
> > > the raw unicode data would loose too much information. Again,
> >
> > I very much disagree. To me, fp.write() is not about writing characters
> > to a stream. I think it makes much more sense as "writing bytes to a
> > stream" and the buffer interface fits that perfectly.
>
> This is perfectly ok, but shouldn't the behaviour of fp.write()
> mimic that of previous Python versions ? How does JPython
> write the data ?

fp.write() had no semantics for writing Unicode objects since they
didn't exist. Therefore, we are not breaking or changing any behavior.

> Inlined different subject:
> I think the internal semantics of "s#" using the getreadbuffer slot
> and "t#" the getcharbuffer slot should be switched; see my other post.

1) Too late
2) The use of "t#" ("text") for the getcharbuffer slot was decided by
the Benevolent Dictator.
3) see (2)

> In previous Python versions "s#" had the semantics of string data
> with possibly embedded NULL bytes. Now it suddenly has the meaning
> of binary data and you can't simply change extensions to use the
> new "t#" because people are still using them with older Python
> versions.

Guido and I had a pretty long discussion on what the best approach here
was. I think we even pulled in Tim as a final arbiter, as I recall.

I believe "s#" remained getreadbuffer simply because it *also* meant
"give me the bytes of that object". If it changed to getcharbuffer, then
you could see exceptions in code that didn't raise exceptions
beforehand.

(more below)

> > There is no loss of data. You could argue that the byte order is lost,
> > but I think that is incorrect. The application defines the semantics:
> > the file might be defined as using host-order, or the application may be
> > writing a BOM at the head of the file.
>
> The problem here is that many application were not written
> to handle these kind of objects. Previously they could only
> handle strings, now they can suddenly handle any object
> having the buffer interface and then fail when the data
> gets read back in.

An application is a complete unit. How are you suddenly going to
manifest Unicode objects within that application? The only way is if the
developer goes in and changes things; let them deal with the issues and
fallout of their change. The other is external changes such as an
upgrade to the interpreter or a module. Again, (IMO) if you're
perturbing a system, then you are responsible for also correcting any
problems you introduce.

In any case, Guido's position was that things can easily switch over to
the "t#" interface to prevent the class of error where you pass a
Unicode string to a function that expects a standard string.

> > > such things should be handled by extra methods, e.g. fp.rawwrite().
> >
> > I believe this would be a needless complication of the interface.
>
> It would clarify things and make the interface 100% backward
> compatible again.

No. "s#" used to pull bytes from any buffer-capable object. Your
suggestion for "s#" to use the getcharbuffer could introduce exceptions
into currently-working code.

(this was probably Guido's prime motivation for the currently meaning of
"t#"... I can dig up the mail thread if people need an authoritative
commentary on the decision that was made)

> > > Hmm, I guess the philosophy behind the interface is not
> > > really clear.
> >
> > I didn't design or implement it initially, but (as you may have guessed)
> > I am a proponent of its existence.
> >
> > > Binary data is fetched via getreadbuffer and then
> > > interpreted as character data... I always thought that the
> > > getcharbuffer should be used for such an interpretation.
> >
> > The former is bad behavior. That is why getcharbuffer was added (by me,
> > for 1.5.2). It was a preventative measure for the introduction of
> > Unicode strings. Using getreadbuffer for characters would break badly
> > given a Unicode string. Therefore, "clients" that want (8-bit)
> > characters from an object supporting the buffer interface should use
> > getcharbuffer. The Unicode object doesn't implement it, implying that it
> > cannot provide 8-bit characters. You can get the raw bytes thru
> > getreadbuffer.
>
> I agree 100%, but did you add the "t#" instead of having
> "s#" use the getcharbuffer interface ?

Yes. For reasons detailed above.

> E.g. my mxTextTools
> package uses "s#" on many APIs. Now someone could stick
> in a Unicode object and get pretty strange results without
> any notice about mxTextTools and Unicode being incompatible.

They could also stick in an array of integers. That supports the buffer
interface, meaning the "s#" in your code would extract the bytes from
it. In other words, people can already stick bogus stuff into your code.

This seems to be a moot argument.

> You could argue that I change to "t#", but that doesn't
> work since many people out there still use Python versions
> <1.5.2 and those didn't have "t#", so mxTextTools would then
> fail completely for them.

If support for the older versions is needed, then use an #ifdef to set
up the appropriate macro in some header. Use that throughout your code.

In any case: yes -- I would argue that you should absolutely be using
"t#".

Cheers,
-g

--
Greg Stein, http://www.lyra.org/
Re: buffer design (was: marshal (was:Buffer interface in abstract.c?)) [ In reply to ]
Greg Stein <gstein@lyra.org> wrote:
> > E.g. my mxTextTools
> > package uses "s#" on many APIs. Now someone could stick
> > in a Unicode object and get pretty strange results without
> > any notice about mxTextTools and Unicode being incompatible.
>
> They could also stick in an array of integers. That supports the buffer
> interface, meaning the "s#" in your code would extract the bytes from
> it. In other words, people can already stick bogus stuff into your code.

Except that people may expect unicode strings
to work just like any other kind of string, while
arrays are surely a different thing.

I'm beginning to suspect that the current buffer
design is partially broken; it tries to work around
at least two problems at once:

a) the current use of "string" objects for two purposes:
as strings of 8-bit characters, and as buffers containing
arbitrary binary data.

b) performance issues when reading/writing certain kinds
of data to/from streams.

and fails to fully address either of them.

</F>
Re: buffer design [ In reply to ]
Greg Stein wrote:
>
> M.-A. Lemburg wrote:
> >...
> > I meant PyUnicode_* style APIs for dealing with all the aspects
> > of Unicode objects -- much like the PyString_* APIs available.
>
> Sure, these could be added as necessary. For raw access to the bytes, I
> would refer people to the abstract buffer functions, tho.

I guess that's up to them... PyUnicode_AS_WCHAR() could also be
exposed I guess (are C's wchar strings useable as Unicode basis ?).

> > > Your abstract.c functions make it quite simple.
> >
> > BTW, do we need an extra set of those with buffer index or not ?
> > Those would really be one-liners for the sake of hiding the
> > type slots from applications.
>
> It sounds like NumPy and PIL would need it, which makes the landscape
> quite a bit different from the last time we discussed this (when we
> didn't imagine anybody needing those).

Ok, then I'll add them and post the new set next week.

> >...
> > > > Since fp.write() uses "s#" this would use the getreadbuffer
> > > > slot in 1.5.2... I think what it *should* do is use the
> > > > getcharbuffer slot instead (see my other post), since dumping
> > > > the raw unicode data would loose too much information. Again,
> > >
> > > I very much disagree. To me, fp.write() is not about writing characters
> > > to a stream. I think it makes much more sense as "writing bytes to a
> > > stream" and the buffer interface fits that perfectly.
> >
> > This is perfectly ok, but shouldn't the behaviour of fp.write()
> > mimic that of previous Python versions ? How does JPython
> > write the data ?
>
> fp.write() had no semantics for writing Unicode objects since they
> didn't exist. Therefore, we are not breaking or changing any behavior.

The problem is hidden in polymorph functions and tools: previously
they could not handle anything but strings, now they also work
on arbitrary buffers without raising exceptions. That's what I'm
concerned about.

> > Inlined different subject:
> > I think the internal semantics of "s#" using the getreadbuffer slot
> > and "t#" the getcharbuffer slot should be switched; see my other post.
>
> 1) Too late
> 2) The use of "t#" ("text") for the getcharbuffer slot was decided by
> the Benevolent Dictator.
> 3) see (2)

1) It's not too late: most people aren't even aware of the buffer
interface (except maybe the small crowd on this list).

2) A mistake in patchlevel release of Python can easily be undone
in the next minor release. No big deal.

3) Too remain even compatible to 1.5.2 in future revisions, a
new explicit marker, e.g. "r#" for raw data, could be added to hold the
code for getreadbuffer. "s#" and "z#" should then switch
to using getcharbuffer.

> > In previous Python versions "s#" had the semantics of string data
> > with possibly embedded NULL bytes. Now it suddenly has the meaning
> > of binary data and you can't simply change extensions to use the
> > new "t#" because people are still using them with older Python
> > versions.
>
> Guido and I had a pretty long discussion on what the best approach here
> was. I think we even pulled in Tim as a final arbiter, as I recall.

What was the final argument then ? (I guess the discussion was
held *before* the addition of getcharbuffer, right ?)

> I believe "s#" remained getreadbuffer simply because it *also* meant
> "give me the bytes of that object". If it changed to getcharbuffer, then
> you could see exceptions in code that didn't raise exceptions
> beforehand.
>
> (more below)

"s#" historically always meant "give be char* data with length".
It did not mean: "give me a pointer to the data area and its length".
That interpretation is new in 1.5.2. Even integers and lists
could provide buffer access with the new interpretation...
(sound evil ;-)

> > > There is no loss of data. You could argue that the byte order is lost,
> > > but I think that is incorrect. The application defines the semantics:
> > > the file might be defined as using host-order, or the application may be
> > > writing a BOM at the head of the file.
> >
> > The problem here is that many application were not written
> > to handle these kind of objects. Previously they could only
> > handle strings, now they can suddenly handle any object
> > having the buffer interface and then fail when the data
> > gets read back in.
>
> An application is a complete unit. How are you suddenly going to
> manifest Unicode objects within that application? The only way is if the
> developer goes in and changes things; let them deal with the issues and
> fallout of their change. The other is external changes such as an
> upgrade to the interpreter or a module. Again, (IMO) if you're
> perturbing a system, then you are responsible for also correcting any
> problems you introduce.

Well, ok, if you're talking about standalone apps. I was
referring to applications which interact with other applications,
e.g. via files or sockets. You could pass a Unicode obj to a
socket and have it transfer the data to the other end without
getting an exception on the sending part of the connection.
The receiver would read the data as string and most probably
fail.

The whole application sitting in between and dealing with
the protocol and connection management wouldn't even notice
that you've just tried to extended its capabilities.

> In any case, Guido's position was that things can easily switch over to
> the "t#" interface to prevent the class of error where you pass a
> Unicode string to a function that expects a standard string.

Strange, why should code that relies on 8-bit character data
be changed because a new unsupported object type pops up ?
Code supporting the new type will have to be rewritten anyway,
but why break existing extensions in unpredicted ways ?

> > > > such things should be handled by extra methods, e.g. fp.rawwrite().
> > >
> > > I believe this would be a needless complication of the interface.
> >
> > It would clarify things and make the interface 100% backward
> > compatible again.
>
> No. "s#" used to pull bytes from any buffer-capable object. Your
> suggestion for "s#" to use the getcharbuffer could introduce exceptions
> into currently-working code.

The buffer objects were introduced in 1.5.1, AFAIR. Changing
the semantics back to the original ones would only break
extensions relying on the behaviour you desribe -- the distribution
can easily be adapted to use some other marker, such as "r#".

> (this was probably Guido's prime motivation for the currently meaning of
> "t#"... I can dig up the mail thread if people need an authoritative
> commentary on the decision that was made)
>
> > > > Hmm, I guess the philosophy behind the interface is not
> > > > really clear.
> > >
> > > I didn't design or implement it initially, but (as you may have guessed)
> > > I am a proponent of its existence.
> > >
> > > > Binary data is fetched via getreadbuffer and then
> > > > interpreted as character data... I always thought that the
> > > > getcharbuffer should be used for such an interpretation.
> > >
> > > The former is bad behavior. That is why getcharbuffer was added (by me,
> > > for 1.5.2). It was a preventative measure for the introduction of
> > > Unicode strings. Using getreadbuffer for characters would break badly
> > > given a Unicode string. Therefore, "clients" that want (8-bit)
> > > characters from an object supporting the buffer interface should use
> > > getcharbuffer. The Unicode object doesn't implement it, implying that it
> > > cannot provide 8-bit characters. You can get the raw bytes thru
> > > getreadbuffer.
> >
> > I agree 100%, but did you add the "t#" instead of having
> > "s#" use the getcharbuffer interface ?
>
> Yes. For reasons detailed above.
>
> > E.g. my mxTextTools
> > package uses "s#" on many APIs. Now someone could stick
> > in a Unicode object and get pretty strange results without
> > any notice about mxTextTools and Unicode being incompatible.
>
> They could also stick in an array of integers. That supports the buffer
> interface, meaning the "s#" in your code would extract the bytes from
> it. In other words, people can already stick bogus stuff into your code.

Right now they can with 1.5.1 and 1.5.2 which is unfortunate.
I'd rather have the parsing function raise an exception.

> This seems to be a moot argument.

Not really when you have to support extensions across three
different patchlevels of Python.

> > You could argue that I change to "t#", but that doesn't
> > work since many people out there still use Python versions
> > <1.5.2 and those didn't have "t#", so mxTextTools would then
> > fail completely for them.
>
> If support for the older versions is needed, then use an #ifdef to set
> up the appropriate macro in some header. Use that throughout your code.
>
> In any case: yes -- I would argue that you should absolutely be using
> "t#".

I can easily change my code, no big deal, but what about
the dozens of other extensions I don't want to bother diving
into ? I'd rather see an exception then complete garbage written
to a file or a socket.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 139 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: buffer design [ In reply to ]
Fredrik Lundh wrote:
>
> Greg Stein <gstein@lyra.org> wrote:
> > > E.g. my mxTextTools
> > > package uses "s#" on many APIs. Now someone could stick
> > > in a Unicode object and get pretty strange results without
> > > any notice about mxTextTools and Unicode being incompatible.
> >
> > They could also stick in an array of integers. That supports the buffer
> > interface, meaning the "s#" in your code would extract the bytes from
> > it. In other words, people can already stick bogus stuff into your code.
>
> Except that people may expect unicode strings
> to work just like any other kind of string, while
> arrays are surely a different thing.
>
> I'm beginning to suspect that the current buffer
> design is partially broken; it tries to work around
> at least two problems at once:
>
> a) the current use of "string" objects for two purposes:
> as strings of 8-bit characters, and as buffers containing
> arbitrary binary data.
>
> b) performance issues when reading/writing certain kinds
> of data to/from streams.
>
> and fails to fully address either of them.

True, a higher level interface for those two objectives would
certainly address them much better than what we are trying to do at
bit level. Buffers should probably only be treated as pointers to
abstract memory areas and nothing more.

BTW, what about my suggestion to extend buffers to also allocate
memory (in case you pass None as object) ? Or should array
be used for that purpose (its an undocumented feature of arrays) ?

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 139 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/