Mailing List Archive

Re: marshal (was:Buffer interface in abstract.c? )
Recently, "Fredrik Lundh" <fredrik@pythonware.com> said:
> on the other hand, I think something is missing from
> the buffer design; I definitely don't like that people
> can write and marshal objects that happen to
> implement the buffer interface, only to find that
> Python didn't do what they expected...
>
> >>> import unicode
> >>> import marshal
> >>> u = unicode.unicode
> >>> s = u("foo")
> >>> data = marshal.dumps(s)
> >>> marshal.loads(data)
> 'f\000o\000o\000'
> >>> type(marshal.loads(data))
> <type 'string'>

Hmm. Looking at the code there is a catchall at the end, with a
comment explicitly saying "Write unknown buffer-style objects as a string".
IMHO this is an incorrect design, but that's a bit philosophical (so
I'll gladly defer to Our Great Philosopher if he has anything to say
on the matter:-). Unless, of course, there are buffer-style non-string
objects around that are better read back as strings than not read back
at all.

Hmm again, I think I'd like it better if marshal.dumps() would barf on
attempts to write unrepresentable data. Currently unrepresentable
objects are written as TYPE_UNKNOWN (unless they have bufferness (or
should I call that "a buffer-aspect"? :-)), which means you think you
are writing correctly marshalled data but you'll be in for an
exception when you try to read it back...
--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Jack Jansen wrote:
>
> Recently, "Fredrik Lundh" <fredrik@pythonware.com> said:
> > on the other hand, I think something is missing from
> > the buffer design; I definitely don't like that people
> > can write and marshal objects that happen to
> > implement the buffer interface, only to find that
> > Python didn't do what they expected...
> >
> > >>> import unicode
> > >>> import marshal
> > >>> u = unicode.unicode
> > >>> s = u("foo")
> > >>> data = marshal.dumps(s)
> > >>> marshal.loads(data)
> > 'f\000o\000o\000'
> > >>> type(marshal.loads(data))
> > <type 'string'>

Why do Unicode objects implement the bf_getcharbuffer slot ? I thought
that unicode objects use a two-byte character representation.

Note that implementing the char buffer interface will also give
you strange results with other code that uses
PyArg_ParseTuple(...,"s#",...), e.g. you could search through
Unicode strings as if they were normal 1-byte/char strings (and
most certainly not find what you're looking for, I guess).

> Hmm again, I think I'd like it better if marshal.dumps() would barf on
> attempts to write unrepresentable data. Currently unrepresentable
> objects are written as TYPE_UNKNOWN (unless they have bufferness (or
> should I call that "a buffer-aspect"? :-)), which means you think you
> are writing correctly marshalled data but you'll be in for an
> exception when you try to read it back...

I'd prefer an exception on write too.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 147 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
> > > >>> import unicode
> > > >>> import marshal
> > > >>> u = unicode.unicode
> > > >>> s = u("foo")
> > > >>> data = marshal.dumps(s)
> > > >>> marshal.loads(data)
> > > 'f\000o\000o\000'
> > > >>> type(marshal.loads(data))
> > > <type 'string'>
>
> Why do Unicode objects implement the bf_getcharbuffer slot ? I thought
> that unicode objects use a two-byte character representation.

>>> import array
>>> import marshal
>>> a = array.array
>>> s = a("f", [1, 2, 3])
>>> data = marshal.dumps(s)
>>> marshal.loads(data)
'\000\000\200?\000\000\000@\000\000@@'

looks like the various implementors haven't
really understood the intentions of whoever
designed the buffer interface...

</F>
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Fredrik Lundh wrote:
>
> > > > >>> import unicode
> > > > >>> import marshal
> > > > >>> u = unicode.unicode
> > > > >>> s = u("foo")
> > > > >>> data = marshal.dumps(s)
> > > > >>> marshal.loads(data)
> > > > 'f\000o\000o\000'
> > > > >>> type(marshal.loads(data))
> > > > <type 'string'>

This was a "nicety" that was put during a round of patches that I
submitted to Guido. We both had questions about it but figured that it
couldn't hurt since it at least let some things be marshalled out that
couldn't be marshalled before.

I would suggest backing out the marshalling of buffer-interface objects
and adding a mechanism for arbitrary type objects to marshal themselves.
Without the second part, arrays and Unicode objects aren't marshallable
at all (seems bad).

> > Why do Unicode objects implement the bf_getcharbuffer slot ? I thought
> > that unicode objects use a two-byte character representation.

Unicode objects should *not* implement the getcharbuffer slot. Only
read, write, and segcount.

> >>> import array
> >>> import marshal
> >>> a = array.array
> >>> s = a("f", [1, 2, 3])
> >>> data = marshal.dumps(s)
> >>> marshal.loads(data)
> '\000\000\200?\000\000\000@\000\000@@'
>
> looks like the various implementors haven't
> really understood the intentions of whoever
> designed the buffer interface...

Arrays can/should support both the getreadbuffer and getcharbuffer
interface. The former: definitely. The latter: only if the contents are
byte-sized.

The loading back as a string is a different matter, as pointed out
above.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Recently, Greg Stein <gstein@lyra.org> said:
> I would suggest backing out the marshalling of buffer-interface objects
> and adding a mechanism for arbitrary type objects to marshal themselves.
> Without the second part, arrays and Unicode objects aren't marshallable
> at all (seems bad).

This sounds like the right approach. It would require 2 slots in the
tp_ structure and a little extra glue for the typecodes (currently
marshall knows all the 1-letter typecodes for all objecttypes it can
handle, but types marshalling their own objects would require a
centralized registry of object types. For the time being it would
probably suffice to have the mapping of type<->letter be hardcoded in
marshal.h, but eventually you probably want a more extensible scheme,
where Joe R. Extension-Writer could add a marshaller to his objects
and know it won't collide with someone else's.

--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Jack Jansen wrote:
>
> Recently, Greg Stein <gstein@lyra.org> said:
> > I would suggest backing out the marshalling of buffer-interface objects
> > and adding a mechanism for arbitrary type objects to marshal themselves.
> > Without the second part, arrays and Unicode objects aren't marshallable
> > at all (seems bad).
>
> This sounds like the right approach. It would require 2 slots in the
> tp_ structure and a little extra glue for the typecodes (currently
> marshall knows all the 1-letter typecodes for all objecttypes it can
> handle, but types marshalling their own objects would require a
> centralized registry of object types. For the time being it would
> probably suffice to have the mapping of type<->letter be hardcoded in
> marshal.h, but eventually you probably want a more extensible scheme,
> where Joe R. Extension-Writer could add a marshaller to his objects
> and know it won't collide with someone else's.

This registry should ideally be reachable via C APIs. Then a module
writer could call these APIs in the init function of his module and
he'd be set. Since marshal won't be able to handle imports on the
fly (like pickle et al.), these modules will have to be imported
before unmarshalling.

Aside: wouldn't it make sense to move from marshal to pickle and
depreciate marshal altogether ? cPickle is quite fast and much more
flexible than marshal, plus it already provides mechanisms for
registering new types.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 144 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
> Aside: wouldn't it make sense to move from marshal to pickle and
> depreciate marshal altogether ? cPickle is quite fast and much more
> flexible than marshal, plus it already provides mechanisms for
> registering new types.

This is probably the best idea so far. Just remove the buffer-workaround in
marshall, keep it functioning for the things it is used for now (like pyc
files) and refer people to (c)Pickle for new development.

--
Jack Jansen | ++++ stop the execution of Mumia Abu-Jamal ++++
Jack.Jansen@oratrix.com | ++++ if you agree copy these lines to your sig ++++
www.oratrix.nl/~jack | see http://www.xs4all.nl/~tank/spg-l/sigaction.htm
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Greg Stein <gstein@lyra.org> wrote:
> > > > > >>> import unicode
> > > > > >>> import marshal
> > > > > >>> u = unicode.unicode
> > > > > >>> s = u("foo")
> > > > > >>> data = marshal.dumps(s)
> > > > > >>> marshal.loads(data)
> > > > > 'f\000o\000o\000'
> > > > > >>> type(marshal.loads(data))
> > > > > <type 'string'>
>
> > > Why do Unicode objects implement the bf_getcharbuffer slot ? I thought
> > > that unicode objects use a two-byte character representation.
>
> Unicode objects should *not* implement the getcharbuffer slot. Only
> read, write, and segcount.

unicode objects do not implement the getcharbuffer slot.
here's the relevant descriptor:

static PyBufferProcs unicode_as_buffer = {
(getreadbufferproc) unicode_buffer_getreadbuf,
(getwritebufferproc) unicode_buffer_getwritebuf,
(getsegcountproc) unicode_buffer_getsegcount
};

the array module uses a similar descriptor.

maybe the unicode class shouldn't implement the
buffer interface at all? sure looks like the best way
to avoid trivial mistakes (the current behaviour of
fp.write(unicodeobj) is even more serious than the
marshal glitch...)

or maybe the buffer design needs an overhaul?

</F>
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
> Greg Stein <gstein@lyra.org> wrote:
> > > > > > >>> import unicode
> > > > > > >>> import marshal
> > > > > > >>> u = unicode.unicode
> > > > > > >>> s = u("foo")
> > > > > > >>> data = marshal.dumps(s)
> > > > > > >>> marshal.loads(data)
> > > > > > 'f\000o\000o\000'
> > > > > > >>> type(marshal.loads(data))
> > > > > > <type 'string'>
> >
> > > > Why do Unicode objects implement the bf_getcharbuffer slot ? I thought
> > > > that unicode objects use a two-byte character representation.
> >
> > Unicode objects should *not* implement the getcharbuffer slot. Only
> > read, write, and segcount.
>
> unicode objects do not implement the getcharbuffer slot.
> here's the relevant descriptor:
>
> static PyBufferProcs unicode_as_buffer = {
> (getreadbufferproc) unicode_buffer_getreadbuf,
> (getwritebufferproc) unicode_buffer_getwritebuf,
> (getsegcountproc) unicode_buffer_getsegcount
> };
>
> the array module uses a similar descriptor.
>
> maybe the unicode class shouldn't implement the
> buffer interface at all? sure looks like the best way
> to avoid trivial mistakes (the current behaviour of
> fp.write(unicodeobj) is even more serious than the
> marshal glitch...)
>
> or maybe the buffer design needs an overhaul?

I think most places that should use the charbuffer interface actually
use the readbuffer interface. This is what should be fixed.

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Fredrik Lundh wrote:
>
> unicode objects do not implement the getcharbuffer slot.
>...
> or maybe the buffer design needs an overhaul?

I think its usage does. The character slot should be used whenever
character data is needed, not the read buffer slot. The latter one is
for passing around raw binary data (without reinterpretation !),
if I understood Greg correctly back when I gave those abstract
APIs a try.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 143 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
> > or maybe the buffer design needs an overhaul?
>
> I think most places that should use the charbuffer interface actually
> use the readbuffer interface. This is what should be fixed.

ok.

btw, how about adding support for buffer access
to data that have strange internal formats (like cer-
tain PIL image memories) or isn't directly accessible
(like "virtual" and "abstract" image buffers in PIL 1.1).
something like:

int initbuffer(PyObject* obj, void** context);
int exitbuffer(PyObject* obj, void* context);

and corresponding context arguments to the
rest of the functions...

</F>
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
> btw, how about adding support for buffer access
> to data that have strange internal formats (like cer-
> tain PIL image memories) or isn't directly accessible
> (like "virtual" and "abstract" image buffers in PIL 1.1).
> something like:
>
> int initbuffer(PyObject* obj, void** context);
> int exitbuffer(PyObject* obj, void* context);
>
> and corresponding context arguments to the
> rest of the functions...

Can you explain this idea more? Without more understanding of PIL I
have no idea what you're talking about...

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
On Tue, 10 Aug 1999, Fredrik Lundh wrote:
>...
> unicode objects do not implement the getcharbuffer slot.

This is Goodness. All righty.

>...
> maybe the unicode class shouldn't implement the
> buffer interface at all? sure looks like the best way

It is needed for fp.write(unicodeobj) ...

It is also very handy for C functions to deal with Unicode strings.

> to avoid trivial mistakes (the current behaviour of
> fp.write(unicodeobj) is even more serious than the
> marshal glitch...)

What's wrong with fp.write(unicodeobj)? It should write the unicode value
to the file. Are you suggesting that it will need to be done differently?
Icky.

> or maybe the buffer design needs an overhaul?

Not that I know of.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
On Tue, 10 Aug 1999, Guido van Rossum wrote:
>...
> > or maybe the buffer design needs an overhaul?
>
> I think most places that should use the charbuffer interface actually
> use the readbuffer interface. This is what should be fixed.

I believe that I properly changed all of these within the core
distribution. Per your requested design, third-party extensions must
switch from "s#" to "t#" to move to the charbuffer interface, as needed.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
M.-A. Lemburg writes:
> Aside: Is the buffer interface reachable in any way from within
> Python ? Why isn't the interface exposed via __XXX__ methods
> on normal Python instances (could be implemented by returning a
> buffer object) ?

Would it even make sense? I though a large part of the intent was
to for performance, avoiding memory copies. Perhaps there should be
an .__as_buffer__() which returned an object that supports the C
buffer interface. I'm not sure how useful it would be; perhaps for
classes that represent image data? They could return a buffer object
created from a string/array/NumPy array.


-Fred

--
Fred L. Drake, Jr. <fdrake@acm.org>
Corporation for National Research Initiatives
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
> Would it even make sense? I though a large part of the intent was
> to for performance, avoiding memory copies.

looks like there's some confusion here over
what the buffer interface is all about. time
for a new GvR essay, perhaps?

</F>
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Fredrik Lundh writes:
> looks like there's some confusion here over
> what the buffer interface is all about. time
> for a new GvR essay, perhaps?

If he'll write something about it, I'll be glad to adapt it to the
extending & embedding manual. It seems important that it be included
in the standard documentation since it will be important for extension
writers to understand when they should implement it.


-Fred

--
Fred L. Drake, Jr. <fdrake@acm.org>
Corporation for National Research Initiatives
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Guido van Rossum wrote:
> > btw, how about adding support for buffer access
> > to data that have strange internal formats (like cer-
> > tain PIL image memories) or isn't directly accessible
> > (like "virtual" and "abstract" image buffers in PIL 1.1).
> > something like:
> >
> > int initbuffer(PyObject* obj, void** context);
> > int exitbuffer(PyObject* obj, void* context);
> >
> > and corresponding context arguments to the
> > rest of the functions...
>
> Can you explain this idea more? Without more understanding of PIL I
> have no idea what you're talking about...

in code:

void* context;

// this can be done at any time
segments = pb->getsegcount(obj, NULL, context);

if (!pb->bf_initbuffer(obj, &context))
... failed to initialise buffer api ...

... allocate segment size buffer ...

pb->getsegcount(obj, &bytes, context);
... calculate total buffer size and allocate buffer ...

for (i = offset = 0; i < segments; i++) {
n = pb->getreadbuffer(obj, i, &p, context);
if (n < 0)
... failed to fetch a given segment ...
memcpy(buf + offset, p, n); // or write to file, or whatevef
offset = offset + n;
}

pb->bf_exitbuffer(obj, context);

in other words, this would given the target object a
chance to keep some local context (like a temporary
buffer) during a sequence of buffer operations...

for PIL, this would make it possible to

1) store required metadata (size, mode, palette)
along with the actual buffer contents.

2) possibly pack formats that use extra internal
storage for performance reasons -- RGB pixels
are stored as 32-bit integers, for example.

3) access virtual image memories (that can only
be accessed via a buffer-like interface in them-
selves -- given an image object, you acquire an
access handle, and use a getdata method to
access the actual data. without initbuffer,
there's no way to do two buffer access in
parallel. without exitbuffer, there's no way
to release the access handle. without the
context variable, there's nowhere to keep
the access handle between calls.)

4) access abstract image memories (like virtual
memories, but they reside outside PIL, like on
a remote server, or inside another image pro-
cessing library, or on a hardware device).

5) convert to external formats on the fly:

fp.write(im.buffer("JPEG"))

and probably a lot more. as far as I can tell,
nothing of this can be done using the current
design...

...

besides, what about buffers and threads? if you
return a pointer from getreadbuf, wouldn't it be
good to know exactly when Python doesn't need
that pointer any more? explicit initbuffer/exitbuffer
calls around each sequence of buffer operations
would make that a lot safer...

</F>
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Greg Stein wrote:
>
> On Tue, 10 Aug 1999, Guido van Rossum wrote:
> >...
> > > or maybe the buffer design needs an overhaul?
> >
> > I think most places that should use the charbuffer interface actually
> > use the readbuffer interface. This is what should be fixed.
>
> I believe that I properly changed all of these within the core
> distribution. Per your requested design, third-party extensions must
> switch from "s#" to "t#" to move to the charbuffer interface, as needed.

Shouldn't this be the other way around ? After all, extensions
using "s#" do expect character data and not arbitrary binary
encodings of information. IMHO, the latter should be special
cased, not the prior. E.g. it doesn't make sense to use the
re module to scan over 2-byte Unicode with single character
based search patterns.

Aside: Is the buffer interface reachable in any way from within
Python ? Why isn't the interface exposed via __XXX__ methods
on normal Python instances (could be implemented by returning a
buffer object) ?

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 140 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Fred L. Drake, Jr. wrote:
>
> M.-A. Lemburg writes:
> > Aside: Is the buffer interface reachable in any way from within
> > Python ? Why isn't the interface exposed via __XXX__ methods
> > on normal Python instances (could be implemented by returning a
> > buffer object) ?
>
> Would it even make sense? I though a large part of the intent was
> to for performance, avoiding memory copies. Perhaps there should be
> an .__as_buffer__() which returned an object that supports the C
> buffer interface. I'm not sure how useful it would be; perhaps for
> classes that represent image data? They could return a buffer object
> created from a string/array/NumPy array.

That's what I had in mind.

def __getreadbuffer__(self):
return buffer(self.data)

def __getcharbuffer__(self):
return buffer(self.string_data)

def __getwritebuffer__(self):
return buffer(self.mmaped_file)

Note that buffer() does not copy the data, it only adds a reference
to the object being used.

Hmm, how about adding a writeable binary object to the core ?
This would be useful for the __getwritebbuffer__() API because
currently, I think, only mmap'ed files are useable as write
buffers -- no other in-memory type. Perhaps buffer objects
could be used for this purpose too, e.g. by having them
allocate the needed memory chunk in case you pass None as
object.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 140 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Greg Stein wrote:
>
> On Tue, 10 Aug 1999, Fredrik Lundh wrote:
> > maybe the unicode class shouldn't implement the
> > buffer interface at all? sure looks like the best way
>
> It is needed for fp.write(unicodeobj) ...
>
> It is also very handy for C functions to deal with Unicode strings.

Wouldn't a special C API be (even) more convenient ?

> > to avoid trivial mistakes (the current behaviour of
> > fp.write(unicodeobj) is even more serious than the
> > marshal glitch...)
>
> What's wrong with fp.write(unicodeobj)? It should write the unicode value
> to the file. Are you suggesting that it will need to be done differently?
> Icky.

Would this also write some kind of Unicode encoding header ?
[.Sorry, this is my Unicode ignorance shining through... I only
remember lots of talk about these things on the string-sig.]

Since fp.write() uses "s#" this would use the getreadbuffer
slot in 1.5.2... I think what it *should* do is use the
getcharbuffer slot instead (see my other post), since dumping
the raw unicode data would loose too much information. Again,
such things should be handled by extra methods, e.g. fp.rawwrite().

Hmm, I guess the philosophy behind the interface is not
really clear. Binary data is fetched via getreadbuffer and then
interpreted as character data... I always thought that the
getcharbuffer should be used for such an interpretation.

Or maybe, we should dump the getcharbufer slot again and
use the getreadbuffer information just as we would a
void* pointer in C: with no explicit or implicit type information.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 140 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
M.-A. Lemburg wrote:
>
> Greg Stein wrote:
> >
> > On Tue, 10 Aug 1999, Fredrik Lundh wrote:
> > > maybe the unicode class shouldn't implement the
> > > buffer interface at all? sure looks like the best way
> >
> > It is needed for fp.write(unicodeobj) ...
> >
> > It is also very handy for C functions to deal with Unicode strings.
>
> Wouldn't a special C API be (even) more convenient ?

Why? Accessing the Unicode values as a series of bytes matches exactly
to the semantics of the buffer interface. Why throw in Yet Another
Function?

Your abstract.c functions make it quite simple.

> > > to avoid trivial mistakes (the current behaviour of
> > > fp.write(unicodeobj) is even more serious than the
> > > marshal glitch...)
> >
> > What's wrong with fp.write(unicodeobj)? It should write the unicode value
> > to the file. Are you suggesting that it will need to be done differently?
> > Icky.
>
> Would this also write some kind of Unicode encoding header ?
> [.Sorry, this is my Unicode ignorance shining through... I only
> remember lots of talk about these things on the string-sig.]

Absolutely not. Placing the Byte Order Mark (BOM) into an output stream
is an application-level task. It should never by done by any subsystem.

There are no other "encoding headers" that would go into the output
stream. The output would simply be UTF-16 (2-byte values in host byte
order).

> Since fp.write() uses "s#" this would use the getreadbuffer
> slot in 1.5.2... I think what it *should* do is use the
> getcharbuffer slot instead (see my other post), since dumping
> the raw unicode data would loose too much information. Again,

I very much disagree. To me, fp.write() is not about writing characters
to a stream. I think it makes much more sense as "writing bytes to a
stream" and the buffer interface fits that perfectly.

There is no loss of data. You could argue that the byte order is lost,
but I think that is incorrect. The application defines the semantics:
the file might be defined as using host-order, or the application may be
writing a BOM at the head of the file.

> such things should be handled by extra methods, e.g. fp.rawwrite().

I believe this would be a needless complication of the interface.

> Hmm, I guess the philosophy behind the interface is not
> really clear.

I didn't design or implement it initially, but (as you may have guessed)
I am a proponent of its existence.

> Binary data is fetched via getreadbuffer and then
> interpreted as character data... I always thought that the
> getcharbuffer should be used for such an interpretation.

The former is bad behavior. That is why getcharbuffer was added (by me,
for 1.5.2). It was a preventative measure for the introduction of
Unicode strings. Using getreadbuffer for characters would break badly
given a Unicode string. Therefore, "clients" that want (8-bit)
characters from an object supporting the buffer interface should use
getcharbuffer. The Unicode object doesn't implement it, implying that it
cannot provide 8-bit characters. You can get the raw bytes thru
getreadbuffer.

> Or maybe, we should dump the getcharbufer slot again and
> use the getreadbuffer information just as we would a
> void* pointer in C: with no explicit or implicit type information.

Nope. That path is frought with failure :-)

Cheers,
-g

--
Greg Stein, http://www.lyra.org/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Greg Stein wrote:
>
> M.-A. Lemburg wrote:
> >
> > Greg Stein wrote:
> > >
> > > On Tue, 10 Aug 1999, Fredrik Lundh wrote:
> > > > maybe the unicode class shouldn't implement the
> > > > buffer interface at all? sure looks like the best way
> > >
> > > It is needed for fp.write(unicodeobj) ...
> > >
> > > It is also very handy for C functions to deal with Unicode strings.
> >
> > Wouldn't a special C API be (even) more convenient ?
>
> Why? Accessing the Unicode values as a series of bytes matches exactly
> to the semantics of the buffer interface. Why throw in Yet Another
> Function?

I meant PyUnicode_* style APIs for dealing with all the aspects
of Unicode objects -- much like the PyString_* APIs available.

> Your abstract.c functions make it quite simple.

BTW, do we need an extra set of those with buffer index or not ?
Those would really be one-liners for the sake of hiding the
type slots from applications.

> > > > to avoid trivial mistakes (the current behaviour of
> > > > fp.write(unicodeobj) is even more serious than the
> > > > marshal glitch...)
> > >
> > > What's wrong with fp.write(unicodeobj)? It should write the unicode value
> > > to the file. Are you suggesting that it will need to be done differently?
> > > Icky.
> >
> > Would this also write some kind of Unicode encoding header ?
> > [.Sorry, this is my Unicode ignorance shining through... I only
> > remember lots of talk about these things on the string-sig.]
>
> Absolutely not. Placing the Byte Order Mark (BOM) into an output stream
> is an application-level task. It should never by done by any subsystem.
>
> There are no other "encoding headers" that would go into the output
> stream. The output would simply be UTF-16 (2-byte values in host byte
> order).

Ok.

> > Since fp.write() uses "s#" this would use the getreadbuffer
> > slot in 1.5.2... I think what it *should* do is use the
> > getcharbuffer slot instead (see my other post), since dumping
> > the raw unicode data would loose too much information. Again,
>
> I very much disagree. To me, fp.write() is not about writing characters
> to a stream. I think it makes much more sense as "writing bytes to a
> stream" and the buffer interface fits that perfectly.

This is perfectly ok, but shouldn't the behaviour of fp.write()
mimic that of previous Python versions ? How does JPython
write the data ?

Inlined different subject:
I think the internal semantics of "s#" using the getreadbuffer slot
and "t#" the getcharbuffer slot should be switched; see my other post.
In previous Python versions "s#" had the semantics of string data
with possibly embedded NULL bytes. Now it suddenly has the meaning
of binary data and you can't simply change extensions to use the
new "t#" because people are still using them with older Python
versions.

> There is no loss of data. You could argue that the byte order is lost,
> but I think that is incorrect. The application defines the semantics:
> the file might be defined as using host-order, or the application may be
> writing a BOM at the head of the file.

The problem here is that many application were not written
to handle these kind of objects. Previously they could only
handle strings, now they can suddenly handle any object
having the buffer interface and then fail when the data
gets read back in.

> > such things should be handled by extra methods, e.g. fp.rawwrite().
>
> I believe this would be a needless complication of the interface.

It would clarify things and make the interface 100% backward
compatible again.

> > Hmm, I guess the philosophy behind the interface is not
> > really clear.
>
> I didn't design or implement it initially, but (as you may have guessed)
> I am a proponent of its existence.
>
> > Binary data is fetched via getreadbuffer and then
> > interpreted as character data... I always thought that the
> > getcharbuffer should be used for such an interpretation.
>
> The former is bad behavior. That is why getcharbuffer was added (by me,
> for 1.5.2). It was a preventative measure for the introduction of
> Unicode strings. Using getreadbuffer for characters would break badly
> given a Unicode string. Therefore, "clients" that want (8-bit)
> characters from an object supporting the buffer interface should use
> getcharbuffer. The Unicode object doesn't implement it, implying that it
> cannot provide 8-bit characters. You can get the raw bytes thru
> getreadbuffer.

I agree 100%, but did you add the "t#" instead of having
"s#" use the getcharbuffer interface ? E.g. my mxTextTools
package uses "s#" on many APIs. Now someone could stick
in a Unicode object and get pretty strange results without
any notice about mxTextTools and Unicode being incompatible.
You could argue that I change to "t#", but that doesn't
work since many people out there still use Python versions
<1.5.2 and those didn't have "t#", so mxTextTools would then
fail completely for them.

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 139 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
M.-A. Lemburg <mal@lemburg.com> wrote:
> I meant PyUnicode_* style APIs for dealing with all the aspects
> of Unicode objects -- much like the PyString_* APIs available.

it's already there, of course. see unicode.h
in the unicode distribution (Mark is hopefully
adding this to 1.6 in this very moment...)

> > I very much disagree. To me, fp.write() is not about writing characters
> > to a stream. I think it makes much more sense as "writing bytes to a
> > stream" and the buffer interface fits that perfectly.
>
> This is perfectly ok, but shouldn't the behaviour of fp.write()
> mimic that of previous Python versions ? How does JPython
> write the data ?

the crucial point is how an average user expects things
to work. the current design is quite assymmetric -- you
can easily *write* things that implement the buffer inter-
face to a stream, but how the heck do you get them
back?

(as illustrated by the marshal buglet...)

</F>
Re: marshal (was:Buffer interface in abstract.c? ) [ In reply to ]
Fredrik Lundh wrote:
>...
> besides, what about buffers and threads? if you
> return a pointer from getreadbuf, wouldn't it be
> good to know exactly when Python doesn't need
> that pointer any more? explicit initbuffer/exitbuffer
> calls around each sequence of buffer operations
> would make that a lot safer...

This is a pretty obvious one, I think: it lasts only as long as the
object. PyString_AS_STRING is similar. Nothing new or funny here.

Cheers,
-g

--
Greg Stein, http://www.lyra.org/

1 2  View All