Mailing List Archive: python/dist/src/Objects unicodeobject.c,2.139,2.140

python/dist/src/Objects unicodeobject.c,2.139,2.140

Apr 20, 2002, 6:44 AM

Post #1 of 4 (13 views)

Update of /cvsroot/python/python/dist/src/Objects
In directory usw-pr-cvs1:/tmp/cvs-serv30961

Modified Files:
unicodeobject.c
Log Message:
Patch #495401: Count number of required bytes for encoding UTF-8 before
allocating the target buffer.

Index: unicodeobject.c
===================================================================
RCS file: /cvsroot/python/python/dist/src/Objects/unicodeobject.c,v
retrieving revision 2.139
retrieving revision 2.140
diff -C2 -d -r2.139 -r2.140
*** unicodeobject.c 15 Apr 2002 18:42:15 -0000 2.139
--- unicodeobject.c 20 Apr 2002 13:44:01 -0000 2.140
***************
*** 1173,1182 ****
#endif

- /* Allocation strategy: we default to Latin-1, then do one resize
- whenever we hit an order boundary. The assumption is that
- characters from higher orders usually occur often enough to warrant
- this.
- */
-
PyObject *PyUnicode_EncodeUTF8(const Py_UNICODE *s,
int size,
--- 1173,1176 ----
***************
*** 1185,1211 ****
PyObject *v;
char *p;
! int i = 0;
! int overalloc = 2;
! int len;
!
/* Short-cut for emtpy strings */
if (size == 0)
return PyString_FromStringAndSize(NULL, 0);

! v = PyString_FromStringAndSize(NULL, overalloc * size);
if (v == NULL)
return NULL;

p = PyString_AS_STRING(v);
!
! while (i < size) {
Py_UCS4 ch = s[i++];

! if (ch < 0x80)
! /* Encode ASCII */
*p++ = (char) ch;

else if (ch < 0x0800) {
- /* Encode Latin-1 */
*p++ = (char)(0xc0 | (ch >> 6));
*p++ = (char)(0x80 | (ch & 0x3f));
--- 1179,1221 ----
PyObject *v;
char *p;
! unsigned int allocated = 0;
! int i;
!
/* Short-cut for emtpy strings */
if (size == 0)
return PyString_FromStringAndSize(NULL, 0);

! for (i = 0; i < size; ) {
! Py_UCS4 ch = s[i++];
! if (ch < 0x80)
! allocated += 1;
! else if (ch < 0x0800)
! allocated += 2;
! else if (ch < 0x10000) {
! /* Check for high surrogate */
! if (0xD800 <= ch && ch <= 0xDBFF &&
! i != size &&
! 0xDC00 <= s[i] && s[i] <= 0xDFFF) {
! allocated += 1;
! i++;
! }
! allocated += 3;
! } else
! allocated += 4;
! }
!
! v = PyString_FromStringAndSize(NULL, allocated);
if (v == NULL)
return NULL;

p = PyString_AS_STRING(v);
! for (i = 0; i < size; ) {
Py_UCS4 ch = s[i++];

! if (ch < 0x80) {
*p++ = (char) ch;
+ }

else if (ch < 0x0800) {
*p++ = (char)(0xc0 | (ch >> 6));
*p++ = (char)(0x80 | (ch & 0x3f));
***************
*** 1213,1268 ****

else {
! /* Encode UCS2 Unicode ordinals */
if (ch < 0x10000) {
!
! /* Special case: check for high surrogate */
if (0xD800 <= ch && ch <= 0xDBFF && i != size) {
Py_UCS4 ch2 = s[i];
! /* Check for low surrogate and combine the two to
! form a UCS4 value */
if (0xDC00 <= ch2 && ch2 <= 0xDFFF) {
! ch = ((ch - 0xD800) << 10 | (ch2 - 0xDC00)) + 0x10000;
! i++;
! goto encodeUCS4;
}
/* Fall through: handles isolated high surrogates */
}
-
- if (overalloc < 3) {
- len = (int)(p - PyString_AS_STRING(v));
- overalloc = 3;
- if (_PyString_Resize(&v, overalloc * size))
- goto onError;
- p = PyString_AS_STRING(v) + len;
- }
*p++ = (char)(0xe0 | (ch >> 12));
*p++ = (char)(0x80 | ((ch >> 6) & 0x3f));
*p++ = (char)(0x80 | (ch & 0x3f));
! continue;
! }
!
! /* Encode UCS4 Unicode ordinals */
! encodeUCS4:
! if (overalloc < 4) {
! len = (int)(p - PyString_AS_STRING(v));
! overalloc = 4;
! if (_PyString_Resize(&v, overalloc * size))
! goto onError;
! p = PyString_AS_STRING(v) + len;
}
- *p++ = (char)(0xf0 | (ch >> 18));
- *p++ = (char)(0x80 | ((ch >> 12) & 0x3f));
- *p++ = (char)(0x80 | ((ch >> 6) & 0x3f));
- *p++ = (char)(0x80 | (ch & 0x3f));
}
}
! *p = '\0';
! if (_PyString_Resize(&v, (int)(p - PyString_AS_STRING(v))))
! goto onError;
return v;
-
- onError:
- Py_DECREF(v);
- return NULL;
}

--- 1223,1257 ----

else {
!
if (ch < 0x10000) {
! /* Check for high surrogate */
if (0xD800 <= ch && ch <= 0xDBFF && i != size) {
Py_UCS4 ch2 = s[i];
! /* Check for low surrogate */
if (0xDC00 <= ch2 && ch2 <= 0xDFFF) {
! ch = ((ch - 0xD800)<<10 | (ch2-0xDC00))+0x10000;
! *p++ = (char)((ch >> 18) | 0xf0);
! *p++ = (char)(0x80 | ((ch >> 12) & 0x3f));
! *p++ = (char)(0x80 | ((ch >> 6) & 0x3f));
! *p++ = (char)(0x80 | (ch & 0x3f));
! i++;
! continue;
}
/* Fall through: handles isolated high surrogates */
}
*p++ = (char)(0xe0 | (ch >> 12));
*p++ = (char)(0x80 | ((ch >> 6) & 0x3f));
*p++ = (char)(0x80 | (ch & 0x3f));
!
! } else {
! *p++ = (char)(0xf0 | (ch>>18));
! *p++ = (char)(0x80 | ((ch>>12) & 0x3f));
! *p++ = (char)(0x80 | ((ch>>6) & 0x3f));
! *p++ = (char)(0x80 | (ch & 0x3f));
}
}
}
! assert(p - PyString_AS_STRING(v) == allocated);
return v;
}

Re: python/dist/src/Objects unicodeobject.c,2.139,2.140 [ In reply to ]

mal at lemburg

Apr 20, 2002, 8:26 AM

Post #2 of 4 (13 views)

Permalink

loewis@sourceforge.net wrote:
>
> Update of /cvsroot/python/python/dist/src/Objects
> In directory usw-pr-cvs1:/tmp/cvs-serv30961
>
> Modified Files:
> unicodeobject.c
> Log Message:
> Patch #495401: Count number of required bytes for encoding UTF-8 before
> allocating the target buffer.

Martin, please back out this change again. We have discussed this
quite a few times and I am against using your strategy since
it introduces a performance hit which does not relate to the
gained advantage of (temporarily) using less memory.

Your timings also show this, so I wonder why you checked in this
patch, e.g. from the patch log:
"""
For the current
CVS (unicodeobject.c 2.136: MAL's change to use a variable
overalloc), I get

10 spaces 20.060
100 spaces 2.600
200 spaces 2.030
1000 spaces 0.930
10000 spaces 0.690
10 spaces, 3 bytes 23.520
100 spaces, 3 bytes 3.730
200 spaces, 3 bytes 2.470
1000 spaces, 3 bytes 0.980
10000 spaces, 3 bytes 0.690
30 bytes 24.800
300 bytes 5.220
600 bytes 3.830
3000 bytes 2.480
30000 bytes 2.230

With unicode3.diff (that's the one you checked in), I get

10 spaces 19.940
100 spaces 3.260
200 spaces 2.340
1000 spaces 1.650
10000 spaces 1.450
10 spaces, 3 bytes 21.420
100 spaces, 3 bytes 3.410
200 spaces, 3 bytes 2.420
1000 spaces, 3 bytes 1.660
10000 spaces, 3 bytes 1.450
30 bytes 22.260
300 bytes 5.830
600 bytes 4.700
3000 bytes 3.740
30000 bytes 3.540
"""

The only case where your patch is faster is for very short
strings and then only by a few percent, whereas for all
longer strings you get worse timings, e.g. 3.74 seconds
compared to 2.48 seconds -- that's a 50% increase in
run-time !

Thanks,
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
______________________________________________________________________
Company & Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/

Re: python/dist/src/Objects unicodeobject.c,2.139,2.140 [ In reply to ]

guido at python

Apr 21, 2002, 4:48 PM

Post #3 of 4 (13 views)

Permalink

> The only case where your patch is faster is for very short
> strings and then only by a few percent, whereas for all
> longer strings you get worse timings, e.g. 3.74 seconds
> compared to 2.48 seconds -- that's a 50% increase in
> run-time !

First decide what's worse -- overallocating memory or slowing down.
This is not at all clear! If the normal use case is that strings to
be encoded are significantly smaller than memory, overallocating is
worth it. If we expect this to happen for strings close to the VM
size, overallocating may cause problems. Does Linux still have the
problem that its malloc() will let you allocate more memory than the
system has available, and then crash hard when you try to touch all of
it?

--Guido van Rossum (home page: http://www.python.org/~guido/)

RE: python/dist/src/Objects unicodeobject.c,2.139,2.140 [ In reply to ]

tim.one at comcast

Apr 21, 2002, 9:19 PM

Post #4 of 4 (13 views)

Permalink

[Guido]
> First decide what's worse -- overallocating memory or slowing down.
> This is not at all clear! If the normal use case is that strings to
> be encoded are significantly smaller than memory, overallocating is
> worth it. If we expect this to happen for strings close to the VM
> size, overallocating may cause problems.

I'd be surprised if people are slinging individual multi-hundred megabyte
Unicode strings. Martin's timing program went up to 10K characters/string
max. Then again, I'm surprised when anyone slings a Unicode string,
regardless of size <wink>.

> Does Linux still have the problem that its malloc() will let you
> allocate more memory than the system has available, and then crash
> hard when you try to touch all of it?

Apparently so, and apparently the memory characteristics of popular
applications running on large servers are such that Linux wouldn't be usable
in that market without overcommitment. Or so some people say. A google
search turns up many inflamed arguments.