Mailing List Archive

Unicode mapping tables
I am just coding the translate method for Unicode objects and
have come along a design question that may have some importance
with resp. to speed and memory allocation size.

Currently, mapping tables map characters to Unicode characters
and vice-versa. Now the .translate method will use a different
kind of table: mapping integer ordinals to integer ordinals.

Question: What is more of efficient: having lots of integers
in a dictionary or lots of characters ?

Another aspect of this question is: the translate method
will be able to handle sequences *and* mappings because it
looks up integers which can be interpreted as indexes as well
as dictionary keys. The character mapping codec uses characters
as key and thus only allows dictionaries to be used (the reason
is that in some future version it should be possible to
map single characters to multiple characters or even combinations
to bnew combinations).

BTW, I dropped the deletions argument from the translate method:
it is not needed, since a mapping to None will have the same effect.
Note that not specifying a mapping causes the characters to be
copied as-is. This has the nice side-effect of grealty reducing
the mapping table's size.

Note that there will be no .maketrans() method. The same functionality
can easily be coded in Python if needed and doesn't fit into the
OO-style nature of string and Unicode objects anymore.

--

Something else that changed is the way .capitalize() works. The
Unicode version uses the Unicode algorithm for it (see TechRep. 13
on the www.unicode.org site). Here's the new doc string:

S.capitalize() -> unicode

Return a capitalized version of S, i.e. words start with title case
characters, all remaining cased characters have lower case.

Note that *all* characters are touched, not just the first one.
The change was needed to get it in sync with the .iscapitalized()
method which is based on the Unicode algorithm too.

Should this change be propogated to the string implementation ?

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Unicode mapping tables [ In reply to ]
"M.-A. Lemburg" wrote:
>
> I am just coding the translate method for Unicode objects and
> have come along a design question that may have some importance
> with resp. to speed and memory allocation size.
>
> Currently, mapping tables map characters to Unicode characters
> and vice-versa. Now the .translate method will use a different
> kind of table: mapping integer ordinals to integer ordinals.
>
> Question: What is more of efficient: having lots of integers
> in a dictionary or lots of characters ?

Turns out that integers are more flexible after some tests...
I'll stick with them :-)

Perhaps we could bump the small int optimization limit to
256 (it is currently set to 100) ?! This would be ideal for
these tables, since then at least most of the keys would
be shared between tables.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: Unicode mapping tables [ In reply to ]
[M.-A. Lemburg]
> ...
> Currently, mapping tables map characters to Unicode characters
> and vice-versa. Now the .translate method will use a different
> kind of table: mapping integer ordinals to integer ordinals.

You mean that if I want to map u"a" to u"A", I have to set up some sort of
dict mapping ord(u"a") to ord(u"A")? I simply couldn't follow this.

> Question: What is more of efficient: having lots of integers
> in a dictionary or lots of characters ?

My bet is "lots of integers", to reduce both space use and comparison time.

> ...
> Something else that changed is the way .capitalize() works. The
> Unicode version uses the Unicode algorithm for it (see TechRep. 13
> on the www.unicode.org site).

#13 is "Unicode Newline Guidelines". I assume you meant #21 ("Case
Mappings").

> Here's the new doc string:
>
> S.capitalize() -> unicode
>
> Return a capitalized version of S, i.e. words start with title case
> characters, all remaining cased characters have lower case.
>
> Note that *all* characters are touched, not just the first one.
> The change was needed to get it in sync with the .iscapitalized()
> method which is based on the Unicode algorithm too.
>
> Should this change be propogated to the string implementation ?

Unicode makes distinctions among "upper case", "lower case" and "title
case", and you're trying to get away with a single "capitalize" function.
Java has separate toLowerCase, toUpperCase and toTitleCase methods, and
that's the way to do it. Whatever you do, leave .capitalize alone for 8-bit
strings -- there's no reason to break code that currently works.
"capitalize" seems a terrible choice of name for a titlecase method anyway,
because of its baggage connotations from 8-bit strings. Since this stuff is
complicated, I say it would be much better to use the same names for these
things as the Unicode and Java folk do: there's excellent documentation
elsewhere for all this stuff, and it's Bad to make users mentally translate
unique Python terminology to make sense of the official docs.

So my vote is: leave capitalize the hell alone <wink>. Do not implement
capitialize for Unicode strings. Introduce a new titlecase method for
Unicode strings. Add a new titlecase method to 8-bit strings too. Unicode
strings should also have methods to get at uppercase and lowercase (as
Unicode defines those).
Re: Unicode mapping tables [ In reply to ]
Tim Peters wrote:
>
> [M.-A. Lemburg]
> > ...
> > Currently, mapping tables map characters to Unicode characters
> > and vice-versa. Now the .translate method will use a different
> > kind of table: mapping integer ordinals to integer ordinals.
>
> You mean that if I want to map u"a" to u"A", I have to set up some sort of
> dict mapping ord(u"a") to ord(u"A")? I simply couldn't follow this.

I meant:

'a': u'A' vs. ord('a'): ord(u'A')

The latter wins ;-) Reasoning for the first was that it allows
character sequences to be handled by the same mapping algorithm.
I decided to leave those techniques to some future implementation,
since mapping integers has the nice side-effect of also allowing
sequences to be used as mapping tables... resulting in some
speedup at the cost of memory consumption.

BTW, there are now three different ways to do char translations:

1. char -> unicode (char mapping codec's decode)
2. unicode -> char (char mapping codec's encode)
3. unicode -> unicode (unicode's .translate() method)

> > Question: What is more of efficient: having lots of integers
> > in a dictionary or lots of characters ?
>
> My bet is "lots of integers", to reduce both space use and comparison time.

Right. That's what I found too... it's "lots of integers" now :-)

> > ...
> > Something else that changed is the way .capitalize() works. The
> > Unicode version uses the Unicode algorithm for it (see TechRep. 13
> > on the www.unicode.org site).
>
> #13 is "Unicode Newline Guidelines". I assume you meant #21 ("Case
> Mappings").

Dang. You're right. Here's the URL in case someone
wants to join in:

http://www.unicode.org/unicode/reports/tr21/tr21-2.html

> > Here's the new doc string:
> >
> > S.capitalize() -> unicode
> >
> > Return a capitalized version of S, i.e. words start with title case
> > characters, all remaining cased characters have lower case.
> >
> > Note that *all* characters are touched, not just the first one.
> > The change was needed to get it in sync with the .iscapitalized()
> > method which is based on the Unicode algorithm too.
> >
> > Should this change be propogated to the string implementation ?
>
> Unicode makes distinctions among "upper case", "lower case" and "title
> case", and you're trying to get away with a single "capitalize" function.
> Java has separate toLowerCase, toUpperCase and toTitleCase methods, and
> that's the way to do it.

The Unicode implementation has the corresponding:

.upper(), .lower() and .capitalize()

They work just like .toUpperCase, .toLowerCase, .toTitleCase
resp. (well at least they should ;).

> Whatever you do, leave .capitalize alone for 8-bit
> strings -- there's no reason to break code that currently works.
> "capitalize" seems a terrible choice of name for a titlecase method anyway,
> because of its baggage connotations from 8-bit strings. Since this stuff is
> complicated, I say it would be much better to use the same names for these
> things as the Unicode and Java folk do: there's excellent documentation
> elsewhere for all this stuff, and it's Bad to make users mentally translate
> unique Python terminology to make sense of the official docs.

Hmm, that's an argument but it breaks the current method
naming scheme of all lowercase letter. Perhaps I should simply
provide a new method for .toTitleCase(), e.g. .title(), and
leave the previous definition of .capitalize() intact...

> So my vote is: leave capitalize the hell alone <wink>. Do not implement
> capitialize for Unicode strings. Introduce a new titlecase method for
> Unicode strings. Add a new titlecase method to 8-bit strings too. Unicode
> strings should also have methods to get at uppercase and lowercase (as
> Unicode defines those).

...looks like you're more or less on the same wave length here ;-)

Here's what I'll do:

* implement .capitalize() in the traditional way for Unicode
objects (simply convert the first char to uppercase)
* implement u.title() to mean the same as Java's toTitleCase()
* don't implement s.title(): the reasoning here is that it would
confuse the user when she get's different return values for
the same string (titlecase chars usually live in higher Unicode
code ranges not reachable in Latin-1)

Thanks for the feedback,
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Unicode mapping tables [ In reply to ]
> Here's what I'll do:
>
> * implement .capitalize() in the traditional way for Unicode
> objects (simply convert the first char to uppercase)
> * implement u.title() to mean the same as Java's toTitleCase()
> * don't implement s.title(): the reasoning here is that it would
> confuse the user when she get's different return values for
> the same string (titlecase chars usually live in higher Unicode
> code ranges not reachable in Latin-1)

Huh? For ASCII at least, titlecase seems to map to ASCII; in your
current implementation, only two Latin-1 characters (u'\265' and
u'\377', I have no easy way to show them in Latin-1) map outside the
Latin-1 range.

Anyway, I would suggest to add a title() call to 8-bit strings as
well; then we can do away with string.capwords(), which does something
similar but different, mostly by accident.

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: Unicode mapping tables [ In reply to ]
Guido van Rossum wrote:
>
> > Here's what I'll do:
> >
> > * implement .capitalize() in the traditional way for Unicode
> > objects (simply convert the first char to uppercase)
> > * implement u.title() to mean the same as Java's toTitleCase()
> > * don't implement s.title(): the reasoning here is that it would
> > confuse the user when she get's different return values for
> > the same string (titlecase chars usually live in higher Unicode
> > code ranges not reachable in Latin-1)
>
> Huh? For ASCII at least, titlecase seems to map to ASCII; in your
> current implementation, only two Latin-1 characters (u'\265' and
> u'\377', I have no easy way to show them in Latin-1) map outside the
> Latin-1 range.

You're right, sorry for the confusion. I was thinking of other
encodings like e.g. cp437 which have corresponding characters
in the higher Unicode ranges.

> Anyway, I would suggest to add a title() call to 8-bit strings as
> well; then we can do away with string.capwords(), which does something
> similar but different, mostly by accident.

Ok, I'll do it this way then: s.title() will use C's toupper() and
tolower() for case mapping and u.title() the Unicode routines.

This will be in sync with the rest of the 8-bit string world
(which is locale aware on many platforms AFAIK), even though
it might not return the same string as the corresponding
u.title() call.

u.capwords() will be disabled in the Unicode implemetation...
it wasn't even implemented for the string implementetation,
so there's no breakage ;-)

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
RE: Unicode mapping tables [ In reply to ]
[M.-A. Lemburg]
> ...
> Here's what I'll do:
>
> * implement .capitalize() in the traditional way for Unicode
> objects (simply convert the first char to uppercase)

Given .title(), is .capitalize() of use for Unicode strings? Or is it just
a temptation to do something senseless in the Unicode world? If it doesn't
make sense, leave it out (this *seems* like compulsion <wink> to implement
all current string methods in *some* way for Unicode, whether or not they
make sense).
Re: Unicode mapping tables [ In reply to ]
> [M.-A. Lemburg]
> > ...
> > Here's what I'll do:
> >
> > * implement .capitalize() in the traditional way for Unicode
> > objects (simply convert the first char to uppercase)

[Tim]
> Given .title(), is .capitalize() of use for Unicode strings? Or is it just
> a temptation to do something senseless in the Unicode world? If it doesn't
> make sense, leave it out (this *seems* like compulsion <wink> to implement
> all current string methods in *some* way for Unicode, whether or not they
> make sense).

The intention of this is to make code that does something using
strings do exactly the same strings if those strings happen to be
Unicode strings with the same values.

The capitalize method returns self[0].upper() + self[1:] -- that may
not make sense for e.g. Japanese, but it certainly does for Russian or
Greek.

It also does this in JPython.

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: Unicode mapping tables [ In reply to ]
Tim Peters wrote:
>
> [M.-A. Lemburg]
> > ...
> > Here's what I'll do:
> >
> > * implement .capitalize() in the traditional way for Unicode
> > objects (simply convert the first char to uppercase)
>
> Given .title(), is .capitalize() of use for Unicode strings? Or is it just
> a temptation to do something senseless in the Unicode world? If it doesn't
> make sense, leave it out (this *seems* like compulsion <wink> to implement
> all current string methods in *some* way for Unicode, whether or not they
> make sense).

.capitalize() only touches the first char of the string - not
sure whether it makes sense in both worlds ;-)

Anyhow, the difference is there but subtle: string.capitalize()
will use C's toupper() which is locale dependent, while
unicode.capitalize() uses Unicode's toTitleCase() for the first
character.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/