Mailing List Archive

Unicode character property methods
As you may have noticed, the Unicode objects provide
new methods .islower(), .isupper() and .istitle(). Finn Bock
mentioned that Java also provides .isdigit() and .isspace().

Question: should Unicode also provide these character
property methods: .isdigit(), .isnumeric(), .isdecimal()
and .isspace() ? Plus maybe .digit(), .numeric() and
.decimal() for the corresponding decoding ?

Similar APIs are already available through the unicodedata
module, but could easily be moved to the Unicode object
(they cause the builtin interpreter to grow a bit in size
due to the new mapping tables).

BTW, string.atoi et al. are currently not mapped to
string methods... should they be ?

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Unicode character property methods [ In reply to ]
> As you may have noticed, the Unicode objects provide
> new methods .islower(), .isupper() and .istitle(). Finn Bock
> mentioned that Java also provides .isdigit() and .isspace().
>
> Question: should Unicode also provide these character
> property methods: .isdigit(), .isnumeric(), .isdecimal()
> and .isspace() ? Plus maybe .digit(), .numeric() and
> .decimal() for the corresponding decoding ?

What would be the difference between isdigit, isnumeric, isdecimal?
I'd say don't do more than Java. I don't understand what the
"corresponding decoding" refers to. What would "3".decimal() return?

> Similar APIs are already available through the unicodedata
> module, but could easily be moved to the Unicode object
> (they cause the builtin interpreter to grow a bit in size
> due to the new mapping tables).
>
> BTW, string.atoi et al. are currently not mapped to
> string methods... should they be ?

They are mapped to int() c.s.

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: Unicode character property methods [ In reply to ]
Guido van Rossum wrote:
>
> > As you may have noticed, the Unicode objects provide
> > new methods .islower(), .isupper() and .istitle(). Finn Bock
> > mentioned that Java also provides .isdigit() and .isspace().
> >
> > Question: should Unicode also provide these character
> > property methods: .isdigit(), .isnumeric(), .isdecimal()
> > and .isspace() ? Plus maybe .digit(), .numeric() and
> > .decimal() for the corresponding decoding ?
>
> What would be the difference between isdigit, isnumeric, isdecimal?
> I'd say don't do more than Java. I don't understand what the
> "corresponding decoding" refers to. What would "3".decimal() return?

These originate in the Unicode database; see

ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html

Here are the descriptions:

"""
6
Decimal digit value
normative
This is a numeric field. If the
character has the decimal digit
property, as specified in Chapter
4 of the Unicode Standard, the
value of that digit is represented
with an integer value in this field
7
Digit value
normative
This is a numeric field. If the
character represents a digit, not
necessarily a decimal digit, the
value is here. This covers digits
which do not form decimal radix
forms, such as the compatibility
superscript digits
8
Numeric value
normative
This is a numeric field. If the
character has the numeric
property, as specified in Chapter
4 of the Unicode Standard, the
value of that character is
represented with an integer or
rational number in this field. This
includes fractions as, e.g., "1/5" for
U+2155 VULGAR FRACTION
ONE FIFTH Also included are
numerical values for compatibility
characters such as circled
numbers.

u"3".decimal() would return 3. u"\u2155".

Some more examples from the unicodedata module (which makes
all fields of the database available in Python):

>>> unicodedata.decimal(u"3")
3
>>> unicodedata.decimal(u"²")
2
>>> unicodedata.digit(u"²")
2
>>> unicodedata.numeric(u"²")
2.0
>>> unicodedata.numeric(u"\u2155")
0.2
>>> unicodedata.numeric(u'\u215b')
0.125

> > Similar APIs are already available through the unicodedata
> > module, but could easily be moved to the Unicode object
> > (they cause the builtin interpreter to grow a bit in size
> > due to the new mapping tables).
> >
> > BTW, string.atoi et al. are currently not mapped to
> > string methods... should they be ?
>
> They are mapped to int() c.s.

Hmm, I just noticed that int() et friends don't like
Unicode... shouldn't they use the "t" parser marker
instead of requiring a string or tp_int compatible
type ?

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: Unicode character property methods [ In reply to ]
[MAL]
> > > As you may have noticed, the Unicode objects provide
> > > new methods .islower(), .isupper() and .istitle(). Finn Bock
> > > mentioned that Java also provides .isdigit() and .isspace().
> > >
> > > Question: should Unicode also provide these character
> > > property methods: .isdigit(), .isnumeric(), .isdecimal()
> > > and .isspace() ? Plus maybe .digit(), .numeric() and
> > > .decimal() for the corresponding decoding ?

[Guido]
> > What would be the difference between isdigit, isnumeric, isdecimal?
> > I'd say don't do more than Java. I don't understand what the
> > "corresponding decoding" refers to. What would "3".decimal() return?

[MAL]
> These originate in the Unicode database; see
>
> ftp://ftp.unicode.org/Public/3.0-Update/UnicodeData-3.0.0.html
>
> Here are the descriptions:
>
> """
> 6
> Decimal digit value
> normative
> This is a numeric field. If the
> character has the decimal digit
> property, as specified in Chapter
> 4 of the Unicode Standard, the
> value of that digit is represented
> with an integer value in this field
> 7
> Digit value
> normative
> This is a numeric field. If the
> character represents a digit, not
> necessarily a decimal digit, the
> value is here. This covers digits
> which do not form decimal radix
> forms, such as the compatibility
> superscript digits
> 8
> Numeric value
> normative
> This is a numeric field. If the
> character has the numeric
> property, as specified in Chapter
> 4 of the Unicode Standard, the
> value of that character is
> represented with an integer or
> rational number in this field. This
> includes fractions as, e.g., "1/5" for
> U+2155 VULGAR FRACTION
> ONE FIFTH Also included are
> numerical values for compatibility
> characters such as circled
> numbers.
>
> u"3".decimal() would return 3. u"\u2155".
>
> Some more examples from the unicodedata module (which makes
> all fields of the database available in Python):
>
> >>> unicodedata.decimal(u"3")
> 3
> >>> unicodedata.decimal(u"²")
> 2
> >>> unicodedata.digit(u"²")
> 2
> >>> unicodedata.numeric(u"²")
> 2.0
> >>> unicodedata.numeric(u"\u2155")
> 0.2
> >>> unicodedata.numeric(u'\u215b')
> 0.125

Hm, very Unicode centric. Probably best left out of the general
string methods. Isspace() seems useful, and an isdigit() that is only
true for ASCII '0' - '9' also makes sense.

What about "123".isdigit()? What does Java say? Or do these only
apply to single chars there? I think "123".isdigit() should be true
if "abc".islower() is true.

> > > Similar APIs are already available through the unicodedata
> > > module, but could easily be moved to the Unicode object
> > > (they cause the builtin interpreter to grow a bit in size
> > > due to the new mapping tables).
> > >
> > > BTW, string.atoi et al. are currently not mapped to
> > > string methods... should they be ?
> >
> > They are mapped to int() c.s.
>
> Hmm, I just noticed that int() et friends don't like
> Unicode... shouldn't they use the "t" parser marker
> instead of requiring a string or tp_int compatible
> type ?

Good catch. Go ahead.

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: Unicode character property methods [ In reply to ]
Guido van Rossum wrote:
> [MAL about adding .isdecimal(), .isdigit() and .isnumeric()]
> > Some more examples from the unicodedata module (which makes
> > all fields of the database available in Python):
> >
> > >>> unicodedata.decimal(u"3")
> > 3
> > >>> unicodedata.decimal(u"²")
> > 2
> > >>> unicodedata.digit(u"²")
> > 2
> > >>> unicodedata.numeric(u"²")
> > 2.0
> > >>> unicodedata.numeric(u"\u2155")
> > 0.2
> > >>> unicodedata.numeric(u'\u215b')
> > 0.125
>
> Hm, very Unicode centric. Probably best left out of the general
> string methods. Isspace() seems useful, and an isdigit() that is only
> true for ASCII '0' - '9' also makes sense.

Well, how about having all three on Unicode objects
and only .isdigit() on string objects ?

> What about "123".isdigit()? What does Java say? Or do these only
> apply to single chars there? I think "123".isdigit() should be true
> if "abc".islower() is true.

In the current uPython implementation u"123".isdigit() is true;
same for the other two methods.

> > > > Similar APIs are already available through the unicodedata
> > > > module, but could easily be moved to the Unicode object
> > > > (they cause the builtin interpreter to grow a bit in size
> > > > due to the new mapping tables).
> > > >
> > > > BTW, string.atoi et al. are currently not mapped to
> > > > string methods... should they be ?
> > >
> > > They are mapped to int() c.s.
> >
> > Hmm, I just noticed that int() et friends don't like
> > Unicode... shouldn't they use the "t" parser marker
> > instead of requiring a string or tp_int compatible
> > type ?
>
> Good catch. Go ahead.

Done. float(), int() and long() now accept charbuf
compatible objects as argument.

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/