Mailing List Archive

unicode alphanumerics
when looking through skip's coverage listing, I noted a bug in
SRE:

#define SRE_UNI_IS_ALNUM(ch) ((ch) < 256 ? isalnum((ch)) : 0)

this predicate is used for \w when a pattern is compiled using
the "unicode locale" (flag U), and should definitely not use 8-bit
locale stuff.

however, there's no such thing as a Py_UNICODE_ISALNUM
(or even a Py_UNICODE_ISALPHA). what should I do? how
about using:

Py_UNICODE_ISLOWER ||
Py_UNICODE_ISUPPER ||
Py_UNICODE_ISTITLE ||
Py_UNICODE_ISDIGIT

</F>
Re: unicode alphanumerics [ In reply to ]
Fredrik Lundh wrote:
>
> when looking through skip's coverage listing, I noted a bug in
> SRE:
>
> #define SRE_UNI_IS_ALNUM(ch) ((ch) < 256 ? isalnum((ch)) : 0)
>
> this predicate is used for \w when a pattern is compiled using
> the "unicode locale" (flag U), and should definitely not use 8-bit
> locale stuff.
>
> however, there's no such thing as a Py_UNICODE_ISALNUM
> (or even a Py_UNICODE_ISALPHA). what should I do? how
> about using:
>
> Py_UNICODE_ISLOWER ||
> Py_UNICODE_ISUPPER ||
> Py_UNICODE_ISTITLE ||
> Py_UNICODE_ISDIGIT

This will give you all cased chars along with all digits;
it ommits the non-cased ones.

It's a good start, but probably won't cover the full range
of letters + numbers.

Perhaps we need another table for isalpha in unicodectype.c ?
(Or at least one which defines all non-cased letters.)

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: unicode alphanumerics [ In reply to ]
mal wrote:
> > Py_UNICODE_ISLOWER ||
> > Py_UNICODE_ISUPPER ||
> > Py_UNICODE_ISTITLE ||
> > Py_UNICODE_ISDIGIT
>
> This will give you all cased chars along with all digits;
> it ommits the non-cased ones.

but of course...

> It's a good start, but probably won't cover the full range
> of letters + numbers.
>
> Perhaps we need another table for isalpha in unicodectype.c ?
> (Or at least one which defines all non-cased letters.)

+1 from me (SRE needs this, and it doesn't really make much
sense to add unicode tables to SRE just because the built-in
ones are slightly incomplete...)

how about this plan:

-- you add a Py_UNICODE_ALPHA to unicodeobject.h asap,
which does exactly that (or I can do that, if you prefer).
(and maybe even a Py_UNICODE_ALNUM)

-- I change SRE to use that asap.

-- you, I, or someone else add a better implementation,
some other day.

</F>
Re: unicode alphanumerics [ In reply to ]
Fredrik Lundh wrote:
>
> mal wrote:
> > > Py_UNICODE_ISLOWER ||
> > > Py_UNICODE_ISUPPER ||
> > > Py_UNICODE_ISTITLE ||
> > > Py_UNICODE_ISDIGIT
> >
> > This will give you all cased chars along with all digits;
> > it ommits the non-cased ones.
>
> but of course...
>
> > It's a good start, but probably won't cover the full range
> > of letters + numbers.
> >
> > Perhaps we need another table for isalpha in unicodectype.c ?
> > (Or at least one which defines all non-cased letters.)
>
> +1 from me (SRE needs this, and it doesn't really make much
> sense to add unicode tables to SRE just because the built-in
> ones are slightly incomplete...)
>
> how about this plan:
>
> -- you add a Py_UNICODE_ALPHA to unicodeobject.h asap,
> which does exactly that (or I can do that, if you prefer).
> (and maybe even a Py_UNICODE_ALNUM)

Ok, I'll add Py_UNICODE_ISALPHA and Py_UNICODE_ISALNUM
(first with approximations of the sort you give above and
later with true implementations using tables in unicodectype.c)
on Monday... gotta run now.

> -- I change SRE to use that asap.
>
> -- you, I, or someone else add a better implementation,
> some other day.
>
> </F>

Nice weekend :)
--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: unicode alphanumerics [ In reply to ]
"M.-A. Lemburg" wrote:
>
> Fredrik Lundh wrote:
> > how about this plan:
> >
> > -- you add a Py_UNICODE_ALPHA to unicodeobject.h asap,
> > which does exactly that (or I can do that, if you prefer).
> > (and maybe even a Py_UNICODE_ALNUM)
>
> Ok, I'll add Py_UNICODE_ISALPHA and Py_UNICODE_ISALNUM
> (first with approximations of the sort you give above and
> later with true implementations using tables in unicodectype.c)
> on Monday... gotta run now.
>
> > -- I change SRE to use that asap.
> >
> > -- you, I, or someone else add a better implementation,
> > some other day.

I've just looked into this... the problem here is what to
consider as being "alpha" and what "numeric".

I could add two new tables for the characters with category 'Lo'
(other letters, not cased) and 'Lm' (letter modifiers)
to match all letters in the Unicode database, but those
tables have some 5200 entries (note that there are only 804 lower
case letters and 686 upper case ones).

Note that there seems to be no definition of what is to be
considered alphanumeric in Unicode. The only quote I found was
in http://www.w3.org/TR/xslt#convert which says:

"""
Alphanumeric means any character that has a Unicode
category of Nd, Nl, No, Lu, Ll, Lt, Lm or Lo.
"""

Here's what the glibc has to say about these chars:

/* Test for any wide character for which `iswupper' or 'iswlower' is
true, or any wide character that is one of a locale-specific set of
wide-characters for which none of `iswcntrl', `iswdigit',
`iswpunct', or `iswspace' is true. */
extern int iswalpha __P ((wint_t __wc));

Question:
Should I go ahead and add the Lo and Lm tables to unicodectype.c ?

Pros: standards confrom
Cons: huge in size

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: unicode alphanumerics [ In reply to ]
FYI, I've added two new macros which allow querying alphabetic
and alphanumeric characters:

Py_UNICODE_ISALPHA() and Py_UNICODE_ISALNUM()

The implementation is currently only experimental -- some 5200
chars are missing from being correctly identified as being
alphanumeric (see my other post on the topic).

--
Marc-Andre Lemburg
______________________________________________________________________
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: unicode alphanumerics [ In reply to ]
[M.-A. Lemburg]

>"M.-A. Lemburg" wrote:
>>
>> Fredrik Lundh wrote:
>> > how about this plan:
>> >
>> > -- you add a Py_UNICODE_ALPHA to unicodeobject.h asap,
>> > which does exactly that (or I can do that, if you prefer).
>> > (and maybe even a Py_UNICODE_ALNUM)
>>
>> Ok, I'll add Py_UNICODE_ISALPHA and Py_UNICODE_ISALNUM
>> (first with approximations of the sort you give above and
>> later with true implementations using tables in unicodectype.c)
>> on Monday... gotta run now.
>>
>> > -- I change SRE to use that asap.
>> >
>> > -- you, I, or someone else add a better implementation,
>> > some other day.
>
>I've just looked into this... the problem here is what to
>consider as being "alpha" and what "numeric".
>
>I could add two new tables for the characters with category 'Lo'
>(other letters, not cased) and 'Lm' (letter modifiers)
>to match all letters in the Unicode database, but those
>tables have some 5200 entries (note that there are only 804 lower
>case letters and 686 upper case ones).

In JDK1.3, Character.isLetter(..) and Character.isDigit(..) are
documented as:

http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.html#isLetter(char)
http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.html#isDigit(char)
http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.html#isLetterOrDigit(char)

I guess that java uses the extra huge tables.

regards,
finn