Mailing List Archive

1 2 3 4  View All
Re: String methods... finally [ In reply to ]
> Is there any sort of agreement that Python will use L"..." to denote
> Unicode strings? I would be happy with it.

I don't know of any agreement, but it makes sense.

> Also, should:
> print L"foo" -> 'foo'
> and
> print `L"foo"` -> L'foo'

Yes, I think this should be the way. Exactly what happens to
non-ASCII characters is up to the implementation.

Do we have agreement on escapes like \xDDDD? Should \uDDDD be added?

The difference between the two is that according to the ANSI C
standard, which I follow rather strictly for string literals,
'\xABCDEF' is a single character whose value is the lower bits
(however many fit in a char) of 0xABCDEF; this makes it cumbersome to
write a string consisting of a hex escape followed by a digit or
letter a-f or A-F; you would have to use another hex escape or split
the literal in two, like this: "\xABCD" "EF". (This is true for 8-bit
chars as well as for long char in ANSI C.) The \u escape takes up to
4 bytes but is not ANSI C. In Java, \u has the additional funny
property that it is recognized *everywhere* in the source code, not
just in string literals, and I believe that this complicates the
interpretation of things like "\\uffff" (is the \uffff interpreted
before regular string \ processing happens?). I don't think we ought
to copy this behavior, although JPython users or developers might
disagree. (I don't know anyone who *uses* Unicode strings much, so
it's hard to gauge the importance of these issues.)

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: String methods... finally [ In reply to ]
> How do endian issues fit in with \u?

I would assume that it uses the same rules as hex and octal numeric
literals: these are always *written* in big-endian notation, since
that is also what we use for decimal numbers. Thus, on a
little-endian machine, the short integer 0x1234 would be stored as the
bytes {0x34, 0x12} and so would the string literal "\x1234".

--Guido van Rossum (home page: http://www.python.org/~guido/)
Re: String methods... finally [ In reply to ]
>>>>> "MH" == Mark Hammond <MHammond@skippinet.com.au> writes:

MH> OTOH, my gut tells me this is better - that an implicit
MH> conversion to the seperator type be performed.

Right now, the implementation of join uses PyObject_Str() to str-ify
the elements in the sequence. I can't remember, but in our Unicode
worldview doesn't PyObject_Str() return a narrowed string if it can,
and raise an exception if not? So maybe narrow-string's join
shouldn't be doing it this way because that'll autoconvert to the
separator's type, which breaks the symmetry.

OTOH, we could promote sep to the type of sequence[0] and forward the
call to it's join if it were a widestring. That should retain the
symmetry.

-Barry
Re: String methods... finally [ In reply to ]
>>>>> "Guido" == Guido van Rossum <guido@cnri.reston.va.us> writes:

Guido> Should \uDDDD be added?

That'd be nice! :)

Guido> In Java, \u has the additional funny property that it is
Guido> recognized *everywhere* in the source code, not just in
Guido> string literals, and I believe that this complicates the
Guido> interpretation of things like "\\uffff" (is the \uffff
Guido> interpreted before regular string \ processing happens?).

No. JLS section 3.3 says[1]

In addition to the processing implied by the grammar, for each raw
input character that is a backslash \, input processing must
consider how many other \ characters contiguously precede it,
separating it from a non-\ character or the start of the input
stream. If this number is even, then the \ is eligible to begin a
Unicode escape; if the number is odd, then the \ is not eligible
to begin a Unicode escape.

and this is born out by example.

-------------------- snip snip --------------------Uni.java
public class Uni
{
static public void main(String[] args) {
System.out.println("\\u00a9");
System.out.println("\u00a9");
}
}
-------------------- snip snip --------------------outputs
\u00a9
©
-------------------- snip snip --------------------

-Barry

[1] http://java.sun.com/docs/books/jls/html/3.doc.html#44591

PS. it is wonderful having the JLS online :)
Re: String methods... finally [ In reply to ]
Guido asks:

> Do we have agreement on escapes like \xDDDD? Should \uDDDD be
> added?

> ... The \u escape
> takes up to 4 bytes but is not ANSI C.

How do endian issues fit in with \u?

- Gordon
RE: String methods... finally [ In reply to ]
[MarkH agonizes, over whether to auto-convert or not]

Well, the rule *could* be that the result type is the widest string type
among the separator and the sequences' string elements (if any), and other
types convert to the result type along the way. I'd be more specific,
except I'm not sure which flavor of string str() returns (or, indeed,
whether that's up to each __str__ implementation). In any case, widening to
Unicode should always be possible, and if "widest wins" it doesn't require a
multi-pass algorithm regardless (although the partial result so far may need
to be widened once -- but that's true even if auto-convert of non-string
types isn't implemented).

Or, IOW,
sep.join([a, b, c]) == f(a) + sep + f(b) + sep + f(c)

where I don't know how to spell f, but f(x) *means*

x' = if x has a string type then x else x.__str__()
return x' coerced to the widest string type seen so far

So I think everyone can get what they want -- except that those who want
auto-convert are at direct odds with those who prefer to wag Guido's fingers
and go "tsk, tsk, we know what you want but you didn't say 'please' so your
program dies" <wink>.

master-of-fair-summaries-ly y'rs - tim
RE: String methods... finally [ In reply to ]
[Guido]
> Do we have agreement on escapes like \xDDDD?

I think we have to agree to leave that alone -- it affects what e.g. the
regular expression parser does too.

> Should \uDDDD be added?

Yes, but only in string literals. You don't want to be within 10 miles of
Barry if you tell him that Emacs pymode has to treat the Unicode escape for
a newline as if it were-- as Java treats it outside literals --an actual
line break <0.01 wink>.

> ...
> The \u escape takes up to 4 bytes

Not in Java: it requires exactly 4 hex characters after == exactly 2 bytes,
and it's an error if it's followed by fewer than 4 hex characters. That's a
good rule (simple!), while ANSI C's is too clumsy to live with if people
want to take Unicode seriously.

So what does it mean for a Unicode escape to appear in a non-L string?

aha-the-secret-escape-to-ucs4<wink>-ly y'rs - tim
Re: String methods... finally [ In reply to ]
Guido van Rossum wrote:
>
> > Is there any sort of agreement that Python will use L"..." to denote
> > Unicode strings? I would be happy with it.
>
> I don't know of any agreement, but it makes sense.

The u"..." looks more intuitive too me. While inheriting C/C++
constructs usually makes sense I think usage in the C community
is not that wide-spread yet and for a Python freak, the small u will
definitely remind him of Unicode whereas the L will stand for
(nearly) unlimited length/precision.

Not that this is important, but...

--
Marc-Andre Lemburg
______________________________________________________________________
Y2000: 198 days left
Business: http://www.lemburg.com/
Python Pages: http://www.lemburg.com/python/
Re: String methods... finally [ In reply to ]
> > The \u escape takes up to 4 bytes
>
> Not in Java: it requires exactly 4 hex characters after == exactly 2 bytes,
> and it's an error if it's followed by fewer than 4 hex characters. That's a
> good rule (simple!), while ANSI C's is too clumsy to live with if people
> want to take Unicode seriously.
>
> So what does it mean for a Unicode escape to appear in a non-L string?

my suggestion is to store it as UTF-8; see the patches
included in the unicode package for details.

this also means that an u-string literal (L-string, whatever)
could be stored as an 8-bit string internally. and that the
following two are equivalent:

string = u"foo"
string = unicode("foo")

also note that:

unicode(str(u"whatever")) == u"whatever"

...

on the other hand, this means that we have at least four
major "arrays of bytes or characters" thingies mapped on
two data types:

the old string type is used for:

-- plain old 8-bit strings (ascii, iso-latin-1, whatever)
-- byte buffers containing arbitrary data
-- unicode strings stored as 8-bit characters, using
the UTF-8 encoding.

and the unicode string type is used for:

-- unicode strings stored as 16-bit characters

is this reasonable?

...

yet another question is how to deal with source code.
is a python 1.6 source file written in ASCII, ISO Latin 1,
or UTF-8.

speaking from a non-us standpoint, it would be really
cool if you could write Python sources in UTF-8...

</F>
Re: String methods... finally [ In reply to ]
>>>>> "M" == M <mal@lemburg.com> writes:

M> The u"..." looks more intuitive too me. While inheriting C/C++
M> constructs usually makes sense I think usage in the C community
M> is not that wide-spread yet and for a Python freak, the small u
M> will definitely remind him of Unicode whereas the L will stand
M> for (nearly) unlimited length/precision.

I don't think I've every seen C code with L"..." strings in them.
Here's my list in no particular order.

U"..." -- reminds Java/JPython users of Unicode. Alternative
mnemonic: Unamerican-strings

L"..." -- long-strings, Lundh-strings, ...

W"..." -- wide-strings, Warsaw-strings (just trying to take credit
where credit's not due :), what-the-heck-are-these?-strings

H"..." -- happy-strings, Hammond-strings,
hey-you-just-made-my-extension-module-crash-strings

F"..." -- funky-stuff-in-these-hyar-strings

A"..." -- ain't-strings

S"..." -- strange-strings, silly-strings

M> Not that this is important, but...

Agreed.

-Barry

1 2 3 4  View All