Mailing List Archive

C's isprint() concept?
If I want to replace all non-printable characters in a string with a single
space, what would be the best way? Do I need to loop over the entire string
character by character checking the ord() value of each one? Anyone have
a sane way to do this with regular expressions?

[ C has isprint() and isgraph() macros ]

Thanks for any help
C's isprint() concept? [ In reply to ]
> If I want to replace all non-printable characters in a string with a single
> space, what would be the best way? Do I need to loop over the entire string
> character by character checking the ord() value of each one? Anyone have
> a sane way to do this with regular expressions?

In Perl, I could do it this way:
$string =~ tr/\x00-\x1f\x80-\xff//d;

What that means is this:
remove the following characters from $string:
characters whose ASCII value is from \x00 (0) to \x1f (31)
characters whose ASCII value is from \x80 (128) to \xff (255)

This can be done, almost as painlessly, in Python.

--
jeff pinyan japhy@pobox.com japhy+perl@pobox.com japhy+crap@pobox.com
japhy's little hole in the (fire) wall: http://www.pobox.com/~japhy/
japhy's perl supposit^Wrepository: http://www.pobox.com/~japhy/perl/
The "CRAP" Project: http://www.pobox.com/~japhy/perl/crap/
CPAN ID: PINYAN http://www.perl.com/CPAN/authors/id/P/PI/PINYAN/
C's isprint() concept? [ In reply to ]
> $string =~ tr/\x00-\x1f\x80-\xff//d;

Perhaps, more readably:

$string =~ tr/ -~//cd;

the /c means complement (take the opposite of) the list
the /d means delete those characters without a replacement
the list is ' ' through '~', which is the class of printables

--
jeff pinyan japhy@pobox.com japhy+perl@pobox.com japhy+crap@pobox.com
japhy's little hole in the (fire) wall: http://www.pobox.com/~japhy/
japhy's perl supposit^Wrepository: http://www.pobox.com/~japhy/perl/
The "CRAP" Project: http://www.pobox.com/~japhy/perl/crap/
CPAN ID: PINYAN http://www.perl.com/CPAN/authors/id/P/PI/PINYAN/
C's isprint() concept? [ In reply to ]
In article <slrn7rea88.9uf.jblaine@shell2.shore.net>,
Jeff Blaine <jblaine@shell2.shore.net> wrote:
>
>If I want to replace all non-printable characters in a string with a single
>space, what would be the best way? Do I need to loop over the entire string
>character by character checking the ord() value of each one? Anyone have
>a sane way to do this with regular expressions?

Oh, sure. What isn't clear from the way you write this is whether a run
of multiple non-printable characters should be replace with a single
space. Here's a regex that I created recently as part of some code to
detect binary documents:

# Everything that isn't CR/LF or 7-bit normal characters
reBinary = re.compile ( r'[^\r\n\x20-\x7F]' )

To make this substitute a space for each character, just do

newString = reBinary.sub ( ' ', string )

I leave the multi-sub as an exercise for the reader.

>[ C has isprint() and isgraph() macros ]

Don't remember offhand what isgraph() does, but it occurs to me that it
would be useful to have a character class string.printable. Tim? Guido?
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)
C's isprint() concept? [ In reply to ]
If I'm not mistaken, this would be the python equivalent...

teststring = '\000\001that\003\004'

import re
NOTPRINTABLE = r'[^ -~]+'
re.sub( NOTPRINTABLE, ' ', teststring )
# replace all sequences of characters
# not in the character range NOTPRINTABLE

or (from the original form):

# note use of octal instead hexadecimal escape codes
# also note use of raw string to prevent null bytes
# in the regex pattern.
NONPRINTABLE = r'[\000-\037\200-\377]+'
re.sub( NONPRINTABLE, ' ', teststring )
# replace all sequences of characters
# in the range NONPRINTABLE

If you were wanting to replace each character with a " ", leave off the plus
character in the range (thereby matching every single character). If you
need something faster for character-by-character replace, check out the
string module's mapping functions.

I hope that helps,
Mike

-----Original Message-----
From: python-list-request@cwi.nl [mailto:python-list-request@cwi.nl]On
Behalf Of Jeff Pinyan
Sent: August 15, 1999 5:29 PM
To: python-list@cwi.nl
Subject: Re: C's isprint() concept?


> $string =~ tr/\x00-\x1f\x80-\xff//d;

Perhaps, more readably:

$string =~ tr/ -~//cd;

the /c means complement (take the opposite of) the list
the /d means delete those characters without a replacement
the list is ' ' through '~', which is the class of printables
C's isprint() concept? [ In reply to ]
Jeff Pinyan <jeffp@crusoe.net> wrote:
:> If I want to replace all non-printable characters in a string with a single
:> space, what would be the best way? Do I need to loop over the entire string
:> character by character checking the ord() value of each one? Anyone have
:> a sane way to do this with regular expressions?

: In Perl, I could do it this way:
: $string =~ tr/\x00-\x1f\x80-\xff//d;

: What that means is this:
: remove the following characters from $string:
: characters whose ASCII value is from \x00 (0) to \x1f (31)
: characters whose ASCII value is from \x80 (128) to \xff (255)

: This can be done, almost as painlessly, in Python.

Considering this was asked to a Python newsgroup, how about showing
how to do it in Python.

import string
# works for ASCII
control_chars = string._idmap[:ord(' ')] # 0 to 31
high_chars = string._idmap[ord('~')+:] # 127 to 255
to_remove = control_chars + high_chars
map = string.maketrans(to_remove, ' ' * len(to_remove))

midstr = string.translate(instr, map)
outstr = string.join(string.split(midstr))

or:
import re
outstr = re.sub(r'[^ -~]+', ' ', instr)

(Indented for "easy cut&paste" ;)

-Arcege
C's isprint() concept? [ In reply to ]
[Jeff Blaine, wants to
> ... replace all non-printable characters in a string
> with a single space

[Aahz Maruch]
> Oh, sure. What isn't clear from the way you write this is whether
> a run of multiple non-printable characters should be replace with
> a single space. Here's a regex that I created recently as part of
> some code to detect binary documents:
>
> # Everything that isn't CR/LF or 7-bit normal characters
> reBinary = re.compile ( r'[^\r\n\x20-\x7F]' )

Take a look at string.translate.

> ...
> Don't remember offhand what isgraph() does, but it occurs to me that
> it would be useful to have a character class string.printable.
> Tim? Guido?

I can't speak for Tim, but Guido would say "be my guest -- define whatever
character classes you want". I think he doesn't want to pee away time
arguing over e.g. whether \x7F is "normal" or not <wink>.

unicode-is-gonna-be-sooooo-much-fun-ly y'rs - tim
C's isprint() concept? [ In reply to ]
In article <000201bee78f$6bbf6ec0$f22d2399@tim>,
Tim Peters <tim_one@email.msn.com> wrote:
>[Aahz Maruch]
>>
>> Oh, sure. What isn't clear from the way you write this is whether
>> a run of multiple non-printable characters should be replace with
>> a single space. Here's a regex that I created recently as part of
>> some code to detect binary documents:
>>
>> # Everything that isn't CR/LF or 7-bit normal characters
>> reBinary = re.compile ( r'[^\r\n\x20-\x7F]' )
>
>Take a look at string.translate.

Ah. Interesting. I'm using re.sub because we're still on 1.5.1, but
we're supposed to move to 1.5.2 Real Soon Now. How does
string.translate compare with the speed of re.findall? (I'm only
interested in the number of matches; I don't intend to *do* anything
with the matches.)
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)
C's isprint() concept? [ In reply to ]
Aahz Maruch wrote:

> Ah. Interesting. I'm using re.sub because we're still on 1.5.1, but
> we're supposed to move to 1.5.2 Real Soon Now.

Just out of curiosity, what is the biggest source of the delay from
switching to 1.5.2? Is it a matter of "if it ain't broke...", or do the
changes warrant enough work (rewriting code, updatin libraries,
etc.) that it is a big job? I'm writing apps, and currently rely on
some 1.5.2 behaviors, so I'm curious as to how difficult it may be
for customers to upgrade if need be.

Chad Netzer
chad@vision.arc.nasa.gov
C's isprint() concept? [ In reply to ]
In article <37B8A265.8A9EAC69@vision.arc.nasa.gov>,
Chad Netzer <chad@vision.arc.nasa.gov> wrote:
>Aahz Maruch wrote:
>>
>> Ah. Interesting. I'm using re.sub because we're still on 1.5.1, but
>> we're supposed to move to 1.5.2 Real Soon Now.
>
>Just out of curiosity, what is the biggest source of the delay from
>switching to 1.5.2? Is it a matter of "if it ain't broke...", or do the
>changes warrant enough work (rewriting code, updatin libraries,
>etc.) that it is a big job? I'm writing apps, and currently rely on
>some 1.5.2 behaviors, so I'm curious as to how difficult it may be
>for customers to upgrade if need be.

There were a couple of specific bugs in 1.5.1 that required us to create
a special Python package, and it will be a little tricky to back out
those changes (I wasn't involved, so don't ask for more details). Plus
we simply haven't had the time to do the necessary regression testing.

(We've got Zope, Medusa, mxODBC, and a slew of other bits; just the back
end currently has, um, about twelve running processes.)

Overall, if your customers are running unpatched Python distributions
and have not written workarounds for bugs in 1.5.1, upgrading should be
a snap.
--
--- Aahz (@netcom.com)

Androgynous poly kinky vanilla queer het <*> http://www.rahul.net/aahz/
Hugs and backrubs -- I break Rule 6 (if you want to know, do some research)
C's isprint() concept? [ In reply to ]
On Sun, 15 Aug 1999 22:31:18 -0400, "Tim Peters" <tim_one@email.msn.com> wrote:

>
>Take a look at string.translate.

And avoid it, since it isn't compatible with unicode or UTF-8
encodings :-(

John Max Skaller ph:61-2-96600850
mailto:skaller@maxtal.com.au 10/1 Toxteth Rd
http://www.maxtal.com.au/~skaller Glebe 2037 NSW AUSTRALIA
C's isprint() concept? [ In reply to ]
[Tim]
> Take a look at string.translate.

[John (Max) Skaller]
> And avoid it, since it isn't compatible with unicode or UTF-8
> encodings :-(

I guess Aahz had better avoid len() then too ...

something-that-views-the-world-as-a-raw-bytestream-can't-really-be-
incompatible-with-anything-ly y'rs - tim