Mailing List Archive: Counting boring characters (was RE: C's isprint() concept?)

[Tim]
> Take a look at string.translate.

[Aahz Maruch]
> Ah. Interesting. I'm using re.sub because we're still on 1.5.1, but
> we're supposed to move to 1.5.2 Real Soon Now.

string.translate has been there since the Beginning of Time -- which, I
believe, includes 1.5.1 <wink>.

> How does string.translate compare with the speed of re.findall?

The former is enormously faster, even if you (as I believe you did)
precompile the regexp. Partly because straight string ops have much less
overhead than firing up all the regexp machinery, and partly because much of
the re code is still written in Python (re.findall in particular is a Python
loop that repeatedly fires up the regexp machinery).

> (I'm only interested in the number of matches; I don't intend to
> *do* anything with the matches.)

One way to do it is to map all boring chars into one particular one, then
count the number of the latter:

def count_uninteresting_1(s, boring):
dummy = boring[0]
boring_map = string.maketrans(boring, dummy * len(boring))
return string.count(string.translate(s, boring_map), dummy)

A faster way is to *delete* the boring chars (via the 3-arg form of
translate), then see how much shorter the string gets:

_idmap = string.join(map(chr, range(256)), "")

def count_uninteresting_2(s, boring):
return len(s) - len(string.translate(s, _idmap, boring))

regexps-are-to-speed-as-regexps-are-to-clarity-ly y'rs - tim