Mailing List Archive

Changing Python's string search algorithms
Fredrik Lundh crafted our current string search algorithms, and
they've served us very well. They're nearly always as fast as
dumbest-possible brute force search, and sometimes much faster. This
was bought with some very cheap one-pass preprocessing of the pattern
(the substring to search _for_), to compute a single integer "skip",
and a single 32-bit bitmask, which can sometimes be used to skip over
areas where it can prove matches are doomed to fail.

Unlike most "fancier than brute" algorithms, it does not build
preprocessing tables of size proportional to the length of the
pattern, or to the size of the alphabet. The extra space is O(1), and
tiny, independent of pattern and alphabet size.

A weakness is that worst-case search time is proportional to the
product of the lengths of the inputs. This can be so for searches that
succeed as well as for those that fail. This was recently
dramatically illustrated by this bug report:

https://bugs.python.org/issue41972

where a single large string search consumed well over 3 hours of CPU
time before it succeeded. It "should take" a fraction of a second.

While it was news to me, Dennis Sweeney was aware of that other
algorithms requiring only O(1) extra space are now well known with
worst-case linear time. That's really quite remarkable :-) They do
require fancier (but still linear-time) preprocessing, so can't always
be a pure win.

So this message: string (which includes byte) search is an extremely
common operation, across all kinds of applications, so changes here
need as many eyeballs on them as they can get. What's typical? What's
important? What doesn't matter? What about reverse searches ("rfind")?
How can we quantify the acceptable region of tradeoffs?

My guess is that searches for substrings less than 50 characters long
in strings under ten thousand characters are most important for most
apps, but I just pulled that guess out of my butt. I'm also guessing
that searches for multi-million-character substrings (like in the bug
report above) are rare indeed, and that any change sparing them from
quadratic-time disaster would be "way more than good enough".

But I don't know. If you can help, Dennis already has a preliminary PR
implementing one of the newer algorithms, augmented with Fredrik's
unique ideas:

https://github.com/python/cpython/pull/22679

By "help" I don't necessarily mean code review - it could, e.g., be a
huge help to test-drive it on your own critical string-slinging apps.
Faster? Slower? A wash? Wrong answer? Crash?

Due to the natures of the old and new algorithms, neither will be
faster or slower in all cases, the newer one should never be
dramatically slower than the old one, the new one will be
spectacularly faster in some cases, and "in general" it seems
impossible to guess in advance. It depends on the value the fancier
preprocessing can extract from the pattern versus the higher
preprocessing cost, and that in turn depends on details of exactly
what the patterns and texts to be searched are.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECAZN35JCEE67ZVYHALRXDP4FILGR53Y/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On 10/13/20 9:54 AM, Tim Peters wrote:
> Due to the natures of the old and new algorithms, neither will be
> faster or slower in all cases, the newer one should never be
> dramatically slower than the old one, the new one will be
> spectacularly faster in some cases, and "in general" it seems
> impossible to guess in advance. It depends on the value the fancier
> preprocessing can extract from the pattern versus the higher
> preprocessing cost, and that in turn depends on details of exactly
> what the patterns and texts to be searched are.


My off-the-top-of-my-head reaction: I've always assumed that most string
searching is tiny.  Like, searching for a < 10 character string in a <
256 character string.  That's what I'm always doing, at least.  So while
I'd obviously like to see us address the pathological cases cited--and
thanks to Dennis Sweeney for the PRs!--I'd hope that it doesn't make
these small searches any slower.  Your email didn't give me a sense of
how much additional preprocessing these new algorithms require; the fact
that they're O(1) space suggests they aren't bad.  But if you do lots
and lots of dinky searches surely it would add up.

Looking at the PR, I see there's a special case for searching for a
single character, and then cases for forward-search and reverse-search. 
I was wondering if I'd find a heuristic like "if the string you're
searching in is < 2k, use the old search algorithm".  Is that worth
pursuing?  It doesn't seem like the maintenance cost would be that high;
I don't think anybody has touched that code in years.  (No surprise, the
Effbot did a good job.)


//arry/
Re: Changing Python's string search algorithms [ In reply to ]
I'm sure that the large majority of the string searches I've done are in
Larry's tiny category.

However, I also think that big data needs are increasing, and things like
FASTA files can be enormously large texts that one has good reasons to
search on.

If there is a heuristic switch between algorithms, it seems like it should
threshold on needle, not on haystack. Or possibly some more complex
function of both. But if I understand the two-way approach correctly (which
I probably don't), there's not really much gain for small needle.

On Wed, Oct 14, 2020, 12:09 AM Larry Hastings <larry@hastings.org> wrote:

>
> On 10/13/20 9:54 AM, Tim Peters wrote:
>
> Due to the natures of the old and new algorithms, neither will be
> faster or slower in all cases, the newer one should never be
> dramatically slower than the old one, the new one will be
> spectacularly faster in some cases, and "in general" it seems
> impossible to guess in advance. It depends on the value the fancier
> preprocessing can extract from the pattern versus the higher
> preprocessing cost, and that in turn depends on details of exactly
> what the patterns and texts to be searched are.
>
>
> My off-the-top-of-my-head reaction: I've always assumed that most string
> searching is tiny. Like, searching for a < 10 character string in a < 256
> character string. That's what I'm always doing, at least. So while I'd
> obviously like to see us address the pathological cases cited--and thanks
> to Dennis Sweeney for the PRs!--I'd hope that it doesn't make these small
> searches any slower. Your email didn't give me a sense of how much
> additional preprocessing these new algorithms require; the fact that
> they're O(1) space suggests they aren't bad. But if you do lots and lots
> of dinky searches surely it would add up.
>
> Looking at the PR, I see there's a special case for searching for a single
> character, and then cases for forward-search and reverse-search. I was
> wondering if I'd find a heuristic like "if the string you're searching in
> is < 2k, use the old search algorithm". Is that worth pursuing? It
> doesn't seem like the maintenance cost would be that high; I don't think
> anybody has touched that code in years. (No surprise, the Effbot did a
> good job.)
>
>
> */arry*
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/D5GYO2FVUW762RGZ5NMYKYM7PPWWRDIS/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: Changing Python's string search algorithms [ In reply to ]
Rest assured that Dennis is aware of that pragmatics may change for
shorter needles.

The code has always made a special-case of 1-character needles,
because it's impossible "even in theory" to improve over
straightforward brute force search then.

Say the length of the text to search is `t`, and the length of the
pattern `p`. Brute force and the current code have worst case O(t*p)
behavior. The new code, worst case O(t+p). If that's as deep as you
dig into it, it seems all but obvious then that O(t*p) just can't be
that bad when p is small, so keep it simple.

But there's another twist: the current and new code both have O(t/p)
cases too (but brute force, and even Knuth-Morris-Pratt, don't). That
can be highly beneficial even for p as small as 3.

Unfortunately, the exact cases in which the current and new code enjoy
O(t/p) behavior aren't the same.

Lying a bit: In general the current code has just two tricks. One of
those tricks is useless (pure waste of time) if the pattern happens to
end with a repeated character pair, and is most effective if the last
character appears nowhere else in the pattern. The other trick is most
effective if the set of characters in the text has no intersection
with the set of characters in the pattern (note that the search is
certain to fail then), but useless if the set of text characters is a
subset of the set of pattern characters (as, e.g., it very often is in
real-life searches in apps with a small alphabet, like [.ACGT}+ for
genomes).

But I don't know how to characterize when the new code gets maximum
benefit. It's not based on intuitive tricks. The paper that
introduced it[1] says it's based on "a deep theorem on words
known as the Critical Factorization Theorem due to Cesari, Duval,
Vincent, and Lothaire", and I still don't fully understand why it
works.

It's clear enough, though, that the new code gets real systematic
value out of patterns with repetitions (like "abab"), where the
current code gets real value from those only by accident (e.g., "a"
and "b" happen to appear rarely in the text being searched, in which
case that the pattern has repetitions is irrelevant).

But, as I said in the first message, the PR is "preliminary". There
are still worlds of tweaks that have been thought of but not yet tried
even once.

[1] search for "Two-Way String Matching" by Crochemore and Perrin.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYWWEM26S2XKVX5W6P54R47YQTG/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
Maybe someone reading this can finish the Wikipedia page on Two-Way Search?
The code example trails off with a function with some incomprehensible
remarks and then a TODO...

On Wed, Oct 14, 2020 at 9:07 AM Tim Peters <tim.peters@gmail.com> wrote:

> Rest assured that Dennis is aware of that pragmatics may change for
> shorter needles.
>
> The code has always made a special-case of 1-character needles,
> because it's impossible "even in theory" to improve over
> straightforward brute force search then.
>
> Say the length of the text to search is `t`, and the length of the
> pattern `p`. Brute force and the current code have worst case O(t*p)
> behavior. The new code, worst case O(t+p). If that's as deep as you
> dig into it, it seems all but obvious then that O(t*p) just can't be
> that bad when p is small, so keep it simple.
>
> But there's another twist: the current and new code both have O(t/p)
> cases too (but brute force, and even Knuth-Morris-Pratt, don't). That
> can be highly beneficial even for p as small as 3.
>
> Unfortunately, the exact cases in which the current and new code enjoy
> O(t/p) behavior aren't the same.
>
> Lying a bit: In general the current code has just two tricks. One of
> those tricks is useless (pure waste of time) if the pattern happens to
> end with a repeated character pair, and is most effective if the last
> character appears nowhere else in the pattern. The other trick is most
> effective if the set of characters in the text has no intersection
> with the set of characters in the pattern (note that the search is
> certain to fail then), but useless if the set of text characters is a
> subset of the set of pattern characters (as, e.g., it very often is in
> real-life searches in apps with a small alphabet, like [.ACGT}+ for
> genomes).
>
> But I don't know how to characterize when the new code gets maximum
> benefit. It's not based on intuitive tricks. The paper that
> introduced it[1] says it's based on "a deep theorem on words
> known as the Critical Factorization Theorem due to Cesari, Duval,
> Vincent, and Lothaire", and I still don't fully understand why it
> works.
>
> It's clear enough, though, that the new code gets real systematic
> value out of patterns with repetitions (like "abab"), where the
> current code gets real value from those only by accident (e.g., "a"
> and "b" happen to appear rarely in the text being searched, in which
> case that the pattern has repetitions is irrelevant).
>
> But, as I said in the first message, the PR is "preliminary". There
> are still worlds of tweaks that have been thought of but not yet tried
> even once.
>
> [1] search for "Two-Way String Matching" by Crochemore and Perrin.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/G53VXXYWWEM26S2XKVX5W6P54R47YQTG/
> Code of Conduct: http://python.org/psf/codeofconduct/
>


--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Re: Changing Python's string search algorithms [ In reply to ]
[Guido]
> Maybe someone reading this can finish the Wikipedia page on
> Two-Way Search? The code example trails off with a function with
> some incomprehensible remarks and then a TODO..

Yes, the Wikipedia page is worse than useless in its current state,
although some of the references it lists are helpful. This is a much
better summary:

http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260

but, I believe, still far too telegraphic.

The original paper is by far the best account I've seen so far, with
complete code and exhaustive explanations and proofs. Even examples
;-)

But here's the thing: I don't believe this algorithm this _can_ be
reduced to an elevator pitch. Excruciating details appear to be
fundamental at every step, and I've seen nothing yet anywhere
approaching an "intuitive explanation" :-( What happens instead is
that you run it on the hardest cases you can dream up, and your jaw
drops in amazement :-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/B2MPSYXRKFA3V3GP45GAFSIKKCK5NHJ3/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On Wed, Oct 14, 2020 at 7:57 PM Tim Peters <tim.peters@gmail.com> wrote:
>
> [Guido]
> > Maybe someone reading this can finish the Wikipedia page on
> > Two-Way Search? The code example trails off with a function with
> > some incomprehensible remarks and then a TODO..
>
> Yes, the Wikipedia page is worse than useless in its current state,
> although some of the references it lists are helpful. This is a much
> better summary:
>
> http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
>
> but, I believe, still far too telegraphic.
>
> The original paper is by far the best account I've seen so far, with
> complete code and exhaustive explanations and proofs. Even examples
> ;-)
>
> But here's the thing: I don't believe this algorithm this _can_ be
> reduced to an elevator pitch. Excruciating details appear to be
> fundamental at every step, and I've seen nothing yet anywhere
> approaching an "intuitive explanation" :-( What happens instead is
> that you run it on the hardest cases you can dream up, and your jaw
> drops in amazement :-)

That sounds very interesting! I'll definitely be digging deeper into
the algorithm, papers and proofs of the underlying "Critical
Factorization" theorem. My goal would be to find a good way to explain
them, ideally some kind of interactive article. Ideally that would
also lead to better Wikipedia pages.

I'd be happy to help with this project. I've done quite a bit of
relevant work while developing my fuzzysearch[1] library.

- Tal Einat

[1] https://pypi.org/project/fuzzysearch/
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HJQTATKMX7SOEUGI5XBVA6JQZKDFCBEH/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
Perhaps this is a silly suggestion, but could we offer this as an
external function in the stdlib rather than a string method?

Leave it up to the user to decide whether or not their data best suits
the find method or the new search function. It sounds like we can offer
some rough heuristics, but the only way to really know is "try it and
see which works best for you".

The `string` module is an obvious place for it.

--
Steve
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/SVMEBOHDPPODZQGDOGSUPZYT4B7UIIAW/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On Wed, Oct 14, 2020 at 7:45 PM Steven D'Aprano <steve@pearwood.info> wrote:

> Perhaps this is a silly suggestion, but could we offer this as an
> external function in the stdlib rather than a string method?
>

That feels unworkable to me.

For one thing, the 'in' operator hits this same issue, doesn't it? But for
another, the example initially posted where the current code did or didn't
hit that quadratic breakdown was just one extra character in a long
string. I cannot imagine end users knowing in advance whether their exact
search will hit that unlucky circumstance that is very data dependent.

If it was a case of "in certain circumstances, this other function might be
twice as fast" I can see that being useful. But here we get a big-O
explosion that went from milliseconds to hours on ostensibly similar
operations.

That said, I do recognize that the `re` module also has pathological cases,
and the standard library documentation explicitly says "maybe you want to
try the third-party `regex` implementation." That is sort of precedent for
this approach.

--
The dead increasingly dominate and strangle both the living and the
not-yet born. Vampiric capital and undead corporate persons abuse
the lives and control the thoughts of homo faber. Ideas, once born,
become abortifacients against new conceptions.
Re: Changing Python's string search algorithms [ In reply to ]
[Steven D'Aprano <steve@pearwood.info>]
> Perhaps this is a silly suggestion, but could we offer this as an
> external function in the stdlib rather than a string method?
>
> Leave it up to the user to decide whether or not their data best suits
> the find method or the new search function. It sounds like we can offer
> some rough heuristics, but the only way to really know is "try it and
> see which works best for you".
>
> The `string` module is an obvious place for it.

I think this is premature. There is almost never an optimization
that's a pure win in all cases. For example, on some platforms
`timsort` will never be as fast as the old samplesort in cases with a
very large number of equal elements, and on all platforms `timsort`
consumes more memory. And it's a wash on "random" data on most
platforms. Nevertheless, it was a significant speed win for many -
possibly even most - real-life cases.

So far, the PR reduces the runtime in the bug report from about 3 1/2
hours to under a tenth of a second. It would be crazy not to gift
users such dramatic relief by default, unless there are good reasons
not to. Too soon to say. On tests of random strings with characters
following a Zipf distribution (more realistic than uniform but still
very easy to generate - and not contrived in any way to favor either
approach), the current new code it usually faster than the status quo,
in no case has been worse than twice as slow, and in a number of cases
has been 20x faster. It's already clearly a win _overall_.

The bulk of the "but it's slower" cases now are found with very short
needles (patterns), which was expected (read my comments on the bug
report), and exacerbated by that the structure of the random test
generation is quite likely to create cases where a short needle is
found early in the haystack. This combination makes the higher
preprocessing overhead as damaging as possible. Dennis also expanded
the current code's "32-bit bitmask" trick (which is an extreme
simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte
table of character-class-specific skip counts, which went a _very_
long way toward eliminating the cases where the current code enjoyed a
"pure luck" advantage over the new code. But that increased the
preprocessing overhead again, which again is relatively most
significant for short needles - those 256 bytes have to be initialized
every time, no matter the lengths of the needle or haystack.

If the new code were changed to exempt short needles, even now it
might be hard to make an objective case not to adopt it.

But it would leave open a different idea for an "opt-in" facility: one
that allowed to give a needle to a function and get back an object
capturing the results of preprocessing. Then a different wrapper
around the search code that accepted the already-pre-processed info.
There are, e.g., certainly cases where repeated searches for a given
4-character needle could be sped significantly by exploiting the new
code, but where the overhead of preprocessing is too much to bear in
"random" performance testing. It would also open the door to more
aggressive (expensive) kinds of preprocessing. That's where users may
be able to make useful, informed judgments.

[David Mertz]
> That said, I do recognize that the `re` module also has pathological cases,
> and the standard library documentation explicitly says "maybe you want to
> try the third-party `regex` implementation." That is sort of precedent for
> this approach.

`regex` also suffers exponential time disasters, they're just harder
to stumble into - and there are books explaining in detail how and why
regex engines can fall into these traps.

It's far less _expected_ in plain string searches, and Dennis was
aware of the new algorithms because, apparently, (at least) glibc's
memmem switched to one some years ago. So we're playing catch-up here.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/ECPXBBF7OUNYLDURCUKYXIOTGPGVBMHX/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On Thu, Oct 15, 2020 at 11:38 AM Tim Peters <tim.peters@gmail.com> wrote:
> I think this is premature. There is almost never an optimization
> that's a pure win in all cases. For example, on some platforms
> `timsort` will never be as fast as the old samplesort in cases with a
> very large number of equal elements, and on all platforms `timsort`
> consumes more memory. And it's a wash on "random" data on most
> platforms. Nevertheless, it was a significant speed win for many -
> possibly even most - real-life cases.

And timsort is one of my go-tos for teaching the concept of hybrid
sorting algorithms, because, at its heart, it's simple enough to
explain, and it manages to be so incredibly valuable in real-world
code. :)

> But it would leave open a different idea for an "opt-in" facility: one
> that allowed to give a needle to a function and get back an object
> capturing the results of preprocessing. Then a different wrapper
> around the search code that accepted the already-pre-processed info.
> There are, e.g., certainly cases where repeated searches for a given
> 4-character needle could be sped significantly by exploiting the new
> code, but where the overhead of preprocessing is too much to bear in
> "random" performance testing. It would also open the door to more
> aggressive (expensive) kinds of preprocessing. That's where users may
> be able to make useful, informed judgments.

Kinda like the way a compiled regex is used? I like this idea. So it'd
be heuristics in the core language that choose a good default for most
situations, and then a str method that returns a preprocessed needle.
In a Python implementation that doesn't want to use two different
algorithms, that preprocessor could return the string unchanged, but
otherwise it's an opaque object usable only in string searches.

+1, if my interpretation of your description is correct.

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/4MOC4RQDFPFGCF2OHUAK4YACGGYMTFGS/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On Wed, Oct 14, 2020 at 9:56 AM Tim Peters <tim.peters@gmail.com> wrote:

> [Guido]
> > Maybe someone reading this can finish the Wikipedia page on
> > Two-Way Search? The code example trails off with a function with
> > some incomprehensible remarks and then a TODO..
>
> Yes, the Wikipedia page is worse than useless in its current state,
> although some of the references it lists are helpful. This is a much
> better summary:
>
> http://www-igm.univ-mlv.fr/~lecroq/string/node26.html#SECTION00260
>
> but, I believe, still far too telegraphic.
>

The key seems to be:

"""
The searching phase of the Two Way algorithm consists in first comparing
the character of xr from left to right, then the character of x from right
to left.
When a mismatch occurs when scanning the k-th character of xr, then a shift
of length k is performed.
When a mismatch occurs when scanning x, or when an occurrence of the
pattern is found, then a shift of length per(x) is performed.
Such a scheme leads to a quadratic worst case algorithm, this can be
avoided by a prefix memorization: when a shift of length per(x) is
performed the length of the matching prefix of the pattern at the beginning
of the window (namely m-per(x)) after the shift is memorized to avoid to
scan it again during the next attempt.
"""

The preprocessing comes down to splitting the original search string
("needle") in two parts, xl and xr, using some clever algorithm (IIUC the
wikipedia page does describe this, although my brain is currently too
addled to visualize it).

The original paper is by far the best account I've seen so far, with
> complete code and exhaustive explanations and proofs. Even examples
> ;-)
>

I need a Python version though.


> But here's the thing: I don't believe this algorithm this _can_ be
> reduced to an elevator pitch. Excruciating details appear to be
> fundamental at every step, and I've seen nothing yet anywhere
> approaching an "intuitive explanation" :-( What happens instead is
> that you run it on the hardest cases you can dream up, and your jaw
> drops in amazement :-)
>

I am not able to dream up any hard cases -- like other posters, my own use
of substring search is usually looking for a short string in a relatively
short piece of text. I doubt even the current optimizations matter to my
uses. What are typical hard cases used for? DNA search? (That would be
cool!)

--
--Guido van Rossum (python.org/~guido)
*Pronouns: he/him **(why is my pronoun here?)*
<http://feministing.com/2015/02/03/how-using-they-as-a-singular-pronoun-can-change-the-world/>
Re: Changing Python's string search algorithms [ In reply to ]
On Wed., Oct. 14, 2020, 17:37 Tim Peters, <tim.peters@gmail.com> wrote:

> [Steven D'Aprano <steve@pearwood.info>]
> > Perhaps this is a silly suggestion, but could we offer this as an
> > external function in the stdlib rather than a string method?
> >
> > Leave it up to the user to decide whether or not their data best suits
> > the find method or the new search function. It sounds like we can offer
> > some rough heuristics, but the only way to really know is "try it and
> > see which works best for you".
> >
> > The `string` module is an obvious place for it.
>
> I think this is premature. There is almost never an optimization
> that's a pure win in all cases. For example, on some platforms
> `timsort` will never be as fast as the old samplesort in cases with a
> very large number of equal elements, and on all platforms `timsort`
> consumes more memory. And it's a wash on "random" data on most
> platforms. Nevertheless, it was a significant speed win for many -
> possibly even most - real-life cases.
>
> So far, the PR reduces the runtime in the bug report from about 3 1/2
> hours to under a tenth of a second. It would be crazy not to gift
> users such dramatic relief by default, unless there are good reasons
> not to. Too soon to say. On tests of random strings with characters
> following a Zipf distribution (more realistic than uniform but still
> very easy to generate - and not contrived in any way to favor either
> approach), the current new code it usually faster than the status quo,
> in no case has been worse than twice as slow, and in a number of cases
> has been 20x faster. It's already clearly a win _overall_.
>
> The bulk of the "but it's slower" cases now are found with very short
> needles (patterns), which was expected (read my comments on the bug
> report), and exacerbated by that the structure of the random test
> generation is quite likely to create cases where a short needle is
> found early in the haystack. This combination makes the higher
> preprocessing overhead as damaging as possible. Dennis also expanded
> the current code's "32-bit bitmask" trick (which is an extreme
> simplification of Daniel Sunday's algorithm) to a fixed-size 256-byte
> table of character-class-specific skip counts, which went a _very_
> long way toward eliminating the cases where the current code enjoyed a
> "pure luck" advantage over the new code. But that increased the
> preprocessing overhead again, which again is relatively most
> significant for short needles - those 256 bytes have to be initialized
> every time, no matter the lengths of the needle or haystack.
>
> If the new code were changed to exempt short needles, even now it
> might be hard to make an objective case not to adopt it.
>

This is where it sounds like we might want to go. If we can figure out a
reasonable, cheap heuristic to define "too short"-- for either the search
string or the string to search -- to use the fancier algorithm then I would
support adding the fancy string search.

-Brett


> But it would leave open a different idea for an "opt-in" facility: one
> that allowed to give a needle to a function and get back an object
> capturing the results of preprocessing. Then a different wrapper
> around the search code that accepted the already-pre-processed info.
> There are, e.g., certainly cases where repeated searches for a given
> 4-character needle could be sped significantly by exploiting the new
> code, but where the overhead of preprocessing is too much to bear in
> "random" performance testing. It would also open the door to more
> aggressive (expensive) kinds of preprocessing. That's where users may
> be able to make useful, informed judgments.
>
> [David Mertz]
> > That said, I do recognize that the `re` module also has pathological
> cases,
> > and the standard library documentation explicitly says "maybe you want to
> > try the third-party `regex` implementation." That is sort of precedent
> for
> > this approach.
>
> `regex` also suffers exponential time disasters, they're just harder
> to stumble into - and there are books explaining in detail how and why
> regex engines can fall into these traps.
>
> It's far less _expected_ in plain string searches, and Dennis was
> aware of the new algorithms because, apparently, (at least) glibc's
> memmem switched to one some years ago. So we're playing catch-up here.
> _______________________________________________
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-leave@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/ECPXBBF7OUNYLDURCUKYXIOTGPGVBMHX/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
Re: Changing Python's string search algorithms [ In reply to ]
[Guido]
> The key seems to be:

Except none of that quoted text (which I'll skip repeating) gives the
slightest clue as to _why_ it may be an improvement. So you split the
needle into two pieces. So what? What's the _point_? Why would
someone even imagine that might help?

Why is one half then searched right to left, but the other half left
to right? There's actually "a reason" for searching the right half
left to right, but because the shift on a mismatch in the left half is
a constant ("per(x)") independent of the mismatching position, it's
actually irrelevant in which order the left half's characters are
compared (well, at least in some variations of these newer
algorithms). Over-specification can be actively misleading too.

What's the key idea? "Split the needle into two pieces" is a key
_step_, not a key _insight_.


> I need a Python version though.

Dennis wrote one for his first stab, which you can download from the
bug report (it's attached there as fastsearch.py).

On the bug report's data, running that under CPython 3.9,0 on my box
reduced the time from about 3 1/2 hours to about 20 seconds. Running
it under PyPy, to under one second (and the current C implementation
under 0.1 second).

> I am not able to dream up any hard cases -- like other posters, my own
> use of substring search is usually looking for a short string in a relatively
> short piece of text. I doubt even the current optimizations matter to my uses.

They probably do, but not dramatically. Would you, e.g., notice a
factor of 2 either way in those cases? A factor of 4? Of 10? When
Fredrik first wrote this code, he routinely saw factors of 2 to 10
speedup on all sorts of string-searching benchmarks, both contrived
and "natural". The speedups were especially pronounced for Unicode
strings at the time, where skipping futile Unicode character
comparisons could be more valuable than when skipping (shorter)
single-byte comparisons.

Should also note that the fixed string searching code is also used as
a subroutine by parts of our regexp implementation, by str.count(),
str.replace(), similar `bytes` operations, and so on.

> What are typical hard cases used for?

It's kinda like asking what typical hard rounding cases for pow() are
used for ;-) They aren't deliberate. They "just happen", typically
when a pattern and the text to search contain self-similarities "but
with glitches". Search the string

text = "X" * 10_000_000

for

needle = "X" * 1_000_000

Easy! It matches at once. But now tack "Y" on to the end of the
needle. Then it's impossible to match. Brute force search first
finds a prefix match at text{; 1000000] but then fails to match the
trailing "Y". So brute force uses another million compares to match
the prefix at text[1 : 1000001]. But again fails to match the trailing
"Y". Then another million plus 1 compares to fail to match starting
at index 2, and again at index 3, and again at index 4, ... it
approaches a trillion comparisons before it finally fails.

The current code starts comparing at the right end of each trial
position first. Then an "X" from the text and the needle's trailing
"Y" mismatch at once. That's a million times faster.

Here's a simple case where the current code is horribly slow, because
neither "right-to-left" nor any of its preprocessing tricks do any
good at all. Even Daniel Sunday's full-blown algorithm[1] (which the
current code strips down _almost_ to the point of non-existence)
wouldn't help (unless it used a variant I've never actually seen that
compared the rarest pattern character first):

("X" * 10_000_000).find("X" * 500_000 + "Y" + "X" * 500_000)

The newer code returns -1 seemingly instantly.

> DNA search? (That would be cool!)

It can certainly help. Self-similarities are bound to be common in
long strings from a 4-character alphabet (ACGT). But, to be fair,
serious work with genomes typically invests in building a giant suffix
tree (or enhanced suffix array) for each cataloged sequence to be
searched. No preprocessing of needles is needed then to guarantee
worst-case efficient searches of many kinds (including things like
finding the longest adjacent repeated subsequences, more like a regexp
(.+)\1 kind of search).

But the new code could quite plausibly speed more casual prototyping
of such explorations.

[1] https://dl.acm.org/doi/abs/10.1145/79173.79184
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/XUP6S2IVJAW5K2NSHZ7UOKN5YQFNUWVQ/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On 15/10/20 1:45 pm, Chris Angelico wrote:
> So it'd
> be heuristics in the core language that choose a good default for most
> situations, and then a str method that returns a preprocessed needle.

Or maybe cache the results of the preprocessing?

--
Greg
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/EH26IH753UX3PSURGN5GUKEKX6QDANEZ/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
Here's my attempt at some heuristic motivation:

Try to construct a needle that will perform as poorly as possible when
using the naive two-nested-for-loops algorithm. You'll find that if
there isn't some sort of vague periodicity in your needle, then you
won't ever get *that* unlucky; each particular alignment will fail
early, and if it doesn't then some future alignment would be pigeonholed
to fail early.

So Crochemore and Perrin's algorithm explicitly handles this "worst case"
of periodic strings. Once we've identified in the haystack some period
from the needle, there's no need to re-match it. We can keep a memory
of how many periods we currently remember matching up, and never re-match
them. This is what gives the O(n) behavior for periodic strings.

But wait! There are some bad needles that aren't quite periodic.
For instance:

>>> 'ABCABCAABCABC' in 'ABC'*1_000_000

The key insight though is that the worst strings are still
"periodic enough", and if we have two different patterns going on,
then we can intentionally split them apart. For example,
`"xyxyxyxyabcabc" --> "xyxyxyxy" + "abcabc"`. I believe the goal is to
line it up so that if the right half matches but not the left then we
can be sure to skip somewhat far ahead. This might not correspond
exactly with splitting up two patterns. This is glossing over some
details that I'm admittedly still a little hazy on as well, but
hopefully that gives at least a nudge of intuition.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MXMS5XIV6WJFFRHTH7TBHAO3TC4QIHBZ/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
[Dennis Sweeney <sweeney.dennis650@gmail.com>]
> Here's my attempt at some heuristic motivation:

Thanks, Dennis! It helps. One gloss:

> ....
> The key insight though is that the worst strings are still
> "periodic enough", and if we have two different patterns going on,
> then we can intentionally split them apart.

The amazing (to me) thing is that splitting into JUST two parts is
always enough to guarantee linearity. What if there are a million
different patterns going on ? Doesn't matter! I assume this
remarkable outcome is a consequence of the Critical Factorization
Theorem.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/IKPKLR43U522PC55JB7GQZMQSGJHNCKF/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
[Guido]
> I am not able to dream up any hard cases -- like other posters,
> my own use of substring search is usually looking for a short
> string in a relatively short piece of text. I doubt even the current
> optimizations matter to my uses.

I should have responded to this part differently. What I said was
fine ;-) , but it's a mistake to focus exclusively on pathologies
here. What Fredrik did was with the aim of significantly speeding
utterly ordinary searches. For example, search for "xyz" in
"abcdefgh...xyz".

Brute force starts comparing at the first location:

abcdefgh...
xyz

The current code compares "c" with "z" fist. They don't match. Now
what? It looks at the_next_ character in the haystack, "d". Thanks
to preprocessing the needle, it knows that "d" appears nowhere in the
needle (actually, the code is so keen on "tiny constant extra space"
that preprocessing only saves enough info to get a _probabilitistic_
guess about whether "d" is in the needle, but one that's always
correct when "not in the needle" is its guess).

For that reason, there's no possible match when starting the attempt
at _any_ position that leaves "d" overlapping with the needle. So it
can immediately skip even trying starting at text indices 1, 2, or 3.
It can jump immediately to try next at index 4:

abcdefgh...
0123xyz

Then the same kind of thing, seeing that "g" and "z" don't match, and
that "h" isn't in the needle. And so on, jumping 4 positions at a
time until finally hitting "xyz" at the end of the haystack.

The new code gets similar kinds of benefits in _many_ similarly
ordinary searches too, but they're harder to explain because they're
not based on intuitive tricks explicitly designed to speed ordinary
cases. They're more happy consequences of making pathological cases
impossible.

From randomized tests so far, it's already clear that the new code is
finding more of this nature to exploit in non-pathological cases than
the current code. Although that's partly (but only partly) a
consequence of Dennis augmenting the new algorithm with a more
powerful version of the specific trick explained above (which is an
extreme simplification of Daniel Sunday's algorithm, which was also
aimed at speeding ordinary searches).
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Y3DFFBHNMHDGRE2GIEMH7XLY5YR6BMKR/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
Excuse me if I intrude in an algorithm that I have not understood, but
the new optimization can be applied to regexps too?
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/UAYHXDYVG5IW4NDV3TSHTMSBKJKRFVEM/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
[Marco Sulla]
> Excuse me if I intrude in an algorithm that I have not understood, but
> the new optimization can be applied to regexps too?

The algorithm is limited to searching for fixed strings.

However, _part_ of our regexp implementation (the bit that looks ahead
for a fixed string) will inherit it by magic.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z4R6VUAOXMH2D5NDGWZX52BLU6F7XNXE/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
I don't plan on making a series of these posts, just this one, to give
people _some_ insight into why the new algorithm gets systematic
benefits the current algorithm can't. It splits the needle into two
pieces, u and v, very carefully selected by subtle linear-time needle
preprocessing (and it's quite a chore to prove that a suitable
splitting always exists!):

needle == u + v

There are a whole bunch of things that can be said about this
splitting (and must be said to prove worst-case linear time), but
here's just one for which it's easy to show the benefit: no
(non-empty) prefix of v is a suffix of u.

Now I think it's quite easy to twist your head into knots trying to
see what follows from that, so I'll just give a highly generalizable
concrete example. Suppose v starts with "xyza". Then u cannot end
with "x", "xy", "xyz", or "xyza". If u did, then u would have a suffix
that's also a prefix of v.

A match attempt at a given haystack position starts by matching the
characters in v one at a time, left to right. Here with haystack on
the first line, needle on the second, and "0" denoting "don't care /
unknown / doesn't exist" - it's just _some_ glyph so the example has
some chance of lining up under annoying proportional fonts :-(

Say the first character matches:

00000x000000
0uuuuxyzavvv

And the second and third:

00000xyz0000
0uuuuxyzavvv

But the 4th doesn't:

00000xyzZ000
0uuuuxyzavvv

Is there any possibility of a match if we shift the needle one to the
right? No! If we tried, the last character of u would line up with the
"x" in the haystack, and we already know "x" is not a suffix of u.

How about if we shifted two to the right? No again! And for the same
reason: then the last two characters of u would line up with "xy" in
the haystack, but we already know "xy" is not a suffix of u.

And so on. In general, if we find a mismatch while trying the k'th
(1-based) character of v, we can shift the needle k places to the
right before there's any possibility of matching. In this case, k=4,
so we can jump to:

00000xyzA0000000
00000uuuuxyzavvv

Note that no "extra" storage is needed to exploit this. No character
lookups, no extra expenses in time or space of any kind. Just "if we
mismatch on the k'th try, we can jump ahead k positions". Nothing
special about "xyza" either - this works with any characters at all!

"xyza" isn't by any stretch a hard case regardless. But it goes
_some_ way toward hinting at why potentially hard cases are rendered
toothless too. For example, the needle "x" * 100 is split into

u = "" # yup, empty!
v = needle

Now, e.g., if we find a mismatch at the 83'rd character of v, we can
leap ahead 83 positions immediately.

What about needle "x" * 100 + "y"? I picked that to crush your
premature optimism ;-) Now the splitting is

u = "x" * 100
v = "y"

and the wonderful trick above only applies to a trivial 1-character
string. The "hard part" is all in matching the u piece now, which I'll
leave as an exercise for the reader ;-)
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HPMYTQSJ4IXUYUCZE2EYCIF34RTQDW7O/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On Sat, Oct 17, 2020 at 12:30 PM Tim Peters <tim.peters@gmail.com> wrote:
>
> I don't plan on making a series of these posts, just this one, to give
> people _some_ insight into why the new algorithm gets systematic
> benefits the current algorithm can't. It splits the needle into two
> pieces, u and v, very carefully selected by subtle linear-time needle
> preprocessing (and it's quite a chore to prove that a suitable
> splitting always exists!):
>
> needle == u + v
>
> [... more details...]

Thank you, great explanation. Can this be added to the source code
if/when this algorithm gets implemented? I've learned all kinds of
things from the huge block comments about timsort, list growth, and
the like. They make great reading when you're trying to figure out how
to explain things (or if you're a bored nerd browsing for curiosity's
sake - that works too!).

ChrisA
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/7QZPJAELSJ426IF6TCVWACZ4P5RXEJMJ/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
[.Tim Peters, explains one of the new algorithm's surprisingly
effective moving parts]

[Chris Angelico <rosuav@gmail.com>]
> Thank you, great explanation. Can this be added to the source code
> if/when this algorithm gets implemented?

No ;-) While I enjoy trying to make hard things clear(er), I need to
understand them inside out myself first, and I'm still not there with
this algorithm.

Even of what I do grok now, I cheated some to make that post simple.
For example, if you read it carefully, you'll realize that the
explanation I gave doesn't actually explain why the "x" * 100 example
is correct: the explanation implicitly relied on that the left half
the splitting (u) is at least as long as the right half (v). But
that's not always the case (and, in particular, for "x" * 100, u is
empty!).

Tal Einat posted earlier that he was keen to try to work up a clear
explanation, and I look forward to that. All the expositions I've
found of this algorithm so far are hard to approach. The original
paper is still the best I've seen. Everything I wrote was but a gloss
on its Lemma 3.2.

> I've learned all kinds of things from the huge block comments
> about timsort, list growth, and the like. They make great reading
> when you're trying to figure out how to explain things (or if
> you're a bored nerd browsing for curiosity's sake - that works too!).

Takes one to know one ;-) There's beauty and elegance in this
algorithm, but they'll go unloved without explanations clear enough
for the non-obsessed to follow.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HMSOQD22CLHLZ4VEWV6FPRA2RC3TM75H/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
On Fri, 16 Oct 2020 20:24:22 -0500
Tim Peters <tim.peters@gmail.com> wrote:
>
> Note that no "extra" storage is needed to exploit this. No character
> lookups, no extra expenses in time or space of any kind. Just "if we
> mismatch on the k'th try, we can jump ahead k positions".

Ok, so that means that on a N-character haystack, it'll always do at
least N character comparisons?

IIRC, the current algorithm is able to do much less than that in
favorable cases. If the needle is k characters long and they are all
distinct, it's able to do N/k comparisons.

Regards

Antoine.

_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QTB3XVIPZOO3QBVONPWLUN5KHFD3UWBC/
Code of Conduct: http://python.org/psf/codeofconduct/
Re: Changing Python's string search algorithms [ In reply to ]
[Tim]
>> Note that no "extra" storage is needed to exploit this. No character
>> lookups, no extra expenses in time or space of any kind. Just "if we
>> mismatch on the k'th try, we can jump ahead k positions".

[Antoine Pitrou <solipsis@pitrou.net>]
> Ok, so that means that on a N-character haystack, it'll always do at
> least N character comparisons?
>
> IIRC, the current algorithm is able to do much less than that in
> favorable cases. If the needle is k characters long and they are all
> distinct, it's able to do N/k comparisons.

It's an excellent question, and from what I said it's a correct inference.

But I only described how the algorithm matches the "v" part of the
"needle = u + v" splitting. When matching the "u" part, skips
materially exceeding the count of comparisons made are possible.

The paper claims the minimal number of comparisons needed (ignoring
preprocessing) is 2N/k, so same order as the current algorithm. But,
for the constant factor, Dennis's code achieves N/k, because he
augmented the Crochemre-Perrin algorithm with a variant of Sunday's
algorithm (a simplification, but not as extreme as the current code's
simplification). Note that best case near N/k is a known lower bound
for best cases - impossible to do materially better.

In the 'x' * 100 + 'y' example I ended my msg with, recall that v wa
just "y", so the part I described was of minimum value - if the last
haystack character in the current search window isn't "y", that the
mismatch occurred on the first try leads to a shift of just 1.

But if the character one beyond the end of the current search window
isn't "x" or "y" then, Sunday's algorithm allows to skip 102 positions
(len(needle) + 1). His algorithm requires precomputing a skip-count
table of O(len(sigma)) space, where sigma is the universe of possible
characters ("the alphabet"). That's "a whole lot" for Unicode.

Fredrik and Dennis simplified Sunday's algorithm to weaker versions
that require relatively small constant space instead, independent of
needle, pattern, and alphabet size.

Despite the current code's simplification of that nearly to the point
of non-existence (it reduces the space needed to 32 bits, and with
only one possible skip count), it's nevertheless so effective that it
beat Dennis's initially pure Crochemre-Perrin code by dramatic margins
in a significant number of exceptionally lucky (for the current code)
cases. After adding a variation of Sunday (the current PR uses 256
bytes, and has 64K possible skip counts - but perhaps other tradeoffs
would work better), we haven't yet seen a case (contrived or natural)
where the current code is dramatically faster.

But we have seen non-contrived cases where the current PR is
dramatically faster, including in the benchmark code Fredrik gifted us
with (Tools/stringbench/stringbench.py in your Python tree). Alas,
the higher preprocessing costs leave the current PR slower in "too
many" cases too, especially when the needle is short and found early
in the haystack. Then any preprocessing cost approaches a pure waste
of time.
_______________________________________________
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-leave@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/MHGNAZH7X4HS2KSWTRDVSX2AJXN7POXZ/
Code of Conduct: http://python.org/psf/codeofconduct/

1 2  View All