Mailing List Archive

Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c
On Fri, Jul 8, 2022 at 5:07 PM <ivan@apache.org> wrote:
>
> Author: ivan
> Date: Fri Jul 8 15:07:00 2022
> New Revision: 1902572
>
> URL: http://svn.apache.org/viewvc?rev=1902572&view=rev
> Log:
> Rewrite ap_regexec() without a thread-local storage context for allocations.
>
> Provide custom malloc() and free() implementations that use a stack buffer
> for first N bytes and then fall back to an ordinary malloc/free().
>
> The key properties of this approach are:
>
> 1) Allocations with PCRE2 happen the same way as they were happening
> with PCRE1 in httpd 2.4.52 and earlier.
>
> 2) There are no malloc()/free() calls for typical cases where the
> match data can be kept on stack.
>
> 3) The patch avoids a malloc() for the match_data structure itself,
> because the match data is allocated with the provided custom malloc()
> function.
>
> 4) Using custom allocation functions should ensure that PCRE is not
> going to use malloc() for any auxiliary allocations, if they are
> necessary.
>
> 5) There is no per-thread state.
>
> References:
> 1) https://lists.apache.org/thread/l6m7dqjkk0yy3tooyd2so0rb20jmtpwd
> 2) https://lists.apache.org/thread/5k9y264whn4f1ll35tvl2164dz0wphvy

The main downside is:

> +#ifndef AP_PCRE_STACKBUF_SIZE
> +#define AP_PCRE_STACKBUF_SIZE (256)
> #endif
[]
> AP_DECLARE(int) ap_regexec(const ap_regex_t *preg, const char *string,
> apr_size_t nmatch, ap_regmatch_t *pmatch,
> int eflags)
> - int small_vector[POSIX_MALLOC_THRESHOLD * 3];
> + /* Use apr_uint64_t to get proper alignment. */
> + apr_uint64_t stack_buf[(AP_PCRE_STACKBUF_SIZE + sizeof(apr_uint64_t) - 1) / sizeof(apr_uint64_t)];

So, if I count correctly, that's >2KB stack usage (regardless of PCRE1
or PCRE2) compared to the 120B we had previously (for PCRE1 only)?
This stack size allows for more than 200 captures for PCR1 (way too
much), so how many captures/complexity/margin do these 2KB represent
for PCRE2 and how much can it be reduced to have PCRE2 not malloc()ate
for the usual (<10 captures) regexes?
(Remember that ap_regexec() can be called anywhere/deep in the call
stack, e.g. mod_rewrite, so 2KB stack is not reasonable IMHO).

Finally, I don't see how it's better than one allocation (or two) for
the whole lifetime of a thread, given that thread_local is available
on all the platforms (where performances matter), neither do I see
what are your grievances against thread_local..


Regards;
Yann.
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Fri, 8 Jul 2022 at 19:58, Yann Ylavic <ylavic.dev@gmail.com> wrote:
>
> On Fri, Jul 8, 2022 at 5:07 PM <ivan@apache.org> wrote:
> >
> > Author: ivan
> > Date: Fri Jul 8 15:07:00 2022
> > New Revision: 1902572
> >
> > URL: http://svn.apache.org/viewvc?rev=1902572&view=rev
> > Log:
> > Rewrite ap_regexec() without a thread-local storage context for allocations.
> >
> > Provide custom malloc() and free() implementations that use a stack buffer
> > for first N bytes and then fall back to an ordinary malloc/free().
> >
> > The key properties of this approach are:
> >
> > 1) Allocations with PCRE2 happen the same way as they were happening
> > with PCRE1 in httpd 2.4.52 and earlier.
> >
> > 2) There are no malloc()/free() calls for typical cases where the
> > match data can be kept on stack.
> >
> > 3) The patch avoids a malloc() for the match_data structure itself,
> > because the match data is allocated with the provided custom malloc()
> > function.
> >
> > 4) Using custom allocation functions should ensure that PCRE is not
> > going to use malloc() for any auxiliary allocations, if they are
> > necessary.
> >
> > 5) There is no per-thread state.
> >
> > References:
> > 1) https://lists.apache.org/thread/l6m7dqjkk0yy3tooyd2so0rb20jmtpwd
> > 2) https://lists.apache.org/thread/5k9y264whn4f1ll35tvl2164dz0wphvy
>
> The main downside is:
>
> > +#ifndef AP_PCRE_STACKBUF_SIZE
> > +#define AP_PCRE_STACKBUF_SIZE (256)
> > #endif
> []
> > AP_DECLARE(int) ap_regexec(const ap_regex_t *preg, const char *string,
> > apr_size_t nmatch, ap_regmatch_t *pmatch,
> > int eflags)
> > - int small_vector[POSIX_MALLOC_THRESHOLD * 3];
> > + /* Use apr_uint64_t to get proper alignment. */
> > + apr_uint64_t stack_buf[(AP_PCRE_STACKBUF_SIZE + sizeof(apr_uint64_t) - 1) / sizeof(apr_uint64_t)];
>
> So, if I count correctly, that's >2KB stack usage (regardless of PCRE1
> or PCRE2) compared to the 120B we had previously (for PCRE1 only)?
AP_PCRE_STACKBUF_SIZE is in bytes. And by default is 256 bytes. So
code allocates exactly 256 bytes on stack.

> This stack size allows for more than 200 captures for PCR1 (way too
> much), so how many captures/complexity/margin do these 2KB represent
> for PCRE2 and how much can it be reduced to have PCRE2 not malloc()ate
> for the usual (<10 captures) regexes?
> (Remember that ap_regexec() can be called anywhere/deep in the call
> stack, e.g. mod_rewrite, so 2KB stack is not reasonable IMHO).
>

> Finally, I don't see how it's better than one allocation (or two) for
> the whole lifetime of a thread, given that thread_local is available
> on all the platforms (where performances matter), neither do I see
> what are your grievances against thread_local..
>
1. tls storage is growing to maximum captures count. So it requires
more than one or two allocations.
2. tls storage doesn't support dynamic MPM that uses native threads
3. access to tls storage requires hash lookup.
4. Windows build is currently broken due to tls storage API changes.
5. Using tls reserves the memory, even if pcre is not being used
afterwards. This can be thought as of implicitly increasing the
per-thread stack size.
6. With new code there will be no actual allocations in real-world cases.



--
Ivan Zhakov
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Fri, Jul 8, 2022 at 7:15 PM Ivan Zhakov <ivan@visualsvn.com> wrote:
>
> On Fri, 8 Jul 2022 at 19:58, Yann Ylavic <ylavic.dev@gmail.com> wrote:
> >
> > The main downside is:
> >
> > > +#ifndef AP_PCRE_STACKBUF_SIZE
> > > +#define AP_PCRE_STACKBUF_SIZE (256)
> > > #endif
> > []
> > > AP_DECLARE(int) ap_regexec(const ap_regex_t *preg, const char *string,
> > > apr_size_t nmatch, ap_regmatch_t *pmatch,
> > > int eflags)
> > > - int small_vector[POSIX_MALLOC_THRESHOLD * 3];
> > > + /* Use apr_uint64_t to get proper alignment. */
> > > + apr_uint64_t stack_buf[(AP_PCRE_STACKBUF_SIZE + sizeof(apr_uint64_t) - 1) / sizeof(apr_uint64_t)];
> >
> > So, if I count correctly, that's >2KB stack usage (regardless of PCRE1
> > or PCRE2) compared to the 120B we had previously (for PCRE1 only)?
> >
> AP_PCRE_STACKBUF_SIZE is in bytes. And by default is 256 bytes. So
> code allocates exactly 256 bytes on stack.

Argh, so I can't count either.

Let me explain (maybe), I thought that PCRE2's match_data were not
only containing the captures vector but also all the frames needed for
matching not-too-complex regexes (at least the initial ones since they
will be reallocated on need). So the (supposed) 2KB looked necessary
to me for a few stack frames.
But the actual 256 bytes only made me look at the PCRE2 code again
(better this time) for how that can possibly work, and guess what: the
initial frames are allocated on the stack ([1]) still in PCRE2, and
that's 20KiB ([2]) of stack space for each call to pcre2_match().

On Alpine Linux for instance (quite used in docker envs) there is only
128KiB stack per thread, so I've always configured
--no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
vector cache, with custom pcre_malloc/free() to avoid performances
overhead and be able to run more complex regexes without exploding,
and I can't do that now with PCRE2 and without requiring 20KB of stack
:/
So I created a pull request ([3]).

> >
> > Finally, I don't see how it's better than one allocation (or two) for
> > the whole lifetime of a thread, given that thread_local is available
> > on all the platforms (where performances matter), neither do I see
> > what are your grievances against thread_local..
> >
> 1. tls storage is growing to maximum captures count. So it requires
> more than one or two allocations.

For the lifetime of a thread and the number of ap_regexec() it may
call, that's all negligible.

> 2. tls storage doesn't support dynamic MPM that uses native threads

httpd-2.4 (and APR 1.8 unless reverted) provide the way to apr-ize a
native thread so that ap_thread_current() is available.

> 3. access to tls storage requires hash lookup.

Hmm no, util_pcre uses no hash lookup for instance
If one wants to handle TLS data with apr_thread_data_get(&data,
ap_current_thread()) then they will be stored in an apr_hash_t, but
there's nothing implicit. And by the way, using thread_data hashtable
won't be less efficient than registering as much native thread data
(e.g. pthread_key_t), I already mentioned that in another thread.

> 4. Windows build is currently broken due to tls storage API changes.

Some Windows dev to fix the trivial compilation error (which by the
way passed the 2.4.54 release process)?
I'm not, though I am trying to update the code of all platforms when possible..

> 5. Using tls reserves the memory, even if pcre is not being used
> afterwards. This can be thought as of implicitly increasing the
> per-thread stack size.

Which memory is reserved exactly for TLS?

> 6. With new code there will be no actual allocations in real-world cases.

Well, there will be no allocation but the 20KB stack space, though
that's true for the thread_local implementation too.
Now if PCRE2 stops using stack frames by allowing users to reserve
that space in match_data (like I propose in [3]), the thread_local
implementation clearly wins thanks to the ~almost~ single allocation
per thread, while your implementation would either use 20KB of stack
or malloc()ate them for every pcre2_match() call.
That'd be great if PCRE2 would allow for that, not to show that my
implementation is better but because 20KB of stack for a single
function in a lib is just insane (IMHO).


Regards;
Yann.

[1] https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_match.c#L6362
[2] https://github.com/PCRE2Project/pcre2/blob/master/src/pcre2_internal.h#L231
[3] https://github.com/PCRE2Project/pcre2/pull/134
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic.dev@gmail.com> wrote:
>
> On Fri, Jul 8, 2022 at 7:15 PM Ivan Zhakov <ivan@visualsvn.com> wrote:
> >
> > On Fri, 8 Jul 2022 at 19:58, Yann Ylavic <ylavic.dev@gmail.com> wrote:
> > >
> > > The main downside is:
> > >
> > > > +#ifndef AP_PCRE_STACKBUF_SIZE
> > > > +#define AP_PCRE_STACKBUF_SIZE (256)
> > > > #endif
> > > []
> > > > AP_DECLARE(int) ap_regexec(const ap_regex_t *preg, const char *string,
> > > > apr_size_t nmatch, ap_regmatch_t *pmatch,
> > > > int eflags)
> > > > - int small_vector[POSIX_MALLOC_THRESHOLD * 3];
> > > > + /* Use apr_uint64_t to get proper alignment. */
> > > > + apr_uint64_t stack_buf[(AP_PCRE_STACKBUF_SIZE + sizeof(apr_uint64_t) - 1) / sizeof(apr_uint64_t)];
> > >
> > > So, if I count correctly, that's >2KB stack usage (regardless of PCRE1
> > > or PCRE2) compared to the 120B we had previously (for PCRE1 only)?
> > >
> > AP_PCRE_STACKBUF_SIZE is in bytes. And by default is 256 bytes. So
> > code allocates exactly 256 bytes on stack.
>
> Argh, so I can't count either.
>
> Let me explain (maybe), I thought that PCRE2's match_data were not
> only containing the captures vector but also all the frames needed for
> matching not-too-complex regexes (at least the initial ones since they
> will be reallocated on need). So the (supposed) 2KB looked necessary
> to me for a few stack frames.
> But the actual 256 bytes only made me look at the PCRE2 code again
> (better this time) for how that can possibly work, and guess what: the
> initial frames are allocated on the stack ([1]) still in PCRE2, and
> that's 20KiB ([2]) of stack space for each call to pcre2_match().
>
> On Alpine Linux for instance (quite used in docker envs) there is only
> 128KiB stack per thread, so I've always configured
> --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> vector cache, with custom pcre_malloc/free() to avoid performances
> overhead and be able to run more complex regexes without exploding,
> and I can't do that now with PCRE2 and without requiring 20KB of stack
> :/
> So I created a pull request ([3]).
>
The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
I would suggest a different solution: it would be nice if PCRE provided
a compile time directive to specify the maximum stack allocated buffer.
So then distributions like Alpine Linux would specify more appropriate
defaults.

Creating match context with 20KB and caching it in thread local storage
doesn't look like an appropriate solution for me, because it is just an
increase of the per-thread memory overhead: the effective per-thread memory
usage will be 128 KB for stack plus 20 KB for match context.

> > >
> > > Finally, I don't see how it's better than one allocation (or two) for
> > > the whole lifetime of a thread, given that thread_local is available
> > > on all the platforms (where performances matter), neither do I see
> > > what are your grievances against thread_local..
> > >
> > 1. tls storage is growing to maximum captures count. So it requires
> > more than one or two allocations.
>
> For the lifetime of a thread and the number of ap_regexec() it may
> call, that's all negligible.
>
> > 2. tls storage doesn't support dynamic MPM that uses native threads
>
> httpd-2.4 (and APR 1.8 unless reverted) provide the way to apr-ize a
> native thread so that ap_thread_current() is available.

I meant native OS thread pools.

>
> > 3. access to tls storage requires hash lookup.
>
> Hmm no, util_pcre uses no hash lookup for instance
> If one wants to handle TLS data with apr_thread_data_get(&data,
> ap_current_thread()) then they will be stored in an apr_hash_t, but
> there's nothing implicit. And by the way, using thread_data hashtable
> won't be less efficient than registering as much native thread data
> (e.g. pthread_key_t), I already mentioned that in another thread.
>
[.[.[.
apr_thread_data_get((void **)&tls, "apreg", current);
if (!tls || tls->size < size) {
apr_pool_t *tp = apr_thread_pool_get(current);
if (!tls) {
tls = apr_pcalloc(tp, sizeof(*tls));
#ifdef HAVE_PCRE2
apr_thread_data_set(tls, "apreg", apreg_tls_cleanup, current);
#else
apr_thread_data_set(tls, "apreg", NULL, current);
#endif
}

]]]

apr_thread_data_get uses apr_pool_data_get() which uses apr hash.

> > 4. Windows build is currently broken due to tls storage API changes.
>
> Some Windows dev to fix the trivial compilation error (which by the
> way passed the 2.4.54 release process)?
> I'm not, though I am trying to update the code of all platforms when possible..
>
There are more problems than trivial compilation error: now thread stack memory
is committed, but before all these changes memory was only reserved.

Also, considering the big picture: while personally I appreciate the
development effort, we are talking about a large and complex change that
spans over multiple platforms and projects. It has the development cost,
but also an implicit cost of maintaining and testing it. Considering that,
I would say that the stackbuf approach is significantly simpler, local
and has much less overall costs while providing similar characteristics.

> > 5. Using tls reserves the memory, even if pcre is not being used
> > afterwards. This can be thought as of implicitly increasing the
> > per-thread stack size.
>
> Which memory is reserved exactly for TLS?
Memory reserved for PCRE match context.

>
> > 6. With new code there will be no actual allocations in real-world cases.
>
> Well, there will be no allocation but the 20KB stack space, though
> that's true for the thread_local implementation too.
> Now if PCRE2 stops using stack frames by allowing users to reserve
> that space in match_data (like I propose in [3]), the thread_local
> implementation clearly wins thanks to the ~almost~ single allocation
> per thread, while your implementation would either use 20KB of stack
> or malloc()ate them for every pcre2_match() call.
> That'd be great if PCRE2 would allow for that, not to show that my
> implementation is better but because 20KB of stack for a single
> function in a lib is just insane (IMHO).
>
As per above, this change means effectively increasing the per-thread
memory usage by 20 KB. With the same effect we could just increase the
thread stack memory size.

---
Ivan Zhakov
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <ivan@apache.org> wrote:
>
> On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic.dev@gmail.com> wrote:
> >
> > Let me explain (maybe), I thought that PCRE2's match_data were not
> > only containing the captures vector but also all the frames needed for
> > matching not-too-complex regexes (at least the initial ones since they
> > will be reallocated on need). So the (supposed) 2KB looked necessary
> > to me for a few stack frames.
> > But the actual 256 bytes only made me look at the PCRE2 code again
> > (better this time) for how that can possibly work, and guess what: the
> > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> >
> > On Alpine Linux for instance (quite used in docker envs) there is only
> > 128KiB stack per thread, so I've always configured
> > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > vector cache, with custom pcre_malloc/free() to avoid performances
> > overhead and be able to run more complex regexes without exploding,
> > and I can't do that now with PCRE2 and without requiring 20KB of stack
> > :/
> > So I created a pull request ([3]).
> >
> The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
> I would suggest a different solution: it would be nice if PCRE provided
> a compile time directive to specify the maximum stack allocated buffer.
> So then distributions like Alpine Linux would specify more appropriate
> defaults.

This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
solution indeed.
Say START_FRAMES_SIZE is set to some minimum, a single struct
heapframe with 24 captures, that's (on a 64bit system):
offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
~= 512 bytes
which would be much appreciated, but would also put much more pressure
on util_regex's private_malloc().
Will ap_regexec() start to use malloc() quite often with the current
implementation then?
Up to the old/default START_FRAMES_SIZE if it was a good default afterall?

But we are not here, because I don't think distros will change the
default pcre START_FRAMES_SIZE for all possible code they run, so most
httpd users (which run on distros I suppose) will continue to consume
20K of stack memory for every ap_regexec() call.

The flexible way for pcre2 users (and httpd maintainers) would be to
let them opt-out (at runtime) of stack allocation from pcre2_match(),
which is the patch I proposed.

>
> Creating match context with 20KB and caching it in thread local storage
> doesn't look like an appropriate solution for me, because it is just an
> increase of the per-thread memory overhead: the effective per-thread memory
> usage will be 128 KB for stack plus 20 KB for match context.

With no cache the other solutions are a big stack or a lot of
malloc()s, for some regexes (not so unreasonable in terms of captures
by the way). For instance mod_security might run some of them.
Not only can each pcre frame be big because of the number of captures,
but also the number of frames (not necessarily big ones) needed to run
a regex can be big because of the regex complexity (mod_security might
run some of them). Whatever the frame size is, pcre2 needs a new one
each time it "recurses" (though it jumps nowadays it seems, didn't
look at the jit code).

> >
> > > 3. access to tls storage requires hash lookup.
> >
> > Hmm no, util_pcre uses no hash lookup for instance
> > If one wants to handle TLS data with apr_thread_data_get(&data,
> > ap_current_thread()) then they will be stored in an apr_hash_t, but
> > there's nothing implicit. And by the way, using thread_data hashtable
> > won't be less efficient than registering as much native thread data
> > (e.g. pthread_key_t), I already mentioned that in another thread.
> >
> [.[.[.
> apr_thread_data_get((void **)&tls, "apreg", current);
> if (!tls || tls->size < size) {
> apr_pool_t *tp = apr_thread_pool_get(current);
> if (!tls) {
> tls = apr_pcalloc(tp, sizeof(*tls));
> #ifdef HAVE_PCRE2
> apr_thread_data_set(tls, "apreg", apreg_tls_cleanup, current);
> #else
> apr_thread_data_set(tls, "apreg", NULL, current);
> #endif
> }
>
> ]]]
>
> apr_thread_data_get uses apr_pool_data_get() which uses apr hash.
[]
>
> Also, considering the big picture: while personally I appreciate the
> development effort, we are talking about a large and complex change that
> spans over multiple platforms and projects. It has the development cost,
> but also an implicit cost of maintaining and testing it. Considering that,
> I would say that the stackbuf approach is significantly simpler, local
> and has much less overall costs while providing similar characteristics.

I don't deny that you're implementation is superior when
!AP_HAS_THREAD_LOCAL || !ap_current_thread(), but which case is that?
A real case?
The httpd-2.4.54 implementation is optimized for the other case, the
most common one (i.e. platforms where usually performances count),
that's a choice.
AFAICT, the httpd-2.4.54 implementation is not performing worse than
the new trunk one in any case when ap_current_thread() != NULL.
So this is/was not so bad?

Anyway, please have a look at my proposed patch (elsethread) which
takes the best of both worlds.
This old vs new implementation discussion is kind of moot with this no?
I fixed things and tested it since posted (don't look too much in the
details this v0), comments on the approach welcome though.


Regards;
Yann.
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic.dev@gmail.com> wrote:

> On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <ivan@apache.org> wrote:
> >
> > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic.dev@gmail.com> wrote:
> > >
> > > Let me explain (maybe), I thought that PCRE2's match_data were not
> > > only containing the captures vector but also all the frames needed for
> > > matching not-too-complex regexes (at least the initial ones since they
> > > will be reallocated on need). So the (supposed) 2KB looked necessary
> > > to me for a few stack frames.
> > > But the actual 256 bytes only made me look at the PCRE2 code again
> > > (better this time) for how that can possibly work, and guess what: the
> > > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> > >
> > > On Alpine Linux for instance (quite used in docker envs) there is only
> > > 128KiB stack per thread, so I've always configured
> > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > > vector cache, with custom pcre_malloc/free() to avoid performances
> > > overhead and be able to run more complex regexes without exploding,
> > > and I can't do that now with PCRE2 and without requiring 20KB of stack
> > > :/
> > > So I created a pull request ([3]).
> > >
> > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
> > I would suggest a different solution: it would be nice if PCRE provided
> > a compile time directive to specify the maximum stack allocated buffer.
> > So then distributions like Alpine Linux would specify more appropriate
> > defaults.
>
> This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
> solution indeed.
> Say START_FRAMES_SIZE is set to some minimum, a single struct
> heapframe with 24 captures, that's (on a 64bit system):
> offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
> ~= 512 bytes
> which would be much appreciated, but would also put much more pressure
> on util_regex's private_malloc().
> Will ap_regexec() start to use malloc() quite often with the current
> implementation then?
> Up to the old/default START_FRAMES_SIZE if it was a good default afterall?
>
> But we are not here, because I don't think distros will change the
> default pcre START_FRAMES_SIZE for all possible code they run, so most
> httpd users (which run on distros I suppose) will continue to consume
> 20K of stack memory for every ap_regexec() call.
>
> The flexible way for pcre2 users (and httpd maintainers) would be to
> let them opt-out (at runtime) of stack allocation from pcre2_match(),
> which is the patch I proposed.
>
> I see several problems in this approach:
1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending
of the platform.
2. The implementation may introduce unbounded memory usage, if PCRE2
performs multiple malloc/free during pattern matching. It seems that
currently it doesn't, but we cannot rely on the implementation details.
Stackbuf code doesn't have this problem, because it's limited to a
fixed-size buffer.
3. As I said before, the proposed approach effectively increases per thread
memory usage: 4 or 8 kb for match data stored in thread state + stack size.
Which is something I don't really understand, because if we are fine with
that, then we can just increase the thread stack size.

With the above in mind, I'm -1 for this approach and r1902731.


>
> > Creating match context with 20KB and caching it in thread local storage
> > doesn't look like an appropriate solution for me, because it is just an
> > increase of the per-thread memory overhead: the effective per-thread
> memory
> > usage will be 128 KB for stack plus 20 KB for match context.
>
> With no cache the other solutions are a big stack or a lot of
> malloc()s, for some regexes (not so unreasonable in terms of captures
> by the way). For instance mod_security might run some of them.
> Not only can each pcre frame be big because of the number of captures,
> but also the number of frames (not necessarily big ones) needed to run
> a regex can be big because of the regex complexity (mod_security might
> run some of them). Whatever the frame size is, pcre2 needs a new one
> each time it "recurses" (though it jumps nowadays it seems, didn't
> look at the jit code).
>
> > >
> > > > 3. access to tls storage requires hash lookup.
> > >
> > > Hmm no, util_pcre uses no hash lookup for instance
> > > If one wants to handle TLS data with apr_thread_data_get(&data,
> > > ap_current_thread()) then they will be stored in an apr_hash_t, but
> > > there's nothing implicit. And by the way, using thread_data hashtable
> > > won't be less efficient than registering as much native thread data
> > > (e.g. pthread_key_t), I already mentioned that in another thread.
> > >
> > [.[.[.
> > apr_thread_data_get((void **)&tls, "apreg", current);
> > if (!tls || tls->size < size) {
> > apr_pool_t *tp = apr_thread_pool_get(current);
> > if (!tls) {
> > tls = apr_pcalloc(tp, sizeof(*tls));
> > #ifdef HAVE_PCRE2
> > apr_thread_data_set(tls, "apreg", apreg_tls_cleanup,
> current);
> > #else
> > apr_thread_data_set(tls, "apreg", NULL, current);
> > #endif
> > }
> >
> > ]]]
> >
> > apr_thread_data_get uses apr_pool_data_get() which uses apr hash.
> []
> >
> > Also, considering the big picture: while personally I appreciate the
> > development effort, we are talking about a large and complex change that
> > spans over multiple platforms and projects. It has the development cost,
> > but also an implicit cost of maintaining and testing it. Considering
> that,
> > I would say that the stackbuf approach is significantly simpler, local
> > and has much less overall costs while providing similar characteristics.
>
> I don't deny that you're implementation is superior when
> !AP_HAS_THREAD_LOCAL || !ap_current_thread(), but which case is that?
> A real case?
> The httpd-2.4.54 implementation is optimized for the other case, the
> most common one (i.e. platforms where usually performances count),
> that's a choice.
> AFAICT, the httpd-2.4.54 implementation is not performing worse than
> the new trunk one in any case when ap_current_thread() != NULL.
> So this is/was not so bad?
>
> Anyway, please have a look at my proposed patch (elsethread) which
> takes the best of both worlds.
> This old vs new implementation discussion is kind of moot with this no?
> I fixed things and tested it since posted (don't look too much in the
> details this v0), comments on the approach welcome though.
>
>
> Regards;
> Yann.
>


--
Ivan Zhakov
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Sat, Jul 16, 2022 at 10:29 AM Ivan Zhakov <ivan@apache.org> wrote:
>
> On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic.dev@gmail.com> wrote:
>>
>> The flexible way for pcre2 users (and httpd maintainers) would be to
>> let them opt-out (at runtime) of stack allocation from pcre2_match(),
>> which is the patch I proposed.
>>
> I see several problems in this approach:

I meant the PR/patch I proposed to the pcre team, not the patch to
util_pcre which you seem to refer to below.

> 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending of the platform.
> 2. The implementation may introduce unbounded memory usage, if PCRE2 performs multiple malloc/free during pattern matching. It seems that currently it doesn't, but we cannot rely on the implementation details. Stackbuf code doesn't have this problem, because it's limited to a fixed-size buffer.

The memory usage with r1902731 is bounded by the maximum size ever
asked by pcre2_match() AND the subpool's max_free setting.
This looks quite optimal to me: the pool adapts to the actual needs up
to ap_max_mem_free, if ~20K are what most regexes need this just
mostly reuse 20K of apr_memnode_t(s) of different sizes (number of
system pages) for all the ap_regexec() calls of each thread (memory
management would be in the noise of pcre_exec/match, whereas system's
malloc/free would/could not).

Your stack proposal stops where pcre2 starts calling private_malloc()
by itself (and r1902731 leaved this alone), but should that happen
because either the regex is more complex, or pcre2 reduces
START_FRAMES_SIZE, or pcre2 starts calling pcre2_ctx->malloc()
unconditionally for its frames, or pcre2 optionally lets users choose
between 20K of stack space and full/20K-initial pcre2_ctx->malloc().
I think that httpd should (still) care about 20K stack allocated
anywhere in the call stack, we don't do that elsewhere (besides some
"low in call stack" exceptional cases in the MPMs), that's why I
raised a PR to the prcre team.

> 3. As I said before, the proposed approach effectively increases per thread memory usage: 4 or 8 kb for match data stored in thread state + stack size. Which is something I don't really understand, because if we are fine with that, then we can just increase the thread stack size.

On the one hand you ask some admin to increase thread stack size if
needed, on the other hand you prevent some admin from reducing it if
possible. Why is that?
If one has/buys a docker slot to run httpd (usually only that), say
with MaxRequestWorkers 1000, would you want it to consume 8GB of stack
(Linux default) or e.g. 128MB (with ulimit -s or by default with the
docker-fitted Alpine Linux for instance) or anything in between for
one's needs.
The only thing we (as devs) can do about it is good stack usage
practices (20K is not IMHO), because stack exhaustion is fatal and
hard to debug.

>
> With the above in mind, I'm -1 for this approach and r1902731.

If (as currently being discussed) the pcre2 team decides to reduce
stack usage (optionally or not) and that the match_data should
preserve the allocated memory between the calls (i.e. until an
explicit pcre2_match_data_free), I think the initial implementation is
all we need (the one in 2.4.x, where I thought match_data was doing
that exactly). We could replace the hash lookup with an
AP_THREAD_LOCAL since you disliked it, but having a pcre_match_data_t
per thread would do (we could still track memory usage through
private_malloc() to free the memory at some point..).
Your approach with 256 bytes of stack space would help allocate the
pcre2_general_context and pcre2_match_data structs only, but not the
very first frame(s).

Until something changes on the pcre2 side (if ever), r1902731 does not
hurt your use case anyway (it takes the hand where you leave it to
malloc/free, that is pcre2_match() needs more than 20KB of frames), so
I'm wondering why this -1.
I hear that it complexifies the code a bit (not much more than the
stack usage by the way), but it helps some
existing/pcre-patched/future cases, so I'm hoping that your -1 comes
from my poor explanation on how it actually works.

Whether apr_thead_current() belongs to apr(-1.8) or not is irrelevant
here, but if ap_thread_current() is useful to httpd I think that the
apr_ version is worth it for other software too. I mean, your
reluctance seems to be against ap(r)_thread_current(), I don't grok
why you deny the advantages (at least) in the util_pcre case though.


Regards;
Yann.
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On 7/16/22 10:28 AM, Ivan Zhakov wrote:
> On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic.dev@gmail.com <mailto:ylavic.dev@gmail.com>> wrote:
>
> On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <ivan@apache.org <mailto:ivan@apache.org>> wrote:
> >
> > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic.dev@gmail.com <mailto:ylavic.dev@gmail.com>> wrote:
> > >
> > > Let me explain (maybe), I thought that PCRE2's match_data were not
> > > only containing the captures vector but also all the frames needed for
> > > matching not-too-complex regexes (at least the initial ones since they
> > > will be reallocated on need). So the (supposed) 2KB looked necessary
> > > to me for a few stack frames.
> > > But the actual 256 bytes only made me look at the PCRE2 code again
> > > (better this time) for how that can possibly work, and guess what: the
> > > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> > >
> > > On Alpine Linux for instance (quite used in docker envs) there is only
> > > 128KiB stack per thread, so I've always configured
> > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > > vector cache, with custom pcre_malloc/free() to avoid performances
> > > overhead and be able to run more complex regexes without exploding,
> > > and I can't do that now with PCRE2 and without requiring 20KB of stack
> > > :/
> > > So I created a pull request ([3]).
> > >
> > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
> > I would suggest a different solution: it would be nice if PCRE provided
> > a compile time directive to specify the maximum stack allocated buffer.
> > So then distributions like Alpine Linux would specify more appropriate
> > defaults.
>
> This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
> solution indeed.
> Say START_FRAMES_SIZE is set to some minimum, a single struct
> heapframe with 24 captures, that's (on a 64bit system):
>   offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
> ~= 512 bytes
> which would be much appreciated, but would also put much more pressure
> on util_regex's private_malloc().
> Will ap_regexec() start to use malloc() quite often with the current
> implementation then?
> Up to the old/default START_FRAMES_SIZE if it was a good default afterall?
>
> But we are not here, because I don't think distros will change the
> default pcre START_FRAMES_SIZE for all possible code they run, so most
> httpd users (which run on distros I suppose) will continue to consume
> 20K of stack memory for every ap_regexec() call.
>
> The flexible way for pcre2 users (and httpd maintainers) would be to
> let them opt-out (at runtime) of stack allocation from pcre2_match(),
> which is the patch I proposed.
>
> I see several problems in this approach:
> 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending of the platform.
> 2. The implementation may introduce unbounded memory usage, if PCRE2 performs multiple malloc/free during pattern matching. It
> seems that currently it doesn't, but we cannot rely on the implementation details. Stackbuf code doesn't have this problem,
> because it's limited to a fixed-size buffer.
> 3. As I said before, the proposed approach effectively increases per thread memory usage: 4 or 8 kb for match data stored in
> thread state + stack size. Which is something I don't really understand, because if we are fine with that, then we can just
> increase the thread stack size.
>
> With the above in mind, I'm -1 for this approach and r1902731.
>

The latest implementation gives you both, a small allocation on the stack that does not drive stack usage
up too much and that should be sufficient for many regex cases. If this does not work we use tls to tuck away a pointer to a per
thread pool and we only create the pool in case the stack memory isn't sufficient.
If we really increase the memory usage if need to use the pool isn't clear to me as we use a subpool of the thread pool and thus
its allocator. Maybe that allocator has free nodes still around and we reuse these.
With regards to unbound memory usage I don't think that this is an actual problem, but if PCRE would really massively malloc/free
beyond the space we use on the stack just using malloc/free to satisfy these needs would be problematic as well and I guess this
would probably need fixing in PCRE then.
Furthermore if this happens we could have a look at using apr_bucket_alloc / apr_bucket_free or as this might not fit exactly
take ideas from there.
Could you please reconsider your -1?

Regards

Rüdiger
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On 2022/07/29 06:19:03 Ruediger Pluem wrote:
>
>
> On 7/16/22 10:28 AM, Ivan Zhakov wrote:
> > On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic.dev@gmail.com <mailto:ylavic.dev@gmail.com>> wrote:
> >
> > On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <ivan@apache.org <mailto:ivan@apache.org>> wrote:
> > >
> > > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic.dev@gmail.com <mailto:ylavic.dev@gmail.com>> wrote:
> > > >
> > > > Let me explain (maybe), I thought that PCRE2's match_data were not
> > > > only containing the captures vector but also all the frames needed for
> > > > matching not-too-complex regexes (at least the initial ones since they
> > > > will be reallocated on need). So the (supposed) 2KB looked necessary
> > > > to me for a few stack frames.
> > > > But the actual 256 bytes only made me look at the PCRE2 code again
> > > > (better this time) for how that can possibly work, and guess what: the
> > > > initial frames are allocated on the stack ([1]) still in PCRE2, and
> > > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
> > > >
> > > > On Alpine Linux for instance (quite used in docker envs) there is only
> > > > 128KiB stack per thread, so I've always configured
> > > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
> > > > vector cache, with custom pcre_malloc/free() to avoid performances
> > > > overhead and be able to run more complex regexes without exploding,
> > > > and I can't do that now with PCRE2 and without requiring 20KB of stack
> > > > :/
> > > > So I created a pull request ([3]).
> > > >
> > > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
> > > I would suggest a different solution: it would be nice if PCRE provided
> > > a compile time directive to specify the maximum stack allocated buffer.
> > > So then distributions like Alpine Linux would specify more appropriate
> > > defaults.
> >
> > This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
> > solution indeed.
> > Say START_FRAMES_SIZE is set to some minimum, a single struct
> > heapframe with 24 captures, that's (on a 64bit system):
> >   offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
> > ~= 512 bytes
> > which would be much appreciated, but would also put much more pressure
> > on util_regex's private_malloc().
> > Will ap_regexec() start to use malloc() quite often with the current
> > implementation then?
> > Up to the old/default START_FRAMES_SIZE if it was a good default afterall?
> >
> > But we are not here, because I don't think distros will change the
> > default pcre START_FRAMES_SIZE for all possible code they run, so most
> > httpd users (which run on distros I suppose) will continue to consume
> > 20K of stack memory for every ap_regexec() call.
> >
> > The flexible way for pcre2 users (and httpd maintainers) would be to
> > let them opt-out (at runtime) of stack allocation from pcre2_match(),
> > which is the patch I proposed.
> >
> > I see several problems in this approach:
> > 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending of the platform.
> > 2. The implementation may introduce unbounded memory usage, if PCRE2 performs multiple malloc/free during pattern matching. It
> > seems that currently it doesn't, but we cannot rely on the implementation details. Stackbuf code doesn't have this problem,
> > because it's limited to a fixed-size buffer.
> > 3. As I said before, the proposed approach effectively increases per thread memory usage: 4 or 8 kb for match data stored in
> > thread state + stack size. Which is something I don't really understand, because if we are fine with that, then we can just
> > increase the thread stack size.
> >
> > With the above in mind, I'm -1 for this approach and r1902731.
> >
>
> The latest implementation gives you both, a small allocation on the stack that does not drive stack usage
> up too much and that should be sufficient for many regex cases. If this does not work we use tls to tuck away a pointer to a per
> thread pool and we only create the pool in case the stack memory isn't sufficient.
> If we really increase the memory usage if need to use the pool isn't clear to me as we use a subpool of the thread pool and thus
> its allocator. Maybe that allocator has free nodes still around and we reuse these.
> With regards to unbound memory usage I don't think that this is an actual problem, but if PCRE would really massively malloc/free
> beyond the space we use on the stack just using malloc/free to satisfy these needs would be problematic as well and I guess this
> would probably need fixing in PCRE then.
> Furthermore if this happens we could have a look at using apr_bucket_alloc / apr_bucket_free or as this might not fit exactly
> take ideas from there.
> Could you please reconsider your -1?
>

My overall concern is that with the current solution as of r1902858, there are valid and
reasonable allocation patterns that will cause an unbounded memory usage.

For example, nothing prevents current or future PCRE versions from doing this:

for (int i = 0; i < N; i++)
{
void *p = private_malloc();
private_free(p);
}

which is going to cause an unbounded memory usage because private_free() is
a no-op and retains everything that was allocated.
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On 8/1/22 5:05 PM, Ivan Zhakov wrote:
> On 2022/07/29 06:19:03 Ruediger Pluem wrote:
>>
>>
>> On 7/16/22 10:28 AM, Ivan Zhakov wrote:
>>> On Wed, 13 Jul 2022 at 18:06, Yann Ylavic <ylavic.dev@gmail.com <mailto:ylavic.dev@gmail.com>> wrote:
>>>
>>> On Wed, Jul 13, 2022 at 1:38 PM Ivan Zhakov <ivan@apache.org <mailto:ivan@apache.org>> wrote:
>>> >
>>> > On Sun, 10 Jul 2022 at 19:52, Yann Ylavic <ylavic.dev@gmail.com <mailto:ylavic.dev@gmail.com>> wrote:
>>> > >
>>> > > Let me explain (maybe), I thought that PCRE2's match_data were not
>>> > > only containing the captures vector but also all the frames needed for
>>> > > matching not-too-complex regexes (at least the initial ones since they
>>> > > will be reallocated on need). So the (supposed) 2KB looked necessary
>>> > > to me for a few stack frames.
>>> > > But the actual 256 bytes only made me look at the PCRE2 code again
>>> > > (better this time) for how that can possibly work, and guess what: the
>>> > > initial frames are allocated on the stack ([1]) still in PCRE2, and
>>> > > that's 20KiB ([2]) of stack space for each call to pcre2_match().
>>> > >
>>> > > On Alpine Linux for instance (quite used in docker envs) there is only
>>> > > 128KiB stack per thread, so I've always configured
>>> > > --no-recurse-on-stack with PCRE1 and used per-thread (and its pool)
>>> > > vector cache, with custom pcre_malloc/free() to avoid performances
>>> > > overhead and be able to run more complex regexes without exploding,
>>> > > and I can't do that now with PCRE2 and without requiring 20KB of stack
>>> > > :/
>>> > > So I created a pull request ([3]).
>>> > >
>>> > The 20KB stack usage in PCRE is interesting, but somewhat unrelated. But
>>> > I would suggest a different solution: it would be nice if PCRE provided
>>> > a compile time directive to specify the maximum stack allocated buffer.
>>> > So then distributions like Alpine Linux would specify more appropriate
>>> > defaults.
>>>
>>> This is what pcre2 folks proposed in [3] as a fast(ly acceptable)
>>> solution indeed.
>>> Say START_FRAMES_SIZE is set to some minimum, a single struct
>>> heapframe with 24 captures, that's (on a 64bit system):
>>>   offsetof(heapframe, ovector) + 24 * 2 * sizeof(size_t) = 125 + 384
>>> ~= 512 bytes
>>> which would be much appreciated, but would also put much more pressure
>>> on util_regex's private_malloc().
>>> Will ap_regexec() start to use malloc() quite often with the current
>>> implementation then?
>>> Up to the old/default START_FRAMES_SIZE if it was a good default afterall?
>>>
>>> But we are not here, because I don't think distros will change the
>>> default pcre START_FRAMES_SIZE for all possible code they run, so most
>>> httpd users (which run on distros I suppose) will continue to consume
>>> 20K of stack memory for every ap_regexec() call.
>>>
>>> The flexible way for pcre2 users (and httpd maintainers) would be to
>>> let them opt-out (at runtime) of stack allocation from pcre2_match(),
>>> which is the patch I proposed.
>>>
>>> I see several problems in this approach:
>>> 1. Every apr_pool_t consumes at least one page, i.e., 4kb or 8kb depending of the platform.
>>> 2. The implementation may introduce unbounded memory usage, if PCRE2 performs multiple malloc/free during pattern matching. It
>>> seems that currently it doesn't, but we cannot rely on the implementation details. Stackbuf code doesn't have this problem,
>>> because it's limited to a fixed-size buffer.
>>> 3. As I said before, the proposed approach effectively increases per thread memory usage: 4 or 8 kb for match data stored in
>>> thread state + stack size. Which is something I don't really understand, because if we are fine with that, then we can just
>>> increase the thread stack size.
>>>
>>> With the above in mind, I'm -1 for this approach and r1902731.
>>>
>>
>> The latest implementation gives you both, a small allocation on the stack that does not drive stack usage
>> up too much and that should be sufficient for many regex cases. If this does not work we use tls to tuck away a pointer to a per
>> thread pool and we only create the pool in case the stack memory isn't sufficient.
>> If we really increase the memory usage if need to use the pool isn't clear to me as we use a subpool of the thread pool and thus
>> its allocator. Maybe that allocator has free nodes still around and we reuse these.
>> With regards to unbound memory usage I don't think that this is an actual problem, but if PCRE would really massively malloc/free
>> beyond the space we use on the stack just using malloc/free to satisfy these needs would be problematic as well and I guess this
>> would probably need fixing in PCRE then.
>> Furthermore if this happens we could have a look at using apr_bucket_alloc / apr_bucket_free or as this might not fit exactly
>> take ideas from there.
>> Could you please reconsider your -1?
>>
>
> My overall concern is that with the current solution as of r1902858, there are valid and
> reasonable allocation patterns that will cause an unbounded memory usage.
>
> For example, nothing prevents current or future PCRE versions from doing this:
>
> for (int i = 0; i < N; i++)
> {
> void *p = private_malloc();
> private_free(p);
> }
>
> which is going to cause an unbounded memory usage because private_free() is
> a no-op and retains everything that was allocated.

This is true, but these kind of allocation patterns would also cause a lot of problems with a malloc/free backend and thus I think
they are unlikely and would need to be addressed within PCRE. But even if they are happening and cannot be fixed for our users
by adjusting PCRE e.g. because they are stuck to distribution versions we still could tackle this then with an apr_bucket_alloc /
apr_bucket_free approach or worst case even with a malloc / free approach for particular "bad" PCRE versions. But for now the
current code seems to work well with the current PCRE allocation pattern.
While having a view beyond the current is good, we I think we should not assume the worst for the future.

Regards

Rüdiger
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Mon, Aug 1, 2022 at 8:15 PM Ruediger Pluem <rpluem@apache.org> wrote:
>
> On 8/1/22 5:05 PM, Ivan Zhakov wrote:
> >
> > My overall concern is that with the current solution as of r1902858, there are valid and
> > reasonable allocation patterns that will cause an unbounded memory usage.
> >
> > For example, nothing prevents current or future PCRE versions from doing this:
> >
> > for (int i = 0; i < N; i++)
> > {
> > void *p = private_malloc();
> > private_free(p);
> > }
> >
> > which is going to cause an unbounded memory usage because private_free() is
> > a no-op and retains everything that was allocated.
>
> This is true, but these kind of allocation patterns would also cause a lot of problems with a malloc/free backend and thus I think
> they are unlikely and would need to be addressed within PCRE. But even if they are happening and cannot be fixed for our users
> by adjusting PCRE e.g. because they are stuck to distribution versions we still could tackle this then with an apr_bucket_alloc /
> apr_bucket_free approach or worst case even with a malloc / free approach for particular "bad" PCRE versions. But for now the
> current code seems to work well with the current PCRE allocation pattern.

We could switch to an apr_allocator+apr_memnode_t pattern just now,
and trade the 8K used by the current (sub)pool for an overhead of
sizeof(apr_memnode_t)+8 only, and private_free() just calls
apr_allocator_free().

Something like the attached patch?


Regards;
Yann.
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On 8/2/22 2:31 PM, Yann Ylavic wrote:
> On Mon, Aug 1, 2022 at 8:15 PM Ruediger Pluem <rpluem@apache.org> wrote:
>>
>> On 8/1/22 5:05 PM, Ivan Zhakov wrote:
>>>
>>> My overall concern is that with the current solution as of r1902858, there are valid and
>>> reasonable allocation patterns that will cause an unbounded memory usage.
>>>
>>> For example, nothing prevents current or future PCRE versions from doing this:
>>>
>>> for (int i = 0; i < N; i++)
>>> {
>>> void *p = private_malloc();
>>> private_free(p);
>>> }
>>>
>>> which is going to cause an unbounded memory usage because private_free() is
>>> a no-op and retains everything that was allocated.
>>
>> This is true, but these kind of allocation patterns would also cause a lot of problems with a malloc/free backend and thus I think
>> they are unlikely and would need to be addressed within PCRE. But even if they are happening and cannot be fixed for our users
>> by adjusting PCRE e.g. because they are stuck to distribution versions we still could tackle this then with an apr_bucket_alloc /
>> apr_bucket_free approach or worst case even with a malloc / free approach for particular "bad" PCRE versions. But for now the
>> current code seems to work well with the current PCRE allocation pattern.
>
> We could switch to an apr_allocator+apr_memnode_t pattern just now,
> and trade the 8K used by the current (sub)pool for an overhead of
> sizeof(apr_memnode_t)+8 only, and private_free() just calls
> apr_allocator_free().
>
> Something like the attached patch?

But this would allocate at least 8k for each private_malloc call, correct?

Regards

Rüdiger
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On Tue, Aug 2, 2022 at 2:59 PM Ruediger Pluem <rpluem@apache.org> wrote:
>
> On 8/2/22 2:31 PM, Yann Ylavic wrote:
> > On Mon, Aug 1, 2022 at 8:15 PM Ruediger Pluem <rpluem@apache.org> wrote:
> >>
> >> On 8/1/22 5:05 PM, Ivan Zhakov wrote:
> >>>
> >>> My overall concern is that with the current solution as of r1902858, there are valid and
> >>> reasonable allocation patterns that will cause an unbounded memory usage.
> >>>
> >>> For example, nothing prevents current or future PCRE versions from doing this:
> >>>
> >>> for (int i = 0; i < N; i++)
> >>> {
> >>> void *p = private_malloc();
> >>> private_free(p);
> >>> }
> >>>
> >>> which is going to cause an unbounded memory usage because private_free() is
> >>> a no-op and retains everything that was allocated.
> >>
> >> This is true, but these kind of allocation patterns would also cause a lot of problems with a malloc/free backend and thus I think
> >> they are unlikely and would need to be addressed within PCRE. But even if they are happening and cannot be fixed for our users
> >> by adjusting PCRE e.g. because they are stuck to distribution versions we still could tackle this then with an apr_bucket_alloc /
> >> apr_bucket_free approach or worst case even with a malloc / free approach for particular "bad" PCRE versions. But for now the
> >> current code seems to work well with the current PCRE allocation pattern.
> >
> > We could switch to an apr_allocator+apr_memnode_t pattern just now,
> > and trade the 8K used by the current (sub)pool for an overhead of
> > sizeof(apr_memnode_t)+8 only, and private_free() just calls
> > apr_allocator_free().
> >
> > Something like the attached patch?
>
> But this would allocate at least 8k for each private_malloc call, correct?

Yes, 8K with apr-1.x (4K with apr-trunk IIRC).

But anyway pcre2_match() is going to ask at least 40K for its first
allocation (or 20K with next versions), because it doubles the
number/size of the heap frames when needed and starts with 20K (either
on the stack for the current version, or on the heap for the next
versions [0]).

Regards;
Yann.

[0] https://github.com/PCRE2Project/pcre2/commit/d90fb23878eeb8bcb5153b1ebb23948e7de8cf34
Re: svn commit: r1902572 - /httpd/httpd/trunk/server/util_pcre.c [ In reply to ]
On 8/2/22 3:19 PM, Yann Ylavic wrote:
> On Tue, Aug 2, 2022 at 2:59 PM Ruediger Pluem <rpluem@apache.org> wrote:
>>
>> On 8/2/22 2:31 PM, Yann Ylavic wrote:
>>> On Mon, Aug 1, 2022 at 8:15 PM Ruediger Pluem <rpluem@apache.org> wrote:
>>>>
>>>> On 8/1/22 5:05 PM, Ivan Zhakov wrote:
>>>>>
>>>>> My overall concern is that with the current solution as of r1902858, there are valid and
>>>>> reasonable allocation patterns that will cause an unbounded memory usage.
>>>>>
>>>>> For example, nothing prevents current or future PCRE versions from doing this:
>>>>>
>>>>> for (int i = 0; i < N; i++)
>>>>> {
>>>>> void *p = private_malloc();
>>>>> private_free(p);
>>>>> }
>>>>>
>>>>> which is going to cause an unbounded memory usage because private_free() is
>>>>> a no-op and retains everything that was allocated.
>>>>
>>>> This is true, but these kind of allocation patterns would also cause a lot of problems with a malloc/free backend and thus I think
>>>> they are unlikely and would need to be addressed within PCRE. But even if they are happening and cannot be fixed for our users
>>>> by adjusting PCRE e.g. because they are stuck to distribution versions we still could tackle this then with an apr_bucket_alloc /
>>>> apr_bucket_free approach or worst case even with a malloc / free approach for particular "bad" PCRE versions. But for now the
>>>> current code seems to work well with the current PCRE allocation pattern.
>>>
>>> We could switch to an apr_allocator+apr_memnode_t pattern just now,
>>> and trade the 8K used by the current (sub)pool for an overhead of
>>> sizeof(apr_memnode_t)+8 only, and private_free() just calls
>>> apr_allocator_free().
>>>
>>> Something like the attached patch?
>>
>> But this would allocate at least 8k for each private_malloc call, correct?
>
> Yes, 8K with apr-1.x (4K with apr-trunk IIRC).
>
> But anyway pcre2_match() is going to ask at least 40K for its first
> allocation (or 20K with next versions), because it doubles the
> number/size of the heap frames when needed and starts with 20K (either
> on the stack for the current version, or on the heap for the next
> versions [0]).

Thanks for the details as I was not aware of these, but this is what it currently does and in the end the discussion from my point
of view was that it could do something entirely different. My point is more: The current implementation with the sub pool is fine
with the current allocation pattern of PCRE. If the pattern changes we would have ways to adopt the code on our side to react on
these changes.

Regards

Rüdiger