Mailing List Archive

Caching Modified URLs by Varnish instead of the original requested URL
Hello All,


For our spring boot application, we are using Varnish Caching in a
production environment.




Requirement: [To utilize cache effectively]

Modify the URL (Removal of unnecessary parameters) while caching the user
request, so that the modified URL can be cached by varnish which helps
improve cache HITS for similar URLs.


For Example:

Let's consider the below Request URL

Url at time t, 1. samplehost.com/search/ims?q=bags&source=android
&options.start=0


Our Requirement:

To make varnish consider URLs with options.start=0 and without
options.start parameter
as EQUIVALENT, such that a single cached response(Single Key) can be
utilized in both cases.


*1st URL after modification:*

samplehost.com/search/ims?q=bags&source=android


*Cached URL at Varnish:*

samplehost.com/search/ims?q=bags&source=android



Now, Url at time t+1, 2. samplehost.com/search/ims?q=bags&source=android


At present, varnish considers the above URL as different from 1st URL and
uses a different key while caching the 2nd URL[So, it will be a miss]


*So, URL after Modification:*

samplehost.com/search/ims?q=bags&source=android


Now, 2nd URL will be a HIT at varnish, effectively utilizing the cache.



NOTE:

We aim to execute this URL Modification without implementing the logic directly
within the default.VCL file. Our intention is to maintain a clean and
manageable codebase in the VCL.



To address this requirement effectively, we have explored two potential
Approaches:


Approach-1:



Approach-2:




1. Please go through the approaches mentioned above and let me know the
effective solution.

2. Regarding Approach-2

At Step 2:

May I know if there is any way to access and execute a custom subroutine
from another VCL, for modifying the Request URL? if yes, pls help with
details.

At Step 3:

Tomcat Backend should receive the Original Request URL instead of the
Modified URL.

3. Please let us know if there is any better approach that can be
implemented.



Thanks & Regards
Uday Kumar
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
Hi Uday,

I'm not exactly sure how to read those diagrams, so I apologize if I'm
missing the mark or if I'm too broad here.

There are a few points I'd like to attract your attention to. The first one
is that varnish doesn't cache the request or the URL. The cache is
essentially a big hashmap/dictionary/database, in which you store the
response. The request/url is the key for it, so you need to have it in its
"final" form before you do anything.

From what I read, you are not against it, and you just want to sanitize the
URL in vcl_recv, but you don't like the idea of making the main file too
unwieldy. If I got that right, then I have a nice answer for you: use
includes and function calls.

As an example:

# cat /etc/varnish/url.vcl
sub sanitize_url {
# do whatever modifications you need here
}

# cat /etc/varnish/default.vcl
include "./url.vcl";

sub vcl_recvl {
call sanitize_url;
}


That should get you going.

Hopefully I didn't miss the mark too much here, let me know if I did.

--
Guillaume Quintard


On Tue, Aug 22, 2023 at 3:45?AM Uday Kumar <uday.polu@indiamart.com> wrote:

> Hello All,
>
>
> For our spring boot application, we are using Varnish Caching in a
> production environment.
>
>
>
>
> Requirement: [To utilize cache effectively]
>
> Modify the URL (Removal of unnecessary parameters) while caching the user
> request, so that the modified URL can be cached by varnish which helps
> improve cache HITS for similar URLs.
>
>
> For Example:
>
> Let's consider the below Request URL
>
> Url at time t, 1. samplehost.com/search/ims?q=bags&source=android
> &options.start=0
>
>
> Our Requirement:
>
> To make varnish consider URLs with options.start=0 and without
> options.start parameter as EQUIVALENT, such that a single cached
> response(Single Key) can be utilized in both cases.
>
>
> *1st URL after modification:*
>
> samplehost.com/search/ims?q=bags&source=android
>
>
> *Cached URL at Varnish:*
>
> samplehost.com/search/ims?q=bags&source=android
>
>
>
> Now, Url at time t+1, 2. samplehost.com/search/ims?q=bags&source=android
>
>
> At present, varnish considers the above URL as different from 1st URL and
> uses a different key while caching the 2nd URL[So, it will be a miss]
>
>
> *So, URL after Modification:*
>
> samplehost.com/search/ims?q=bags&source=android
>
>
> Now, 2nd URL will be a HIT at varnish, effectively utilizing the cache.
>
>
>
> NOTE:
>
> We aim to execute this URL Modification without implementing the logic directly
> within the default.VCL file. Our intention is to maintain a clean and
> manageable codebase in the VCL.
>
>
>
> To address this requirement effectively, we have explored two potential
> Approaches:
>
>
> Approach-1:
>
>
>
> Approach-2:
>
>
>
>
> 1. Please go through the approaches mentioned above and let me know the
> effective solution.
>
> 2. Regarding Approach-2
>
> At Step 2:
>
> May I know if there is any way to access and execute a custom subroutine
> from another VCL, for modifying the Request URL? if yes, pls help with
> details.
>
> At Step 3:
>
> Tomcat Backend should receive the Original Request URL instead of the
> Modified URL.
>
> 3. Please let us know if there is any better approach that can be
> implemented.
>
>
>
> Thanks & Regards
> Uday Kumar
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
Hi Guillaume,

*use includes and function calls*
This is great, thank you so much for your help!

Thanks & Regards
Uday Kumar


On Wed, Aug 23, 2023 at 1:32?AM Guillaume Quintard <
guillaume.quintard@gmail.com> wrote:

> Hi Uday,
>
> I'm not exactly sure how to read those diagrams, so I apologize if I'm
> missing the mark or if I'm too broad here.
>
> There are a few points I'd like to attract your attention to. The first
> one is that varnish doesn't cache the request or the URL. The cache is
> essentially a big hashmap/dictionary/database, in which you store the
> response. The request/url is the key for it, so you need to have it in its
> "final" form before you do anything.
>
> From what I read, you are not against it, and you just want to sanitize
> the URL in vcl_recv, but you don't like the idea of making the main file
> too unwieldy. If I got that right, then I have a nice answer for you: use
> includes and function calls.
>
> As an example:
>
> # cat /etc/varnish/url.vcl
> sub sanitize_url {
> # do whatever modifications you need here
> }
>
> # cat /etc/varnish/default.vcl
> include "./url.vcl";
>
> sub vcl_recvl {
> call sanitize_url;
> }
>
>
> That should get you going.
>
> Hopefully I didn't miss the mark too much here, let me know if I did.
>
> --
> Guillaume Quintard
>
>
> On Tue, Aug 22, 2023 at 3:45?AM Uday Kumar <uday.polu@indiamart.com>
> wrote:
>
>> Hello All,
>>
>>
>> For our spring boot application, we are using Varnish Caching in a
>> production environment.
>>
>>
>>
>>
>> Requirement: [To utilize cache effectively]
>>
>> Modify the URL (Removal of unnecessary parameters) while caching the user
>> request, so that the modified URL can be cached by varnish which helps
>> improve cache HITS for similar URLs.
>>
>>
>> For Example:
>>
>> Let's consider the below Request URL
>>
>> Url at time t, 1. samplehost.com/search/ims?q=bags&source=android
>> &options.start=0
>>
>>
>> Our Requirement:
>>
>> To make varnish consider URLs with options.start=0 and without
>> options.start parameter as EQUIVALENT, such that a single cached
>> response(Single Key) can be utilized in both cases.
>>
>>
>> *1st URL after modification:*
>>
>> samplehost.com/search/ims?q=bags&source=android
>>
>>
>> *Cached URL at Varnish:*
>>
>> samplehost.com/search/ims?q=bags&source=android
>>
>>
>>
>> Now, Url at time t+1, 2. samplehost.com/search/ims?q=bags&source=android
>>
>>
>> At present, varnish considers the above URL as different from 1st URL
>> and uses a different key while caching the 2nd URL[So, it will be a miss]
>>
>>
>> *So, URL after Modification:*
>>
>> samplehost.com/search/ims?q=bags&source=android
>>
>>
>> Now, 2nd URL will be a HIT at varnish, effectively utilizing the cache.
>>
>>
>>
>> NOTE:
>>
>> We aim to execute this URL Modification without implementing the logic directly
>> within the default.VCL file. Our intention is to maintain a clean and
>> manageable codebase in the VCL.
>>
>>
>>
>> To address this requirement effectively, we have explored two potential
>> Approaches:
>>
>>
>> Approach-1:
>>
>>
>>
>> Approach-2:
>>
>>
>>
>>
>> 1. Please go through the approaches mentioned above and let me know the
>> effective solution.
>>
>> 2. Regarding Approach-2
>>
>> At Step 2:
>>
>> May I know if there is any way to access and execute a custom subroutine
>> from another VCL, for modifying the Request URL? if yes, pls help with
>> details.
>>
>> At Step 3:
>>
>> Tomcat Backend should receive the Original Request URL instead of the
>> Modified URL.
>>
>> 3. Please let us know if there is any better approach that can be
>> implemented.
>>
>>
>>
>> Thanks & Regards
>> Uday Kumar
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
Hi Guillaume,

In the process of modifying the query string in VCL code, we have a
requirement of *lowercasing value of specific parameter*, instead of the *whole
query string*

*Example Request URL:*
/search/ims?q=*CRICKET bat*&country_code=IN

*Requirement:*
We have to modify the request URL by lowercasing the value of only the *q *
parameter
i.e ./search/ims?q=*cricket bat*&country_code=IN

*For that, we have found below regex:*
set req.http.hash-url = regsuball(req.http.hash-url, "(q=)(.*?)(\&|$)",
"\1"+*std.tolower("\2")*+"\3");

*ISSUE:*
*std.tolower("\2")* in the above statement is *not lowercasing* the string
that's captured, but if I test it using *std.tolower("SAMPLE"),* its
lowercasing as expected.

1. May I know why it's not lowercasing if *std.tolower("\2") is used*?
2. Also, please provide possible optimal solutions for the same. (using
regex)

Thanks & Regards
Uday Kumar


On Wed, Aug 23, 2023 at 12:01?PM Uday Kumar <uday.polu@indiamart.com> wrote:

> Hi Guillaume,
>
> *use includes and function calls*
> This is great, thank you so much for your help!
>
> Thanks & Regards
> Uday Kumar
>
>
> On Wed, Aug 23, 2023 at 1:32?AM Guillaume Quintard <
> guillaume.quintard@gmail.com> wrote:
>
>> Hi Uday,
>>
>> I'm not exactly sure how to read those diagrams, so I apologize if I'm
>> missing the mark or if I'm too broad here.
>>
>> There are a few points I'd like to attract your attention to. The first
>> one is that varnish doesn't cache the request or the URL. The cache is
>> essentially a big hashmap/dictionary/database, in which you store the
>> response. The request/url is the key for it, so you need to have it in its
>> "final" form before you do anything.
>>
>> From what I read, you are not against it, and you just want to sanitize
>> the URL in vcl_recv, but you don't like the idea of making the main file
>> too unwieldy. If I got that right, then I have a nice answer for you: use
>> includes and function calls.
>>
>> As an example:
>>
>> # cat /etc/varnish/url.vcl
>> sub sanitize_url {
>> # do whatever modifications you need here
>> }
>>
>> # cat /etc/varnish/default.vcl
>> include "./url.vcl";
>>
>> sub vcl_recvl {
>> call sanitize_url;
>> }
>>
>>
>> That should get you going.
>>
>> Hopefully I didn't miss the mark too much here, let me know if I did.
>>
>> --
>> Guillaume Quintard
>>
>>
>> On Tue, Aug 22, 2023 at 3:45?AM Uday Kumar <uday.polu@indiamart.com>
>> wrote:
>>
>>> Hello All,
>>>
>>>
>>> For our spring boot application, we are using Varnish Caching in a
>>> production environment.
>>>
>>>
>>>
>>>
>>> Requirement: [To utilize cache effectively]
>>>
>>> Modify the URL (Removal of unnecessary parameters) while caching the
>>> user request, so that the modified URL can be cached by varnish which
>>> helps improve cache HITS for similar URLs.
>>>
>>>
>>> For Example:
>>>
>>> Let's consider the below Request URL
>>>
>>> Url at time t, 1. samplehost.com/search/ims?q=bags&source=android
>>> &options.start=0
>>>
>>>
>>> Our Requirement:
>>>
>>> To make varnish consider URLs with options.start=0 and without
>>> options.start parameter as EQUIVALENT, such that a single cached
>>> response(Single Key) can be utilized in both cases.
>>>
>>>
>>> *1st URL after modification:*
>>>
>>> samplehost.com/search/ims?q=bags&source=android
>>>
>>>
>>> *Cached URL at Varnish:*
>>>
>>> samplehost.com/search/ims?q=bags&source=android
>>>
>>>
>>>
>>> Now, Url at time t+1, 2. samplehost.com/search/ims?q=bags&source=android
>>>
>>>
>>> At present, varnish considers the above URL as different from 1st URL
>>> and uses a different key while caching the 2nd URL[So, it will be a miss
>>> ]
>>>
>>>
>>> *So, URL after Modification:*
>>>
>>> samplehost.com/search/ims?q=bags&source=android
>>>
>>>
>>> Now, 2nd URL will be a HIT at varnish, effectively utilizing the cache.
>>>
>>>
>>>
>>> NOTE:
>>>
>>> We aim to execute this URL Modification without implementing the logic directly
>>> within the default.VCL file. Our intention is to maintain a clean and
>>> manageable codebase in the VCL.
>>>
>>>
>>>
>>> To address this requirement effectively, we have explored two potential
>>> Approaches:
>>>
>>>
>>> Approach-1:
>>>
>>>
>>>
>>> Approach-2:
>>>
>>>
>>>
>>>
>>> 1. Please go through the approaches mentioned above and let me know the
>>> effective solution.
>>>
>>> 2. Regarding Approach-2
>>>
>>> At Step 2:
>>>
>>> May I know if there is any way to access and execute a custom
>>> subroutine from another VCL, for modifying the Request URL? if yes, pls
>>> help with details.
>>>
>>> At Step 3:
>>>
>>> Tomcat Backend should receive the Original Request URL instead of the
>>> Modified URL.
>>>
>>> 3. Please let us know if there is any better approach that can be
>>> implemented.
>>>
>>>
>>>
>>> Thanks & Regards
>>> Uday Kumar
>>> _______________________________________________
>>> varnish-misc mailing list
>>> varnish-misc@varnish-cache.org
>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>
>>
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
I'm pretty sure it's correctly lowercasing "\2" correctly. The problem is
that you want to lowercase the *value* referenced by "\2" instead.

On this, I don't think you have a choice, you need to make that captured
group its own string, lowercase it, and only then concatenate it. Something
like:

set req.http.hash-url = regsuball(req.http.hash-url, ".*(q=)(.*?)(\&|$).*",
"\1") + *std.tolower("regsuball(req.http.hash-url, ".*(q=)(.*?)(\&|$).*",
"\2")") + *regsuball(req.http.hash-url, ".*(q=)(.*?)(\&|$).*", "\3"));

It's disgusting, but eh, we started with regex, so...

Other options include vmod_querystring
<https://github.com/Dridi/libvmod-querystring/blob/master/src/vmod_querystring.vcc.in>
(Dridi might possibly be of assistance on this topic) and vmod_urlplus
<https://docs.varnish-software.com/varnish-enterprise/vmods/urlplus/#query_get>
(Varnish
Enterprise), and the last, and possibly most promising one, vmod_re2
<https://gitlab.com/uplex/varnish/libvmod-re2/-/blob/master/README.md> which
would allow you to do something like

if (myset.match(".*(q=)(.*?)(\&|$).*", "\1")) {
set req.http.hash-url = myset.matched(1) + std.lower(myset.matched(2)) +
myset.matched(3)
}

--
Guillaume Quintard


On Thu, Aug 31, 2023 at 1:03?AM Uday Kumar <uday.polu@indiamart.com> wrote:

> Hi Guillaume,
>
> In the process of modifying the query string in VCL code, we have a
> requirement of *lowercasing value of specific parameter*, instead of the *whole
> query string*
>
> *Example Request URL:*
> /search/ims?q=*CRICKET bat*&country_code=IN
>
> *Requirement:*
> We have to modify the request URL by lowercasing the value of only the *q
> *parameter
> i.e ./search/ims?q=*cricket bat*&country_code=IN
>
> *For that, we have found below regex:*
> set req.http.hash-url = regsuball(req.http.hash-url, "(q=)(.*?)(\&|$)",
> "\1"+*std.tolower("\2")*+"\3");
>
> *ISSUE:*
> *std.tolower("\2")* in the above statement is *not lowercasing* the
> string that's captured, but if I test it using *std.tolower("SAMPLE"),* its
> lowercasing as expected.
>
> 1. May I know why it's not lowercasing if *std.tolower("\2") is used*?
> 2. Also, please provide possible optimal solutions for the same. (using
> regex)
>
> Thanks & Regards
> Uday Kumar
>
>
> On Wed, Aug 23, 2023 at 12:01?PM Uday Kumar <uday.polu@indiamart.com>
> wrote:
>
>> Hi Guillaume,
>>
>> *use includes and function calls*
>> This is great, thank you so much for your help!
>>
>> Thanks & Regards
>> Uday Kumar
>>
>>
>> On Wed, Aug 23, 2023 at 1:32?AM Guillaume Quintard <
>> guillaume.quintard@gmail.com> wrote:
>>
>>> Hi Uday,
>>>
>>> I'm not exactly sure how to read those diagrams, so I apologize if I'm
>>> missing the mark or if I'm too broad here.
>>>
>>> There are a few points I'd like to attract your attention to. The first
>>> one is that varnish doesn't cache the request or the URL. The cache is
>>> essentially a big hashmap/dictionary/database, in which you store the
>>> response. The request/url is the key for it, so you need to have it in its
>>> "final" form before you do anything.
>>>
>>> From what I read, you are not against it, and you just want to sanitize
>>> the URL in vcl_recv, but you don't like the idea of making the main file
>>> too unwieldy. If I got that right, then I have a nice answer for you: use
>>> includes and function calls.
>>>
>>> As an example:
>>>
>>> # cat /etc/varnish/url.vcl
>>> sub sanitize_url {
>>> # do whatever modifications you need here
>>> }
>>>
>>> # cat /etc/varnish/default.vcl
>>> include "./url.vcl";
>>>
>>> sub vcl_recvl {
>>> call sanitize_url;
>>> }
>>>
>>>
>>> That should get you going.
>>>
>>> Hopefully I didn't miss the mark too much here, let me know if I did.
>>>
>>> --
>>> Guillaume Quintard
>>>
>>>
>>> On Tue, Aug 22, 2023 at 3:45?AM Uday Kumar <uday.polu@indiamart.com>
>>> wrote:
>>>
>>>> Hello All,
>>>>
>>>>
>>>> For our spring boot application, we are using Varnish Caching in a
>>>> production environment.
>>>>
>>>>
>>>>
>>>>
>>>> Requirement: [To utilize cache effectively]
>>>>
>>>> Modify the URL (Removal of unnecessary parameters) while caching the
>>>> user request, so that the modified URL can be cached by varnish which
>>>> helps improve cache HITS for similar URLs.
>>>>
>>>>
>>>> For Example:
>>>>
>>>> Let's consider the below Request URL
>>>>
>>>> Url at time t, 1. samplehost.com/search/ims?q=bags&source=android
>>>> &options.start=0
>>>>
>>>>
>>>> Our Requirement:
>>>>
>>>> To make varnish consider URLs with options.start=0 and without
>>>> options.start parameter as EQUIVALENT, such that a single cached
>>>> response(Single Key) can be utilized in both cases.
>>>>
>>>>
>>>> *1st URL after modification:*
>>>>
>>>> samplehost.com/search/ims?q=bags&source=android
>>>>
>>>>
>>>> *Cached URL at Varnish:*
>>>>
>>>> samplehost.com/search/ims?q=bags&source=android
>>>>
>>>>
>>>>
>>>> Now, Url at time t+1, 2.
>>>> samplehost.com/search/ims?q=bags&source=android
>>>>
>>>>
>>>> At present, varnish considers the above URL as different from 1st URL
>>>> and uses a different key while caching the 2nd URL[So, it will be a
>>>> miss]
>>>>
>>>>
>>>> *So, URL after Modification:*
>>>>
>>>> samplehost.com/search/ims?q=bags&source=android
>>>>
>>>>
>>>> Now, 2nd URL will be a HIT at varnish, effectively utilizing the cache.
>>>>
>>>>
>>>>
>>>> NOTE:
>>>>
>>>> We aim to execute this URL Modification without implementing the logic directly
>>>> within the default.VCL file. Our intention is to maintain a clean and
>>>> manageable codebase in the VCL.
>>>>
>>>>
>>>>
>>>> To address this requirement effectively, we have explored two potential
>>>> Approaches:
>>>>
>>>>
>>>> Approach-1:
>>>>
>>>>
>>>>
>>>> Approach-2:
>>>>
>>>>
>>>>
>>>>
>>>> 1. Please go through the approaches mentioned above and let me know the
>>>> effective solution.
>>>>
>>>> 2. Regarding Approach-2
>>>>
>>>> At Step 2:
>>>>
>>>> May I know if there is any way to access and execute a custom
>>>> subroutine from another VCL, for modifying the Request URL? if yes,
>>>> pls help with details.
>>>>
>>>> At Step 3:
>>>>
>>>> Tomcat Backend should receive the Original Request URL instead of the
>>>> Modified URL.
>>>>
>>>> 3. Please let us know if there is any better approach that can be
>>>> implemented.
>>>>
>>>>
>>>>
>>>> Thanks & Regards
>>>> Uday Kumar
>>>> _______________________________________________
>>>> varnish-misc mailing list
>>>> varnish-misc@varnish-cache.org
>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>
>>>
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
On 8/31/23 22:06, Guillaume Quintard wrote:
>
> Other options include vmod_querystring
> <https://github.com/Dridi/libvmod-querystring/blob/master/src/vmod_querystring.vcc.in>
> (Dridi might possibly be of assistance on this topic) and
> vmod_urlplus
> <https://docs.varnish-software.com/varnish-enterprise/vmods/urlplus/#query_get>
> (Varnish Enterprise), and the last, and possibly most promising one,
> vmod_re2
> <https://gitlab.com/uplex/varnish/libvmod-re2/-/blob/master/README.md>

I would suggest going with vmod_re for a task like this:

https://gitlab.com/uplex/varnish/libvmod-re

Because:

- VMOD re is based on Varnish's internal interface for regex-ing, so it
uses the pcre2 library that's always installed with Varnish. For VMOD
re2 you also have to install the re2 library.

- pcre2 regex matching is generally faster than re2 matching. The point
of re2 regexen is that matches won't go into catastrophic backtracking
on pathological cases.

- The real strength of re2 lies in the set interface, which matches
multiple regexen "simultaneously", and then can tell you which one
matched. The matching regex can be associated with a backend, a
subroutine, or a number of other VCL objects; and there are a variety of
other bells and whistles. VMOD re is just about subexpression capture,
which is the job to be done here.

For either VMOD re or re2, it's a good idea to initialize the regex in
vcl_init, so that it's pre-compiled at runtime. The versions of the
match function that take a regex as a parameter compile the regex on
every invocation.

So with VMOD re it would look like this:

import re;

sub vcl_init {
new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*");
}

sub vcl_recv {
if (query_pattern.match(req.url)) {
set req.http.hash-url = query_pattern.backref(1) +
std.lower(query_pattern.backref(2)) +
query_pattern.backref(3);
}
}


HTH,
Geoff
--
** * * UPLEX - Nils Goroll Systemoptimierung

Scheffelstraße 32
22301 Hamburg

Tel +49 40 2880 5731
Mob +49 176 636 90917
Fax +49 40 42949753

http://uplex.de
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
Sorry, I get nerdy about this subject and can't help following up.

I said:

> - pcre2 regex matching is generally faster than re2 matching. The point
> of re2 regexen is that matches won't go into catastrophic backtracking
> on pathological cases.

Should have mentioned that pcre2 is even better at subexpression
capture, which is what the OP's question is all about.

> sub vcl_init {
>     new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*");
> }

OMG no. Like this please:

new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)");

I have sent an example of a pcre regex with .* (two of them!) to a
public mailing list, for which I will burn in hell.

To match a name-value pair in a cookie, use a regex with \b for 'word
boundary' in front of the name. That way it will match either at the
beginning of the Cookie value, or following an ampersand.

And ?: tells pcre not to bother capturing the last expression in
parentheses (they're just for grouping).

Avoid .* in pcre regexen if you possibly can. You can, almost always.

With .* at the beginning, the pcre matcher searches all the way to the
end of the string, and then backtracks all the way back, looking for the
first letter to match. In this case 'q', and it will stop and search and
backtrack at any other 'q' that it may find while working backwards.

pcre2 fortunately has an optimization that ignores a trailing .* if it
has found a match up until there, so that it doesn't busily match the
dot against every character left in the string. So this time .* does no
harm, but it's superfluous, and violates the golden rule of pcre: avoid
.* if at all possible.

Incidentally, this is an area where re2 does have an advantage over
pcre2. The efficiency of pcre2 matching depends crucially on how you
write the regex, because details like \b instead of .* give it hints for
pruning the search. While re2 matching usually isn't as fast as pcre2
matching against well-written patterns, re2 doesn't depend so much on
that sort of thing.


OK I can chill now,
Geoff
--
** * * UPLEX - Nils Goroll Systemoptimierung

Scheffelstraße 32
22301 Hamburg

Tel +49 40 2880 5731
Mob +49 176 636 90917
Fax +49 40 42949753

http://uplex.de
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
Thank you so much Geoff for that very useful knowledge dump!

Good call out on the .*, I realized I carried them over too, when I
copy-pasted the regex from the pure vcl example (where it's needed) to the
vmod one.

And so, just to be clear about it:
- vmod-re is based on libpcre2
- vmod-re2 is based on libre2
Correct?

I see no way I'm going to misremember that, at all :-D

--
Guillaume Quintard


On Fri, Sep 1, 2023 at 7:47?AM Geoff Simmons <geoff@uplex.de> wrote:

> Sorry, I get nerdy about this subject and can't help following up.
>
> I said:
>
> > - pcre2 regex matching is generally faster than re2 matching. The point
> > of re2 regexen is that matches won't go into catastrophic backtracking
> > on pathological cases.
>
> Should have mentioned that pcre2 is even better at subexpression
> capture, which is what the OP's question is all about.
>
> > sub vcl_init {
> > new query_pattern = re.regex(".*(q=)(.*?)(\&|$).*");
> > }
>
> OMG no. Like this please:
>
> new query_pattern = re.regex("\b(q=)(.*?)(?:\&|$)");
>
> I have sent an example of a pcre regex with .* (two of them!) to a
> public mailing list, for which I will burn in hell.
>
> To match a name-value pair in a cookie, use a regex with \b for 'word
> boundary' in front of the name. That way it will match either at the
> beginning of the Cookie value, or following an ampersand.
>
> And ?: tells pcre not to bother capturing the last expression in
> parentheses (they're just for grouping).
>
> Avoid .* in pcre regexen if you possibly can. You can, almost always.
>
> With .* at the beginning, the pcre matcher searches all the way to the
> end of the string, and then backtracks all the way back, looking for the
> first letter to match. In this case 'q', and it will stop and search and
> backtrack at any other 'q' that it may find while working backwards.
>
> pcre2 fortunately has an optimization that ignores a trailing .* if it
> has found a match up until there, so that it doesn't busily match the
> dot against every character left in the string. So this time .* does no
> harm, but it's superfluous, and violates the golden rule of pcre: avoid
> .* if at all possible.
>
> Incidentally, this is an area where re2 does have an advantage over
> pcre2. The efficiency of pcre2 matching depends crucially on how you
> write the regex, because details like \b instead of .* give it hints for
> pruning the search. While re2 matching usually isn't as fast as pcre2
> matching against well-written patterns, re2 doesn't depend so much on
> that sort of thing.
>
>
> OK I can chill now,
> Geoff
> --
> ** * * UPLEX - Nils Goroll Systemoptimierung
>
> Scheffelstraße 32
> 22301 Hamburg
>
> Tel +49 40 2880 5731
> Mob +49 176 636 90917
> Fax +49 40 42949753
>
> http://uplex.de
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
On 9/1/23 16:58, Guillaume Quintard wrote:
>
> - vmod-re is based on libpcre2
> - vmod-re2 is based on libre2
> Correct?

Correct.

It used to be the case that libvmod-re only used the internal VRE
interface. So the VMOD was using whatever Varnish used, so to speak,
which happened to be pcre. But since the transition to pcre2, we have
some direct calls into libpcre2.

> I see no way I'm going to misremember that, at all :-D

Yeah. Back when it was pcre vs re2, that wasn't so hard. But now, oh well.


Best,
Geoff
--
** * * UPLEX - Nils Goroll Systemoptimierung

Scheffelstraße 32
22301 Hamburg

Tel +49 40 2880 5731
Mob +49 176 636 90917
Fax +49 40 42949753

http://uplex.de
Re: Caching Modified URLs by Varnish instead of the original requested URL [ In reply to ]
Thanks Guillaume, I'll look into it.

Thanks & Regards
Uday Kumar


On Fri, Sep 1, 2023 at 1:36?AM Guillaume Quintard <
guillaume.quintard@gmail.com> wrote:

> I'm pretty sure it's correctly lowercasing "\2" correctly. The problem is
> that you want to lowercase the *value* referenced by "\2" instead.
>
> On this, I don't think you have a choice, you need to make that captured
> group its own string, lowercase it, and only then concatenate it. Something
> like:
>
> set req.http.hash-url = regsuball(req.http.hash-url,
> ".*(q=)(.*?)(\&|$).*", "\1") + *std.tolower("regsuball(req.http.hash-url,
> ".*(q=)(.*?)(\&|$).*", "\2")") + *regsuball(req.http.hash-url,
> ".*(q=)(.*?)(\&|$).*", "\3"));
>
> It's disgusting, but eh, we started with regex, so...
>
> Other options include vmod_querystring
> <https://github.com/Dridi/libvmod-querystring/blob/master/src/vmod_querystring.vcc.in>
> (Dridi might possibly be of assistance on this topic) and vmod_urlplus
> <https://docs.varnish-software.com/varnish-enterprise/vmods/urlplus/#query_get> (Varnish
> Enterprise), and the last, and possibly most promising one, vmod_re2
> <https://gitlab.com/uplex/varnish/libvmod-re2/-/blob/master/README.md> which
> would allow you to do something like
>
> if (myset.match(".*(q=)(.*?)(\&|$).*", "\1")) {
> set req.http.hash-url = myset.matched(1) + std.lower(myset.matched(2))
> + myset.matched(3)
> }
>
> --
> Guillaume Quintard
>
>
> On Thu, Aug 31, 2023 at 1:03?AM Uday Kumar <uday.polu@indiamart.com>
> wrote:
>
>> Hi Guillaume,
>>
>> In the process of modifying the query string in VCL code, we have a
>> requirement of *lowercasing value of specific parameter*, instead of the *whole
>> query string*
>>
>> *Example Request URL:*
>> /search/ims?q=*CRICKET bat*&country_code=IN
>>
>> *Requirement:*
>> We have to modify the request URL by lowercasing the value of only the *q
>> *parameter
>> i.e ./search/ims?q=*cricket bat*&country_code=IN
>>
>> *For that, we have found below regex:*
>> set req.http.hash-url = regsuball(req.http.hash-url, "(q=)(.*?)(\&|$)",
>> "\1"+*std.tolower("\2")*+"\3");
>>
>> *ISSUE:*
>> *std.tolower("\2")* in the above statement is *not lowercasing* the
>> string that's captured, but if I test it using *std.tolower("SAMPLE"),* its
>> lowercasing as expected.
>>
>> 1. May I know why it's not lowercasing if *std.tolower("\2") is used*?
>> 2. Also, please provide possible optimal solutions for the same. (using
>> regex)
>>
>> Thanks & Regards
>> Uday Kumar
>>
>>
>> On Wed, Aug 23, 2023 at 12:01?PM Uday Kumar <uday.polu@indiamart.com>
>> wrote:
>>
>>> Hi Guillaume,
>>>
>>> *use includes and function calls*
>>> This is great, thank you so much for your help!
>>>
>>> Thanks & Regards
>>> Uday Kumar
>>>
>>>
>>> On Wed, Aug 23, 2023 at 1:32?AM Guillaume Quintard <
>>> guillaume.quintard@gmail.com> wrote:
>>>
>>>> Hi Uday,
>>>>
>>>> I'm not exactly sure how to read those diagrams, so I apologize if I'm
>>>> missing the mark or if I'm too broad here.
>>>>
>>>> There are a few points I'd like to attract your attention to. The first
>>>> one is that varnish doesn't cache the request or the URL. The cache is
>>>> essentially a big hashmap/dictionary/database, in which you store the
>>>> response. The request/url is the key for it, so you need to have it in its
>>>> "final" form before you do anything.
>>>>
>>>> From what I read, you are not against it, and you just want to sanitize
>>>> the URL in vcl_recv, but you don't like the idea of making the main file
>>>> too unwieldy. If I got that right, then I have a nice answer for you: use
>>>> includes and function calls.
>>>>
>>>> As an example:
>>>>
>>>> # cat /etc/varnish/url.vcl
>>>> sub sanitize_url {
>>>> # do whatever modifications you need here
>>>> }
>>>>
>>>> # cat /etc/varnish/default.vcl
>>>> include "./url.vcl";
>>>>
>>>> sub vcl_recvl {
>>>> call sanitize_url;
>>>> }
>>>>
>>>>
>>>> That should get you going.
>>>>
>>>> Hopefully I didn't miss the mark too much here, let me know if I did.
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>>
>>>> On Tue, Aug 22, 2023 at 3:45?AM Uday Kumar <uday.polu@indiamart.com>
>>>> wrote:
>>>>
>>>>> Hello All,
>>>>>
>>>>>
>>>>> For our spring boot application, we are using Varnish Caching in a
>>>>> production environment.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Requirement: [To utilize cache effectively]
>>>>>
>>>>> Modify the URL (Removal of unnecessary parameters) while caching the
>>>>> user request, so that the modified URL can be cached by varnish which
>>>>> helps improve cache HITS for similar URLs.
>>>>>
>>>>>
>>>>> For Example:
>>>>>
>>>>> Let's consider the below Request URL
>>>>>
>>>>> Url at time t, 1. samplehost.com/search/ims?q=bags&source=android
>>>>> &options.start=0
>>>>>
>>>>>
>>>>> Our Requirement:
>>>>>
>>>>> To make varnish consider URLs with options.start=0 and without
>>>>> options.start parameter as EQUIVALENT, such that a single cached
>>>>> response(Single Key) can be utilized in both cases.
>>>>>
>>>>>
>>>>> *1st URL after modification:*
>>>>>
>>>>> samplehost.com/search/ims?q=bags&source=android
>>>>>
>>>>>
>>>>> *Cached URL at Varnish:*
>>>>>
>>>>> samplehost.com/search/ims?q=bags&source=android
>>>>>
>>>>>
>>>>>
>>>>> Now, Url at time t+1, 2.
>>>>> samplehost.com/search/ims?q=bags&source=android
>>>>>
>>>>>
>>>>> At present, varnish considers the above URL as different from 1st URL
>>>>> and uses a different key while caching the 2nd URL[So, it will be a
>>>>> miss]
>>>>>
>>>>>
>>>>> *So, URL after Modification:*
>>>>>
>>>>> samplehost.com/search/ims?q=bags&source=android
>>>>>
>>>>>
>>>>> Now, 2nd URL will be a HIT at varnish, effectively utilizing the
>>>>> cache.
>>>>>
>>>>>
>>>>>
>>>>> NOTE:
>>>>>
>>>>> We aim to execute this URL Modification without implementing the
>>>>> logic directly within the default.VCL file. Our intention is to
>>>>> maintain a clean and manageable codebase in the VCL.
>>>>>
>>>>>
>>>>>
>>>>> To address this requirement effectively, we have explored two
>>>>> potential Approaches:
>>>>>
>>>>>
>>>>> Approach-1:
>>>>>
>>>>>
>>>>>
>>>>> Approach-2:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 1. Please go through the approaches mentioned above and let me know
>>>>> the effective solution.
>>>>>
>>>>> 2. Regarding Approach-2
>>>>>
>>>>> At Step 2:
>>>>>
>>>>> May I know if there is any way to access and execute a custom
>>>>> subroutine from another VCL, for modifying the Request URL? if yes,
>>>>> pls help with details.
>>>>>
>>>>> At Step 3:
>>>>>
>>>>> Tomcat Backend should receive the Original Request URL instead of the
>>>>> Modified URL.
>>>>>
>>>>> 3. Please let us know if there is any better approach that can be
>>>>> implemented.
>>>>>
>>>>>
>>>>>
>>>>> Thanks & Regards
>>>>> Uday Kumar
>>>>> _______________________________________________
>>>>> varnish-misc mailing list
>>>>> varnish-misc@varnish-cache.org
>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>
>>>>