Mailing List Archive: Grace and misbehaving servers

Grace and misbehaving servers

batanun at hotmail

Mar 15, 2020, 2:54 PM

Post #1 of 8 (1616 views)

Hi,

I'm currently setting up Varnish for a project, and the grace feature together with health checks/probes seems to be a great savior when working with servers that might misbehave. But I'm not really sure I understand how to actually achive that, since the example doesn't really make sense:

https://varnish-cache.org/docs/trunk/users-guide/vcl-grace.html

See the section "Misbehaving servers". There the example does "set beresp.grace = 24h" in vcl_backend_response, and "set req.grace = 10s" in vcl_recv, if the backend is healthy. But since vcl_recv is run before vcl_backend_response, doesn't that mean that the 10s grace value of vcl_recv is overwritten by the 24h value in vcl_backend_response?

Also... There is always a risk of some URL's suddenly giving 500-error (or a timeout) all while the probe still returns 200. Is it possible to have Varnish behave more or less as if the backend is sick, but just for those URL? Basically I would like this logic:

If a healthy content exists in the cache:
1. Return the cached (and potentially stale) content to the client
2. Increase the ttl and/or grace, to keep the healthy content longer
3. Only do a bg-fetch if a specified time has past since the last attempt (lets say 5s), to avoid hammering the backend

If a non-health (ie 500-error) exists in the cache:
1. Return the cached 500-content to the client
2. Only do a bg-fetch if a specified time has past since the last attempt (lets say 5s), to avoid hammering the backend

If no content doesn't exists in the cache:
1. Perform a synchronous fetch
2. If the result is a 500-error, cache it with lets say ttl = 5s
3. Otherwise, cache it with a longer ttl
4. Return the result to the client

Is this possible with the community edition of Varnish?

Re: Grace and misbehaving servers [ In reply to ]

Mar 16, 2020, 1:58 AM

Post #2 of 8 (1615 views)

Hi,

On Sun, Mar 15, 2020 at 9:56 PM J X <batanun@hotmail.com> wrote:
>
> Hi,
>
> I'm currently setting up Varnish for a project, and the grace feature together with health checks/probes seems to be a great savior when working with servers that might misbehave. But I'm not really sure I understand how to actually achive that, since the example doesn't really make sense:
>
> https://varnish-cache.org/docs/trunk/users-guide/vcl-grace.html
>
> See the section "Misbehaving servers". There the example does "set beresp.grace = 24h" in vcl_backend_response, and "set req.grace = 10s" in vcl_recv, if the backend is healthy. But since vcl_recv is run before vcl_backend_response, doesn't that mean that the 10s grace value of vcl_recv is overwritten by the 24h value in vcl_backend_response?

Not really, it's actually the other way around. The beresp.grace
variable defines how long you may serve an object past its TTL once it
enters the cache.

Subsequent requests can then limit grace mode, so think of req.grace
as a req.max_grace variable (which maybe hints that it should have
been called that in the first place).

> Also... There is always a risk of some URL's suddenly giving 500-error (or a timeout) all while the probe still returns 200. Is it possible to have Varnish behave more or less as if the backend is sick, but just for those URL? Basically I would like this logic:
>
> If a healthy content exists in the cache:
> 1. Return the cached (and potentially stale) content to the client
> 2. Increase the ttl and/or grace, to keep the healthy content longer
> 3. Only do a bg-fetch if a specified time has past since the last attempt (lets say 5s), to avoid hammering the backend
>
> If a non-health (ie 500-error) exists in the cache:
> 1. Return the cached 500-content to the client
> 2. Only do a bg-fetch if a specified time has past since the last attempt (lets say 5s), to avoid hammering the backend

What you are describing is stale-if-error, something we don't support
but could be approximated with somewhat convoluted VCL. It used to be
easier when Varnish had saint mode built-in because it generally
resulted in less convoluted VCL.

It's not something I would recommend attempting today.

> If no content doesn't exists in the cache:
> 1. Perform a synchronous fetch
> 2. If the result is a 500-error, cache it with lets say ttl = 5s
> 3. Otherwise, cache it with a longer ttl
> 4. Return the result to the client
>
> Is this possible with the community edition of Varnish?

You can do that with plain VCL, but even better, teach your backend to
inform Varnish how to handle either cases with the Cache-Control
response header.

Dridi
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Re: Grace and misbehaving servers [ In reply to ]

batanun at hotmail

Mar 17, 2020, 1:04 PM

Post #3 of 8 (1609 views)

Hi Dridi,

On Monday, March 16, 2020 9:58 AM Dridi Boukelmoune <dridi@varni.sh> wrote:

> Not really, it's actually the other way around. The beresp.grace
> variable defines how long you may serve an object past its TTL once it
> enters the cache.
>
> Subsequent requests can then limit grace mode, so think of req.grace
> as a req.max_grace variable (which maybe hints that it should have
> been called that in the first place).

OK. So beresp.grace mainly effects how long the object can stay in the cache? And if ttl + grace + keep is a low value set in vcl_backend_response, then vcl_recv is limited in how high the grace can be?

And req.grace doesn't effect the time that the object is in the cache? Even if req.grace is set to a low value on the very first request (ie the same request that triggers the call to the backend)?

> What you are describing is stale-if-error, something we don't support
> but could be approximated with somewhat convoluted VCL. It used to be
> easier when Varnish had saint mode built-in because it generally
> resulted in less convoluted VCL.
>
> It's not something I would recommend attempting today.

That's strange. This stale-if-error sounds like something pretty much everyone would want, right? I mean, if there is is stale content available why show an error page to the end user?

But maybe it was my want to "cache/remember" previous failed fetches and that made it complicated? So if I loosen the requirements/wish-list a bit, into this:

Assuming that:
* A request comes in to Varnish
* The content is stale, but still in the cache
* The backend is considered healthy
* The short (10s) grace has expired
* Varnish triggers a synchronus fetch in the backend
* This fetch fails (timeout or 5xx error)

I would then like Varnish to:
* Return the stale content

Would this be possible using basic Varnish community edition, without a "convoluted VCL", as you put it? Is it possible without triggering a restart of the request? Either way, I am interested in hearing about how it can be achieved. Is there any documentation or blog post that mentions this? Or can you give me some example code perhaps? Even a convoluted example would be OK by me.

Increasing the req.grace value for every request is not an option, since we only want to serve old content if Varnish can't get hold of new content. And some of our pages are visited very rarely, so we can't rely on a constant stream of visitors keeping the content fresh in the cache.

Regards
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Re: Grace and misbehaving servers [ In reply to ]

Mar 19, 2020, 3:12 AM

Post #4 of 8 (1607 views)

On Tue, Mar 17, 2020 at 8:06 PM Batanun B <batanun@hotmail.com> wrote:
>
> Hi Dridi,
>
> On Monday, March 16, 2020 9:58 AM Dridi Boukelmoune <dridi@varni.sh> wrote:
>
> > Not really, it's actually the other way around. The beresp.grace
> > variable defines how long you may serve an object past its TTL once it
> > enters the cache.
> >
> > Subsequent requests can then limit grace mode, so think of req.grace
> > as a req.max_grace variable (which maybe hints that it should have
> > been called that in the first place).
>
> OK. So beresp.grace mainly effects how long the object can stay in the cache? And if ttl + grace + keep is a low value set in vcl_backend_response, then vcl_recv is limited in how high the grace can be?

Not quite!

ttl+grace+keep defines how long an object may stay in the cache
(barring any form of invalidation).

The grace I'm referring to is beresp.grace, it defines how long we
might serve a stale object while a background fetch is in progress.

> And req.grace doesn't effect the time that the object is in the cache? Even if req.grace is set to a low value on the very first request (ie the same request that triggers the call to the backend)?

Right, req.grace only defines the maximum staleness tolerated by a
client. So if backend selection happens on the backend side, you can
for example adjust that maximum based on the health of the backend.

> > What you are describing is stale-if-error, something we don't support
> > but could be approximated with somewhat convoluted VCL. It used to be
> > easier when Varnish had saint mode built-in because it generally
> > resulted in less convoluted VCL.
> >
> > It's not something I would recommend attempting today.
>
> That's strange. This stale-if-error sounds like something pretty much everyone would want, right? I mean, if there is is stale content available why show an error page to the end user?

As always in such cases it's not black or white. Depending on the
nature of your web traffic you may want to put the cursor on always
serving something, or never serving something stale. For example, live
"real time" traffic may favor failing some requests over serving stale
data.

Many users want stale-if-error, but it's not trivial, and it needs to
be balanced against other aspects like performance.

> But maybe it was my want to "cache/remember" previous failed fetches and that made it complicated? So if I loosen the requirements/wish-list a bit, into this:
>
> Assuming that:
> * A request comes in to Varnish
> * The content is stale, but still in the cache
> * The backend is considered healthy
> * The short (10s) grace has expired
> * Varnish triggers a synchronus fetch in the backend
> * This fetch fails (timeout or 5xx error)
>
> I would then like Varnish to:
> * Return the stale content

I agree that on paper it sounds simple, but in practice it might be
harder to get right.

For example, "add HTTP/3 support" is a simple statement, but the work
it implies can be orders of magnitude more complicated. And
stale-if-error is one those tricky features: tricky for performance,
that must not break existing VCL, etc.

> Would this be possible using basic Varnish community edition, without a "convoluted VCL", as you put it? Is it possible without triggering a restart of the request? Either way, I am interested in hearing about how it can be achieved. Is there any documentation or blog post that mentions this? Or can you give me some example code perhaps? Even a convoluted example would be OK by me.

I wouldn't recommend stale-if-error at all today, as I said in my first reply.

> Increasing the req.grace value for every request is not an option, since we only want to serve old content if Varnish can't get hold of new content. And some of our pages are visited very rarely, so we can't rely on a constant stream of visitors keeping the content fresh in the cache.

Is it hurting you that less frequently requested contents don't stay
in the cache?

Another option is to give Varnish a high TTL (and give clients a lower
TTL) and trigger a form of invalidation directly from the backend when
you know a resource changed.

Dridi
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Re: Grace and misbehaving servers [ In reply to ]

batanun at hotmail

Mar 20, 2020, 3:11 PM

Post #5 of 8 (1605 views)

On Thu , Mar 19, 2020 at 11:12 AM Dridi Boukelmoune <dridi at varni.sh> wrote:
>
> Not quite!
>
> ttl+grace+keep defines how long an object may stay in the cache
> (barring any form of invalidation).
>
> The grace I'm referring to is beresp.grace,

Well, when I wrote "if ttl + grace + keep is a low value set in vcl_backend_response", I was talking about beresp.grace, as in beresp.ttl + beresp.grace + beresp.keep.

> it defines how long we might serve a stale object while a background fetch is in progress.

I'm not really seeing how that is different from what I said. If beresp.ttl + beresp.grace + beresp.keep is 10s in total, then a req.grace of say 24h wouldn't do much good, right? Or maybe I just misunderstood what you were saying here.

> As always in such cases it's not black or white. Depending on the
> nature of your web traffic you may want to put the cursor on always
> serving something, or never serving something stale. For example, live
> "real time" traffic may favor failing some requests over serving stale
> data.

Well, I was thinking of the typical "regular" small/medium website, like blogs, corporate profile, small town news etc.

> I agree that on paper it sounds simple, but in practice it might be
> harder to get right.

OK. But what if I implemented it in this way, in my VCL?

* In vcl_backend_response, set beresp.grace to 72h if status < 400
* In vcl_backend_error and vcl_backend_response (when status >= 500), return (abandon)
* In vcl_synth, restart the request, with a special req header set
* In vcl_recv, if this req header is present, set req.grace to 72h

Wouldn't this work? If no, why? If yes, would you say there is something else problematic with it? Of course I would have to handle some special cases, and maybe check req.restarts and such, but I'm talking about the thought process as a whole here. I might be missing something, but I think I would need someone to point it out to me because I just don't get why this would be wrong.

> Is it hurting you that less frequently requested contents don't stay
> in the cache?

If it results in people seeing error pages when a stale content would be perfectly fine for them, then yes.

And these less frequently requested pages might still be part of a group of pages that all result in an error in the backend (while the health probe still return 200 OK). So while one individual page might be visited infrequently, the total number of visits on these kind of pages might be high.

Lets say that there are 3.000 unique (and cachable) pages that are visited during an average weekend. And all of these are in the Varnish cache, but 2.000 of these have stale content. Now lets say that 50% of all pages start returning 500 errors from the backend, on a Friday evening. That would mean that about ~1000 of these stale pages would result in the error displayed to the end users during that weekend. I would much more prefer if it were to still serve them stale content, and then I could look into the problem on Monday morning.

> Another option is to give Varnish a high TTL (and give clients a lower
> TTL) and trigger a form of invalidation directly from the backend when
> you know a resource changed.

Well, that is perfectly fine for pages that have a one-to-one mapping between the page (ie the URL) and the content updated. But most pages in our setup contain a mix of multiple contents, and it is not possible to know beforehand if a specific content will contribute to the result of a specific page. That is especially true for new content that might be included in multiple pages already in the cache.

The only way to handle that in a foolproof way, as far as I can tell, is to invalidate all pages (since any page can contain this kind of content) the moment any object is updated. But that would pretty much clear the cache constantly. And we would still have to handle the case where the cache is invalidated for a page that gives a 500 error when Varnish tries to fetch it.

Re: Grace and misbehaving servers [ In reply to ]

Mar 23, 2020, 3:00 AM

Post #6 of 8 (1600 views)

Hi,

On Fri, Mar 20, 2020 at 10:14 PM Batanun B <batanun@hotmail.com> wrote:
>
> On Thu , Mar 19, 2020 at 11:12 AM Dridi Boukelmoune <dridi at varni.sh> wrote:
> >
> > Not quite!
> >
> > ttl+grace+keep defines how long an object may stay in the cache
> > (barring any form of invalidation).
> >
> > The grace I'm referring to is beresp.grace,
>
> Well, when I wrote "if ttl + grace + keep is a low value set in vcl_backend_response", I was talking about beresp.grace, as in beresp.ttl + beresp.grace + beresp.keep.
>
>
> > it defines how long we might serve a stale object while a background fetch is in progress.
>
> I'm not really seeing how that is different from what I said. If beresp.ttl + beresp.grace + beresp.keep is 10s in total, then a req.grace of say 24h wouldn't do much good, right? Or maybe I just misunderstood what you were saying here.

Or maybe *I* just misunderstood your understanding :)

> > As always in such cases it's not black or white. Depending on the
> > nature of your web traffic you may want to put the cursor on always
> > serving something, or never serving something stale. For example, live
> > "real time" traffic may favor failing some requests over serving stale
> > data.
>
> Well, I was thinking of the typical "regular" small/medium website, like blogs, corporate profile, small town news etc.
>
>
> > I agree that on paper it sounds simple, but in practice it might be
> > harder to get right.
>
> OK. But what if I implemented it in this way, in my VCL?
>
> * In vcl_backend_response, set beresp.grace to 72h if status < 400
> * In vcl_backend_error and vcl_backend_response (when status >= 500), return (abandon)
> * In vcl_synth, restart the request, with a special req header set
> * In vcl_recv, if this req header is present, set req.grace to 72h
>
> Wouldn't this work? If no, why? If yes, would you say there is something else problematic with it? Of course I would have to handle some special cases, and maybe check req.restarts and such, but I'm talking about the thought process as a whole here. I might be missing something, but I think I would need someone to point it out to me because I just don't get why this would be wrong.

For starters, there currently is no way to know for sure that you
entered vcl_synth because of a return(abandon) transition. There are
plans to make it possible, but currently you can do that with
confidence lower than 100%.

A problem with the restart logic is the race it opens since you now
have two lookups, but overall, that's the kind of convoluted VCL that
should work. The devil might be in the details.

> > Is it hurting you that less frequently requested contents don't stay
> > in the cache?
>
> If it results in people seeing error pages when a stale content would be perfectly fine for them, then yes.
>
> And these less frequently requested pages might still be part of a group of pages that all result in an error in the backend (while the health probe still return 200 OK). So while one individual page might be visited infrequently, the total number of visits on these kind of pages might be high.
>
> Lets say that there are 3.000 unique (and cachable) pages that are visited during an average weekend. And all of these are in the Varnish cache, but 2.000 of these have stale content. Now lets say that 50% of all pages start returning 500 errors from the backend, on a Friday evening. That would mean that about ~1000 of these stale pages would result in the error displayed to the end users during that weekend. I would much more prefer if it were to still serve them stale content, and then I could look into the problem on Monday morning.

In this case you might want to combine your VCL restart logic with
vmod_saintmode.

https://github.com/varnish/varnish-modules/blob/6.0-lts/docs/vmod_saintmode.rst#vmod_saintmode

This VMOD allows you to create circuit breakers for individual
resources for a given backend. That will result in more complicated
but will help you mark individual resources as sick, making the need
for a "special req header" redundant. And since vmod_saintmode marks
resources sick for a given time, it means that NOT ALL individual
clients will go through the complete restart dance during that window.

I think you may still have to do a restart in vcl_miss because only
then will you know the saint-mode health (you need both a backend and
a hash).

> > Another option is to give Varnish a high TTL (and give clients a lower
> > TTL) and trigger a form of invalidation directly from the backend when
> > you know a resource changed.
>
> Well, that is perfectly fine for pages that have a one-to-one mapping between the page (ie the URL) and the content updated. But most pages in our setup contain a mix of multiple contents, and it is not possible to know beforehand if a specific content will contribute to the result of a specific page. That is especially true for new content that might be included in multiple pages already in the cache.
>
> The only way to handle that in a foolproof way, as far as I can tell, is to invalidate all pages (since any page can contain this kind of content) the moment any object is updated. But that would pretty much clear the cache constantly. And we would still have to handle the case where the cache is invalidated for a page that gives a 500 error when Varnish tries to fetch it.

And you might solve this problem with vmod_xkey!

https://github.com/varnish/varnish-modules/blob/6.0-lts/docs/vmod_xkey.rst#vmod_xkey

You need help from the backend to communicate a list of "abstract
identifiers" of "things" that contribute to a response. This way if a
change in your backend spans multiple responses you can still perform
a single invalidation to affect them all.

Dridi
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc

Re: Grace and misbehaving servers [ In reply to ]

batanun at hotmail

Mar 25, 2020, 11:41 AM

Post #7 of 8 (1596 views)

On Mo, Mar 23, 2020 at 10:00 AM Dridi Boukelmoune <dridi at varni.sh> wrote:
>
> For starters, there currently is no way to know for sure that you
> entered vcl_synth because of a return(abandon) transition. There are
> plans to make it possible, but currently you can do that with
> confidence lower than 100%.

I see. I actually had a feeling about that, since I didn't see an obvious way to pass that kind of information into vcl_synth when triggered by an abandon.

Although, just having a general rule to restart 500-requests there, regardless of what caused it, is not really that bad anyway.

> A problem with the restart logic is the race it opens since you now
> have two lookups, but overall, that's the kind of convoluted VCL that
> should work. The devil might be in the details.

Could you describe this race condition that you mean can happen? What could the worst case scenario be? If it is just a guru meditation for this single request, and it happens very rarely, then that is something I can live with. If it is something that can cause Varnish to crash or hang, then it is not something I can live with :)

> In this case you might want to combine your VCL restart logic with
> vmod_saintmode.

Yes, I have already heard some things about this vmod. I will definitely look into it. Thanks.

> And you might solve this problem with vmod_xkey!

We actually already use this vmod. But like I said, it doesn't solve the problem with new content that effects existing pages. Several pages might for example include information about the latest objects created in the system. If one of these pages were loaded and cached at time T1, and then at T2 a new object O2 was created, an "xkey purge" with the key "O2" will have no effect since that page was not associated with the "O2" key at time T1, because O2 didn't even exist then.

And since there is no way to know beforehand which these pages are, the only bullet proof way I can see of handling this is to purge all pages* any time any content is updated.

* or at least a large subset of all pages, since the vast majority might include something related to newly created objects

Re: Grace and misbehaving servers [ In reply to ]

Mar 25, 2020, 3:18 PM

Post #8 of 8 (1594 views)

> > A problem with the restart logic is the race it opens since you now
> > have two lookups, but overall, that's the kind of convoluted VCL that
> > should work. The devil might be in the details.
>
> Could you describe this race condition that you mean can happen? What could the worst case scenario be? If it is just a guru meditation for this single request, and it happens very rarely, then that is something I can live with. If it is something that can cause Varnish to crash or hang, then it is not something I can live with :)

In general by the time you get to the second lookup the state of the
cache may have changed. An object may go away in between, so a
restart would cause unnecessary processing that would likely lead to
an additional erroring fetch.

Using a combination of saint mode and req.grace to emulate
stale-if-error could in theory lead to something simpler.

At least it would if this change landed one way or the other:

https://github.com/varnishcache/varnish-cache/issues/3259

> > In this case you might want to combine your VCL restart logic with
> > vmod_saintmode.
>
> Yes, I have already heard some things about this vmod. I will definitely look into it. Thanks.

It used to be a no brainer with Varnish 3, being part of VCL...

> > And you might solve this problem with vmod_xkey!
>
> We actually already use this vmod. But like I said, it doesn't solve the problem with new content that effects existing pages.

Oh, now I get it! That's an interesting limitation I don't think I
ever considered. I will give it some thought!

> Several pages might for example include information about the latest objects created in the system. If one of these pages were loaded and cached at time T1, and then at T2 a new object O2 was created, an "xkey purge" with the key "O2" will have no effect since that page was not associated with the "O2" key at time T1, because O2 didn't even exist then.
>
> And since there is no way to know beforehand which these pages are, the only bullet proof way I can see of handling this is to purge all pages* any time any content is updated.
>
> * or at least a large subset of all pages, since the vast majority might include something related to newly created objects

You can always use vmod_xkey to broadly tag responses. An example
I like to take to illustrate this is tagging a response as "article". If
you change the template for articles, you know you can [soft] purge
them all at once.

That doesn't solve the invalidation using keys unknown (yet) to the
cache, but my take would be that if my application can know that, it
should be able to invalidate individual resources affected by their
new key (I'm aware it's not always that easy).

Dridi
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc