Mailing List Archive

Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic
Hi,

One of our websites usually has quite a low but steady stream of visitors, but occationally we get a sudden surge of requests over a very short time period (about 1-2 minutes). Varnish seems to handle the load fine, but the backends struggle with this. But I have noticed that Varnish doesn't serve the stale cached data, but instead shows our synthetic 500 page. This is true even for the start page, that definitely existed in the cache. And we have a grace period of 10 days, so I it's quite annoying that we can't simply serve the stale cached data during this short period.

I have tried picturing the entire flow, following the logic in the vcl, but I can't see what we do wrong. And annoyingly I can't reproduce the problem locally by simply shutting down the backends (or setting them to unhealthy), because whenever I do that I get the stale content served just as intended. Could the sheer volume itself cause this, making it impossible to reproduce by simply fetching the page a few times in the browser before and after disabling the backends? Or is there some edge case that I haven't thought of that is causing this?

A simpliedfied version of our vcl is included below, with only the relevant parts. But unless I have some blatent problem with the vcl, I think it would be good if I learned how to troubleshoot this using the Varnish tools, like varnishlog. So that next time this happens, I can use varnishlog etc to see what's happening.

Is it possible using varnishlog to find the very first request for a specific path ("/" in our case) where it returned the synthetic 500 and put it in the cache? And is it also possible to find the request just before that one, for the same path. If I could extract those two requests (including all the metadata in those transactions) from the djungle of thousands of requests, then maybe I can find some explanation why the second request doesn't return the stale data.

-----------------------------

sub vcl_hit {
if (obj.ttl > 0s) {
// Regular cache hit
return (deliver);
} else if (req.restarts == 0 && std.healthy(req.backend_hint)) {
// Graced cache hit, first attempt.
// Force cache miss to trigger fetch in foreground (ie synchronous fetch).
set req.http.X-forced-miss = "true";
return (miss);
} else {
// Graced cache hit, previous attempts failed (or backend unhealthy). Let the fetch happen in the background (ie asynchronous fetch), and return the cached value.
return (deliver);
}
}

sub vcl_recv {
if (req.http.X-cache-pass == "true") {
return(pass);
}

set req.grace = 240h; // 10 day grace
}

sub vcl_backend_response {
if (bereq.http.X-cache-pass != "true") {
if (beresp.status < 400) {
set beresp.grace = 240h;
set beresp.ttl = 30m; // Cache invalidation in the form of xkey softpurge can put objects into grace before the TTL is past.
} else if (beresp.status < 500) {
set beresp.ttl = 1m;
return (deliver);
} else if (beresp.status >= 500) {
// In some cases we want to abandon the backend request on 500-errors, since it otherwise would overwrite the cached object that still is usefull for grace.
// This will make it jump to vcl_synth with a 503 status. There it will restart the request.
if (bereq.is_bgfetch) {
// In grace period. Abandoning 5xx request, since it otherwise would overwrite the cached object that still is usefull for grace
return (abandon);
} else if (bereq.http.X-forced-miss == "true") {
return (abandon);
}
// Non background fetch, ie no grace period (and no stale content available). Cache the error page for a few seconds.
set beresp.ttl = 15s;
return (deliver);
}
}
}

sub vcl_synth {
if (req.http.X-forced-miss == "true" && resp.status >= 500) {
return (restart);
}
}

sub vcl_backend_error {
if (bereq.is_bgfetch || bereq.http.X-forced-miss == "true") {
// By calling abandon, it will jump to vcl_synth with a 503 status. There it will restart the request.
return (abandon);
}
set beresp.http.Content-Type = "text/html; charset=utf-8";
set beresp.ttl = 15s;
synthetic(std.fileread("/etc/varnish/500.html"));
return (deliver);
}

-----------------------------
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
Could be a vary issue, can you post the output of `varnishlog -d -g request
-q 'RespStaus eq 500'? Please anonymize the bits you don't want us to see
by replacing them with "XXXXXX", but don't simply remove them.

In the meantime, here's a cheat sheet to get started with varnishlog:
https://docs.varnish-software.com/tutorials/vsl-query/

--
Guillaume Quintard


On Thu, Dec 16, 2021 at 12:15 PM Batanun B <batanun@hotmail.com> wrote:

> Hi,
>
> One of our websites usually has quite a low but steady stream of visitors,
> but occationally we get a sudden surge of requests over a very short time
> period (about 1-2 minutes). Varnish seems to handle the load fine, but the
> backends struggle with this. But I have noticed that Varnish doesn't serve
> the stale cached data, but instead shows our synthetic 500 page. This is
> true even for the start page, that definitely existed in the cache. And we
> have a grace period of 10 days, so I it's quite annoying that we can't
> simply serve the stale cached data during this short period.
>
> I have tried picturing the entire flow, following the logic in the vcl,
> but I can't see what we do wrong. And annoyingly I can't reproduce the
> problem locally by simply shutting down the backends (or setting them to
> unhealthy), because whenever I do that I get the stale content served just
> as intended. Could the sheer volume itself cause this, making it impossible
> to reproduce by simply fetching the page a few times in the browser before
> and after disabling the backends? Or is there some edge case that I haven't
> thought of that is causing this?
>
> A simpliedfied version of our vcl is included below, with only the
> relevant parts. But unless I have some blatent problem with the vcl, I
> think it would be good if I learned how to troubleshoot this using the
> Varnish tools, like varnishlog. So that next time this happens, I can use
> varnishlog etc to see what's happening.
>
> Is it possible using varnishlog to find the very first request for a
> specific path ("/" in our case) where it returned the synthetic 500 and put
> it in the cache? And is it also possible to find the request just before
> that one, for the same path. If I could extract those two requests
> (including all the metadata in those transactions) from the djungle of
> thousands of requests, then maybe I can find some explanation why the
> second request doesn't return the stale data.
>
> -----------------------------
>
> sub vcl_hit {
> if (obj.ttl > 0s) {
> // Regular cache hit
> return (deliver);
> } else if (req.restarts == 0 && std.healthy(req.backend_hint)) {
> // Graced cache hit, first attempt.
> // Force cache miss to trigger fetch in foreground (ie synchronous
> fetch).
> set req.http.X-forced-miss = "true";
> return (miss);
> } else {
> // Graced cache hit, previous attempts failed (or backend
> unhealthy). Let the fetch happen in the background (ie asynchronous fetch),
> and return the cached value.
> return (deliver);
> }
> }
>
> sub vcl_recv {
> if (req.http.X-cache-pass == "true") {
> return(pass);
> }
>
> set req.grace = 240h; // 10 day grace
> }
>
> sub vcl_backend_response {
> if (bereq.http.X-cache-pass != "true") {
> if (beresp.status < 400) {
> set beresp.grace = 240h;
> set beresp.ttl = 30m; // Cache invalidation in the form of xkey
> softpurge can put objects into grace before the TTL is past.
> } else if (beresp.status < 500) {
> set beresp.ttl = 1m;
> return (deliver);
> } else if (beresp.status >= 500) {
> // In some cases we want to abandon the backend request on
> 500-errors, since it otherwise would overwrite the cached object that still
> is usefull for grace.
> // This will make it jump to vcl_synth with a 503 status.
> There it will restart the request.
> if (bereq.is_bgfetch) {
> // In grace period. Abandoning 5xx request, since it
> otherwise would overwrite the cached object that still is usefull for grace
> return (abandon);
> } else if (bereq.http.X-forced-miss == "true") {
> return (abandon);
> }
> // Non background fetch, ie no grace period (and no stale
> content available). Cache the error page for a few seconds.
> set beresp.ttl = 15s;
> return (deliver);
> }
> }
> }
>
> sub vcl_synth {
> if (req.http.X-forced-miss == "true" && resp.status >= 500) {
> return (restart);
> }
> }
>
> sub vcl_backend_error {
> if (bereq.is_bgfetch || bereq.http.X-forced-miss == "true") {
> // By calling abandon, it will jump to vcl_synth with a 503
> status. There it will restart the request.
> return (abandon);
> }
> set beresp.http.Content-Type = "text/html; charset=utf-8";
> set beresp.ttl = 15s;
> synthetic(std.fileread("/etc/varnish/500.html"));
> return (deliver);
> }
>
> -----------------------------
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
> Could be a vary issue

Ah, I remember hearing about that earlier, and made a mental note to read up on that. But I forgot all about it. Now I did just that, and boy was that a cold shower for me! I definitely need to unset that header. But why, for the love of all that is holy, does Varnish not include the vary-data in the hash? Why isn't the hash the _single_ key used when storing and looking up data in the cache? Why does Varnish hide this stuff from us?

However, when checking the Vary header from the backend, it is set to "Accept-Encoding". And since I haven't changed anything in my browser, it should send the same "Accept-Encoding" request header whenever I surf the website. And since I have visited the startpage multiple times the last 10 days, it should have a cached version of it matching my "Accept-Encoding".

> can you post the output of `varnishlog -d -g request -q 'RespStaus eq 500'?

Well, that gives me nothing that is relevant here, sadly. The last time this happened was a few days ago, and the buffer doesn't seem to be big enough to keep data that far back.

But maybe you could describe what you would look for? I would love to learn how to troubleshoot this.

> In the meantime, here's a cheat sheet to get started with varnishlog:
> https://docs.varnish-software.com/tutorials/vsl-query/

Thanks, although most of that stuff I already knew. And it doesn't really give any more advanced examples. Like the problem I mentioned earlier. I really would like to know if it is possible to find the first request where it served the 500 page for the "/" url, as well as the request just before that, for the same url. Do you know how to construct a query that gives me that?
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
> Ah, I remember hearing about that earlier, and made a mental note to read
up on that. But I forgot all about it. Now I did just that, and boy was
that a cold shower for me! I definitely need to unset that header. But why,
for the love of all that is holy, does Varnish not include the vary-data in
the hash? Why isn't the hash the _single_ key used when storing and looking
up data in the cache? Why does Varnish hide this stuff from us?

Easy one: you build the key from the request, while the vary header is a
response property. Therefore, Varnish can't know to put the varied headers
into the hash, because it doesn't have that information yet.
Basically, the vary header is backend information that allows Varnish to
properly differentiate objects with the same hash, and the "hash collision"
happens because you, the VCL writer, were not specific enough in vcl_hash
(not necessarily your fault, though).
(technically, and to avoid the wrath of Varnish veterans: it's not a hash
collision because the resources technically represent the same resource and
the varied headers are just secondary keys).

> However, when checking the Vary header from the backend, it is set to
"Accept-Encoding". And since I haven't changed anything in my browser, it
should send the same "Accept-Encoding" request header whenever I surf the
website. And since I have visited the startpage multiple times the last 10
days, it should have a cached version of it matching my "Accept-Encoding".

Don't touch that guy! Varnish will ignore "accept-encoding" in "vary"
because it handles compression internally, and always forces
"accept-encoding: gzip" before entering vcl_backend_fetch. If your VCL
mentions accept-encoding, it's almost always wrong.

> Well, that gives me nothing that is relevant here, sadly. The last time
this happened was a few days ago, and the buffer doesn't seem to be big
enough to keep data that far back.
> But maybe you could describe what you would look for? I would love to
learn how to troubleshoot this.

The default VSL space is 80MB, which is "only" worth a few (tens of)
thousands requests, so yeah, it can be a short backlog. You can instead
start logging into a file:

> varnishlog -g request -q 'RespStatus eq 500' -w log.bin

once the file grows, you can start looking at it using "varnishlog -r
log.bin"

> Thanks, although most of that stuff I already knew. And it doesn't really
give any more advanced examples. Like the problem I mentioned earlier. I
really would like to know if it is possible to find the first request where
it served the 500 page for the "/" url, as well as the request just before
that, for the same url. Do you know how to construct a query that gives me
that?

The problem is that you have to be recording when the initial request goes
through. But, if you have then, cache hits will show the VXID of that first
request in their "x-varnish" header, and you can find it this way
("varnishlog -r log.bin -q 'vxid == THE_VXID'")

Hope that helps

--
Guillaume Quintard


On Thu, Dec 16, 2021 at 1:33 PM Batanun B <batanun@hotmail.com> wrote:

> > Could be a vary issue
>
> Ah, I remember hearing about that earlier, and made a mental note to read
> up on that. But I forgot all about it. Now I did just that, and boy was
> that a cold shower for me! I definitely need to unset that header. But why,
> for the love of all that is holy, does Varnish not include the vary-data in
> the hash? Why isn't the hash the _single_ key used when storing and looking
> up data in the cache? Why does Varnish hide this stuff from us?
>
> However, when checking the Vary header from the backend, it is set to
> "Accept-Encoding". And since I haven't changed anything in my browser, it
> should send the same "Accept-Encoding" request header whenever I surf the
> website. And since I have visited the startpage multiple times the last 10
> days, it should have a cached version of it matching my "Accept-Encoding".
>
> > can you post the output of `varnishlog -d -g request -q 'RespStaus eq
> 500'?
>
> Well, that gives me nothing that is relevant here, sadly. The last time
> this happened was a few days ago, and the buffer doesn't seem to be big
> enough to keep data that far back.
>
> But maybe you could describe what you would look for? I would love to
> learn how to troubleshoot this.
>
> > In the meantime, here's a cheat sheet to get started with varnishlog:
> > https://docs.varnish-software.com/tutorials/vsl-query/
>
> Thanks, although most of that stuff I already knew. And it doesn't really
> give any more advanced examples. Like the problem I mentioned earlier. I
> really would like to know if it is possible to find the first request where
> it served the 500 page for the "/" url, as well as the request just before
> that, for the same url. Do you know how to construct a query that gives me
> that?
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
> Easy one: you build the key from the request, while the vary header is a response property. Therefore, Varnish can't know to put the varied headers into the hash, because it doesn't have that information yet.

Of course! Makes perfect sense. But the documentation should make this crystal clear, if you ask me. Like a page titled "How is the cache key calculated?" which explains everything related to this. Like how the hash works, how the default implementation works, how to modify it, and how the backend response still can cause multiple different versions (like the Vary header, and maybe other so called "secondary keys"?), and how to modify that behavior. I also think it would make sense if this documentation also mentioned how all this information can be debugged using varnishlog or other tools.

Is there even an official word for this final "cache key"? "Hash" clearly isn't specific enough. I'm talking about a word that refers to the unique key that always corresponds to only a _single_ version of a cached object.


>Don't touch that guy! Varnish will ignore "accept-encoding" in "vary" because it handles compression internally, and always forces "accept-encoding: gzip" before entering vcl_backend_fetch. If your VCL mentions accept-encoding, it's almost always wrong.

Sorry, I'm confused now... Don't touch _which_ guy? Our VCL doesn't contain anything regarding "Accept-Encoding". All I said was that the Vary header in the response from the backend is "Accept-Encoding". And the way I see it, this shouldn't be the cause of the strange problems we are seeing, since even when factoring in this, there should exist a matching cached object for me, and it should be served regardless of TTL or backend health as long as the grace hasn't expired (which it hasn't). Or is my reasoning flawed here, based on the VCL snippet in my original post? Can you think of a scenario where our VCL would return the synthetic 500 page even when there exists a cached objekt matching the hash and vary logic?


> The default VSL space is 80MB, which is "only" worth a few (tens of) thousands requests, so yeah, it can be a short backlog.

Yeah, I think 80MB is a bit to small for us. Ideally we should be able to sit down on a Monday and troubleshoot problems that occured Friday evening, but that might require a way too big VSL space. But a few hundred MB should be fine.


> The problem is that you have to be recording when the initial request goes through. But, if you have then, cache hits will show the VXID of that first request in their "x-varnish" header, and you can find it this way ("varnishlog -r log.bin -q 'vxid == THE_VXID'")

Well, would it really be a cache hit? The main transaction I'm looking for is the first transaction for a specific path (in this case, "/") where Varnish served the synthetic 500 page. And then I would also like to see the closest preceding transaction for that same page, where the hash (and the Vary logic) matches the main transaction mentioned above.

Picture for example this scenario:

1. Request for "/", with with a certain hash and "secondary key" (from the "Accept-Encoding"). Data is fetched from backend and stored in cache.
2. Lots of other, irrelevant requests, including requests for "/" but with a different hash and/or "Accept-Encoding"
3. Request for "/" that uses the cached data from point 1 above
4. Lots of other, irrelevant requests, same as 2 above, but now in such big numbers that the backend servers start having problems
5. Request for "/" with same hash and/or "Accept-Encoding" as 1 above, but for some reason Varnish now returns the synthetic 500 page and puts it in the cache

The transaction log for request in step 5 is the main one I want to find, but also the transaction for step 3 and possibly step 1 as well. But the transaction for step 5 is the most interesting one, and I don't understand how that would be a cache hit while returning a fresh synthetic 500 error page.

Or are you saying that we can find "number 5" above by first finding some transaction later in the log, with a cache hit for the 500 page, and that transaction in turn points to "number 5" since that was the transaction that created the 500 response?

This makes me realize that when looking at past log data, ie using the "-d" parameter, or reading from a file, it looks for matching transactions/requests starting from the beginning. But in this kind troubleshooting I think it could be useful to also be able to search from the end and go backwards. Is that possible? So that if I use the "-" parameter to limit the output to a single transaction, it outputs the _last_ one instead of the first one.
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
On 2021-12-17 13:16:42, Batanun B wrote:
> > The default VSL space is 80MB, which is "only" worth a few (tens of) thousands requests, so yeah, it can be a short backlog.
>
> Yeah, I think 80MB is a bit to small for us. Ideally we should be able to sit down on a Monday and troubleshoot problems that occured Friday evening, but that might require a way too big VSL space. But a few hundred MB should be fine.

I recommend permanent logging, if you want to be able to debug older incidents.
We do it like this:

```
/usr/bin/varnishlog -w /var/log/varnish/varnish-500.log -a -D \
-P /var/run/varnishlog-500.pid \
-q 'RespStatus >= 500 or BerespStatus >=500'
```

I attach our systemd unit file, for you may be interested.

--
Cheers
Marco Dickert
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
> I recommend permanent logging, if you want to be able to debug older incidents.
> We do it like this:
>
> ```
> /usr/bin/varnishlog -w /var/log/varnish/varnish-500.log -a -D \
> -P /var/run/varnishlog-500.pid \
> -q 'RespStatus >= 500 or BerespStatus >=500'
> ```
>
> I attach our systemd unit file, for you may be interested.

Thanks. I have thought about that too. But I think we might want to include non-error transactions as well. I mean, with the problems this post is about we want to see when the cached version of the start page was generated and when it was last served from cache successfully. But maybe we could have a permanent logging just for the start page, regardless of http status. That should hopefully reduce the logging intensity enough so that logging to disk isn't effecting the Varnish performance.

One thing though... If you log all "status: 500+" transactions to disk, isn't there a risk that your logging might exacerbate a situation where your site is overwhelmed with traffic? Where a large load causes your backends to start failing, and that triggers intense logging of those erroneous transactions which might reduce the performance of Varnish, causing more timeouts etc which cause more logging and so on...
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
On 2021-12-17 15:25:31, Batanun B wrote:
> Thanks. I have thought about that too. But I think we might want to include
> non-error transactions as well. I mean, with the problems this post is about
> we want to see when the cached version of the start page was generated and
> when it was last served from cache successfully. But maybe we could have a
> permanent logging just for the start page, regardless of http status. That
> should hopefully reduce the logging intensity enough so that logging to disk
> isn't effecting the Varnish performance.

Well, it depends on the performance of your storage and the amount of req/sec on
the front page, but these logs can get very huge very quickly. I'd suggest to
determine the correct delivery of the front page via an external monitoring
(e.g. icinga2 or a simple script). As far as I understand, you don't need to
know the exact request, but more of a rough point in time of when the requests
start failing. So a monitoring script which curls every minute should be
sufficient and causes a lot less trouble.

> One thing though... If you log all "status: 500+" transactions to disk, isn't
> there a risk that your logging might exacerbate a situation where your site is
> overwhelmed with traffic? Where a large load causes your backends to start
> failing, and that triggers intense logging of those erroneous transactions
> which might reduce the performance of Varnish, causing more timeouts etc which
> cause more logging and so on...

Indeed there is a risk of self-reinforcing effects, but it didn't happen yet. We
also do not plan to logging 500s forever, but only till our problem is solved,
which is an error in varnishs memory handling. At the moment, our most
concerning 500s are caused by varnish itself, stating "Could not get storage",
when the configured memory limit is reached.

--
Mit freundlichen Gr??en
Marco Dickert

Administration und Technik
evolver services GmbH

Fon +49 / (0)3 71 / 4 00 03 78 24
Fax +49 / (0)3 71 / 4 00 03 79

E-Mail marco.dickert@evolver.de
Web https://www.evolver.de

Sitz der Gesellschaft: Chemnitz
Handelsregister: Amtsgericht Chemnitz, HRB 22649
Gesch?ftsf?hrer: Torsten Gramann und Mathias M?ckel
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
> I'd suggest to determine the correct delivery of the front page via an external monitoring
> (e.g. icinga2 or a simple script). As far as I understand, you don't need to
> know the exact request, but more of a rough point in time of when the requests
> start failing. So a monitoring script which curls every minute should be
> sufficient and causes a lot less trouble.

Well, in our use case we would need the exact requests in order to understand why it happens. It might very well be a bug in our VCL, but I would like to have show me something that could narrow it down.
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
--
Guillaume Quintard


On Fri, Dec 17, 2021 at 5:18 AM Batanun B <batanun@hotmail.com> wrote:

> Is there even an official word for this final "cache key"? "Hash" clearly
> isn't specific enough. I'm talking about a word that refers to the unique
> key that always corresponds to only a _single_ version of a cached object.
>

I don't think there is at the moment

Sorry, I'm confused now... Don't touch _which_ guy? Our VCL doesn't contain
> anything regarding "Accept-Encoding". All I said was that the Vary header
> in the response from the backend is "Accept-Encoding". And the way I see
> it, this shouldn't be the cause of the strange problems we are seeing,
> since even when factoring in this, there should exist a matching cached
> object for me, and it should be served regardless of TTL or backend health
> as long as the grace hasn't expired (which it hasn't). Or is my reasoning
> flawed here, based on the VCL snippet in my original post? Can you think of
> a scenario where our VCL would return the synthetic 500 page even when
> there exists a cached objekt matching the hash and vary logic?
>
> As you mentioned that your VCL was simplified, I didn't want to assume
anything. So yes, I meant: do not worry about "accept-encoding", either as
a header, or as an entry in the vary header, Varnish will handle that
properly.

So, the vary hypothesis doesn't pan out. Could it be that your cache size
is too small instead and that the churn is pushing the object out?

Yeah, I think 80MB is a bit to small for us. Ideally we should be able to
> sit down on a Monday and troubleshoot problems that occured Friday evening,
> but that might require a way too big VSL space. But a few hundred MB should
> be fine.
>
> I would just log on disk, and rotate every few days. You mentioned that
the traffic is fairly low, so the disk usage shouldn't be bad, especially
if the backend is on another server. Varnish won't trip itself by choking
the disk, the worst case scenario is that varnishlog will not write the
file enough and will drop a few transactions.

>
> > The problem is that you have to be recording when the initial request
> goes through. But, if you have then, cache hits will show the VXID of that
> first request in their "x-varnish" header, and you can find it this way
> ("varnishlog -r log.bin -q 'vxid == THE_VXID'")
>
> Well, would it really be a cache hit? The main transaction I'm looking for
> is the first transaction for a specific path (in this case, "/") where
> Varnish served the synthetic 500 page. And then I would also like to see
> the closest preceding transaction for that same page, where the hash (and
> the Vary logic) matches the main transaction mentioned above.
>

In that case, you must log all the requests that could match, and once you
have found your offender, walk your way up to find the previous request. I
don't think there's another way here.
Re: Varnish returning synthetic 500 error even though it has stale content it should serve. But only seems to happen during/after a burst of traffic [ In reply to ]
On Fri, Dec 17, 2021 at 4:03 PM Marco Dickert - evolver group
<marco.dickert@evolver.de> wrote:
>
> On 2021-12-17 15:25:31, Batanun B wrote:
> > Thanks. I have thought about that too. But I think we might want to include
> > non-error transactions as well. I mean, with the problems this post is about
> > we want to see when the cached version of the start page was generated and
> > when it was last served from cache successfully. But maybe we could have a
> > permanent logging just for the start page, regardless of http status. That
> > should hopefully reduce the logging intensity enough so that logging to disk
> > isn't effecting the Varnish performance.
>
> Well, it depends on the performance of your storage and the amount of req/sec on
> the front page, but these logs can get very huge very quickly. I'd suggest to
> determine the correct delivery of the front page via an external monitoring
> (e.g. icinga2 or a simple script). As far as I understand, you don't need to
> know the exact request, but more of a rough point in time of when the requests
> start failing. So a monitoring script which curls every minute should be
> sufficient and causes a lot less trouble.
>
> > One thing though... If you log all "status: 500+" transactions to disk, isn't
> > there a risk that your logging might exacerbate a situation where your site is
> > overwhelmed with traffic? Where a large load causes your backends to start
> > failing, and that triggers intense logging of those erroneous transactions
> > which might reduce the performance of Varnish, causing more timeouts etc which
> > cause more logging and so on...
>
> Indeed there is a risk of self-reinforcing effects, but it didn't happen yet. We
> also do not plan to logging 500s forever, but only till our problem is solved,
> which is an error in varnishs memory handling. At the moment, our most
> concerning 500s are caused by varnish itself, stating "Could not get storage",
> when the configured memory limit is reached.

If you get a surge of 5XX responses from either Varnish or the
backend, you can also rate-limit logs to the disk:

https://varnish-cache.org/docs/6.0/reference/varnishlog.html

See the -R option.

Cheers,
Dridi
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc