Mailing List Archive

Varnish attempting to serve disconnected clients
Dear List,

I'm debugging a problem with our Varnish, and our QA folks found an
interesting case.

Ultimately this breadcrumb trail was discovered looking into our varnishes
having an enormous number of open connections in CLOSE_WAIT status, in a
way that doesn't appear to be ideal connection reuse, rather dead
connections that we need to close out; non the less we're focused on a
slightly more specific sub-issue of that issue right now:

We have noticed that if a request is eligible to be coalesced, but the
impatient client disconnects before the request is served, Varnish
continues to try and serve that request by going to the backend even after
the client is disconnected.

I suspect in our case, we can disable request coalescing, but I didn't want
to miss an opportunity to report a possible bug, or learn something about a
corner of Varnish I don't know well... here's our setup:

- With Varnish 6.6.1
- Given a toy python script backend which answers 200OK to the health
check, but answers after 10s to the other requests with an `HTTP 500`
error; [source linked below]
- Given the following config [below]
- started with `/usr/local/sbin/varnishd -n /usr/local/var/varnish -F -f
$PWD/foo.vcl -a test=:21601,HTTP`
- When running `for i in $(seq 10); do curl -m 1 localhost:21601/ &; done`
(ampersand for background is important)
- Varnish makes 1 request to the backend, coalescing the others
- Clients all disconnect thanks to `curl -m 1` (`--max-time`) (or `0.1`,
no difference, naturally)
- First request completed with `HTTP 500`, Varnish continues to retry
requests for disconnected clients. (logs without health checks below)

In the real set-up Varnish actually only handles the next request, then the
next, then the next one each 10 seconds, I didn't take the time to
reproduce that in this set-up as I believe it's a bug that Varnish
continues to do work for disconnected clients. I guess in my toy we benefit
from hit-for-pass, and in our real world setup that's not true.

I can somehow imagine this as a feature (populate the cache even though the
client went away) but queueing hundreds or thousands of requests, and
nibbling away at them one-by-one even after clients are long-since hung up
is causing resource exhaustion for us; we can tune the configs
significantly now that we know the issue, but we'd love to get some
opinionated feedback on what would be an idiomatic approach to this.

- Are we doing something wrong?
- Should varnish still go to the backend to serve disconnected clients?
- Is this a bug, should I report it somewhere more formally and attach the
repro case a bit more diligently?

Warm regards everyone, thanks for Varnish, an active list, and active
support on StackOverflow and such.

[config]
vcl 4.0;
backend foo {
.host = "127.0.0.1";
.port = "3100";
.probe = {
.url = "/healthz";
.interval = 5s;
.timeout = 1s;
.window = 5;
.threshold = 3;
}
}
sub vcl_recv {
return (hash);
}

[toy python server logs]
python3 ./server.py
socket binded to port 3100
2022-07-28 13:45:05.589997 socket is listening
2022-07-28 13:45:17.336371 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 32771\r\n\r\n'
^^ first request
2022-07-28 13:45:27.425837 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 163841\r\n\r\n'
2022-07-28 13:45:27.437121 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 229377\r\n\r\n'
2022-07-28 13:45:27.437913 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 131075\r\n\r\n'
2022-07-28 13:45:27.437999 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 196609\r\n\r\n'
2022-07-28 13:45:27.438363 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 98311\r\n\r\n'
2022-07-28 13:45:27.437782 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 65545\r\n\r\n'
2022-07-28 13:45:27.438033 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 327681\r\n\r\n'
2022-07-28 13:45:27.439453 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 294913\r\n\r\n'
2022-07-28 13:45:27.438401 from client: b'GET / HTTP/1.1\r\nHost:
localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For
: ::1\r\nX-Varnish: 262145\r\n\r\n'
^^ other requests

[toy python server code]
https://gist.github.com/leehambley/fa634f91936b1d30422e3af96ba2eec5

Lee Hambley
http://lee.hambley.name/
+49 (0) 170 298 5667
Re: Varnish attempting to serve disconnected clients [ In reply to ]
On Thu, Jul 28, 2022 at 12:07 PM Lee Hambley <lee.hambley@gmail.com> wrote:
>
> Dear List,
>
> I'm debugging a problem with our Varnish, and our QA folks found an interesting case.
>
> Ultimately this breadcrumb trail was discovered looking into our varnishes having an enormous number of open connections in CLOSE_WAIT status, in a way that doesn't appear to be ideal connection reuse, rather dead connections that we need to close out; non the less we're focused on a slightly more specific sub-issue of that issue right now:
>
> We have noticed that if a request is eligible to be coalesced, but the impatient client disconnects before the request is served, Varnish continues to try and serve that request by going to the backend even after the client is disconnected.

Hi Lee,

Reading up to this point I'm convinced that you have a pretty good
understanding of the problem (but I will read the rest, don't worry).

This is something we fixed a while ago in Varnish Enterprise, but it
took several painful attempts to get it right. While this problem may
look straightforward, it comes with a flurry of details.

The basic idea is to implement a way to "walk away" from the waiting
list, because in this context "there is no rush". The problem is that
your client could be stuck in a variety of places, like for example 3
levels down a parallel ESI tree over an HTTP/2 connection. Another
problem is that you short-circuit the normal delivery path, so you
need to make sure that the client task and session are accurately tore
down.

This is something we wanted to study more before submitting, in
case we could come up with a less complicated solution, but either
way we also need time to work on porting this nontrivial patch.

> I suspect in our case, we can disable request coalescing, but I didn't want to miss an opportunity to report a possible bug, or learn something about a corner of Varnish I don't know well... here's our setup:

Thanks a lot, very much appreciated!

> - With Varnish 6.6.1
> - Given a toy python script backend which answers 200OK to the health check, but answers after 10s to the other requests with an `HTTP 500` error; [source linked below]
> - Given the following config [below]
> - started with `/usr/local/sbin/varnishd -n /usr/local/var/varnish -F -f $PWD/foo.vcl -a test=:21601,HTTP`
> - When running `for i in $(seq 10); do curl -m 1 localhost:21601/ &; done` (ampersand for background is important)
> - Varnish makes 1 request to the backend, coalescing the others
> - Clients all disconnect thanks to `curl -m 1` (`--max-time`) (or `0.1`, no difference, naturally)
> - First request completed with `HTTP 500`, Varnish continues to retry requests for disconnected clients. (logs without health checks below)
>
> In the real set-up Varnish actually only handles the next request, then the next, then the next one each 10 seconds, I didn't take the time to reproduce that in this set-up as I believe it's a bug that Varnish continues to do work for disconnected clients. I guess in my toy we benefit from hit-for-pass, and in our real world setup that's not true.

Yes, either hit-for-pass (return(pass)), hit-for-miss
(beresp.uncacheable) or a very small TTL.

> I can somehow imagine this as a feature (populate the cache even though the client went away) but queueing hundreds or thousands of requests, and nibbling away at them one-by-one even after clients are long-since hung up is causing resource exhaustion for us; we can tune the configs significantly now that we know the issue, but we'd love to get some opinionated feedback on what would be an idiomatic approach to this.

It's the lack of (hitpass, hitmiss or regular) object that causes
waiting list serialization, we don't need to implement a new feature
in that regard.

> - Are we doing something wrong?

Probably a zero TTL, otherwise running into the waiting list nepotism
misbehavior which is not your fault.

> - Should varnish still go to the backend to serve disconnected clients?

No, that would be the walkaway feature.

> - Is this a bug, should I report it somewhere more formally and attach the repro case a bit more diligently?

We don't need a bug report, and I believe we have at least one
reproducer in the enterprise test suite.


> Warm regards everyone, thanks for Varnish, an active list, and active support on StackOverflow and such.
>
> [config]
> vcl 4.0;
> backend foo {
> .host = "127.0.0.1";
> .port = "3100";
> .probe = {
> .url = "/healthz";
> .interval = 5s;
> .timeout = 1s;
> .window = 5;
> .threshold = 3;
> }
> }
> sub vcl_recv {
> return (hash);
> }
>
> [toy python server logs]
> python3 ./server.py
> socket binded to port 3100
> 2022-07-28 13:45:05.589997 socket is listening
> 2022-07-28 13:45:17.336371 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 32771\r\n\r\n'
> ^^ first request
> 2022-07-28 13:45:27.425837 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 163841\r\n\r\n'
> 2022-07-28 13:45:27.437121 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 229377\r\n\r\n'
> 2022-07-28 13:45:27.437913 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 131075\r\n\r\n'
> 2022-07-28 13:45:27.437999 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 196609\r\n\r\n'
> 2022-07-28 13:45:27.438363 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 98311\r\n\r\n'
> 2022-07-28 13:45:27.437782 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 65545\r\n\r\n'
> 2022-07-28 13:45:27.438033 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 327681\r\n\r\n'
> 2022-07-28 13:45:27.439453 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 294913\r\n\r\n'
> 2022-07-28 13:45:27.438401 from client: b'GET / HTTP/1.1\r\nHost: localhost:21601\r\nUser-Agent: curl/7.68.0\r\nAccept: */*\r\nX-Forwarded-For: ::1\r\nX-Varnish: 262145\r\n\r\n'
> ^^ other requests
> [toy python server code]
> https://gist.github.com/leehambley/fa634f91936b1d30422e3af96ba2eec5
>
> Lee Hambley
> http://lee.hambley.name/
> +49 (0) 170 298 5667
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc