Mailing List Archive

Strange Broken Pipe error from Varnish health checks
Hi,

I have a Varnish/nginx cluster running with varnish-7.1.2-1.el7.x86_64 on
CentOS 7.

The issue I am having comes from the varnish health checks. I am getting a
"broken pipe" error in the nginx error log at random times like below:
Apr 10 17:32:46 VARNISH-MASTER nginx_varnish_error: 2023/04/10 17:32:46
[info] 17808#17808: *67626636 writev() failed (32: Broken pipe), client:
unix:, server: _, request: "GET /varnish_check HTTP/1.1", host: "0.0.0.0"

The strange thing is that this error appears only when Varnish performs the
health checks. I have other scripts doing it(nagios, curl, wget, AWS ELB)
but those do not show any errors. In addition to this Varnish and nginx
where the health checks occur are on the same server and it makes no
difference if I use a TCP connection or socket based one.

Below are the varnish vcl and nginx locations for the health checks:
backend nginx_varnish {
.path = "/run/nginx/nginx.sock";
.first_byte_timeout = 600s;
.probe = health;
}

location = /varnish_check {
keepalive_timeout 305;
return 200 'Varnish Check';
access_log /var/log/nginx/varnish_check.log main;
error_log /var/log/nginx/varnish_check_errors.log debug;
error_log
syslog:server=unix:/run/nginx_log.in.sock,facility=local1,tag=nginx_varnish_error,nohostname
info;
}

Are there any docs I can read about how exactly varnish performs the health
checks and what internal processes are involved?
Did anyone happen to have similar issues? This is not causing any
operational problems for the cluster but it is just something that I want
to determine why it is happening because it just should not be happening.

Please help
THanks in advance.
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Hi George,

Just to be sure, how is the probe "health" defined in your VCL?

Cheers,

--
Guillaume Quintard


On Mon, Apr 17, 2023 at 1:23?AM George <izghitu@gmail.com> wrote:

> Hi,
>
> I have a Varnish/nginx cluster running with varnish-7.1.2-1.el7.x86_64 on
> CentOS 7.
>
> The issue I am having comes from the varnish health checks. I am getting a
> "broken pipe" error in the nginx error log at random times like below:
> Apr 10 17:32:46 VARNISH-MASTER nginx_varnish_error: 2023/04/10 17:32:46
> [info] 17808#17808: *67626636 writev() failed (32: Broken pipe), client:
> unix:, server: _, request: "GET /varnish_check HTTP/1.1", host: "0.0.0.0"
>
> The strange thing is that this error appears only when Varnish performs
> the health checks. I have other scripts doing it(nagios, curl, wget, AWS
> ELB) but those do not show any errors. In addition to this Varnish and
> nginx where the health checks occur are on the same server and it makes no
> difference if I use a TCP connection or socket based one.
>
> Below are the varnish vcl and nginx locations for the health checks:
> backend nginx_varnish {
> .path = "/run/nginx/nginx.sock";
> .first_byte_timeout = 600s;
> .probe = health;
> }
>
> location = /varnish_check {
> keepalive_timeout 305;
> return 200 'Varnish Check';
> access_log /var/log/nginx/varnish_check.log main;
> error_log /var/log/nginx/varnish_check_errors.log debug;
> error_log
> syslog:server=unix:/run/nginx_log.in.sock,facility=local1,tag=nginx_varnish_error,nohostname
> info;
> }
>
> Are there any docs I can read about how exactly varnish performs the
> health checks and what internal processes are involved?
> Did anyone happen to have similar issues? This is not causing any
> operational problems for the cluster but it is just something that I want
> to determine why it is happening because it just should not be happening.
>
> Please help
> THanks in advance.
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Hi,

Below is the probe "health", I forgot to send it the first time:
probe health {
.url = "/varnish_check";
.timeout = 2s;
.interval = 5s;
.window = 3;
.threshold = 2;
}

Thanks

În lun., 17 apr. 2023 la 17:32, Guillaume Quintard <
guillaume.quintard@gmail.com> a scris:

> Hi George,
>
> Just to be sure, how is the probe "health" defined in your VCL?
>
> Cheers,
>
> --
> Guillaume Quintard
>
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Thanks, I looked at the code quickly and I'd venture that maybe the
/varnish_check is a bit too large and doesn't fit Varnish's probe buffer
(we only care about the status line of the response anyway), so Varnish
closes the connection while nginx isn't done yet.

If it's that, it's not very polite, but it's harmless.
--
Guillaume Quintard


On Mon, Apr 17, 2023 at 7:49?AM George <izghitu@gmail.com> wrote:

> Hi,
>
> Below is the probe "health", I forgot to send it the first time:
> probe health {
> .url = "/varnish_check";
> .timeout = 2s;
> .interval = 5s;
> .window = 3;
> .threshold = 2;
> }
>
> Thanks
>
> În lun., 17 apr. 2023 la 17:32, Guillaume Quintard <
> guillaume.quintard@gmail.com> a scris:
>
>> Hi George,
>>
>> Just to be sure, how is the probe "health" defined in your VCL?
>>
>> Cheers,
>>
>> --
>> Guillaume Quintard
>>
>> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Hi,

In our case the response body is few bytes (13), header+body is 170 bytes.
Can this be a bug related to something else perhaps?

Please let me know.
Thanks

În lun., 17 apr. 2023 la 18:47, Guillaume Quintard <
guillaume.quintard@gmail.com> a scris:

> Thanks, I looked at the code quickly and I'd venture that maybe the
> /varnish_check is a bit too large and doesn't fit Varnish's probe buffer
> (we only care about the status line of the response anyway), so Varnish
> closes the connection while nginx isn't done yet.
>
> If it's that, it's not very polite, but it's harmless.
> --
> Guillaume Quintard
>
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
That code hasn't moved in a while, so I'd be surprised to see a bug there,
but that's always possible.
Any chance you could get a tcpdump of a probe request (from connection to
disconnection) so we can see what's going on?
--
Guillaume Quintard


On Mon, Apr 17, 2023 at 9:12?AM George <izghitu@gmail.com> wrote:

> Hi,
>
> In our case the response body is few bytes (13), header+body is 170 bytes.
> Can this be a bug related to something else perhaps?
>
> Please let me know.
> Thanks
>
> În lun., 17 apr. 2023 la 18:47, Guillaume Quintard <
> guillaume.quintard@gmail.com> a scris:
>
>> Thanks, I looked at the code quickly and I'd venture that maybe the
>> /varnish_check is a bit too large and doesn't fit Varnish's probe buffer
>> (we only care about the status line of the response anyway), so Varnish
>> closes the connection while nginx isn't done yet.
>>
>> If it's that, it's not very polite, but it's harmless.
>> --
>> Guillaume Quintard
>>
>> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
I am not sure how a tcpdump will help in any way. Since all the health
check requests are virtually the same. Since the varnish thread that probes
the backend dies every once in a while then it will never show anything
with tcpdump or I virtually have no way to know when it will die in order
to capture an incomplete request using tcpdump.

Please advise.


În lun., 17 apr. 2023 la 19:15, Guillaume Quintard <
guillaume.quintard@gmail.com> a scris:

> That code hasn't moved in a while, so I'd be surprised to see a bug there,
> but that's always possible.
> Any chance you could get a tcpdump of a probe request (from connection to
> disconnection) so we can see what's going on?
> --
> Guillaume Quintard
>
>
> On Mon, Apr 17, 2023 at 9:12?AM George <izghitu@gmail.com> wrote:
>
>> Hi,
>>
>> In our case the response body is few bytes (13), header+body is 170
>> bytes. Can this be a bug related to something else perhaps?
>>
>> Please let me know.
>> Thanks
>>
>> În lun., 17 apr. 2023 la 18:47, Guillaume Quintard <
>> guillaume.quintard@gmail.com> a scris:
>>
>>> Thanks, I looked at the code quickly and I'd venture that maybe the
>>> /varnish_check is a bit too large and doesn't fit Varnish's probe buffer
>>> (we only care about the status line of the response anyway), so Varnish
>>> closes the connection while nginx isn't done yet.
>>>
>>> If it's that, it's not very polite, but it's harmless.
>>> --
>>> Guillaume Quintard
>>>
>>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
On Mon, Apr 17, 2023 at 4:17?PM Guillaume Quintard
<guillaume.quintard@gmail.com> wrote:
>
> That code hasn't moved in a while, so I'd be surprised to see a bug there, but that's always possible.
> Any chance you could get a tcpdump of a probe request (from connection to disconnection) so we can see what's going on?

It has: https://github.com/varnishcache/varnish-cache/pull/3886

But it shouldn't have changed the default behavior.

Dridi
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
>
>
> It has: https://github.com/varnishcache/varnish-cache/pull/3886
>

This was introduced in 7.3 and he is running 7.1

Walid



On Tue, Apr 18, 2023 at 12:24?PM Dridi Boukelmoune <dridi@varni.sh> wrote:

> On Mon, Apr 17, 2023 at 4:17?PM Guillaume Quintard
> <guillaume.quintard@gmail.com> wrote:
> >
> > That code hasn't moved in a while, so I'd be surprised to see a bug
> there, but that's always possible.
> > Any chance you could get a tcpdump of a probe request (from connection
> to disconnection) so we can see what's going on?
>
> It has: https://github.com/varnishcache/varnish-cache/pull/3886
>
> But it shouldn't have changed the default behavior.
>
> Dridi
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Hi,

Attached is the packet capture for the health check

Please check and advise.


În lun., 17 apr. 2023 la 19:15, Guillaume Quintard <
guillaume.quintard@gmail.com> a scris:

> That code hasn't moved in a while, so I'd be surprised to see a bug there,
> but that's always possible.
> Any chance you could get a tcpdump of a probe request (from connection to
> disconnection) so we can see what's going on?
> --
> Guillaume Quintard
>
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Hi George,

That pcap only contains HTTP info, it would be super useful to have the TCP
packets, (SYN/ACK/FIN) to see who closes the connection on whom.
--
Guillaume Quintard


On Tue, Apr 18, 2023 at 9:43?AM George <izghitu@gmail.com> wrote:

> Hi,
>
> Attached is the packet capture for the health check
>
> Please check and advise.
>
>
> În lun., 17 apr. 2023 la 19:15, Guillaume Quintard <
> guillaume.quintard@gmail.com> a scris:
>
>> That code hasn't moved in a while, so I'd be surprised to see a bug
>> there, but that's always possible.
>> Any chance you could get a tcpdump of a probe request (from connection to
>> disconnection) so we can see what's going on?
>> --
>> Guillaume Quintard
>>
>> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Strange Broken Pipe error from Varnish health checks [ In reply to ]
Couldn't a HEAD request solve this? Then nginx wouldn't bother with the body at all, right?

This is what we do with our health checks. For example:

backend someBackend {
.host = "[redacted]";
.port = "80";
.probe = {
.interval = 9s;
.request =
"HEAD /healthcheck HTTP/1.1"
"Host: [redacted]"
"User-Agent: varnish-health-probe"
"Connection: Close"
"Accept: */*";
}
}
________________________________
From: varnish-misc <varnish-misc-bounces+batanun=hotmail.com@varnish-cache.org> on behalf of George <izghitu@gmail.com>
Sent: Monday, April 17, 2023 10:21 AM
To: varnish-misc@varnish-cache.org <varnish-misc@varnish-cache.org>
Subject: Strange Broken Pipe error from Varnish health checks

Hi,

I have a Varnish/nginx cluster running with varnish-7.1.2-1.el7.x86_64 on CentOS 7.

The issue I am having comes from the varnish health checks. I am getting a "broken pipe" error in the nginx error log at random times like below:
Apr 10 17:32:46 VARNISH-MASTER nginx_varnish_error: 2023/04/10 17:32:46 [info] 17808#17808: *67626636 writev() failed (32: Broken pipe), client: unix:, server: _, request: "GET /varnish_check HTTP/1.1", host: "0.0.0.0"

The strange thing is that this error appears only when Varnish performs the health checks. I have other scripts doing it(nagios, curl, wget, AWS ELB) but those do not show any errors. In addition to this Varnish and nginx where the health checks occur are on the same server and it makes no difference if I use a TCP connection or socket based one.

Below are the varnish vcl and nginx locations for the health checks:
backend nginx_varnish {
.path = "/run/nginx/nginx.sock";
.first_byte_timeout = 600s;
.probe = health;
}

location = /varnish_check {
keepalive_timeout 305;
return 200 'Varnish Check';
access_log /var/log/nginx/varnish_check.log main;
error_log /var/log/nginx/varnish_check_errors.log debug;
error_log syslog:server=unix:/run/nginx_log.in.sock,facility=local1,tag=nginx_varnish_error,nohostname info;
}

Are there any docs I can read about how exactly varnish performs the health checks and what internal processes are involved?
Did anyone happen to have similar issues? This is not causing any operational problems for the cluster but it is just something that I want to determine why it is happening because it just should not be happening.

Please help
THanks in advance.