Mailing List Archive: Child process recurrently being restarted

Child process recurrently being restarted

stefanobaldo at gmail

Jun 23, 2017, 7:01 AM

Post #1 of 24 (2932 views)

Hello.

I am having a critical problem with Varnish Cache in production for over a
month and any help will be appreciated.
The problem is that Varnish child process is recurrently being restarted
after 10~20h of use, with the following message:

Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not responding
to CLI, killed it.
Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping:
400 CLI communication error
Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9
Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
starts
Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0
mmap'ed 483183820800 bytes of 483183820800

The following link is the varnishstat output just 1 minute before a restart:

https://pastebin.com/g0g5RVTs

Environment:

varnish-5.1.2 revision 6ece695
Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
Installed using pre-built package from official repo at packagecloud.io
CPU 2x2.9 GHz
Mem 3.69 GiB
Running inside a Docker container
NFILES=131072
MEMLOCK=82000

Additional info:

- I need to cache a large number of objets and the cache should last for
almost a week, so I have set up a 450G storage space, I don't know if this
is a problem;
- I use ban a lot. There was about 40k bans in the system just before the
last crash. I really don't know if this is too much or may have anything to
do with it;
- No registered CPU spikes (almost always by 30%);
- No panic is reported, the only info I can retrieve is from syslog;
- During all the time, event moments before the crashes, everything is okay
and requests are being responded very fast.

Best,
Stefano Baldo

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 23, 2017, 7:30 AM

Post #2 of 24 (2932 views)

Hi Stefano,

Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
trying to push/pull data and can't make time to reply to the CLI. I'd
recommend monitoring the disk activity (bandwidth and iops) to confirm.

After some time, the file storage is terrible on a hard drive (SSDs take a
bit more time to degrade) because of fragmentation. One solution to help
the disks cope is to overprovision themif they're SSDs, and you can try
different advices in the file storage definition in the command line (last
parameter, after granularity).

Is your /var/lib/varnish mount on tmpfs? That could help too.

40K bans is a lot, are they ban-lurker friendly?

--
Guillaume Quintard

On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hello.
>
> I am having a critical problem with Varnish Cache in production for over a
> month and any help will be appreciated.
> The problem is that Varnish child process is recurrently being restarted
> after 10~20h of use, with the following message:
>
> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not responding
> to CLI, killed it.
> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping:
> 400 CLI communication error
> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9
> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
> starts
> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0
> mmap'ed 483183820800 bytes of 483183820800
>
> The following link is the varnishstat output just 1 minute before a
> restart:
>
> https://pastebin.com/g0g5RVTs
>
> Environment:
>
> varnish-5.1.2 revision 6ece695
> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
> Installed using pre-built package from official repo at packagecloud.io
> CPU 2x2.9 GHz
> Mem 3.69 GiB
> Running inside a Docker container
> NFILES=131072
> MEMLOCK=82000
>
> Additional info:
>
> - I need to cache a large number of objets and the cache should last for
> almost a week, so I have set up a 450G storage space, I don't know if this
> is a problem;
> - I use ban a lot. There was about 40k bans in the system just before the
> last crash. I really don't know if this is too much or may have anything to
> do with it;
> - No registered CPU spikes (almost always by 30%);
> - No panic is reported, the only info I can retrieve is from syslog;
> - During all the time, event moments before the crashes, everything is
> okay and requests are being responded very fast.
>
> Best,
> Stefano Baldo
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 26, 2017, 7:51 AM

Post #3 of 24 (2920 views)

Hi Guillaume.

Thanks for answering.

I'm using a SSD disk. I've changed from ext4 to ext2 to increase
performance but it stills restarting.
Also, I checked the I/O performance for the disk and there is no signal of
overhead.

I've changed the /var/lib/varnish to a tmpfs and increased its 80m default
size passing "-l 200m,20m" to varnishd and using
"nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
problem here. After a couple of hours varnish died and I received a "no
space left on device" message - deleting the /var/lib/varnish solved the
problem and varnish was up again, but it's weird because there was free
memory on the host to be used with the tmpfs directory, so I don't know
what could have happened. I will try to stop increasing the
/var/lib/varnish size.

Anyway, I am worried about the bans. You asked me if the bans are lurker
friedly. Well, I don't think so. My bans are created this way:

ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " &&
req.http.User-Agent !~ Googlebot");

Are they lurker friendly? I was taking a quick look and the documentation
and it looks like they're not.

Best,
Stefano

On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Hi Stefano,
>
> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
> trying to push/pull data and can't make time to reply to the CLI. I'd
> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>
> After some time, the file storage is terrible on a hard drive (SSDs take a
> bit more time to degrade) because of fragmentation. One solution to help
> the disks cope is to overprovision themif they're SSDs, and you can try
> different advices in the file storage definition in the command line (last
> parameter, after granularity).
>
> Is your /var/lib/varnish mount on tmpfs? That could help too.
>
> 40K bans is a lot, are they ban-lurker friendly?
>
> --
> Guillaume Quintard
>
> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hello.
>>
>> I am having a critical problem with Varnish Cache in production for over
>> a month and any help will be appreciated.
>> The problem is that Varnish child process is recurrently being restarted
>> after 10~20h of use, with the following message:
>>
>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>> responding to CLI, killed it.
>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping:
>> 400 CLI communication error
>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
>> starts
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0
>> mmap'ed 483183820800 bytes of 483183820800
>>
>> The following link is the varnishstat output just 1 minute before a
>> restart:
>>
>> https://pastebin.com/g0g5RVTs
>>
>> Environment:
>>
>> varnish-5.1.2 revision 6ece695
>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>> Installed using pre-built package from official repo at packagecloud.io
>> CPU 2x2.9 GHz
>> Mem 3.69 GiB
>> Running inside a Docker container
>> NFILES=131072
>> MEMLOCK=82000
>>
>> Additional info:
>>
>> - I need to cache a large number of objets and the cache should last for
>> almost a week, so I have set up a 450G storage space, I don't know if this
>> is a problem;
>> - I use ban a lot. There was about 40k bans in the system just before the
>> last crash. I really don't know if this is too much or may have anything to
>> do with it;
>> - No registered CPU spikes (almost always by 30%);
>> - No panic is reported, the only info I can retrieve is from syslog;
>> - During all the time, event moments before the crashes, everything is
>> okay and requests are being responded very fast.
>>
>> Best,
>> Stefano Baldo
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 26, 2017, 8:43 AM

Post #4 of 24 (2920 views)

Not lurker friendly at all indeed. You'll need to avoid req.* expression.
Easiest way is to stash the host, user-agent and url in beresp.http.* and
ban against those (unset them in vcl_deliver).

I don't think you need to expand the VSL at all.

--
Guillaume Quintard

On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:

Hi Guillaume.

Thanks for answering.

I'm using a SSD disk. I've changed from ext4 to ext2 to increase
performance but it stills restarting.
Also, I checked the I/O performance for the disk and there is no signal of
overhead.

I've changed the /var/lib/varnish to a tmpfs and increased its 80m default
size passing "-l 200m,20m" to varnishd and using
"nodev,nosuid,noatime,size=256M
0 0" for the tmpfs mount. There was a problem here. After a couple of hours
varnish died and I received a "no space left on device" message - deleting
the /var/lib/varnish solved the problem and varnish was up again, but it's
weird because there was free memory on the host to be used with the tmpfs
directory, so I don't know what could have happened. I will try to stop
increasing the /var/lib/varnish size.

Anyway, I am worried about the bans. You asked me if the bans are lurker
friedly. Well, I don't think so. My bans are created this way:

ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + " &&
req.http.User-Agent !~ Googlebot");

Are they lurker friendly? I was taking a quick look and the documentation
and it looks like they're not.

Best,
Stefano

On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Hi Stefano,
>
> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
> trying to push/pull data and can't make time to reply to the CLI. I'd
> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>
> After some time, the file storage is terrible on a hard drive (SSDs take a
> bit more time to degrade) because of fragmentation. One solution to help
> the disks cope is to overprovision themif they're SSDs, and you can try
> different advices in the file storage definition in the command line (last
> parameter, after granularity).
>
> Is your /var/lib/varnish mount on tmpfs? That could help too.
>
> 40K bans is a lot, are they ban-lurker friendly?
>
> --
> Guillaume Quintard
>
> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hello.
>>
>> I am having a critical problem with Varnish Cache in production for over
>> a month and any help will be appreciated.
>> The problem is that Varnish child process is recurrently being restarted
>> after 10~20h of use, with the following message:
>>
>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>> responding to CLI, killed it.
>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from ping:
>> 400 CLI communication error
>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
>> starts
>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0
>> mmap'ed 483183820800 bytes of 483183820800
>>
>> The following link is the varnishstat output just 1 minute before a
>> restart:
>>
>> https://pastebin.com/g0g5RVTs
>>
>> Environment:
>>
>> varnish-5.1.2 revision 6ece695
>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>> Installed using pre-built package from official repo at packagecloud.io
>> CPU 2x2.9 GHz
>> Mem 3.69 GiB
>> Running inside a Docker container
>> NFILES=131072
>> MEMLOCK=82000
>>
>> Additional info:
>>
>> - I need to cache a large number of objets and the cache should last for
>> almost a week, so I have set up a 450G storage space, I don't know if this
>> is a problem;
>> - I use ban a lot. There was about 40k bans in the system just before the
>> last crash. I really don't know if this is too much or may have anything to
>> do with it;
>> - No registered CPU spikes (almost always by 30%);
>> - No panic is reported, the only info I can retrieve is from syslog;
>> - During all the time, event moments before the crashes, everything is
>> okay and requests are being responded very fast.
>>
>> Best,
>> Stefano Baldo
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 26, 2017, 10:06 AM

Post #5 of 24 (2920 views)

Hi Guillaume,

Can the following be considered "ban lurker friendly"?

sub vcl_backend_response {
set beresp.http.x-url = bereq.http.host + bereq.url;
set beresp.http.x-user-agent = bereq.http.user-agent;
}

sub vcl_recv {
if (req.method == "PURGE") {
ban("obj.http.x-url == " + req.http.host + req.url + " &&
obj.http.x-user-agent !~ Googlebot");
return(synth(750));
}
}

sub vcl_deliver {
unset resp.http.x-url;
unset resp.http.x-user-agent;
}

Best,
Stefano

On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Not lurker friendly at all indeed. You'll need to avoid req.* expression.
> Easiest way is to stash the host, user-agent and url in beresp.http.* and
> ban against those (unset them in vcl_deliver).
>
> I don't think you need to expand the VSL at all.
>
> --
> Guillaume Quintard
>
> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>
> Hi Guillaume.
>
> Thanks for answering.
>
> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
> performance but it stills restarting.
> Also, I checked the I/O performance for the disk and there is no signal of
> overhead.
>
> I've changed the /var/lib/varnish to a tmpfs and increased its 80m default
> size passing "-l 200m,20m" to varnishd and using
> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
> problem here. After a couple of hours varnish died and I received a "no
> space left on device" message - deleting the /var/lib/varnish solved the
> problem and varnish was up again, but it's weird because there was free
> memory on the host to be used with the tmpfs directory, so I don't know
> what could have happened. I will try to stop increasing the
> /var/lib/varnish size.
>
> Anyway, I am worried about the bans. You asked me if the bans are lurker
> friedly. Well, I don't think so. My bans are created this way:
>
> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + "
> && req.http.User-Agent !~ Googlebot");
>
> Are they lurker friendly? I was taking a quick look and the documentation
> and it looks like they're not.
>
> Best,
> Stefano
>
>
> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Hi Stefano,
>>
>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>> trying to push/pull data and can't make time to reply to the CLI. I'd
>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>
>> After some time, the file storage is terrible on a hard drive (SSDs take
>> a bit more time to degrade) because of fragmentation. One solution to help
>> the disks cope is to overprovision themif they're SSDs, and you can try
>> different advices in the file storage definition in the command line (last
>> parameter, after granularity).
>>
>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>
>> 40K bans is a lot, are they ban-lurker friendly?
>>
>> --
>> Guillaume Quintard
>>
>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hello.
>>>
>>> I am having a critical problem with Varnish Cache in production for over
>>> a month and any help will be appreciated.
>>> The problem is that Varnish child process is recurrently being restarted
>>> after 10~20h of use, with the following message:
>>>
>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>> responding to CLI, killed it.
>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>> ping: 400 CLI communication error
>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died signal=9
>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
>>> starts
>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0
>>> mmap'ed 483183820800 bytes of 483183820800
>>>
>>> The following link is the varnishstat output just 1 minute before a
>>> restart:
>>>
>>> https://pastebin.com/g0g5RVTs
>>>
>>> Environment:
>>>
>>> varnish-5.1.2 revision 6ece695
>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>> Installed using pre-built package from official repo at packagecloud.io
>>> CPU 2x2.9 GHz
>>> Mem 3.69 GiB
>>> Running inside a Docker container
>>> NFILES=131072
>>> MEMLOCK=82000
>>>
>>> Additional info:
>>>
>>> - I need to cache a large number of objets and the cache should last for
>>> almost a week, so I have set up a 450G storage space, I don't know if this
>>> is a problem;
>>> - I use ban a lot. There was about 40k bans in the system just before
>>> the last crash. I really don't know if this is too much or may have
>>> anything to do with it;
>>> - No registered CPU spikes (almost always by 30%);
>>> - No panic is reported, the only info I can retrieve is from syslog;
>>> - During all the time, event moments before the crashes, everything is
>>> okay and requests are being responded very fast.
>>>
>>> Best,
>>> Stefano Baldo
>>>
>>>
>>> _______________________________________________
>>> varnish-misc mailing list
>>> varnish-misc@varnish-cache.org
>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>
>>
>>
>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 26, 2017, 11:10 AM

Post #6 of 24 (2920 views)

Looking good!

--
Guillaume Quintard

On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hi Guillaume,
>
> Can the following be considered "ban lurker friendly"?
>
> sub vcl_backend_response {
> set beresp.http.x-url = bereq.http.host + bereq.url;
> set beresp.http.x-user-agent = bereq.http.user-agent;
> }
>
> sub vcl_recv {
> if (req.method == "PURGE") {
> ban("obj.http.x-url == " + req.http.host + req.url + " &&
> obj.http.x-user-agent !~ Googlebot");
> return(synth(750));
> }
> }
>
> sub vcl_deliver {
> unset resp.http.x-url;
> unset resp.http.x-user-agent;
> }
>
> Best,
> Stefano
>
>
> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Not lurker friendly at all indeed. You'll need to avoid req.* expression.
>> Easiest way is to stash the host, user-agent and url in beresp.http.* and
>> ban against those (unset them in vcl_deliver).
>>
>> I don't think you need to expand the VSL at all.
>>
>> --
>> Guillaume Quintard
>>
>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>
>> Hi Guillaume.
>>
>> Thanks for answering.
>>
>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>> performance but it stills restarting.
>> Also, I checked the I/O performance for the disk and there is no signal
>> of overhead.
>>
>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>> default size passing "-l 200m,20m" to varnishd and using
>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>> problem here. After a couple of hours varnish died and I received a "no
>> space left on device" message - deleting the /var/lib/varnish solved the
>> problem and varnish was up again, but it's weird because there was free
>> memory on the host to be used with the tmpfs directory, so I don't know
>> what could have happened. I will try to stop increasing the
>> /var/lib/varnish size.
>>
>> Anyway, I am worried about the bans. You asked me if the bans are lurker
>> friedly. Well, I don't think so. My bans are created this way:
>>
>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + "
>> && req.http.User-Agent !~ Googlebot");
>>
>> Are they lurker friendly? I was taking a quick look and the documentation
>> and it looks like they're not.
>>
>> Best,
>> Stefano
>>
>>
>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Hi Stefano,
>>>
>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>
>>> After some time, the file storage is terrible on a hard drive (SSDs take
>>> a bit more time to degrade) because of fragmentation. One solution to help
>>> the disks cope is to overprovision themif they're SSDs, and you can try
>>> different advices in the file storage definition in the command line (last
>>> parameter, after granularity).
>>>
>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>
>>> 40K bans is a lot, are they ban-lurker friendly?
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hello.
>>>>
>>>> I am having a critical problem with Varnish Cache in production for
>>>> over a month and any help will be appreciated.
>>>> The problem is that Varnish child process is recurrently being
>>>> restarted after 10~20h of use, with the following message:
>>>>
>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>> responding to CLI, killed it.
>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>> ping: 400 CLI communication error
>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>> signal=9
>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
>>>> starts
>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said SMF.s0
>>>> mmap'ed 483183820800 bytes of 483183820800
>>>>
>>>> The following link is the varnishstat output just 1 minute before a
>>>> restart:
>>>>
>>>> https://pastebin.com/g0g5RVTs
>>>>
>>>> Environment:
>>>>
>>>> varnish-5.1.2 revision 6ece695
>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>> Installed using pre-built package from official repo at packagecloud.io
>>>> CPU 2x2.9 GHz
>>>> Mem 3.69 GiB
>>>> Running inside a Docker container
>>>> NFILES=131072
>>>> MEMLOCK=82000
>>>>
>>>> Additional info:
>>>>
>>>> - I need to cache a large number of objets and the cache should last
>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>> this is a problem;
>>>> - I use ban a lot. There was about 40k bans in the system just before
>>>> the last crash. I really don't know if this is too much or may have
>>>> anything to do with it;
>>>> - No registered CPU spikes (almost always by 30%);
>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>> - During all the time, event moments before the crashes, everything is
>>>> okay and requests are being responded very fast.
>>>>
>>>> Best,
>>>> Stefano Baldo
>>>>
>>>>
>>>> _______________________________________________
>>>> varnish-misc mailing list
>>>> varnish-misc@varnish-cache.org
>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>
>>>
>>>
>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 26, 2017, 11:21 AM

Post #7 of 24 (2920 views)

Hi Guillaume.

I think things will start to going better now after changing the bans.
This is how my last varnishstat looked like moments before a crash
regarding the bans:

MAIN.bans 41336 . Count of bans
MAIN.bans_completed 37967 . Number of bans marked
'completed'
MAIN.bans_obj 0 . Number of bans using obj.*
MAIN.bans_req 41335 . Number of bans using req.*
MAIN.bans_added 41336 0.68 Bans added
MAIN.bans_deleted 0 0.00 Bans deleted

And this is how it looks like now:

MAIN.bans 2 . Count of bans
MAIN.bans_completed 1 . Number of bans marked
'completed'
MAIN.bans_obj 2 . Number of bans using obj.*
MAIN.bans_req 0 . Number of bans using req.*
MAIN.bans_added 2016 0.69 Bans added
MAIN.bans_deleted 2014 0.69 Bans deleted

Before the changes, bans were never deleted!
Now the bans are added and quickly deleted after a minute or even a couple
of seconds.

May this was the cause of the problem? It seems like varnish was having a
large number of bans to manage and test against.
I will let it ride now. Let's see if the problem persists or it's gone! :-)

Best,
Stefano

On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Looking good!
>
> --
> Guillaume Quintard
>
> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hi Guillaume,
>>
>> Can the following be considered "ban lurker friendly"?
>>
>> sub vcl_backend_response {
>> set beresp.http.x-url = bereq.http.host + bereq.url;
>> set beresp.http.x-user-agent = bereq.http.user-agent;
>> }
>>
>> sub vcl_recv {
>> if (req.method == "PURGE") {
>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>> obj.http.x-user-agent !~ Googlebot");
>> return(synth(750));
>> }
>> }
>>
>> sub vcl_deliver {
>> unset resp.http.x-url;
>> unset resp.http.x-user-agent;
>> }
>>
>> Best,
>> Stefano
>>
>>
>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>> expression. Easiest way is to stash the host, user-agent and url in
>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>
>>> I don't think you need to expand the VSL at all.
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>
>>> Hi Guillaume.
>>>
>>> Thanks for answering.
>>>
>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>> performance but it stills restarting.
>>> Also, I checked the I/O performance for the disk and there is no signal
>>> of overhead.
>>>
>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>> default size passing "-l 200m,20m" to varnishd and using
>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>> problem here. After a couple of hours varnish died and I received a "no
>>> space left on device" message - deleting the /var/lib/varnish solved the
>>> problem and varnish was up again, but it's weird because there was free
>>> memory on the host to be used with the tmpfs directory, so I don't know
>>> what could have happened. I will try to stop increasing the
>>> /var/lib/varnish size.
>>>
>>> Anyway, I am worried about the bans. You asked me if the bans are lurker
>>> friedly. Well, I don't think so. My bans are created this way:
>>>
>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url + "
>>> && req.http.User-Agent !~ Googlebot");
>>>
>>> Are they lurker friendly? I was taking a quick look and the
>>> documentation and it looks like they're not.
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Hi Stefano,
>>>>
>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>
>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>> try different advices in the file storage definition in the command line
>>>> (last parameter, after granularity).
>>>>
>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>
>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello.
>>>>>
>>>>> I am having a critical problem with Varnish Cache in production for
>>>>> over a month and any help will be appreciated.
>>>>> The problem is that Varnish child process is recurrently being
>>>>> restarted after 10~20h of use, with the following message:
>>>>>
>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>> responding to CLI, killed it.
>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>> ping: 400 CLI communication error
>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>> signal=9
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said Child
>>>>> starts
>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>
>>>>> The following link is the varnishstat output just 1 minute before a
>>>>> restart:
>>>>>
>>>>> https://pastebin.com/g0g5RVTs
>>>>>
>>>>> Environment:
>>>>>
>>>>> varnish-5.1.2 revision 6ece695
>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>> Installed using pre-built package from official repo at
>>>>> packagecloud.io
>>>>> CPU 2x2.9 GHz
>>>>> Mem 3.69 GiB
>>>>> Running inside a Docker container
>>>>> NFILES=131072
>>>>> MEMLOCK=82000
>>>>>
>>>>> Additional info:
>>>>>
>>>>> - I need to cache a large number of objets and the cache should last
>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>> this is a problem;
>>>>> - I use ban a lot. There was about 40k bans in the system just before
>>>>> the last crash. I really don't know if this is too much or may have
>>>>> anything to do with it;
>>>>> - No registered CPU spikes (almost always by 30%);
>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>> - During all the time, event moments before the crashes, everything is
>>>>> okay and requests are being responded very fast.
>>>>>
>>>>> Best,
>>>>> Stefano Baldo
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> varnish-misc mailing list
>>>>> varnish-misc@varnish-cache.org
>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 26, 2017, 11:47 AM

Post #8 of 24 (2920 views)

Nice! It may have been the cause, time will tell.can you report back in a
few days to let us know?
--
Guillaume Quintard

On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:

> Hi Guillaume.
>
> I think things will start to going better now after changing the bans.
> This is how my last varnishstat looked like moments before a crash
> regarding the bans:
>
> MAIN.bans 41336 . Count of bans
> MAIN.bans_completed 37967 . Number of bans marked
> 'completed'
> MAIN.bans_obj 0 . Number of bans using obj.*
> MAIN.bans_req 41335 . Number of bans using req.*
> MAIN.bans_added 41336 0.68 Bans added
> MAIN.bans_deleted 0 0.00 Bans deleted
>
> And this is how it looks like now:
>
> MAIN.bans 2 . Count of bans
> MAIN.bans_completed 1 . Number of bans marked
> 'completed'
> MAIN.bans_obj 2 . Number of bans using obj.*
> MAIN.bans_req 0 . Number of bans using req.*
> MAIN.bans_added 2016 0.69 Bans added
> MAIN.bans_deleted 2014 0.69 Bans deleted
>
> Before the changes, bans were never deleted!
> Now the bans are added and quickly deleted after a minute or even a couple
> of seconds.
>
> May this was the cause of the problem? It seems like varnish was having a
> large number of bans to manage and test against.
> I will let it ride now. Let's see if the problem persists or it's gone! :-)
>
> Best,
> Stefano
>
>
> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Looking good!
>>
>> --
>> Guillaume Quintard
>>
>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi Guillaume,
>>>
>>> Can the following be considered "ban lurker friendly"?
>>>
>>> sub vcl_backend_response {
>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>> }
>>>
>>> sub vcl_recv {
>>> if (req.method == "PURGE") {
>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>> obj.http.x-user-agent !~ Googlebot");
>>> return(synth(750));
>>> }
>>> }
>>>
>>> sub vcl_deliver {
>>> unset resp.http.x-url;
>>> unset resp.http.x-user-agent;
>>> }
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>
>>>> I don't think you need to expand the VSL at all.
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>
>>>> Hi Guillaume.
>>>>
>>>> Thanks for answering.
>>>>
>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>> performance but it stills restarting.
>>>> Also, I checked the I/O performance for the disk and there is no signal
>>>> of overhead.
>>>>
>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>> default size passing "-l 200m,20m" to varnishd and using
>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>>> problem here. After a couple of hours varnish died and I received a "no
>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>> problem and varnish was up again, but it's weird because there was free
>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>> what could have happened. I will try to stop increasing the
>>>> /var/lib/varnish size.
>>>>
>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>
>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url +
>>>> " && req.http.User-Agent !~ Googlebot");
>>>>
>>>> Are they lurker friendly? I was taking a quick look and the
>>>> documentation and it looks like they're not.
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Hi Stefano,
>>>>>
>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>
>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>> try different advices in the file storage definition in the command line
>>>>> (last parameter, after granularity).
>>>>>
>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>
>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <stefanobaldo@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>> over a month and any help will be appreciated.
>>>>>> The problem is that Varnish child process is recurrently being
>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>
>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>> responding to CLI, killed it.
>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>> ping: 400 CLI communication error
>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>> signal=9
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>> Child starts
>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>
>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>> restart:
>>>>>>
>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>
>>>>>> Environment:
>>>>>>
>>>>>> varnish-5.1.2 revision 6ece695
>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>> Installed using pre-built package from official repo at
>>>>>> packagecloud.io
>>>>>> CPU 2x2.9 GHz
>>>>>> Mem 3.69 GiB
>>>>>> Running inside a Docker container
>>>>>> NFILES=131072
>>>>>> MEMLOCK=82000
>>>>>>
>>>>>> Additional info:
>>>>>>
>>>>>> - I need to cache a large number of objets and the cache should last
>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>>> this is a problem;
>>>>>> - I use ban a lot. There was about 40k bans in the system just before
>>>>>> the last crash. I really don't know if this is too much or may have
>>>>>> anything to do with it;
>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>> - During all the time, event moments before the crashes, everything
>>>>>> is okay and requests are being responded very fast.
>>>>>>
>>>>>> Best,
>>>>>> Stefano Baldo
>>>>>>
>>>>>>
>>>>>> _______________________________________________
>>>>>> varnish-misc mailing list
>>>>>> varnish-misc@varnish-cache.org
>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 26, 2017, 12:08 PM

Post #9 of 24 (2920 views)

Sure, will do!

On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Nice! It may have been the cause, time will tell.can you report back in a
> few days to let us know?
> --
> Guillaume Quintard
>
> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>
>> Hi Guillaume.
>>
>> I think things will start to going better now after changing the bans.
>> This is how my last varnishstat looked like moments before a crash
>> regarding the bans:
>>
>> MAIN.bans 41336 . Count of bans
>> MAIN.bans_completed 37967 . Number of bans marked
>> 'completed'
>> MAIN.bans_obj 0 . Number of bans using
>> obj.*
>> MAIN.bans_req 41335 . Number of bans using
>> req.*
>> MAIN.bans_added 41336 0.68 Bans added
>> MAIN.bans_deleted 0 0.00 Bans deleted
>>
>> And this is how it looks like now:
>>
>> MAIN.bans 2 . Count of bans
>> MAIN.bans_completed 1 . Number of bans marked
>> 'completed'
>> MAIN.bans_obj 2 . Number of bans using
>> obj.*
>> MAIN.bans_req 0 . Number of bans using
>> req.*
>> MAIN.bans_added 2016 0.69 Bans added
>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>
>> Before the changes, bans were never deleted!
>> Now the bans are added and quickly deleted after a minute or even a
>> couple of seconds.
>>
>> May this was the cause of the problem? It seems like varnish was having a
>> large number of bans to manage and test against.
>> I will let it ride now. Let's see if the problem persists or it's gone!
>> :-)
>>
>> Best,
>> Stefano
>>
>>
>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Looking good!
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>
>>>> Can the following be considered "ban lurker friendly"?
>>>>
>>>> sub vcl_backend_response {
>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>> }
>>>>
>>>> sub vcl_recv {
>>>> if (req.method == "PURGE") {
>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>> return(synth(750));
>>>> }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> unset resp.http.x-url;
>>>> unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>
>>>>> I don't think you need to expand the VSL at all.
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> Thanks for answering.
>>>>>
>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>> performance but it stills restarting.
>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>> signal of overhead.
>>>>>
>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>>>> problem here. After a couple of hours varnish died and I received a "no
>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>> problem and varnish was up again, but it's weird because there was free
>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>> what could have happened. I will try to stop increasing the
>>>>> /var/lib/varnish size.
>>>>>
>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>
>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url +
>>>>> " && req.http.User-Agent !~ Googlebot");
>>>>>
>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>> documentation and it looks like they're not.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Hi Stefano,
>>>>>>
>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>
>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>> try different advices in the file storage definition in the command line
>>>>>> (last parameter, after granularity).
>>>>>>
>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>
>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>>> over a month and any help will be appreciated.
>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>> responding to CLI, killed it.
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>>> ping: 400 CLI communication error
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>> signal=9
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> Child starts
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>
>>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>>> restart:
>>>>>>>
>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>
>>>>>>> Environment:
>>>>>>>
>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>> Installed using pre-built package from official repo at
>>>>>>> packagecloud.io
>>>>>>> CPU 2x2.9 GHz
>>>>>>> Mem 3.69 GiB
>>>>>>> Running inside a Docker container
>>>>>>> NFILES=131072
>>>>>>> MEMLOCK=82000
>>>>>>>
>>>>>>> Additional info:
>>>>>>>
>>>>>>> - I need to cache a large number of objets and the cache should last
>>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>>>> this is a problem;
>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>> anything to do with it;
>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>>> - During all the time, event moments before the crashes, everything
>>>>>>> is okay and requests are being responded very fast.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano Baldo
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> varnish-misc mailing list
>>>>>>> varnish-misc@varnish-cache.org
>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 27, 2017, 5:37 AM

Post #10 of 24 (2912 views)

Hi Guillaume,

FYI, restarted again after ~16h :-(

Uptime mgt: 0+17:48:50
Uptime child: 0+02:17:14

Best,
Stefano

On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Nice! It may have been the cause, time will tell.can you report back in a
> few days to let us know?
> --
> Guillaume Quintard
>
> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>
>> Hi Guillaume.
>>
>> I think things will start to going better now after changing the bans.
>> This is how my last varnishstat looked like moments before a crash
>> regarding the bans:
>>
>> MAIN.bans 41336 . Count of bans
>> MAIN.bans_completed 37967 . Number of bans marked
>> 'completed'
>> MAIN.bans_obj 0 . Number of bans using
>> obj.*
>> MAIN.bans_req 41335 . Number of bans using
>> req.*
>> MAIN.bans_added 41336 0.68 Bans added
>> MAIN.bans_deleted 0 0.00 Bans deleted
>>
>> And this is how it looks like now:
>>
>> MAIN.bans 2 . Count of bans
>> MAIN.bans_completed 1 . Number of bans marked
>> 'completed'
>> MAIN.bans_obj 2 . Number of bans using
>> obj.*
>> MAIN.bans_req 0 . Number of bans using
>> req.*
>> MAIN.bans_added 2016 0.69 Bans added
>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>
>> Before the changes, bans were never deleted!
>> Now the bans are added and quickly deleted after a minute or even a
>> couple of seconds.
>>
>> May this was the cause of the problem? It seems like varnish was having a
>> large number of bans to manage and test against.
>> I will let it ride now. Let's see if the problem persists or it's gone!
>> :-)
>>
>> Best,
>> Stefano
>>
>>
>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Looking good!
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>
>>>> Can the following be considered "ban lurker friendly"?
>>>>
>>>> sub vcl_backend_response {
>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>> }
>>>>
>>>> sub vcl_recv {
>>>> if (req.method == "PURGE") {
>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>> return(synth(750));
>>>> }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> unset resp.http.x-url;
>>>> unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>
>>>>> I don't think you need to expand the VSL at all.
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> Thanks for answering.
>>>>>
>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>> performance but it stills restarting.
>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>> signal of overhead.
>>>>>
>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>>>> problem here. After a couple of hours varnish died and I received a "no
>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>> problem and varnish was up again, but it's weird because there was free
>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>> what could have happened. I will try to stop increasing the
>>>>> /var/lib/varnish size.
>>>>>
>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>
>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url +
>>>>> " && req.http.User-Agent !~ Googlebot");
>>>>>
>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>> documentation and it looks like they're not.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Hi Stefano,
>>>>>>
>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>
>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>> try different advices in the file storage definition in the command line
>>>>>> (last parameter, after granularity).
>>>>>>
>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>
>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>>> over a month and any help will be appreciated.
>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>> responding to CLI, killed it.
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>>> ping: 400 CLI communication error
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>> signal=9
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> Child starts
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>
>>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>>> restart:
>>>>>>>
>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>
>>>>>>> Environment:
>>>>>>>
>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>> Installed using pre-built package from official repo at
>>>>>>> packagecloud.io
>>>>>>> CPU 2x2.9 GHz
>>>>>>> Mem 3.69 GiB
>>>>>>> Running inside a Docker container
>>>>>>> NFILES=131072
>>>>>>> MEMLOCK=82000
>>>>>>>
>>>>>>> Additional info:
>>>>>>>
>>>>>>> - I need to cache a large number of objets and the cache should last
>>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>>>> this is a problem;
>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>> anything to do with it;
>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>>> - During all the time, event moments before the crashes, everything
>>>>>>> is okay and requests are being responded very fast.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano Baldo
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> varnish-misc mailing list
>>>>>>> varnish-misc@varnish-cache.org
>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 27, 2017, 2:07 PM

Post #11 of 24 (2911 views)

Hi Guillaume.

It keeps restarting.
Would you mind taking a quick look in the following VCL file to check if
you find anything suspicious?

Thank you very much.

Best,
Stefano

vcl 4.0;

import std;

backend default {
.host = "sites-web-server-lb";
.port = "80";
}

include "/etc/varnish/bad_bot_detection.vcl";

sub vcl_recv {
call bad_bot_detection;

if (req.url == "/nocache" || req.url == "/version") {
return(pass);
}

unset req.http.Cookie;
if (req.method == "PURGE") {
ban("obj.http.x-host == " + req.http.host + " && obj.http.x-user-agent
!~ Googlebot");
return(synth(750));
}

set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
}

sub vcl_synth {
if (resp.status == 750) {
set resp.status = 200;
synthetic("PURGED => " + req.url);
return(deliver);
} elsif (resp.status == 501) {
set resp.status = 200;
set resp.http.Content-Type = "text/html; charset=utf-8";
synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
return(deliver);
}
}

sub vcl_backend_response {
unset beresp.http.Set-Cookie;
set beresp.http.x-host = bereq.http.host;
set beresp.http.x-user-agent = bereq.http.user-agent;

if (bereq.url == "/themes/basic/assets/theme.min.css"
|| bereq.url == "/api/events/PAGEVIEW"
|| bereq.url ~ "^\/assets\/img\/") {
set beresp.http.Cache-Control = "max-age=0";
} else {
unset beresp.http.Cache-Control;
}

if (beresp.status == 200 ||
beresp.status == 301 ||
beresp.status == 302 ||
beresp.status == 404) {
if (bereq.url ~ "\&ordenar=aleatorio$") {
set beresp.http.X-TTL = "1d";
set beresp.ttl = 1d;
} else {
set beresp.http.X-TTL = "1w";
set beresp.ttl = 1w;
}
}

if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$") {
set beresp.do_gzip = true;
}
}

sub vcl_pipe {
set bereq.http.connection = "close";
return (pipe);
}

sub vcl_deliver {
unset resp.http.x-host;
unset resp.http.x-user-agent;
}

sub vcl_backend_error {
if (beresp.status == 502 || beresp.status == 503 || beresp.status == 504)
{
set beresp.status = 200;
set beresp.http.Content-Type = "text/html; charset=utf-8";
synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
return (deliver);
}
}

sub vcl_hash {
if (req.http.User-Agent ~ "Google Page Speed") {
hash_data("Google Page Speed");
} elsif (req.http.User-Agent ~ "Googlebot") {
hash_data("Googlebot");
}
}

sub vcl_deliver {
if (resp.status == 501) {
return (synth(resp.status));
}
if (obj.hits > 0) {
set resp.http.X-Cache = "hit";
} else {
set resp.http.X-Cache = "miss";
}
}

On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Nice! It may have been the cause, time will tell.can you report back in a
> few days to let us know?
> --
> Guillaume Quintard
>
> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>
>> Hi Guillaume.
>>
>> I think things will start to going better now after changing the bans.
>> This is how my last varnishstat looked like moments before a crash
>> regarding the bans:
>>
>> MAIN.bans 41336 . Count of bans
>> MAIN.bans_completed 37967 . Number of bans marked
>> 'completed'
>> MAIN.bans_obj 0 . Number of bans using
>> obj.*
>> MAIN.bans_req 41335 . Number of bans using
>> req.*
>> MAIN.bans_added 41336 0.68 Bans added
>> MAIN.bans_deleted 0 0.00 Bans deleted
>>
>> And this is how it looks like now:
>>
>> MAIN.bans 2 . Count of bans
>> MAIN.bans_completed 1 . Number of bans marked
>> 'completed'
>> MAIN.bans_obj 2 . Number of bans using
>> obj.*
>> MAIN.bans_req 0 . Number of bans using
>> req.*
>> MAIN.bans_added 2016 0.69 Bans added
>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>
>> Before the changes, bans were never deleted!
>> Now the bans are added and quickly deleted after a minute or even a
>> couple of seconds.
>>
>> May this was the cause of the problem? It seems like varnish was having a
>> large number of bans to manage and test against.
>> I will let it ride now. Let's see if the problem persists or it's gone!
>> :-)
>>
>> Best,
>> Stefano
>>
>>
>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Looking good!
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume,
>>>>
>>>> Can the following be considered "ban lurker friendly"?
>>>>
>>>> sub vcl_backend_response {
>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>> }
>>>>
>>>> sub vcl_recv {
>>>> if (req.method == "PURGE") {
>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>> return(synth(750));
>>>> }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> unset resp.http.x-url;
>>>> unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>
>>>>> I don't think you need to expand the VSL at all.
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> Thanks for answering.
>>>>>
>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>> performance but it stills restarting.
>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>> signal of overhead.
>>>>>
>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was a
>>>>> problem here. After a couple of hours varnish died and I received a "no
>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>> problem and varnish was up again, but it's weird because there was free
>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>> what could have happened. I will try to stop increasing the
>>>>> /var/lib/varnish size.
>>>>>
>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>
>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url +
>>>>> " && req.http.User-Agent !~ Googlebot");
>>>>>
>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>> documentation and it looks like they're not.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Hi Stefano,
>>>>>>
>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets stuck
>>>>>> trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>
>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>> try different advices in the file storage definition in the command line
>>>>>> (last parameter, after granularity).
>>>>>>
>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>
>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>>> over a month and any help will be appreciated.
>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>> responding to CLI, killed it.
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>>> ping: 400 CLI communication error
>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>> signal=9
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> Child starts
>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>
>>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>>> restart:
>>>>>>>
>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>
>>>>>>> Environment:
>>>>>>>
>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>> Installed using pre-built package from official repo at
>>>>>>> packagecloud.io
>>>>>>> CPU 2x2.9 GHz
>>>>>>> Mem 3.69 GiB
>>>>>>> Running inside a Docker container
>>>>>>> NFILES=131072
>>>>>>> MEMLOCK=82000
>>>>>>>
>>>>>>> Additional info:
>>>>>>>
>>>>>>> - I need to cache a large number of objets and the cache should last
>>>>>>> for almost a week, so I have set up a 450G storage space, I don't know if
>>>>>>> this is a problem;
>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>> anything to do with it;
>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>>> - During all the time, event moments before the crashes, everything
>>>>>>> is okay and requests are being responded very fast.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano Baldo
>>>>>>>
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> varnish-misc mailing list
>>>>>>> varnish-misc@varnish-cache.org
>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 28, 2017, 12:12 AM

Post #12 of 24 (2909 views)

Sadly, nothing suspicious here, you can still try:
- bumping the cli_timeout
- changing your disk scheduler
- changing the advice option of the file storage

I'm still convinced this is due to Varnish getting stuck waiting for the
disk because of the file storage fragmentation.

Maybe you could look at SMF.*.g_alloc and compare it to the number of
objects. Ideally, we would have a 1:1 relation between objects and
allocations. If that number drops prior to a restart, that would be a good
clue.

--
Guillaume Quintard

On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hi Guillaume.
>
> It keeps restarting.
> Would you mind taking a quick look in the following VCL file to check if
> you find anything suspicious?
>
> Thank you very much.
>
> Best,
> Stefano
>
> vcl 4.0;
>
> import std;
>
> backend default {
> .host = "sites-web-server-lb";
> .port = "80";
> }
>
> include "/etc/varnish/bad_bot_detection.vcl";
>
> sub vcl_recv {
> call bad_bot_detection;
>
> if (req.url == "/nocache" || req.url == "/version") {
> return(pass);
> }
>
> unset req.http.Cookie;
> if (req.method == "PURGE") {
> ban("obj.http.x-host == " + req.http.host + " && obj.http.x-user-agent
> !~ Googlebot");
> return(synth(750));
> }
>
> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
> }
>
> sub vcl_synth {
> if (resp.status == 750) {
> set resp.status = 200;
> synthetic("PURGED => " + req.url);
> return(deliver);
> } elsif (resp.status == 501) {
> set resp.status = 200;
> set resp.http.Content-Type = "text/html; charset=utf-8";
> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
> return(deliver);
> }
> }
>
> sub vcl_backend_response {
> unset beresp.http.Set-Cookie;
> set beresp.http.x-host = bereq.http.host;
> set beresp.http.x-user-agent = bereq.http.user-agent;
>
> if (bereq.url == "/themes/basic/assets/theme.min.css"
> || bereq.url == "/api/events/PAGEVIEW"
> || bereq.url ~ "^\/assets\/img\/") {
> set beresp.http.Cache-Control = "max-age=0";
> } else {
> unset beresp.http.Cache-Control;
> }
>
> if (beresp.status == 200 ||
> beresp.status == 301 ||
> beresp.status == 302 ||
> beresp.status == 404) {
> if (bereq.url ~ "\&ordenar=aleatorio$") {
> set beresp.http.X-TTL = "1d";
> set beresp.ttl = 1d;
> } else {
> set beresp.http.X-TTL = "1w";
> set beresp.ttl = 1w;
> }
> }
>
> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
> {
> set beresp.do_gzip = true;
> }
> }
>
> sub vcl_pipe {
> set bereq.http.connection = "close";
> return (pipe);
> }
>
> sub vcl_deliver {
> unset resp.http.x-host;
> unset resp.http.x-user-agent;
> }
>
> sub vcl_backend_error {
> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
> 504) {
> set beresp.status = 200;
> set beresp.http.Content-Type = "text/html; charset=utf-8";
> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
> return (deliver);
> }
> }
>
> sub vcl_hash {
> if (req.http.User-Agent ~ "Google Page Speed") {
> hash_data("Google Page Speed");
> } elsif (req.http.User-Agent ~ "Googlebot") {
> hash_data("Googlebot");
> }
> }
>
> sub vcl_deliver {
> if (resp.status == 501) {
> return (synth(resp.status));
> }
> if (obj.hits > 0) {
> set resp.http.X-Cache = "hit";
> } else {
> set resp.http.X-Cache = "miss";
> }
> }
>
>
> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Nice! It may have been the cause, time will tell.can you report back in a
>> few days to let us know?
>> --
>> Guillaume Quintard
>>
>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>
>>> Hi Guillaume.
>>>
>>> I think things will start to going better now after changing the bans.
>>> This is how my last varnishstat looked like moments before a crash
>>> regarding the bans:
>>>
>>> MAIN.bans 41336 . Count of bans
>>> MAIN.bans_completed 37967 . Number of bans marked
>>> 'completed'
>>> MAIN.bans_obj 0 . Number of bans using
>>> obj.*
>>> MAIN.bans_req 41335 . Number of bans using
>>> req.*
>>> MAIN.bans_added 41336 0.68 Bans added
>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>
>>> And this is how it looks like now:
>>>
>>> MAIN.bans 2 . Count of bans
>>> MAIN.bans_completed 1 . Number of bans marked
>>> 'completed'
>>> MAIN.bans_obj 2 . Number of bans using
>>> obj.*
>>> MAIN.bans_req 0 . Number of bans using
>>> req.*
>>> MAIN.bans_added 2016 0.69 Bans added
>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>
>>> Before the changes, bans were never deleted!
>>> Now the bans are added and quickly deleted after a minute or even a
>>> couple of seconds.
>>>
>>> May this was the cause of the problem? It seems like varnish was having
>>> a large number of bans to manage and test against.
>>> I will let it ride now. Let's see if the problem persists or it's gone!
>>> :-)
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Looking good!
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Guillaume,
>>>>>
>>>>> Can the following be considered "ban lurker friendly"?
>>>>>
>>>>> sub vcl_backend_response {
>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>> }
>>>>>
>>>>> sub vcl_recv {
>>>>> if (req.method == "PURGE") {
>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>> return(synth(750));
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> unset resp.http.x-url;
>>>>> unset resp.http.x-user-agent;
>>>>> }
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>
>>>>>> I don't think you need to expand the VSL at all.
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> Thanks for answering.
>>>>>>
>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>> performance but it stills restarting.
>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>> signal of overhead.
>>>>>>
>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was
>>>>>> a problem here. After a couple of hours varnish died and I received a "no
>>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>>> problem and varnish was up again, but it's weird because there was free
>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>> what could have happened. I will try to stop increasing the
>>>>>> /var/lib/varnish size.
>>>>>>
>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>
>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url
>>>>>> + " && req.http.User-Agent !~ Googlebot");
>>>>>>
>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>> documentation and it looks like they're not.
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Hi Stefano,
>>>>>>>
>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>
>>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>>> try different advices in the file storage definition in the command line
>>>>>>> (last parameter, after granularity).
>>>>>>>
>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>
>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hello.
>>>>>>>>
>>>>>>>> I am having a critical problem with Varnish Cache in production for
>>>>>>>> over a month and any help will be appreciated.
>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>
>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>> responding to CLI, killed it.
>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply from
>>>>>>>> ping: 400 CLI communication error
>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>> signal=9
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup complete
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>> Child starts
>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>
>>>>>>>> The following link is the varnishstat output just 1 minute before a
>>>>>>>> restart:
>>>>>>>>
>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>
>>>>>>>> Environment:
>>>>>>>>
>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>> packagecloud.io
>>>>>>>> CPU 2x2.9 GHz
>>>>>>>> Mem 3.69 GiB
>>>>>>>> Running inside a Docker container
>>>>>>>> NFILES=131072
>>>>>>>> MEMLOCK=82000
>>>>>>>>
>>>>>>>> Additional info:
>>>>>>>>
>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>> if this is a problem;
>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>> anything to do with it;
>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>> - No panic is reported, the only info I can retrieve is from syslog;
>>>>>>>> - During all the time, event moments before the crashes, everything
>>>>>>>> is okay and requests are being responded very fast.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano Baldo
>>>>>>>>
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> varnish-misc mailing list
>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 28, 2017, 6:20 AM

Post #13 of 24 (2908 views)

Hi Guillaume.

I increased the cli_timeout yesterday to 900sec (15min) and it restarted
anyway, which seems to indicate that the thread is really stalled.

This was 1 minute after the last restart:

MAIN.n_object 3908216 . object structs made
SMF.s0.g_alloc 7794510 . Allocations outstanding

I've just changed the I/O Scheduler to noop to see what happens.

One interest thing I've found is about the memory usage.

In the 1st minute of use:
MemTotal: 3865572 kB
MemFree: 120768 kB
MemAvailable: 2300268 kB

1 minute before a restart:
MemTotal: 3865572 kB
MemFree: 82480 kB
MemAvailable: 68316 kB

It seems like the system is possibly running out of memory.

When calling varnishd, I'm specifying only "-s file,..." as storage. I see
in some examples that is common to use "-s file" AND "-s malloc" together.
Should I be passing "-s malloc" as well to somehow try to limit the memory
usage by varnishd?

Best,
Stefano

On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Sadly, nothing suspicious here, you can still try:
> - bumping the cli_timeout
> - changing your disk scheduler
> - changing the advice option of the file storage
>
> I'm still convinced this is due to Varnish getting stuck waiting for the
> disk because of the file storage fragmentation.
>
> Maybe you could look at SMF.*.g_alloc and compare it to the number of
> objects. Ideally, we would have a 1:1 relation between objects and
> allocations. If that number drops prior to a restart, that would be a good
> clue.
>
>
> --
> Guillaume Quintard
>
> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hi Guillaume.
>>
>> It keeps restarting.
>> Would you mind taking a quick look in the following VCL file to check if
>> you find anything suspicious?
>>
>> Thank you very much.
>>
>> Best,
>> Stefano
>>
>> vcl 4.0;
>>
>> import std;
>>
>> backend default {
>> .host = "sites-web-server-lb";
>> .port = "80";
>> }
>>
>> include "/etc/varnish/bad_bot_detection.vcl";
>>
>> sub vcl_recv {
>> call bad_bot_detection;
>>
>> if (req.url == "/nocache" || req.url == "/version") {
>> return(pass);
>> }
>>
>> unset req.http.Cookie;
>> if (req.method == "PURGE") {
>> ban("obj.http.x-host == " + req.http.host + " &&
>> obj.http.x-user-agent !~ Googlebot");
>> return(synth(750));
>> }
>>
>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>> }
>>
>> sub vcl_synth {
>> if (resp.status == 750) {
>> set resp.status = 200;
>> synthetic("PURGED => " + req.url);
>> return(deliver);
>> } elsif (resp.status == 501) {
>> set resp.status = 200;
>> set resp.http.Content-Type = "text/html; charset=utf-8";
>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>> return(deliver);
>> }
>> }
>>
>> sub vcl_backend_response {
>> unset beresp.http.Set-Cookie;
>> set beresp.http.x-host = bereq.http.host;
>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>
>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>> || bereq.url == "/api/events/PAGEVIEW"
>> || bereq.url ~ "^\/assets\/img\/") {
>> set beresp.http.Cache-Control = "max-age=0";
>> } else {
>> unset beresp.http.Cache-Control;
>> }
>>
>> if (beresp.status == 200 ||
>> beresp.status == 301 ||
>> beresp.status == 302 ||
>> beresp.status == 404) {
>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>> set beresp.http.X-TTL = "1d";
>> set beresp.ttl = 1d;
>> } else {
>> set beresp.http.X-TTL = "1w";
>> set beresp.ttl = 1w;
>> }
>> }
>>
>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>> {
>> set beresp.do_gzip = true;
>> }
>> }
>>
>> sub vcl_pipe {
>> set bereq.http.connection = "close";
>> return (pipe);
>> }
>>
>> sub vcl_deliver {
>> unset resp.http.x-host;
>> unset resp.http.x-user-agent;
>> }
>>
>> sub vcl_backend_error {
>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>> 504) {
>> set beresp.status = 200;
>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>> return (deliver);
>> }
>> }
>>
>> sub vcl_hash {
>> if (req.http.User-Agent ~ "Google Page Speed") {
>> hash_data("Google Page Speed");
>> } elsif (req.http.User-Agent ~ "Googlebot") {
>> hash_data("Googlebot");
>> }
>> }
>>
>> sub vcl_deliver {
>> if (resp.status == 501) {
>> return (synth(resp.status));
>> }
>> if (obj.hits > 0) {
>> set resp.http.X-Cache = "hit";
>> } else {
>> set resp.http.X-Cache = "miss";
>> }
>> }
>>
>>
>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Nice! It may have been the cause, time will tell.can you report back in
>>> a few days to let us know?
>>> --
>>> Guillaume Quintard
>>>
>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> I think things will start to going better now after changing the bans.
>>>> This is how my last varnishstat looked like moments before a crash
>>>> regarding the bans:
>>>>
>>>> MAIN.bans 41336 . Count of bans
>>>> MAIN.bans_completed 37967 . Number of bans marked
>>>> 'completed'
>>>> MAIN.bans_obj 0 . Number of bans using
>>>> obj.*
>>>> MAIN.bans_req 41335 . Number of bans using
>>>> req.*
>>>> MAIN.bans_added 41336 0.68 Bans added
>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>
>>>> And this is how it looks like now:
>>>>
>>>> MAIN.bans 2 . Count of bans
>>>> MAIN.bans_completed 1 . Number of bans marked
>>>> 'completed'
>>>> MAIN.bans_obj 2 . Number of bans using
>>>> obj.*
>>>> MAIN.bans_req 0 . Number of bans using
>>>> req.*
>>>> MAIN.bans_added 2016 0.69 Bans added
>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>
>>>> Before the changes, bans were never deleted!
>>>> Now the bans are added and quickly deleted after a minute or even a
>>>> couple of seconds.
>>>>
>>>> May this was the cause of the problem? It seems like varnish was having
>>>> a large number of bans to manage and test against.
>>>> I will let it ride now. Let's see if the problem persists or it's gone!
>>>> :-)
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Looking good!
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <stefanobaldo@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi Guillaume,
>>>>>>
>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>
>>>>>> sub vcl_backend_response {
>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>> }
>>>>>>
>>>>>> sub vcl_recv {
>>>>>> if (req.method == "PURGE") {
>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>> return(synth(750));
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> unset resp.http.x-url;
>>>>>> unset resp.http.x-user-agent;
>>>>>> }
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>
>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> Thanks for answering.
>>>>>>>
>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>> performance but it stills restarting.
>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>> signal of overhead.
>>>>>>>
>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There was
>>>>>>> a problem here. After a couple of hours varnish died and I received a "no
>>>>>>> space left on device" message - deleting the /var/lib/varnish solved the
>>>>>>> problem and varnish was up again, but it's weird because there was free
>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>> /var/lib/varnish size.
>>>>>>>
>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>
>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " + req.url
>>>>>>> + " && req.http.User-Agent !~ Googlebot");
>>>>>>>
>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>> documentation and it looks like they're not.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Hi Stefano,
>>>>>>>>
>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>
>>>>>>>> After some time, the file storage is terrible on a hard drive (SSDs
>>>>>>>> take a bit more time to degrade) because of fragmentation. One solution to
>>>>>>>> help the disks cope is to overprovision themif they're SSDs, and you can
>>>>>>>> try different advices in the file storage definition in the command line
>>>>>>>> (last parameter, after granularity).
>>>>>>>>
>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>
>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hello.
>>>>>>>>>
>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>
>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>> responding to CLI, killed it.
>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>>> signal=9
>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>> complete
>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) Started
>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>> Child starts
>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>
>>>>>>>>> The following link is the varnishstat output just 1 minute before
>>>>>>>>> a restart:
>>>>>>>>>
>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>
>>>>>>>>> Environment:
>>>>>>>>>
>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>> packagecloud.io
>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>> Mem 3.69 GiB
>>>>>>>>> Running inside a Docker container
>>>>>>>>> NFILES=131072
>>>>>>>>> MEMLOCK=82000
>>>>>>>>>
>>>>>>>>> Additional info:
>>>>>>>>>
>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>> if this is a problem;
>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>> anything to do with it;
>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>> syslog;
>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano Baldo
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> varnish-misc mailing list
>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 28, 2017, 6:26 AM

Post #14 of 24 (2908 views)

Hi,

can you look that "varnishstat -1 | grep g_bytes" and see if if matches the
memory you are seeing?

--
Guillaume Quintard

On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hi Guillaume.
>
> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
> anyway, which seems to indicate that the thread is really stalled.
>
> This was 1 minute after the last restart:
>
> MAIN.n_object 3908216 . object structs made
> SMF.s0.g_alloc 7794510 . Allocations outstanding
>
> I've just changed the I/O Scheduler to noop to see what happens.
>
> One interest thing I've found is about the memory usage.
>
> In the 1st minute of use:
> MemTotal: 3865572 kB
> MemFree: 120768 kB
> MemAvailable: 2300268 kB
>
> 1 minute before a restart:
> MemTotal: 3865572 kB
> MemFree: 82480 kB
> MemAvailable: 68316 kB
>
> It seems like the system is possibly running out of memory.
>
> When calling varnishd, I'm specifying only "-s file,..." as storage. I see
> in some examples that is common to use "-s file" AND "-s malloc" together.
> Should I be passing "-s malloc" as well to somehow try to limit the memory
> usage by varnishd?
>
> Best,
> Stefano
>
>
> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Sadly, nothing suspicious here, you can still try:
>> - bumping the cli_timeout
>> - changing your disk scheduler
>> - changing the advice option of the file storage
>>
>> I'm still convinced this is due to Varnish getting stuck waiting for the
>> disk because of the file storage fragmentation.
>>
>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>> objects. Ideally, we would have a 1:1 relation between objects and
>> allocations. If that number drops prior to a restart, that would be a good
>> clue.
>>
>>
>> --
>> Guillaume Quintard
>>
>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi Guillaume.
>>>
>>> It keeps restarting.
>>> Would you mind taking a quick look in the following VCL file to check if
>>> you find anything suspicious?
>>>
>>> Thank you very much.
>>>
>>> Best,
>>> Stefano
>>>
>>> vcl 4.0;
>>>
>>> import std;
>>>
>>> backend default {
>>> .host = "sites-web-server-lb";
>>> .port = "80";
>>> }
>>>
>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>
>>> sub vcl_recv {
>>> call bad_bot_detection;
>>>
>>> if (req.url == "/nocache" || req.url == "/version") {
>>> return(pass);
>>> }
>>>
>>> unset req.http.Cookie;
>>> if (req.method == "PURGE") {
>>> ban("obj.http.x-host == " + req.http.host + " &&
>>> obj.http.x-user-agent !~ Googlebot");
>>> return(synth(750));
>>> }
>>>
>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>> }
>>>
>>> sub vcl_synth {
>>> if (resp.status == 750) {
>>> set resp.status = 200;
>>> synthetic("PURGED => " + req.url);
>>> return(deliver);
>>> } elsif (resp.status == 501) {
>>> set resp.status = 200;
>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>> return(deliver);
>>> }
>>> }
>>>
>>> sub vcl_backend_response {
>>> unset beresp.http.Set-Cookie;
>>> set beresp.http.x-host = bereq.http.host;
>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>
>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>> || bereq.url == "/api/events/PAGEVIEW"
>>> || bereq.url ~ "^\/assets\/img\/") {
>>> set beresp.http.Cache-Control = "max-age=0";
>>> } else {
>>> unset beresp.http.Cache-Control;
>>> }
>>>
>>> if (beresp.status == 200 ||
>>> beresp.status == 301 ||
>>> beresp.status == 302 ||
>>> beresp.status == 404) {
>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>> set beresp.http.X-TTL = "1d";
>>> set beresp.ttl = 1d;
>>> } else {
>>> set beresp.http.X-TTL = "1w";
>>> set beresp.ttl = 1w;
>>> }
>>> }
>>>
>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>> {
>>> set beresp.do_gzip = true;
>>> }
>>> }
>>>
>>> sub vcl_pipe {
>>> set bereq.http.connection = "close";
>>> return (pipe);
>>> }
>>>
>>> sub vcl_deliver {
>>> unset resp.http.x-host;
>>> unset resp.http.x-user-agent;
>>> }
>>>
>>> sub vcl_backend_error {
>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>> 504) {
>>> set beresp.status = 200;
>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>> return (deliver);
>>> }
>>> }
>>>
>>> sub vcl_hash {
>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>> hash_data("Google Page Speed");
>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>> hash_data("Googlebot");
>>> }
>>> }
>>>
>>> sub vcl_deliver {
>>> if (resp.status == 501) {
>>> return (synth(resp.status));
>>> }
>>> if (obj.hits > 0) {
>>> set resp.http.X-Cache = "hit";
>>> } else {
>>> set resp.http.X-Cache = "miss";
>>> }
>>> }
>>>
>>>
>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Nice! It may have been the cause, time will tell.can you report back in
>>>> a few days to let us know?
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> I think things will start to going better now after changing the bans.
>>>>> This is how my last varnishstat looked like moments before a crash
>>>>> regarding the bans:
>>>>>
>>>>> MAIN.bans 41336 . Count of bans
>>>>> MAIN.bans_completed 37967 . Number of bans marked
>>>>> 'completed'
>>>>> MAIN.bans_obj 0 . Number of bans using
>>>>> obj.*
>>>>> MAIN.bans_req 41335 . Number of bans using
>>>>> req.*
>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>
>>>>> And this is how it looks like now:
>>>>>
>>>>> MAIN.bans 2 . Count of bans
>>>>> MAIN.bans_completed 1 . Number of bans marked
>>>>> 'completed'
>>>>> MAIN.bans_obj 2 . Number of bans using
>>>>> obj.*
>>>>> MAIN.bans_req 0 . Number of bans using
>>>>> req.*
>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>
>>>>> Before the changes, bans were never deleted!
>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>> couple of seconds.
>>>>>
>>>>> May this was the cause of the problem? It seems like varnish was
>>>>> having a large number of bans to manage and test against.
>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>> gone! :-)
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Looking good!
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guillaume,
>>>>>>>
>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>
>>>>>>> sub vcl_backend_response {
>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_recv {
>>>>>>> if (req.method == "PURGE") {
>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>> return(synth(750));
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>> unset resp.http.x-url;
>>>>>>> unset resp.http.x-user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>
>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Hi Guillaume.
>>>>>>>>
>>>>>>>> Thanks for answering.
>>>>>>>>
>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>> performance but it stills restarting.
>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>> signal of overhead.
>>>>>>>>
>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>> /var/lib/varnish size.
>>>>>>>>
>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>
>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>
>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>> documentation and it looks like they're not.
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Stefano,
>>>>>>>>>
>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>
>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>
>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>
>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hello.
>>>>>>>>>>
>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>
>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>>>> signal=9
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>> complete
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>> Started
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>> Child starts
>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>
>>>>>>>>>> The following link is the varnishstat output just 1 minute before
>>>>>>>>>> a restart:
>>>>>>>>>>
>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>
>>>>>>>>>> Environment:
>>>>>>>>>>
>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>> packagecloud.io
>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>> Running inside a Docker container
>>>>>>>>>> NFILES=131072
>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>
>>>>>>>>>> Additional info:
>>>>>>>>>>
>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>> if this is a problem;
>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>> anything to do with it;
>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>> syslog;
>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano Baldo
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> _______________________________________________
>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 28, 2017, 6:39 AM

Post #15 of 24 (2908 views)

Hi.

root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
SMA.Transient.g_bytes 519022 . Bytes
outstanding
SMF.s0.g_bytes 23662845952 . Bytes
outstanding

You mean g_bytes from SMA.Transient? I have set no malloc storage.

On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Hi,
>
> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
> the memory you are seeing?
>
> --
> Guillaume Quintard
>
> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hi Guillaume.
>>
>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>> anyway, which seems to indicate that the thread is really stalled.
>>
>> This was 1 minute after the last restart:
>>
>> MAIN.n_object 3908216 . object structs made
>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>
>> I've just changed the I/O Scheduler to noop to see what happens.
>>
>> One interest thing I've found is about the memory usage.
>>
>> In the 1st minute of use:
>> MemTotal: 3865572 kB
>> MemFree: 120768 kB
>> MemAvailable: 2300268 kB
>>
>> 1 minute before a restart:
>> MemTotal: 3865572 kB
>> MemFree: 82480 kB
>> MemAvailable: 68316 kB
>>
>> It seems like the system is possibly running out of memory.
>>
>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>> see in some examples that is common to use "-s file" AND "-s malloc"
>> together. Should I be passing "-s malloc" as well to somehow try to limit
>> the memory usage by varnishd?
>>
>> Best,
>> Stefano
>>
>>
>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Sadly, nothing suspicious here, you can still try:
>>> - bumping the cli_timeout
>>> - changing your disk scheduler
>>> - changing the advice option of the file storage
>>>
>>> I'm still convinced this is due to Varnish getting stuck waiting for the
>>> disk because of the file storage fragmentation.
>>>
>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>> objects. Ideally, we would have a 1:1 relation between objects and
>>> allocations. If that number drops prior to a restart, that would be a good
>>> clue.
>>>
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> It keeps restarting.
>>>> Would you mind taking a quick look in the following VCL file to check
>>>> if you find anything suspicious?
>>>>
>>>> Thank you very much.
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>> vcl 4.0;
>>>>
>>>> import std;
>>>>
>>>> backend default {
>>>> .host = "sites-web-server-lb";
>>>> .port = "80";
>>>> }
>>>>
>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>
>>>> sub vcl_recv {
>>>> call bad_bot_detection;
>>>>
>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>> return(pass);
>>>> }
>>>>
>>>> unset req.http.Cookie;
>>>> if (req.method == "PURGE") {
>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>> return(synth(750));
>>>> }
>>>>
>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>> }
>>>>
>>>> sub vcl_synth {
>>>> if (resp.status == 750) {
>>>> set resp.status = 200;
>>>> synthetic("PURGED => " + req.url);
>>>> return(deliver);
>>>> } elsif (resp.status == 501) {
>>>> set resp.status = 200;
>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>> return(deliver);
>>>> }
>>>> }
>>>>
>>>> sub vcl_backend_response {
>>>> unset beresp.http.Set-Cookie;
>>>> set beresp.http.x-host = bereq.http.host;
>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>
>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>> set beresp.http.Cache-Control = "max-age=0";
>>>> } else {
>>>> unset beresp.http.Cache-Control;
>>>> }
>>>>
>>>> if (beresp.status == 200 ||
>>>> beresp.status == 301 ||
>>>> beresp.status == 302 ||
>>>> beresp.status == 404) {
>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>> set beresp.http.X-TTL = "1d";
>>>> set beresp.ttl = 1d;
>>>> } else {
>>>> set beresp.http.X-TTL = "1w";
>>>> set beresp.ttl = 1w;
>>>> }
>>>> }
>>>>
>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>> {
>>>> set beresp.do_gzip = true;
>>>> }
>>>> }
>>>>
>>>> sub vcl_pipe {
>>>> set bereq.http.connection = "close";
>>>> return (pipe);
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> unset resp.http.x-host;
>>>> unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> sub vcl_backend_error {
>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>> 504) {
>>>> set beresp.status = 200;
>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>> return (deliver);
>>>> }
>>>> }
>>>>
>>>> sub vcl_hash {
>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>> hash_data("Google Page Speed");
>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>> hash_data("Googlebot");
>>>> }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> if (resp.status == 501) {
>>>> return (synth(resp.status));
>>>> }
>>>> if (obj.hits > 0) {
>>>> set resp.http.X-Cache = "hit";
>>>> } else {
>>>> set resp.http.X-Cache = "miss";
>>>> }
>>>> }
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>> in a few days to let us know?
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> I think things will start to going better now after changing the bans.
>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>> regarding the bans:
>>>>>>
>>>>>> MAIN.bans 41336 . Count of bans
>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>> marked 'completed'
>>>>>> MAIN.bans_obj 0 . Number of bans using
>>>>>> obj.*
>>>>>> MAIN.bans_req 41335 . Number of bans using
>>>>>> req.*
>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>
>>>>>> And this is how it looks like now:
>>>>>>
>>>>>> MAIN.bans 2 . Count of bans
>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>> marked 'completed'
>>>>>> MAIN.bans_obj 2 . Number of bans using
>>>>>> obj.*
>>>>>> MAIN.bans_req 0 . Number of bans using
>>>>>> req.*
>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>
>>>>>> Before the changes, bans were never deleted!
>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>> couple of seconds.
>>>>>>
>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>> having a large number of bans to manage and test against.
>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>> gone! :-)
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Looking good!
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guillaume,
>>>>>>>>
>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>
>>>>>>>> sub vcl_backend_response {
>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub vcl_recv {
>>>>>>>> if (req.method == "PURGE") {
>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>> return(synth(750));
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub vcl_deliver {
>>>>>>>> unset resp.http.x-url;
>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>
>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Guillaume.
>>>>>>>>>
>>>>>>>>> Thanks for answering.
>>>>>>>>>
>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>> performance but it stills restarting.
>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>> signal of overhead.
>>>>>>>>>
>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>
>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>
>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>
>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Stefano,
>>>>>>>>>>
>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>
>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>
>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>
>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello.
>>>>>>>>>>>
>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>>>>> signal=9
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>> complete
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>> Started
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>>> Child starts
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>
>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>> before a restart:
>>>>>>>>>>>
>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>
>>>>>>>>>>> Environment:
>>>>>>>>>>>
>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>> packagecloud.io
>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>> NFILES=131072
>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>
>>>>>>>>>>> Additional info:
>>>>>>>>>>>
>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>> if this is a problem;
>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>> anything to do with it;
>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>> syslog;
>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>> -misc
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 28, 2017, 6:43 AM

Post #16 of 24 (2908 views)

Yeah, I was wondering about Transient, but it seems under control.

Apart from moving away from file storage, I have nothing at the moment :-/

--
Guillaume Quintard

On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hi.
>
> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
> SMA.Transient.g_bytes 519022 . Bytes
> outstanding
> SMF.s0.g_bytes 23662845952 . Bytes
> outstanding
>
> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>
>
> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Hi,
>>
>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>> the memory you are seeing?
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi Guillaume.
>>>
>>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>>> anyway, which seems to indicate that the thread is really stalled.
>>>
>>> This was 1 minute after the last restart:
>>>
>>> MAIN.n_object 3908216 . object structs made
>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>
>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>
>>> One interest thing I've found is about the memory usage.
>>>
>>> In the 1st minute of use:
>>> MemTotal: 3865572 kB
>>> MemFree: 120768 kB
>>> MemAvailable: 2300268 kB
>>>
>>> 1 minute before a restart:
>>> MemTotal: 3865572 kB
>>> MemFree: 82480 kB
>>> MemAvailable: 68316 kB
>>>
>>> It seems like the system is possibly running out of memory.
>>>
>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>> the memory usage by varnishd?
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Sadly, nothing suspicious here, you can still try:
>>>> - bumping the cli_timeout
>>>> - changing your disk scheduler
>>>> - changing the advice option of the file storage
>>>>
>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>> the disk because of the file storage fragmentation.
>>>>
>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>> allocations. If that number drops prior to a restart, that would be a good
>>>> clue.
>>>>
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> It keeps restarting.
>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>> if you find anything suspicious?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>> vcl 4.0;
>>>>>
>>>>> import std;
>>>>>
>>>>> backend default {
>>>>> .host = "sites-web-server-lb";
>>>>> .port = "80";
>>>>> }
>>>>>
>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>
>>>>> sub vcl_recv {
>>>>> call bad_bot_detection;
>>>>>
>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>> return(pass);
>>>>> }
>>>>>
>>>>> unset req.http.Cookie;
>>>>> if (req.method == "PURGE") {
>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>> return(synth(750));
>>>>> }
>>>>>
>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>> }
>>>>>
>>>>> sub vcl_synth {
>>>>> if (resp.status == 750) {
>>>>> set resp.status = 200;
>>>>> synthetic("PURGED => " + req.url);
>>>>> return(deliver);
>>>>> } elsif (resp.status == 501) {
>>>>> set resp.status = 200;
>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>>> return(deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_backend_response {
>>>>> unset beresp.http.Set-Cookie;
>>>>> set beresp.http.x-host = bereq.http.host;
>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>
>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>> } else {
>>>>> unset beresp.http.Cache-Control;
>>>>> }
>>>>>
>>>>> if (beresp.status == 200 ||
>>>>> beresp.status == 301 ||
>>>>> beresp.status == 302 ||
>>>>> beresp.status == 404) {
>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>> set beresp.http.X-TTL = "1d";
>>>>> set beresp.ttl = 1d;
>>>>> } else {
>>>>> set beresp.http.X-TTL = "1w";
>>>>> set beresp.ttl = 1w;
>>>>> }
>>>>> }
>>>>>
>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>> {
>>>>> set beresp.do_gzip = true;
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_pipe {
>>>>> set bereq.http.connection = "close";
>>>>> return (pipe);
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> unset resp.http.x-host;
>>>>> unset resp.http.x-user-agent;
>>>>> }
>>>>>
>>>>> sub vcl_backend_error {
>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>>> 504) {
>>>>> set beresp.status = 200;
>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>> return (deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_hash {
>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>> hash_data("Google Page Speed");
>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>> hash_data("Googlebot");
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> if (resp.status == 501) {
>>>>> return (synth(resp.status));
>>>>> }
>>>>> if (obj.hits > 0) {
>>>>> set resp.http.X-Cache = "hit";
>>>>> } else {
>>>>> set resp.http.X-Cache = "miss";
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>> in a few days to let us know?
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> I think things will start to going better now after changing the
>>>>>>> bans.
>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>> regarding the bans:
>>>>>>>
>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>
>>>>>>> And this is how it looks like now:
>>>>>>>
>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>
>>>>>>> Before the changes, bans were never deleted!
>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>> couple of seconds.
>>>>>>>
>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>> having a large number of bans to manage and test against.
>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>> gone! :-)
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Looking good!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume,
>>>>>>>>>
>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>
>>>>>>>>> sub vcl_backend_response {
>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_recv {
>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>> return(synth(750));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_deliver {
>>>>>>>>> unset resp.http.x-url;
>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>
>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>
>>>>>>>>>> Thanks for answering.
>>>>>>>>>>
>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>> signal of overhead.
>>>>>>>>>>
>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>
>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>
>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>
>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>
>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>
>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>
>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>
>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello.
>>>>>>>>>>>>
>>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>> died signal=9
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>> complete
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> Started
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said Child starts
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>
>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>
>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>
>>>>>>>>>>>> Environment:
>>>>>>>>>>>>
>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>
>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>
>>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>>> if this is a problem;
>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>> syslog;
>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>> -misc
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 28, 2017, 6:47 AM

Post #17 of 24 (2908 views)

SMA.Transient.g_alloc 3518 . Allocations
outstanding
SMA.Transient.g_bytes 546390 . Bytes
outstanding
SMA.Transient.g_space 0 . Bytes available

g_space is always 0. It could mean anything?

On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Yeah, I was wondering about Transient, but it seems under control.
>
> Apart from moving away from file storage, I have nothing at the moment :-/
>
> --
> Guillaume Quintard
>
> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hi.
>>
>> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
>> SMA.Transient.g_bytes 519022 . Bytes
>> outstanding
>> SMF.s0.g_bytes 23662845952 . Bytes
>> outstanding
>>
>> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>>
>>
>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Hi,
>>>
>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>>> the memory you are seeing?
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> I increased the cli_timeout yesterday to 900sec (15min) and it
>>>> restarted anyway, which seems to indicate that the thread is really stalled.
>>>>
>>>> This was 1 minute after the last restart:
>>>>
>>>> MAIN.n_object 3908216 . object structs made
>>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>>
>>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>>
>>>> One interest thing I've found is about the memory usage.
>>>>
>>>> In the 1st minute of use:
>>>> MemTotal: 3865572 kB
>>>> MemFree: 120768 kB
>>>> MemAvailable: 2300268 kB
>>>>
>>>> 1 minute before a restart:
>>>> MemTotal: 3865572 kB
>>>> MemFree: 82480 kB
>>>> MemAvailable: 68316 kB
>>>>
>>>> It seems like the system is possibly running out of memory.
>>>>
>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>>> the memory usage by varnishd?
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Sadly, nothing suspicious here, you can still try:
>>>>> - bumping the cli_timeout
>>>>> - changing your disk scheduler
>>>>> - changing the advice option of the file storage
>>>>>
>>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>>> the disk because of the file storage fragmentation.
>>>>>
>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>>> allocations. If that number drops prior to a restart, that would be a good
>>>>> clue.
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <
>>>>> stefanobaldo@gmail.com> wrote:
>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> It keeps restarting.
>>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>>> if you find anything suspicious?
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>> vcl 4.0;
>>>>>>
>>>>>> import std;
>>>>>>
>>>>>> backend default {
>>>>>> .host = "sites-web-server-lb";
>>>>>> .port = "80";
>>>>>> }
>>>>>>
>>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>>
>>>>>> sub vcl_recv {
>>>>>> call bad_bot_detection;
>>>>>>
>>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>>> return(pass);
>>>>>> }
>>>>>>
>>>>>> unset req.http.Cookie;
>>>>>> if (req.method == "PURGE") {
>>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>> return(synth(750));
>>>>>> }
>>>>>>
>>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>>> }
>>>>>>
>>>>>> sub vcl_synth {
>>>>>> if (resp.status == 750) {
>>>>>> set resp.status = 200;
>>>>>> synthetic("PURGED => " + req.url);
>>>>>> return(deliver);
>>>>>> } elsif (resp.status == 501) {
>>>>>> set resp.status = 200;
>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht
>>>>>> ml"));
>>>>>> return(deliver);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_backend_response {
>>>>>> unset beresp.http.Set-Cookie;
>>>>>> set beresp.http.x-host = bereq.http.host;
>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>
>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>>> } else {
>>>>>> unset beresp.http.Cache-Control;
>>>>>> }
>>>>>>
>>>>>> if (beresp.status == 200 ||
>>>>>> beresp.status == 301 ||
>>>>>> beresp.status == 302 ||
>>>>>> beresp.status == 404) {
>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>>> set beresp.http.X-TTL = "1d";
>>>>>> set beresp.ttl = 1d;
>>>>>> } else {
>>>>>> set beresp.http.X-TTL = "1w";
>>>>>> set beresp.ttl = 1w;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>>> {
>>>>>> set beresp.do_gzip = true;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_pipe {
>>>>>> set bereq.http.connection = "close";
>>>>>> return (pipe);
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> unset resp.http.x-host;
>>>>>> unset resp.http.x-user-agent;
>>>>>> }
>>>>>>
>>>>>> sub vcl_backend_error {
>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status
>>>>>> == 504) {
>>>>>> set beresp.status = 200;
>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>>> return (deliver);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_hash {
>>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>>> hash_data("Google Page Speed");
>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>>> hash_data("Googlebot");
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> if (resp.status == 501) {
>>>>>> return (synth(resp.status));
>>>>>> }
>>>>>> if (obj.hits > 0) {
>>>>>> set resp.http.X-Cache = "hit";
>>>>>> } else {
>>>>>> set resp.http.X-Cache = "miss";
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>>> in a few days to let us know?
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Guillaume.
>>>>>>>>
>>>>>>>> I think things will start to going better now after changing the
>>>>>>>> bans.
>>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>>> regarding the bans:
>>>>>>>>
>>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>>> marked 'completed'
>>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>>> using obj.*
>>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>>> using req.*
>>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>>
>>>>>>>> And this is how it looks like now:
>>>>>>>>
>>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>>> marked 'completed'
>>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>>> using obj.*
>>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>>> using req.*
>>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>>
>>>>>>>> Before the changes, bans were never deleted!
>>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>>> couple of seconds.
>>>>>>>>
>>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>>> having a large number of bans to manage and test against.
>>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>>> gone! :-)
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Looking good!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guillaume,
>>>>>>>>>>
>>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>>
>>>>>>>>>> sub vcl_backend_response {
>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> sub vcl_recv {
>>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>>> return(synth(750));
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> sub vcl_deliver {
>>>>>>>>>> unset resp.http.x-url;
>>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>>
>>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for answering.
>>>>>>>>>>>
>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>>> signal of overhead.
>>>>>>>>>>>
>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans
>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>>
>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>>
>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>>
>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>>
>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>>
>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>>
>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in
>>>>>>>>>>>>> production for over a month and any help will be appreciated.
>>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>> not responding to CLI, killed it.
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>> died signal=9
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>>> complete
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> Started
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> said Child starts
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>>
>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>>
>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>
>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>>
>>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - I need to cache a large number of objets and the cache
>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I
>>>>>>>>>>>>> don't know if this is a problem;
>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>>> syslog;
>>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>>> -misc
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 28, 2017, 6:54 AM

Post #18 of 24 (2908 views)

Also, we are running varnish inside a docker container.
The storage disk is attached to the same host, and mounted to the container
via docker volume.

Do you think it's worth a try to run varnish directly on the host, avoiding
docker? I don't see how this could be a problem but I don't know what to do
anymore.

Best,

On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Yeah, I was wondering about Transient, but it seems under control.
>
> Apart from moving away from file storage, I have nothing at the moment :-/
>
> --
> Guillaume Quintard
>
> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hi.
>>
>> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
>> SMA.Transient.g_bytes 519022 . Bytes
>> outstanding
>> SMF.s0.g_bytes 23662845952 . Bytes
>> outstanding
>>
>> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>>
>>
>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Hi,
>>>
>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>>> the memory you are seeing?
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> I increased the cli_timeout yesterday to 900sec (15min) and it
>>>> restarted anyway, which seems to indicate that the thread is really stalled.
>>>>
>>>> This was 1 minute after the last restart:
>>>>
>>>> MAIN.n_object 3908216 . object structs made
>>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>>
>>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>>
>>>> One interest thing I've found is about the memory usage.
>>>>
>>>> In the 1st minute of use:
>>>> MemTotal: 3865572 kB
>>>> MemFree: 120768 kB
>>>> MemAvailable: 2300268 kB
>>>>
>>>> 1 minute before a restart:
>>>> MemTotal: 3865572 kB
>>>> MemFree: 82480 kB
>>>> MemAvailable: 68316 kB
>>>>
>>>> It seems like the system is possibly running out of memory.
>>>>
>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>>> the memory usage by varnishd?
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Sadly, nothing suspicious here, you can still try:
>>>>> - bumping the cli_timeout
>>>>> - changing your disk scheduler
>>>>> - changing the advice option of the file storage
>>>>>
>>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>>> the disk because of the file storage fragmentation.
>>>>>
>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>>> allocations. If that number drops prior to a restart, that would be a good
>>>>> clue.
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <
>>>>> stefanobaldo@gmail.com> wrote:
>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> It keeps restarting.
>>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>>> if you find anything suspicious?
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>> vcl 4.0;
>>>>>>
>>>>>> import std;
>>>>>>
>>>>>> backend default {
>>>>>> .host = "sites-web-server-lb";
>>>>>> .port = "80";
>>>>>> }
>>>>>>
>>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>>
>>>>>> sub vcl_recv {
>>>>>> call bad_bot_detection;
>>>>>>
>>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>>> return(pass);
>>>>>> }
>>>>>>
>>>>>> unset req.http.Cookie;
>>>>>> if (req.method == "PURGE") {
>>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>> return(synth(750));
>>>>>> }
>>>>>>
>>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>>> }
>>>>>>
>>>>>> sub vcl_synth {
>>>>>> if (resp.status == 750) {
>>>>>> set resp.status = 200;
>>>>>> synthetic("PURGED => " + req.url);
>>>>>> return(deliver);
>>>>>> } elsif (resp.status == 501) {
>>>>>> set resp.status = 200;
>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht
>>>>>> ml"));
>>>>>> return(deliver);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_backend_response {
>>>>>> unset beresp.http.Set-Cookie;
>>>>>> set beresp.http.x-host = bereq.http.host;
>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>
>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>>> } else {
>>>>>> unset beresp.http.Cache-Control;
>>>>>> }
>>>>>>
>>>>>> if (beresp.status == 200 ||
>>>>>> beresp.status == 301 ||
>>>>>> beresp.status == 302 ||
>>>>>> beresp.status == 404) {
>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>>> set beresp.http.X-TTL = "1d";
>>>>>> set beresp.ttl = 1d;
>>>>>> } else {
>>>>>> set beresp.http.X-TTL = "1w";
>>>>>> set beresp.ttl = 1w;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>>> {
>>>>>> set beresp.do_gzip = true;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_pipe {
>>>>>> set bereq.http.connection = "close";
>>>>>> return (pipe);
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> unset resp.http.x-host;
>>>>>> unset resp.http.x-user-agent;
>>>>>> }
>>>>>>
>>>>>> sub vcl_backend_error {
>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status
>>>>>> == 504) {
>>>>>> set beresp.status = 200;
>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>>> return (deliver);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_hash {
>>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>>> hash_data("Google Page Speed");
>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>>> hash_data("Googlebot");
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> if (resp.status == 501) {
>>>>>> return (synth(resp.status));
>>>>>> }
>>>>>> if (obj.hits > 0) {
>>>>>> set resp.http.X-Cache = "hit";
>>>>>> } else {
>>>>>> set resp.http.X-Cache = "miss";
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>>> in a few days to let us know?
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Guillaume.
>>>>>>>>
>>>>>>>> I think things will start to going better now after changing the
>>>>>>>> bans.
>>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>>> regarding the bans:
>>>>>>>>
>>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>>> marked 'completed'
>>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>>> using obj.*
>>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>>> using req.*
>>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>>
>>>>>>>> And this is how it looks like now:
>>>>>>>>
>>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>>> marked 'completed'
>>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>>> using obj.*
>>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>>> using req.*
>>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>>
>>>>>>>> Before the changes, bans were never deleted!
>>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>>> couple of seconds.
>>>>>>>>
>>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>>> having a large number of bans to manage and test against.
>>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>>> gone! :-)
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Looking good!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guillaume,
>>>>>>>>>>
>>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>>
>>>>>>>>>> sub vcl_backend_response {
>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> sub vcl_recv {
>>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>>> return(synth(750));
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> sub vcl_deliver {
>>>>>>>>>> unset resp.http.x-url;
>>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>>
>>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for answering.
>>>>>>>>>>>
>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>>> signal of overhead.
>>>>>>>>>>>
>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans
>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>>
>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>>
>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>>
>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>>
>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>>
>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>>
>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in
>>>>>>>>>>>>> production for over a month and any help will be appreciated.
>>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>> not responding to CLI, killed it.
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>> died signal=9
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>>> complete
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> Started
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> said Child starts
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>>
>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>>
>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>
>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>>
>>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - I need to cache a large number of objets and the cache
>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I
>>>>>>>>>>>>> don't know if this is a problem;
>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>>> syslog;
>>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>>> -misc
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jun 28, 2017, 6:58 AM

Post #19 of 24 (2908 views)

Transient is not limited I suppose, so the g_space == 0 is normal.

You can try running on bare metal, not sure there will be a difference

--
Guillaume Quintard

On Wed, Jun 28, 2017 at 3:54 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Also, we are running varnish inside a docker container.
> The storage disk is attached to the same host, and mounted to the
> container via docker volume.
>
> Do you think it's worth a try to run varnish directly on the host,
> avoiding docker? I don't see how this could be a problem but I don't know
> what to do anymore.
>
> Best,
>
>
> On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Yeah, I was wondering about Transient, but it seems under control.
>>
>> Apart from moving away from file storage, I have nothing at the moment :-/
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi.
>>>
>>> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
>>> SMA.Transient.g_bytes 519022 . Bytes
>>> outstanding
>>> SMF.s0.g_bytes 23662845952 . Bytes
>>> outstanding
>>>
>>> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>>>
>>>
>>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>>>> the memory you are seeing?
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> I increased the cli_timeout yesterday to 900sec (15min) and it
>>>>> restarted anyway, which seems to indicate that the thread is really stalled.
>>>>>
>>>>> This was 1 minute after the last restart:
>>>>>
>>>>> MAIN.n_object 3908216 . object structs made
>>>>> SMF.s0.g_alloc 7794510 . Allocations
>>>>> outstanding
>>>>>
>>>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>>>
>>>>> One interest thing I've found is about the memory usage.
>>>>>
>>>>> In the 1st minute of use:
>>>>> MemTotal: 3865572 kB
>>>>> MemFree: 120768 kB
>>>>> MemAvailable: 2300268 kB
>>>>>
>>>>> 1 minute before a restart:
>>>>> MemTotal: 3865572 kB
>>>>> MemFree: 82480 kB
>>>>> MemAvailable: 68316 kB
>>>>>
>>>>> It seems like the system is possibly running out of memory.
>>>>>
>>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>>>> the memory usage by varnishd?
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Sadly, nothing suspicious here, you can still try:
>>>>>> - bumping the cli_timeout
>>>>>> - changing your disk scheduler
>>>>>> - changing the advice option of the file storage
>>>>>>
>>>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>>>> the disk because of the file storage fragmentation.
>>>>>>
>>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>>>> allocations. If that number drops prior to a restart, that would be a good
>>>>>> clue.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <
>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> It keeps restarting.
>>>>>>> Would you mind taking a quick look in the following VCL file to
>>>>>>> check if you find anything suspicious?
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>> vcl 4.0;
>>>>>>>
>>>>>>> import std;
>>>>>>>
>>>>>>> backend default {
>>>>>>> .host = "sites-web-server-lb";
>>>>>>> .port = "80";
>>>>>>> }
>>>>>>>
>>>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>>>
>>>>>>> sub vcl_recv {
>>>>>>> call bad_bot_detection;
>>>>>>>
>>>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>>>> return(pass);
>>>>>>> }
>>>>>>>
>>>>>>> unset req.http.Cookie;
>>>>>>> if (req.method == "PURGE") {
>>>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>> return(synth(750));
>>>>>>> }
>>>>>>>
>>>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_synth {
>>>>>>> if (resp.status == 750) {
>>>>>>> set resp.status = 200;
>>>>>>> synthetic("PURGED => " + req.url);
>>>>>>> return(deliver);
>>>>>>> } elsif (resp.status == 501) {
>>>>>>> set resp.status = 200;
>>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht
>>>>>>> ml"));
>>>>>>> return(deliver);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_backend_response {
>>>>>>> unset beresp.http.Set-Cookie;
>>>>>>> set beresp.http.x-host = bereq.http.host;
>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>
>>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>>>> } else {
>>>>>>> unset beresp.http.Cache-Control;
>>>>>>> }
>>>>>>>
>>>>>>> if (beresp.status == 200 ||
>>>>>>> beresp.status == 301 ||
>>>>>>> beresp.status == 302 ||
>>>>>>> beresp.status == 404) {
>>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>>>> set beresp.http.X-TTL = "1d";
>>>>>>> set beresp.ttl = 1d;
>>>>>>> } else {
>>>>>>> set beresp.http.X-TTL = "1w";
>>>>>>> set beresp.ttl = 1w;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>>>> {
>>>>>>> set beresp.do_gzip = true;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_pipe {
>>>>>>> set bereq.http.connection = "close";
>>>>>>> return (pipe);
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>> unset resp.http.x-host;
>>>>>>> unset resp.http.x-user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_backend_error {
>>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status
>>>>>>> == 504) {
>>>>>>> set beresp.status = 200;
>>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>>>> return (deliver);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_hash {
>>>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>>>> hash_data("Google Page Speed");
>>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>>>> hash_data("Googlebot");
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>> if (resp.status == 501) {
>>>>>>> return (synth(resp.status));
>>>>>>> }
>>>>>>> if (obj.hits > 0) {
>>>>>>> set resp.http.X-Cache = "hit";
>>>>>>> } else {
>>>>>>> set resp.http.X-Cache = "miss";
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Nice! It may have been the cause, time will tell.can you report
>>>>>>>> back in a few days to let us know?
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume.
>>>>>>>>>
>>>>>>>>> I think things will start to going better now after changing the
>>>>>>>>> bans.
>>>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>>>> regarding the bans:
>>>>>>>>>
>>>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>>>> marked 'completed'
>>>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>>>> using obj.*
>>>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>>>> using req.*
>>>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>>>
>>>>>>>>> And this is how it looks like now:
>>>>>>>>>
>>>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>>>> marked 'completed'
>>>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>>>> using obj.*
>>>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>>>> using req.*
>>>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>>>
>>>>>>>>> Before the changes, bans were never deleted!
>>>>>>>>> Now the bans are added and quickly deleted after a minute or even
>>>>>>>>> a couple of seconds.
>>>>>>>>>
>>>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>>>> having a large number of bans to manage and test against.
>>>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>>>> gone! :-)
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Looking good!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guillaume,
>>>>>>>>>>>
>>>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_backend_response {
>>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_recv {
>>>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>>>> return(synth(750));
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_deliver {
>>>>>>>>>>> unset resp.http.x-url;
>>>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for answering.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to
>>>>>>>>>>>> increase performance but it stills restarting.
>>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is
>>>>>>>>>>>> no signal of overhead.
>>>>>>>>>>>>
>>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount.
>>>>>>>>>>>> There was a problem here. After a couple of hours varnish died and I
>>>>>>>>>>>> received a "no space left on device" message - deleting the
>>>>>>>>>>>> /var/lib/varnish solved the problem and varnish was up again, but it's
>>>>>>>>>>>> weird because there was free memory on the host to be used with the tmpfs
>>>>>>>>>>>> directory, so I don't know what could have happened. I will try to stop
>>>>>>>>>>>> increasing the /var/lib/varnish size.
>>>>>>>>>>>>
>>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans
>>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>>>
>>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>>>
>>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish
>>>>>>>>>>>>> gets stuck trying to push/pull data and can't make time to reply to the
>>>>>>>>>>>>> CLI. I'd recommend monitoring the disk activity (bandwidth and iops) to
>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in
>>>>>>>>>>>>>> production for over a month and any help will be appreciated.
>>>>>>>>>>>>>> The problem is that Varnish child process is recurrently
>>>>>>>>>>>>>> being restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>>> not responding to CLI, killed it.
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected
>>>>>>>>>>>>>> reply from ping: 400 CLI communication error
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>>> died signal=9
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> Started
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> said Child starts
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - I need to cache a large number of objets and the cache
>>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I
>>>>>>>>>>>>>> don't know if this is a problem;
>>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system
>>>>>>>>>>>>>> just before the last crash. I really don't know if this is too much or may
>>>>>>>>>>>>>> have anything to do with it;
>>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>>>> syslog;
>>>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>>>> -misc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Child process recurrently being restarted [ In reply to ]

reza at varnish-software

Jun 28, 2017, 7:33 AM

Post #20 of 24 (2908 views)

Assuming the problem is running out of memory, you will need to do some
memory tuning, especially given the number of threads you are using and
your access patterns. Your options:

- Add more memory to the system
- Reduce thread_pool_max
- Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10")
- Use some of the tuning params in here:
https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage

--
Reza Naghibi
Varnish Software

On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard <
guillaume@varnish-software.com> wrote:

> Hi,
>
> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
> the memory you are seeing?
>
> --
> Guillaume Quintard
>
> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
> wrote:
>
>> Hi Guillaume.
>>
>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>> anyway, which seems to indicate that the thread is really stalled.
>>
>> This was 1 minute after the last restart:
>>
>> MAIN.n_object 3908216 . object structs made
>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>
>> I've just changed the I/O Scheduler to noop to see what happens.
>>
>> One interest thing I've found is about the memory usage.
>>
>> In the 1st minute of use:
>> MemTotal: 3865572 kB
>> MemFree: 120768 kB
>> MemAvailable: 2300268 kB
>>
>> 1 minute before a restart:
>> MemTotal: 3865572 kB
>> MemFree: 82480 kB
>> MemAvailable: 68316 kB
>>
>> It seems like the system is possibly running out of memory.
>>
>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>> see in some examples that is common to use "-s file" AND "-s malloc"
>> together. Should I be passing "-s malloc" as well to somehow try to limit
>> the memory usage by varnishd?
>>
>> Best,
>> Stefano
>>
>>
>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Sadly, nothing suspicious here, you can still try:
>>> - bumping the cli_timeout
>>> - changing your disk scheduler
>>> - changing the advice option of the file storage
>>>
>>> I'm still convinced this is due to Varnish getting stuck waiting for the
>>> disk because of the file storage fragmentation.
>>>
>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>> objects. Ideally, we would have a 1:1 relation between objects and
>>> allocations. If that number drops prior to a restart, that would be a good
>>> clue.
>>>
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> It keeps restarting.
>>>> Would you mind taking a quick look in the following VCL file to check
>>>> if you find anything suspicious?
>>>>
>>>> Thank you very much.
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>> vcl 4.0;
>>>>
>>>> import std;
>>>>
>>>> backend default {
>>>> .host = "sites-web-server-lb";
>>>> .port = "80";
>>>> }
>>>>
>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>
>>>> sub vcl_recv {
>>>> call bad_bot_detection;
>>>>
>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>> return(pass);
>>>> }
>>>>
>>>> unset req.http.Cookie;
>>>> if (req.method == "PURGE") {
>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>> obj.http.x-user-agent !~ Googlebot");
>>>> return(synth(750));
>>>> }
>>>>
>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>> }
>>>>
>>>> sub vcl_synth {
>>>> if (resp.status == 750) {
>>>> set resp.status = 200;
>>>> synthetic("PURGED => " + req.url);
>>>> return(deliver);
>>>> } elsif (resp.status == 501) {
>>>> set resp.status = 200;
>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>> return(deliver);
>>>> }
>>>> }
>>>>
>>>> sub vcl_backend_response {
>>>> unset beresp.http.Set-Cookie;
>>>> set beresp.http.x-host = bereq.http.host;
>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>
>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>> set beresp.http.Cache-Control = "max-age=0";
>>>> } else {
>>>> unset beresp.http.Cache-Control;
>>>> }
>>>>
>>>> if (beresp.status == 200 ||
>>>> beresp.status == 301 ||
>>>> beresp.status == 302 ||
>>>> beresp.status == 404) {
>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>> set beresp.http.X-TTL = "1d";
>>>> set beresp.ttl = 1d;
>>>> } else {
>>>> set beresp.http.X-TTL = "1w";
>>>> set beresp.ttl = 1w;
>>>> }
>>>> }
>>>>
>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>> {
>>>> set beresp.do_gzip = true;
>>>> }
>>>> }
>>>>
>>>> sub vcl_pipe {
>>>> set bereq.http.connection = "close";
>>>> return (pipe);
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> unset resp.http.x-host;
>>>> unset resp.http.x-user-agent;
>>>> }
>>>>
>>>> sub vcl_backend_error {
>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>> 504) {
>>>> set beresp.status = 200;
>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>> return (deliver);
>>>> }
>>>> }
>>>>
>>>> sub vcl_hash {
>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>> hash_data("Google Page Speed");
>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>> hash_data("Googlebot");
>>>> }
>>>> }
>>>>
>>>> sub vcl_deliver {
>>>> if (resp.status == 501) {
>>>> return (synth(resp.status));
>>>> }
>>>> if (obj.hits > 0) {
>>>> set resp.http.X-Cache = "hit";
>>>> } else {
>>>> set resp.http.X-Cache = "miss";
>>>> }
>>>> }
>>>>
>>>>
>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>> in a few days to let us know?
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com> wrote:
>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> I think things will start to going better now after changing the bans.
>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>> regarding the bans:
>>>>>>
>>>>>> MAIN.bans 41336 . Count of bans
>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>> marked 'completed'
>>>>>> MAIN.bans_obj 0 . Number of bans using
>>>>>> obj.*
>>>>>> MAIN.bans_req 41335 . Number of bans using
>>>>>> req.*
>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>
>>>>>> And this is how it looks like now:
>>>>>>
>>>>>> MAIN.bans 2 . Count of bans
>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>> marked 'completed'
>>>>>> MAIN.bans_obj 2 . Number of bans using
>>>>>> obj.*
>>>>>> MAIN.bans_req 0 . Number of bans using
>>>>>> req.*
>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>
>>>>>> Before the changes, bans were never deleted!
>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>> couple of seconds.
>>>>>>
>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>> having a large number of bans to manage and test against.
>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>> gone! :-)
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Looking good!
>>>>>>>
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi Guillaume,
>>>>>>>>
>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>
>>>>>>>> sub vcl_backend_response {
>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub vcl_recv {
>>>>>>>> if (req.method == "PURGE") {
>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>> return(synth(750));
>>>>>>>> }
>>>>>>>> }
>>>>>>>>
>>>>>>>> sub vcl_deliver {
>>>>>>>> unset resp.http.x-url;
>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>> }
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>
>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> Hi Guillaume.
>>>>>>>>>
>>>>>>>>> Thanks for answering.
>>>>>>>>>
>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>> performance but it stills restarting.
>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>> signal of overhead.
>>>>>>>>>
>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its 80m
>>>>>>>>> default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>
>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>
>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>
>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Stefano,
>>>>>>>>>>
>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>
>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>
>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>
>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hello.
>>>>>>>>>>>
>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) died
>>>>>>>>>>> signal=9
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>> complete
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>> Started
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>>> Child starts
>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038) said
>>>>>>>>>>> SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>
>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>> before a restart:
>>>>>>>>>>>
>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>
>>>>>>>>>>> Environment:
>>>>>>>>>>>
>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>> packagecloud.io
>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>> NFILES=131072
>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>
>>>>>>>>>>> Additional info:
>>>>>>>>>>>
>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>> if this is a problem;
>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>> anything to do with it;
>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>> syslog;
>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>> -misc
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>

Re: Child process recurrently being restarted [ In reply to ]

reza at varnish-software

Jun 28, 2017, 7:48 AM

Post #21 of 24 (2908 views)

Transient is using memory. One other option I forgot to mention:

- Move transient allocations to file: -s Transient=file,/tmp,1G

--
Reza Naghibi
Varnish Software

On Wed, Jun 28, 2017 at 9:39 AM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hi.
>
> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
> SMA.Transient.g_bytes 519022 . Bytes
> outstanding
> SMF.s0.g_bytes 23662845952 . Bytes
> outstanding
>
> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>
>
> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Hi,
>>
>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>> the memory you are seeing?
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi Guillaume.
>>>
>>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>>> anyway, which seems to indicate that the thread is really stalled.
>>>
>>> This was 1 minute after the last restart:
>>>
>>> MAIN.n_object 3908216 . object structs made
>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>
>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>
>>> One interest thing I've found is about the memory usage.
>>>
>>> In the 1st minute of use:
>>> MemTotal: 3865572 kB
>>> MemFree: 120768 kB
>>> MemAvailable: 2300268 kB
>>>
>>> 1 minute before a restart:
>>> MemTotal: 3865572 kB
>>> MemFree: 82480 kB
>>> MemAvailable: 68316 kB
>>>
>>> It seems like the system is possibly running out of memory.
>>>
>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>> the memory usage by varnishd?
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Sadly, nothing suspicious here, you can still try:
>>>> - bumping the cli_timeout
>>>> - changing your disk scheduler
>>>> - changing the advice option of the file storage
>>>>
>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>> the disk because of the file storage fragmentation.
>>>>
>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>> allocations. If that number drops prior to a restart, that would be a good
>>>> clue.
>>>>
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> It keeps restarting.
>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>> if you find anything suspicious?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>> vcl 4.0;
>>>>>
>>>>> import std;
>>>>>
>>>>> backend default {
>>>>> .host = "sites-web-server-lb";
>>>>> .port = "80";
>>>>> }
>>>>>
>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>
>>>>> sub vcl_recv {
>>>>> call bad_bot_detection;
>>>>>
>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>> return(pass);
>>>>> }
>>>>>
>>>>> unset req.http.Cookie;
>>>>> if (req.method == "PURGE") {
>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>> return(synth(750));
>>>>> }
>>>>>
>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>> }
>>>>>
>>>>> sub vcl_synth {
>>>>> if (resp.status == 750) {
>>>>> set resp.status = 200;
>>>>> synthetic("PURGED => " + req.url);
>>>>> return(deliver);
>>>>> } elsif (resp.status == 501) {
>>>>> set resp.status = 200;
>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>>> return(deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_backend_response {
>>>>> unset beresp.http.Set-Cookie;
>>>>> set beresp.http.x-host = bereq.http.host;
>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>
>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>> } else {
>>>>> unset beresp.http.Cache-Control;
>>>>> }
>>>>>
>>>>> if (beresp.status == 200 ||
>>>>> beresp.status == 301 ||
>>>>> beresp.status == 302 ||
>>>>> beresp.status == 404) {
>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>> set beresp.http.X-TTL = "1d";
>>>>> set beresp.ttl = 1d;
>>>>> } else {
>>>>> set beresp.http.X-TTL = "1w";
>>>>> set beresp.ttl = 1w;
>>>>> }
>>>>> }
>>>>>
>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>> {
>>>>> set beresp.do_gzip = true;
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_pipe {
>>>>> set bereq.http.connection = "close";
>>>>> return (pipe);
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> unset resp.http.x-host;
>>>>> unset resp.http.x-user-agent;
>>>>> }
>>>>>
>>>>> sub vcl_backend_error {
>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>>> 504) {
>>>>> set beresp.status = 200;
>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>> return (deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_hash {
>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>> hash_data("Google Page Speed");
>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>> hash_data("Googlebot");
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> if (resp.status == 501) {
>>>>> return (synth(resp.status));
>>>>> }
>>>>> if (obj.hits > 0) {
>>>>> set resp.http.X-Cache = "hit";
>>>>> } else {
>>>>> set resp.http.X-Cache = "miss";
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>> in a few days to let us know?
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> I think things will start to going better now after changing the
>>>>>>> bans.
>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>> regarding the bans:
>>>>>>>
>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>
>>>>>>> And this is how it looks like now:
>>>>>>>
>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>
>>>>>>> Before the changes, bans were never deleted!
>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>> couple of seconds.
>>>>>>>
>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>> having a large number of bans to manage and test against.
>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>> gone! :-)
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Looking good!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume,
>>>>>>>>>
>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>
>>>>>>>>> sub vcl_backend_response {
>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_recv {
>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>> return(synth(750));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_deliver {
>>>>>>>>> unset resp.http.x-url;
>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>
>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>
>>>>>>>>>> Thanks for answering.
>>>>>>>>>>
>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>> signal of overhead.
>>>>>>>>>>
>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>
>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>
>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>
>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>
>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>
>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>
>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>
>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello.
>>>>>>>>>>>>
>>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>> died signal=9
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>> complete
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> Started
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said Child starts
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>
>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>
>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>
>>>>>>>>>>>> Environment:
>>>>>>>>>>>>
>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>
>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>
>>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>>> if this is a problem;
>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>> syslog;
>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>> -misc
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>

Re: Child process recurrently being restarted [ In reply to ]

reza at varnish-software

Jun 28, 2017, 9:20 AM

Post #22 of 24 (2899 views)

That means its unlimited. Those numbers are from the Varnish perspective,
so they don't account for how jemalloc manages those allocations.

--
Reza Naghibi
Varnish Software

On Wed, Jun 28, 2017 at 9:47 AM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> SMA.Transient.g_alloc 3518 . Allocations
> outstanding
> SMA.Transient.g_bytes 546390 . Bytes
> outstanding
> SMA.Transient.g_space 0 . Bytes
> available
>
> g_space is always 0. It could mean anything?
>
> On Wed, Jun 28, 2017 at 10:43 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Yeah, I was wondering about Transient, but it seems under control.
>>
>> Apart from moving away from file storage, I have nothing at the moment :-/
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:39 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi.
>>>
>>> root@2c6c325b279f:/# varnishstat -1 | grep g_bytes
>>> SMA.Transient.g_bytes 519022 . Bytes
>>> outstanding
>>> SMF.s0.g_bytes 23662845952 . Bytes
>>> outstanding
>>>
>>> You mean g_bytes from SMA.Transient? I have set no malloc storage.
>>>
>>>
>>> On Wed, Jun 28, 2017 at 10:26 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>>>> the memory you are seeing?
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> I increased the cli_timeout yesterday to 900sec (15min) and it
>>>>> restarted anyway, which seems to indicate that the thread is really stalled.
>>>>>
>>>>> This was 1 minute after the last restart:
>>>>>
>>>>> MAIN.n_object 3908216 . object structs made
>>>>> SMF.s0.g_alloc 7794510 . Allocations
>>>>> outstanding
>>>>>
>>>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>>>
>>>>> One interest thing I've found is about the memory usage.
>>>>>
>>>>> In the 1st minute of use:
>>>>> MemTotal: 3865572 kB
>>>>> MemFree: 120768 kB
>>>>> MemAvailable: 2300268 kB
>>>>>
>>>>> 1 minute before a restart:
>>>>> MemTotal: 3865572 kB
>>>>> MemFree: 82480 kB
>>>>> MemAvailable: 68316 kB
>>>>>
>>>>> It seems like the system is possibly running out of memory.
>>>>>
>>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>>>> the memory usage by varnishd?
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>>
>>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Sadly, nothing suspicious here, you can still try:
>>>>>> - bumping the cli_timeout
>>>>>> - changing your disk scheduler
>>>>>> - changing the advice option of the file storage
>>>>>>
>>>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>>>> the disk because of the file storage fragmentation.
>>>>>>
>>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>>>> allocations. If that number drops prior to a restart, that would be a good
>>>>>> clue.
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <
>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> It keeps restarting.
>>>>>>> Would you mind taking a quick look in the following VCL file to
>>>>>>> check if you find anything suspicious?
>>>>>>>
>>>>>>> Thank you very much.
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>> vcl 4.0;
>>>>>>>
>>>>>>> import std;
>>>>>>>
>>>>>>> backend default {
>>>>>>> .host = "sites-web-server-lb";
>>>>>>> .port = "80";
>>>>>>> }
>>>>>>>
>>>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>>>
>>>>>>> sub vcl_recv {
>>>>>>> call bad_bot_detection;
>>>>>>>
>>>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>>>> return(pass);
>>>>>>> }
>>>>>>>
>>>>>>> unset req.http.Cookie;
>>>>>>> if (req.method == "PURGE") {
>>>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>> return(synth(750));
>>>>>>> }
>>>>>>>
>>>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_synth {
>>>>>>> if (resp.status == 750) {
>>>>>>> set resp.status = 200;
>>>>>>> synthetic("PURGED => " + req.url);
>>>>>>> return(deliver);
>>>>>>> } elsif (resp.status == 501) {
>>>>>>> set resp.status = 200;
>>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht
>>>>>>> ml"));
>>>>>>> return(deliver);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_backend_response {
>>>>>>> unset beresp.http.Set-Cookie;
>>>>>>> set beresp.http.x-host = bereq.http.host;
>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>
>>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>>>> } else {
>>>>>>> unset beresp.http.Cache-Control;
>>>>>>> }
>>>>>>>
>>>>>>> if (beresp.status == 200 ||
>>>>>>> beresp.status == 301 ||
>>>>>>> beresp.status == 302 ||
>>>>>>> beresp.status == 404) {
>>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>>>> set beresp.http.X-TTL = "1d";
>>>>>>> set beresp.ttl = 1d;
>>>>>>> } else {
>>>>>>> set beresp.http.X-TTL = "1w";
>>>>>>> set beresp.ttl = 1w;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>>>> {
>>>>>>> set beresp.do_gzip = true;
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_pipe {
>>>>>>> set bereq.http.connection = "close";
>>>>>>> return (pipe);
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>> unset resp.http.x-host;
>>>>>>> unset resp.http.x-user-agent;
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_backend_error {
>>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status
>>>>>>> == 504) {
>>>>>>> set beresp.status = 200;
>>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>>>> return (deliver);
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_hash {
>>>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>>>> hash_data("Google Page Speed");
>>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>>>> hash_data("Googlebot");
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>> sub vcl_deliver {
>>>>>>> if (resp.status == 501) {
>>>>>>> return (synth(resp.status));
>>>>>>> }
>>>>>>> if (obj.hits > 0) {
>>>>>>> set resp.http.X-Cache = "hit";
>>>>>>> } else {
>>>>>>> set resp.http.X-Cache = "miss";
>>>>>>> }
>>>>>>> }
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Nice! It may have been the cause, time will tell.can you report
>>>>>>>> back in a few days to let us know?
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume.
>>>>>>>>>
>>>>>>>>> I think things will start to going better now after changing the
>>>>>>>>> bans.
>>>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>>>> regarding the bans:
>>>>>>>>>
>>>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>>>> marked 'completed'
>>>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>>>> using obj.*
>>>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>>>> using req.*
>>>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>>>
>>>>>>>>> And this is how it looks like now:
>>>>>>>>>
>>>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>>>> marked 'completed'
>>>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>>>> using obj.*
>>>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>>>> using req.*
>>>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>>>
>>>>>>>>> Before the changes, bans were never deleted!
>>>>>>>>> Now the bans are added and quickly deleted after a minute or even
>>>>>>>>> a couple of seconds.
>>>>>>>>>
>>>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>>>> having a large number of bans to manage and test against.
>>>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>>>> gone! :-)
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Looking good!
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Guillaume,
>>>>>>>>>>>
>>>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_backend_response {
>>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_recv {
>>>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>>>> return(synth(750));
>>>>>>>>>>> }
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> sub vcl_deliver {
>>>>>>>>>>> unset resp.http.x-url;
>>>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>>>
>>>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>
>>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for answering.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to
>>>>>>>>>>>> increase performance but it stills restarting.
>>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is
>>>>>>>>>>>> no signal of overhead.
>>>>>>>>>>>>
>>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount.
>>>>>>>>>>>> There was a problem here. After a couple of hours varnish died and I
>>>>>>>>>>>> received a "no space left on device" message - deleting the
>>>>>>>>>>>> /var/lib/varnish solved the problem and varnish was up again, but it's
>>>>>>>>>>>> weird because there was free memory on the host to be used with the tmpfs
>>>>>>>>>>>> directory, so I don't know what could have happened. I will try to stop
>>>>>>>>>>>> increasing the /var/lib/varnish size.
>>>>>>>>>>>>
>>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans
>>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>>>
>>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>>>
>>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish
>>>>>>>>>>>>> gets stuck trying to push/pull data and can't make time to reply to the
>>>>>>>>>>>>> CLI. I'd recommend monitoring the disk activity (bandwidth and iops) to
>>>>>>>>>>>>> confirm.
>>>>>>>>>>>>>
>>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>>>
>>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>>>
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hello.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in
>>>>>>>>>>>>>> production for over a month and any help will be appreciated.
>>>>>>>>>>>>>> The problem is that Varnish child process is recurrently
>>>>>>>>>>>>>> being restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>>> not responding to CLI, killed it.
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected
>>>>>>>>>>>>>> reply from ping: 400 CLI communication error
>>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>>> died signal=9
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>>>> complete
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> Started
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> said Child starts
>>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> - I need to cache a large number of objets and the cache
>>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I
>>>>>>>>>>>>>> don't know if this is a problem;
>>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system
>>>>>>>>>>>>>> just before the last crash. I really don't know if this is too much or may
>>>>>>>>>>>>>> have anything to do with it;
>>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>>>> syslog;
>>>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Best,
>>>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>>>> -misc
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>

Re: Child process recurrently being restarted [ In reply to ]

stefanobaldo at gmail

Jun 29, 2017, 10:09 AM

Post #23 of 24 (2898 views)

Hi Guillaume and Reza.

This time varnish restarted but it left some more info on syslog.
It seems like the system is running out of memory.

Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.297487] pool_herder invoked
oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.300992] pool_herder cpuset=/
mems_allowed=0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.303157] CPU: 1 PID: 16214
Comm: pool_herder Tainted: G C O 3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Hardware name: Xen HVM
domU, BIOS 4.2.amazon 02/16/2017
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] 0000000000000000
ffffffff815123b5 ffff8800eb3652f0 0000000000000000
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff8150ff8d
0000000000000000 ffffffff810d6e3f 0000000000000000
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff81516d2e
0000000000000200 ffffffff810689d3 ffffffff810c43e4
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Call Trace:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815123b5>]
? dump_stack+0x5d/0x78
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8150ff8d>]
? dump_header+0x76/0x1e8
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810d6e3f>]
? smp_call_function_single+0x5f/0xa0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81516d2e>]
? mutex_lock+0xe/0x2a
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810689d3>]
? put_online_cpus+0x23/0x80
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810c43e4>]
? rcu_oom_notify+0xc4/0xe0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81153d1c>]
? do_try_to_free_pages+0x4ac/0x520
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff811427dd>]
? oom_kill_process+0x21d/0x370
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8114239d>]
? find_lock_task_mm+0x3d/0x90
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81142f43>]
? out_of_memory+0x473/0x4b0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81148e0f>]
? __alloc_pages_nodemask+0x9ef/0xb50
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81065c86>]
? copy_process.part.25+0x116/0x1c50
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81058301>]
? __do_page_fault+0x1d1/0x4f0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81067990>]
? do_fork+0xe0/0x3d0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815188f9>]
? stub_clone+0x69/0x90
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8151858d>]
? system_call_fast_compare_end+0x10/0x15
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.367638] Mem-Info:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.368962] Node 0 DMA per-cpu:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.370768] CPU 0: hi: 0,
btch: 1 usd: 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.373249] CPU 1: hi: 0,
btch: 1 usd: 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.375652] Node 0 DMA32 per-cpu:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.377508] CPU 0: hi: 186,
btch: 31 usd: 29
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.379898] CPU 1: hi: 186,
btch: 31 usd: 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_anon:846474
inactive_anon:1913 isolated_anon:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_file:408
inactive_file:415 isolated_file:32
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] unevictable:20736
dirty:27 writeback:0 unstable:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free:16797
slab_reclaimable:15276 slab_unreclaimable:10521
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] mapped:22002
shmem:22935 pagetables:30362 bounce:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free_cma:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.397242] Node 0 DMA
free:15192kB min:184kB low:228kB high:276kB active_anon:416kB
inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:20kB
isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB
mlocked:20kB dirty:0kB writeback:0kB mapped:20kB shmem:80kB
slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:112kB
pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.416338] lowmem_reserve[]: 0
3757 3757 3757
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.419030] Node 0 DMA32
free:50120kB min:44868kB low:56084kB high:67300kB active_anon:3386780kB
inactive_anon:7592kB active_file:1732kB inactive_file:2060kB
unevictable:82924kB isolated(anon):0kB isolated(file):128kB
present:3915776kB managed:3849676kB mlocked:82924kB dirty:108kB
writeback:0kB mapped:88432kB shmem:91660kB slab_reclaimable:61072kB
slab_unreclaimable:42184kB kernel_stack:27248kB pagetables:121428kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.440095] lowmem_reserve[]: 0 0
0 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.442202] Node 0 DMA: 22*4kB
(UEM) 6*8kB (EM) 1*16kB (E) 2*32kB (UM) 2*64kB (UE) 2*128kB (EM) 3*256kB
(UEM) 1*512kB (E) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15192kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.451936] Node 0 DMA32: 4031*4kB
(EM) 2729*8kB (EM) 324*16kB (EM) 1*32kB (R) 1*64kB (R) 0*128kB 0*256kB
1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 46820kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.460240] Node 0
hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.464122] 24240 total pagecache
pages
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.466048] 0 pages in swap cache
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.467672] Swap cache stats: add
0, delete 0, find 0/0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.470159] Free swap = 0kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.471513] Total swap = 0kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.472980] 982941 pages RAM
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.474380] 0 pages
HighMem/MovableOnly
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.476190] 16525 pages reserved
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.477772] 0 pages hwpoisoned
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.479189] [ pid ] uid tgid
total_vm rss nr_ptes swapents oom_score_adj name
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.482698] [ 163] 0 163
10419 1295 21 0 0 systemd-journal
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.486646] [ 165] 0 165
10202 136 21 0 -1000 systemd-udevd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.490598] [ 294] 0 294
6351 1729 14 0 0 dhclient
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.494457] [ 319] 0 319
6869 62 18 0 0 cron
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.498260] [ 321] 0 321
4964 67 14 0 0 systemd-logind
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.502346] [ 326] 105 326
10558 101 25 0 -900 dbus-daemon
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.506315] [ 342] 0 342
65721 228 31 0 0 rsyslogd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.510222] [ 343] 0 343
88199 2108 61 0 -500 dockerd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.514022] [ 350] 106 350
18280 181 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.518040] [ 351] 106 351
18280 475 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.522041] [ 352] 106 352
18280 187 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.526025] [ 353] 106 353
18280 187 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.530067] [ 354] 106 354
18280 187 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.534033] [ 355] 106 355
18280 190 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.538001] [ 358] 0 358
66390 1826 32 0 0 fail2ban-server
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.541972] [ 400] 0 400
35984 444 24 0 -500 docker-containe
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.545879] [ 568] 0 568
13796 168 30 0 -1000 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.549733] [ 576] 0 576
3604 41 12 0 0 agetty
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.553569] [ 577] 0 577
3559 38 12 0 0 agetty
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.557322] [16201] 0 16201
29695 20707 60 0 0 varnishd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.561103] [16209] 108 16209
118909802 822425 29398 0 0 cache-main
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.565002] [27352] 0 27352
20131 214 42 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.568682] [27354] 1000 27354
20165 211 41 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.572307] [27355] 1000 27355
5487 146 17 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.575920] [27360] 0 27360
11211 107 26 0 0 sudo
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.579593] [27361] 0 27361
11584 97 27 0 0 su
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.583155] [27362] 0 27362
5481 142 15 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.586782] [27749] 0 27749
20131 214 41 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.590428] [27751] 1000 27751
20164 211 39 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.593979] [27752] 1000 27752
5487 147 15 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.597488] [28762] 0 28762
26528 132 17 0 0 varnishstat
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.601239] [28764] 0 28764
11211 106 26 0 0 sudo
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.604737] [28765] 0 28765
11584 97 26 0 0 su
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.608602] [28766] 0 28766
5481 141 15 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.612288] [28768] 0 28768
26528 220 18 0 0 varnishstat
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.616189] Out of memory: Kill
process 16209 (cache-main) score 880 or sacrifice child
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.620106] Killed process 16209
(cache-main) total-vm:475639208kB, anon-rss:3289700kB, file-rss:0kB
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (16209) died signal=9
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child cleanup complete
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) Started
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said Child
starts
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said SMF.s0
mmap'ed 483183820800 bytes of 483183820800

Best,
Stefano

On Wed, Jun 28, 2017 at 11:33 AM, Reza Naghibi <reza@varnish-software.com>
wrote:

> Assuming the problem is running out of memory, you will need to do some
> memory tuning, especially given the number of threads you are using and
> your access patterns. Your options:
>
> - Add more memory to the system
> - Reduce thread_pool_max
> - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10")
> - Use some of the tuning params in here: https://info.varnish-
> software.com/blog/understanding-varnish-cache-memory-usage
> <https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage>
>
>
> --
> Reza Naghibi
> Varnish Software
>
> On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Hi,
>>
>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>> the memory you are seeing?
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi Guillaume.
>>>
>>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>>> anyway, which seems to indicate that the thread is really stalled.
>>>
>>> This was 1 minute after the last restart:
>>>
>>> MAIN.n_object 3908216 . object structs made
>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>
>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>
>>> One interest thing I've found is about the memory usage.
>>>
>>> In the 1st minute of use:
>>> MemTotal: 3865572 kB
>>> MemFree: 120768 kB
>>> MemAvailable: 2300268 kB
>>>
>>> 1 minute before a restart:
>>> MemTotal: 3865572 kB
>>> MemFree: 82480 kB
>>> MemAvailable: 68316 kB
>>>
>>> It seems like the system is possibly running out of memory.
>>>
>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>> the memory usage by varnishd?
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Sadly, nothing suspicious here, you can still try:
>>>> - bumping the cli_timeout
>>>> - changing your disk scheduler
>>>> - changing the advice option of the file storage
>>>>
>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>> the disk because of the file storage fragmentation.
>>>>
>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>> allocations. If that number drops prior to a restart, that would be a good
>>>> clue.
>>>>
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> It keeps restarting.
>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>> if you find anything suspicious?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>> vcl 4.0;
>>>>>
>>>>> import std;
>>>>>
>>>>> backend default {
>>>>> .host = "sites-web-server-lb";
>>>>> .port = "80";
>>>>> }
>>>>>
>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>
>>>>> sub vcl_recv {
>>>>> call bad_bot_detection;
>>>>>
>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>> return(pass);
>>>>> }
>>>>>
>>>>> unset req.http.Cookie;
>>>>> if (req.method == "PURGE") {
>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>> return(synth(750));
>>>>> }
>>>>>
>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>> }
>>>>>
>>>>> sub vcl_synth {
>>>>> if (resp.status == 750) {
>>>>> set resp.status = 200;
>>>>> synthetic("PURGED => " + req.url);
>>>>> return(deliver);
>>>>> } elsif (resp.status == 501) {
>>>>> set resp.status = 200;
>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>>> return(deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_backend_response {
>>>>> unset beresp.http.Set-Cookie;
>>>>> set beresp.http.x-host = bereq.http.host;
>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>
>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>> } else {
>>>>> unset beresp.http.Cache-Control;
>>>>> }
>>>>>
>>>>> if (beresp.status == 200 ||
>>>>> beresp.status == 301 ||
>>>>> beresp.status == 302 ||
>>>>> beresp.status == 404) {
>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>> set beresp.http.X-TTL = "1d";
>>>>> set beresp.ttl = 1d;
>>>>> } else {
>>>>> set beresp.http.X-TTL = "1w";
>>>>> set beresp.ttl = 1w;
>>>>> }
>>>>> }
>>>>>
>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>> {
>>>>> set beresp.do_gzip = true;
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_pipe {
>>>>> set bereq.http.connection = "close";
>>>>> return (pipe);
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> unset resp.http.x-host;
>>>>> unset resp.http.x-user-agent;
>>>>> }
>>>>>
>>>>> sub vcl_backend_error {
>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>>> 504) {
>>>>> set beresp.status = 200;
>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>> return (deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_hash {
>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>> hash_data("Google Page Speed");
>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>> hash_data("Googlebot");
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> if (resp.status == 501) {
>>>>> return (synth(resp.status));
>>>>> }
>>>>> if (obj.hits > 0) {
>>>>> set resp.http.X-Cache = "hit";
>>>>> } else {
>>>>> set resp.http.X-Cache = "miss";
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>> in a few days to let us know?
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> I think things will start to going better now after changing the
>>>>>>> bans.
>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>> regarding the bans:
>>>>>>>
>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>
>>>>>>> And this is how it looks like now:
>>>>>>>
>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>
>>>>>>> Before the changes, bans were never deleted!
>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>> couple of seconds.
>>>>>>>
>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>> having a large number of bans to manage and test against.
>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>> gone! :-)
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Looking good!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume,
>>>>>>>>>
>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>
>>>>>>>>> sub vcl_backend_response {
>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_recv {
>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>> return(synth(750));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_deliver {
>>>>>>>>> unset resp.http.x-url;
>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>
>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>
>>>>>>>>>> Thanks for answering.
>>>>>>>>>>
>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>> signal of overhead.
>>>>>>>>>>
>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>
>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>
>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>
>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>
>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>
>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>
>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>
>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello.
>>>>>>>>>>>>
>>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>> died signal=9
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>> complete
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> Started
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said Child starts
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>
>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>
>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>
>>>>>>>>>>>> Environment:
>>>>>>>>>>>>
>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>
>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>
>>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>>> if this is a problem;
>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>> syslog;
>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>> -misc
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>

Re: Child process recurrently being restarted [ In reply to ]

guillaume at varnish-software

Jul 28, 2017, 1:49 AM

Post #24 of 24 (2738 views)

Anything else using RAM on that machine? Best idea I have now is monitor
Varnish memory usage in parallel of the *.g_bytes counters.

--
Guillaume Quintard

On Thu, Jun 29, 2017 at 7:09 PM, Stefano Baldo <stefanobaldo@gmail.com>
wrote:

> Hi Guillaume and Reza.
>
> This time varnish restarted but it left some more info on syslog.
> It seems like the system is running out of memory.
>
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.297487] pool_herder invoked
> oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.300992] pool_herder cpuset=/
> mems_allowed=0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.303157] CPU: 1 PID: 16214
> Comm: pool_herder Tainted: G C O 3.16.0-4-amd64 #1 Debian
> 3.16.36-1+deb8u2
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Hardware name: Xen
> HVM domU, BIOS 4.2.amazon 02/16/2017
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] 0000000000000000
> ffffffff815123b5 ffff8800eb3652f0 0000000000000000
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff8150ff8d
> 0000000000000000 ffffffff810d6e3f 0000000000000000
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff81516d2e
> 0000000000000200 ffffffff810689d3 ffffffff810c43e4
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Call Trace:
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815123b5>]
> ? dump_stack+0x5d/0x78
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8150ff8d>]
> ? dump_header+0x76/0x1e8
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810d6e3f>]
> ? smp_call_function_single+0x5f/0xa0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81516d2e>]
> ? mutex_lock+0xe/0x2a
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810689d3>]
> ? put_online_cpus+0x23/0x80
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810c43e4>]
> ? rcu_oom_notify+0xc4/0xe0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81153d1c>]
> ? do_try_to_free_pages+0x4ac/0x520
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff811427dd>]
> ? oom_kill_process+0x21d/0x370
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8114239d>]
> ? find_lock_task_mm+0x3d/0x90
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81142f43>]
> ? out_of_memory+0x473/0x4b0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81148e0f>]
> ? __alloc_pages_nodemask+0x9ef/0xb50
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81065c86>]
> ? copy_process.part.25+0x116/0x1c50
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81058301>]
> ? __do_page_fault+0x1d1/0x4f0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81067990>]
> ? do_fork+0xe0/0x3d0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815188f9>]
> ? stub_clone+0x69/0x90
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8151858d>]
> ? system_call_fast_compare_end+0x10/0x15
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.367638] Mem-Info:
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.368962] Node 0 DMA per-cpu:
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.370768] CPU 0: hi: 0,
> btch: 1 usd: 0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.373249] CPU 1: hi: 0,
> btch: 1 usd: 0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.375652] Node 0 DMA32 per-cpu:
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.377508] CPU 0: hi: 186,
> btch: 31 usd: 29
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.379898] CPU 1: hi: 186,
> btch: 31 usd: 0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_anon:846474
> inactive_anon:1913 isolated_anon:0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_file:408
> inactive_file:415 isolated_file:32
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] unevictable:20736
> dirty:27 writeback:0 unstable:0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free:16797
> slab_reclaimable:15276 slab_unreclaimable:10521
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] mapped:22002
> shmem:22935 pagetables:30362 bounce:0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free_cma:0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.397242] Node 0 DMA
> free:15192kB min:184kB low:228kB high:276kB active_anon:416kB
> inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:20kB
> isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB
> mlocked:20kB dirty:0kB writeback:0kB mapped:20kB shmem:80kB
> slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:112kB
> pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
> pages_scanned:0 all_unreclaimable? yes
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.416338] lowmem_reserve[]: 0
> 3757 3757 3757
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.419030] Node 0 DMA32
> free:50120kB min:44868kB low:56084kB high:67300kB active_anon:3386780kB
> inactive_anon:7592kB active_file:1732kB inactive_file:2060kB
> unevictable:82924kB isolated(anon):0kB isolated(file):128kB
> present:3915776kB managed:3849676kB mlocked:82924kB dirty:108kB
> writeback:0kB mapped:88432kB shmem:91660kB slab_reclaimable:61072kB
> slab_unreclaimable:42184kB kernel_stack:27248kB pagetables:121428kB
> unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
> all_unreclaimable? no
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.440095] lowmem_reserve[]: 0 0
> 0 0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.442202] Node 0 DMA: 22*4kB
> (UEM) 6*8kB (EM) 1*16kB (E) 2*32kB (UM) 2*64kB (UE) 2*128kB (EM) 3*256kB
> (UEM) 1*512kB (E) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15192kB
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.451936] Node 0 DMA32:
> 4031*4kB (EM) 2729*8kB (EM) 324*16kB (EM) 1*32kB (R) 1*64kB (R) 0*128kB
> 0*256kB 1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 46820kB
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.460240] Node 0
> hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.464122] 24240 total pagecache
> pages
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.466048] 0 pages in swap cache
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.467672] Swap cache stats: add
> 0, delete 0, find 0/0
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.470159] Free swap = 0kB
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.471513] Total swap = 0kB
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.472980] 982941 pages RAM
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.474380] 0 pages
> HighMem/MovableOnly
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.476190] 16525 pages reserved
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.477772] 0 pages hwpoisoned
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.479189] [ pid ] uid tgid
> total_vm rss nr_ptes swapents oom_score_adj name
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.482698] [ 163] 0 163
> 10419 1295 21 0 0 systemd-journal
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.486646] [ 165] 0 165
> 10202 136 21 0 -1000 systemd-udevd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.490598] [ 294] 0 294
> 6351 1729 14 0 0 dhclient
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.494457] [ 319] 0 319
> 6869 62 18 0 0 cron
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.498260] [ 321] 0 321
> 4964 67 14 0 0 systemd-logind
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.502346] [ 326] 105 326
> 10558 101 25 0 -900 dbus-daemon
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.506315] [ 342] 0 342
> 65721 228 31 0 0 rsyslogd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.510222] [ 343] 0 343
> 88199 2108 61 0 -500 dockerd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.514022] [ 350] 106 350
> 18280 181 36 0 0 zabbix_agentd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.518040] [ 351] 106 351
> 18280 475 36 0 0 zabbix_agentd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.522041] [ 352] 106 352
> 18280 187 36 0 0 zabbix_agentd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.526025] [ 353] 106 353
> 18280 187 36 0 0 zabbix_agentd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.530067] [ 354] 106 354
> 18280 187 36 0 0 zabbix_agentd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.534033] [ 355] 106 355
> 18280 190 36 0 0 zabbix_agentd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.538001] [ 358] 0 358
> 66390 1826 32 0 0 fail2ban-server
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.541972] [ 400] 0 400
> 35984 444 24 0 -500 docker-containe
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.545879] [ 568] 0 568
> 13796 168 30 0 -1000 sshd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.549733] [ 576] 0 576
> 3604 41 12 0 0 agetty
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.553569] [ 577] 0 577
> 3559 38 12 0 0 agetty
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.557322] [16201] 0 16201
> 29695 20707 60 0 0 varnishd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.561103] [16209] 108 16209
> 118909802 822425 29398 0 0 cache-main
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.565002] [27352] 0 27352
> 20131 214 42 0 0 sshd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.568682] [27354] 1000 27354
> 20165 211 41 0 0 sshd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.572307] [27355] 1000 27355
> 5487 146 17 0 0 bash
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.575920] [27360] 0 27360
> 11211 107 26 0 0 sudo
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.579593] [27361] 0 27361
> 11584 97 27 0 0 su
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.583155] [27362] 0 27362
> 5481 142 15 0 0 bash
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.586782] [27749] 0 27749
> 20131 214 41 0 0 sshd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.590428] [27751] 1000 27751
> 20164 211 39 0 0 sshd
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.593979] [27752] 1000 27752
> 5487 147 15 0 0 bash
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.597488] [28762] 0 28762
> 26528 132 17 0 0 varnishstat
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.601239] [28764] 0 28764
> 11211 106 26 0 0 sudo
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.604737] [28765] 0 28765
> 11584 97 26 0 0 su
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.608602] [28766] 0 28766
> 5481 141 15 0 0 bash
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.612288] [28768] 0 28768
> 26528 220 18 0 0 varnishstat
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.616189] Out of memory: Kill
> process 16209 (cache-main) score 880 or sacrifice child
> Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.620106] Killed process 16209
> (cache-main) total-vm:475639208kB, anon-rss:3289700kB, file-rss:0kB
> Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (16209) died signal=9
> Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child cleanup complete
> Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) Started
> Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said Child
> starts
> Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said SMF.s0
> mmap'ed 483183820800 bytes of 483183820800
>
> Best,
> Stefano
>
>
> On Wed, Jun 28, 2017 at 11:33 AM, Reza Naghibi <reza@varnish-software.com>
> wrote:
>
>> Assuming the problem is running out of memory, you will need to do some
>> memory tuning, especially given the number of threads you are using and
>> your access patterns. Your options:
>>
>> - Add more memory to the system
>> - Reduce thread_pool_max
>> - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10")
>> - Use some of the tuning params in here: https://info.varnish-sof
>> tware.com/blog/understanding-varnish-cache-memory-usage
>> <https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage>
>>
>>
>> --
>> Reza Naghibi
>> Varnish Software
>>
>> On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard <
>> guillaume@varnish-software.com> wrote:
>>
>>> Hi,
>>>
>>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>>> the memory you are seeing?
>>>
>>> --
>>> Guillaume Quintard
>>>
>>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>>> wrote:
>>>
>>>> Hi Guillaume.
>>>>
>>>> I increased the cli_timeout yesterday to 900sec (15min) and it
>>>> restarted anyway, which seems to indicate that the thread is really stalled.
>>>>
>>>> This was 1 minute after the last restart:
>>>>
>>>> MAIN.n_object 3908216 . object structs made
>>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>>
>>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>>
>>>> One interest thing I've found is about the memory usage.
>>>>
>>>> In the 1st minute of use:
>>>> MemTotal: 3865572 kB
>>>> MemFree: 120768 kB
>>>> MemAvailable: 2300268 kB
>>>>
>>>> 1 minute before a restart:
>>>> MemTotal: 3865572 kB
>>>> MemFree: 82480 kB
>>>> MemAvailable: 68316 kB
>>>>
>>>> It seems like the system is possibly running out of memory.
>>>>
>>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>>> the memory usage by varnishd?
>>>>
>>>> Best,
>>>> Stefano
>>>>
>>>>
>>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>>> guillaume@varnish-software.com> wrote:
>>>>
>>>>> Sadly, nothing suspicious here, you can still try:
>>>>> - bumping the cli_timeout
>>>>> - changing your disk scheduler
>>>>> - changing the advice option of the file storage
>>>>>
>>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>>> the disk because of the file storage fragmentation.
>>>>>
>>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>>> allocations. If that number drops prior to a restart, that would be a good
>>>>> clue.
>>>>>
>>>>>
>>>>> --
>>>>> Guillaume Quintard
>>>>>
>>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <
>>>>> stefanobaldo@gmail.com> wrote:
>>>>>
>>>>>> Hi Guillaume.
>>>>>>
>>>>>> It keeps restarting.
>>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>>> if you find anything suspicious?
>>>>>>
>>>>>> Thank you very much.
>>>>>>
>>>>>> Best,
>>>>>> Stefano
>>>>>>
>>>>>> vcl 4.0;
>>>>>>
>>>>>> import std;
>>>>>>
>>>>>> backend default {
>>>>>> .host = "sites-web-server-lb";
>>>>>> .port = "80";
>>>>>> }
>>>>>>
>>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>>
>>>>>> sub vcl_recv {
>>>>>> call bad_bot_detection;
>>>>>>
>>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>>> return(pass);
>>>>>> }
>>>>>>
>>>>>> unset req.http.Cookie;
>>>>>> if (req.method == "PURGE") {
>>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>> return(synth(750));
>>>>>> }
>>>>>>
>>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>>> }
>>>>>>
>>>>>> sub vcl_synth {
>>>>>> if (resp.status == 750) {
>>>>>> set resp.status = 200;
>>>>>> synthetic("PURGED => " + req.url);
>>>>>> return(deliver);
>>>>>> } elsif (resp.status == 501) {
>>>>>> set resp.status = 200;
>>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.ht
>>>>>> ml"));
>>>>>> return(deliver);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_backend_response {
>>>>>> unset beresp.http.Set-Cookie;
>>>>>> set beresp.http.x-host = bereq.http.host;
>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>
>>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>>> } else {
>>>>>> unset beresp.http.Cache-Control;
>>>>>> }
>>>>>>
>>>>>> if (beresp.status == 200 ||
>>>>>> beresp.status == 301 ||
>>>>>> beresp.status == 302 ||
>>>>>> beresp.status == 404) {
>>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>>> set beresp.http.X-TTL = "1d";
>>>>>> set beresp.ttl = 1d;
>>>>>> } else {
>>>>>> set beresp.http.X-TTL = "1w";
>>>>>> set beresp.ttl = 1w;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>>> {
>>>>>> set beresp.do_gzip = true;
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_pipe {
>>>>>> set bereq.http.connection = "close";
>>>>>> return (pipe);
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> unset resp.http.x-host;
>>>>>> unset resp.http.x-user-agent;
>>>>>> }
>>>>>>
>>>>>> sub vcl_backend_error {
>>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status
>>>>>> == 504) {
>>>>>> set beresp.status = 200;
>>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>>> return (deliver);
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_hash {
>>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>>> hash_data("Google Page Speed");
>>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>>> hash_data("Googlebot");
>>>>>> }
>>>>>> }
>>>>>>
>>>>>> sub vcl_deliver {
>>>>>> if (resp.status == 501) {
>>>>>> return (synth(resp.status));
>>>>>> }
>>>>>> if (obj.hits > 0) {
>>>>>> set resp.http.X-Cache = "hit";
>>>>>> } else {
>>>>>> set resp.http.X-Cache = "miss";
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>
>>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>>> in a few days to let us know?
>>>>>>> --
>>>>>>> Guillaume Quintard
>>>>>>>
>>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Guillaume.
>>>>>>>>
>>>>>>>> I think things will start to going better now after changing the
>>>>>>>> bans.
>>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>>> regarding the bans:
>>>>>>>>
>>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>>> marked 'completed'
>>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>>> using obj.*
>>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>>> using req.*
>>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>>
>>>>>>>> And this is how it looks like now:
>>>>>>>>
>>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>>> marked 'completed'
>>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>>> using obj.*
>>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>>> using req.*
>>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>>
>>>>>>>> Before the changes, bans were never deleted!
>>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>>> couple of seconds.
>>>>>>>>
>>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>>> having a large number of bans to manage and test against.
>>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>>> gone! :-)
>>>>>>>>
>>>>>>>> Best,
>>>>>>>> Stefano
>>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>
>>>>>>>>> Looking good!
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Guillaume Quintard
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi Guillaume,
>>>>>>>>>>
>>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>>
>>>>>>>>>> sub vcl_backend_response {
>>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> sub vcl_recv {
>>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>>> return(synth(750));
>>>>>>>>>> }
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> sub vcl_deliver {
>>>>>>>>>> unset resp.http.x-url;
>>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>>> }
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>>
>>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>>
>>>>>>>>>>> Thanks for answering.
>>>>>>>>>>>
>>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>>> signal of overhead.
>>>>>>>>>>>
>>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>>
>>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans
>>>>>>>>>>> are lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>>
>>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>>
>>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>>
>>>>>>>>>>> Best,
>>>>>>>>>>> Stefano
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>>
>>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>>
>>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>>
>>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>>
>>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hello.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am having a critical problem with Varnish Cache in
>>>>>>>>>>>>> production for over a month and any help will be appreciated.
>>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>> not responding to CLI, killed it.
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>>> died signal=9
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>>> complete
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> Started
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> said Child starts
>>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>>
>>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>>
>>>>>>>>>>>>> Environment:
>>>>>>>>>>>>>
>>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>>
>>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - I need to cache a large number of objets and the cache
>>>>>>>>>>>>> should last for almost a week, so I have set up a 450G storage space, I
>>>>>>>>>>>>> don't know if this is a problem;
>>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>>> syslog;
>>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Best,
>>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>>> -misc
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>> _______________________________________________
>>> varnish-misc mailing list
>>> varnish-misc@varnish-cache.org
>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>>
>>
>>
>