Mailing List Archive

Odd crash
Hello all,

I woke up to numerous site timeouts, and when I went to check the backend
list, this is what was returned:

root@aviator [~]# varnishadm backend.list
Unknown request in manager process (child not running).
Type 'help' for more info.
Command failed with error code 101
root@aviator [~]#

I believe it was likely related to this panic:
https://zerobin.net/?448b15259bc80551#Geo0NImD4HLpGWVZWk9raD7Qhl11VkNZEmh21J2S9mE=

Varnish has since been upgraded to 4.1.5. Should I still be worried?
Re: Odd crash [ In reply to ]
On Mon, Feb 13, 2017 at 6:22 AM, Andrei <lagged@gmail.com> wrote:
> Hello all,
>
> I woke up to numerous site timeouts, and when I went to check the backend
> list, this is what was returned:
>
> root@aviator [~]# varnishadm backend.list
> Unknown request in manager process (child not running).
> Type 'help' for more info.
> Command failed with error code 101
> root@aviator [~]#

This is because the child process is not running as indicated. This
can be the case when you start varnish in debug mode, run something
like `varnishadm stop` or hit a bug where the child doesn't restart
after a panic.

> I believe it was likely related to this panic:
> https://zerobin.net/?448b15259bc80551#Geo0NImD4HLpGWVZWk9raD7Qhl11VkNZEmh21J2S9mE=

It could be the panic after which the manager process failed to spawn
a new child.

> Varnish has since been upgraded to 4.1.5. Should I still be worried?

Which version are you currently running?

Dridi

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Odd crash [ In reply to ]
I'm running 4.1.5 using the official repo (Cent6) now:

Name : varnish
Arch : x86_64
Version : 4.1.5
Release : 1.el6
Size : 6.3 M

There haven't been any issues since the upgrade, and that was honestly the
first panic I've had with Varnish in 3yrs+. I did notice the "Clock step
detected" mentioned in the panic log, and have seen some reports of clock
stepping causing issues, but there were no ntp/hwclock changes recorded at
the time.



On Mon, Feb 20, 2017 at 4:09 AM, Dridi Boukelmoune <dridi@varni.sh> wrote:

> On Mon, Feb 13, 2017 at 6:22 AM, Andrei <lagged@gmail.com> wrote:
> > Hello all,
> >
> > I woke up to numerous site timeouts, and when I went to check the backend
> > list, this is what was returned:
> >
> > root@aviator [~]# varnishadm backend.list
> > Unknown request in manager process (child not running).
> > Type 'help' for more info.
> > Command failed with error code 101
> > root@aviator [~]#
>
> This is because the child process is not running as indicated. This
> can be the case when you start varnish in debug mode, run something
> like `varnishadm stop` or hit a bug where the child doesn't restart
> after a panic.
>
> > I believe it was likely related to this panic:
> > https://zerobin.net/?448b15259bc80551#Geo0NImD4HLpGWVZWk9raD7Qhl11Vk
> NZEmh21J2S9mE=
>
> It could be the panic after which the manager process failed to spawn
> a new child.
>
> > Varnish has since been upgraded to 4.1.5. Should I still be worried?
>
> Which version are you currently running?
>
> Dridi
>
Re: Odd crash [ In reply to ]
On Mon, Feb 20, 2017 at 11:22 AM, Andrei <lagged@gmail.com> wrote:
> I'm running 4.1.5 using the official repo (Cent6) now:
>
> Name : varnish
> Arch : x86_64
> Version : 4.1.5
> Release : 1.el6
> Size : 6.3 M
>
> There haven't been any issues since the upgrade, and that was honestly the
> first panic I've had with Varnish in 3yrs+. I did notice the "Clock step
> detected" mentioned in the panic log, and have seen some reports of clock
> stepping causing issues, but there were no ntp/hwclock changes recorded at
> the time.

Sorry, I misread your email, it clearly says that you are currently
running 4.1.5... What was the version when the panic occurred?

Regarding stepping clocks, I believe that we still haven't reached a
consensus regarding how to deal with them. So you may still get a
crash for this reason. At the very least, it now tells you the reason
in the panic message.

Dridi

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Odd crash [ In reply to ]
Hi Dridi,

Thanks for the input. Looking over the panic I initially linked, the exact
version info was:

version = varnish-4.1.4 revision 4529ff7 ident =
Linux,2.6.32-642.6.2.el6.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll

I had already upgraded to 4.1.5 before mentioning the error on the list,
and it's been smooth since. It was just odd that there were no clock
changes, yet it panicked due to a stepping, so I figured I'd bring it up.


On Mon, Feb 20, 2017 at 5:02 AM, Dridi Boukelmoune <dridi@varni.sh> wrote:

> On Mon, Feb 20, 2017 at 11:22 AM, Andrei <lagged@gmail.com> wrote:
> > I'm running 4.1.5 using the official repo (Cent6) now:
> >
> > Name : varnish
> > Arch : x86_64
> > Version : 4.1.5
> > Release : 1.el6
> > Size : 6.3 M
> >
> > There haven't been any issues since the upgrade, and that was honestly
> the
> > first panic I've had with Varnish in 3yrs+. I did notice the "Clock step
> > detected" mentioned in the panic log, and have seen some reports of clock
> > stepping causing issues, but there were no ntp/hwclock changes recorded
> at
> > the time.
>
> Sorry, I misread your email, it clearly says that you are currently
> running 4.1.5... What was the version when the panic occurred?
>
> Regarding stepping clocks, I believe that we still haven't reached a
> consensus regarding how to deal with them. So you may still get a
> crash for this reason. At the very least, it now tells you the reason
> in the panic message.
>
> Dridi
>
Re: Odd crash [ In reply to ]
On Mon, Feb 20, 2017 at 1:09 PM, Andrei <lagged@gmail.com> wrote:
> Hi Dridi,
>
> Thanks for the input. Looking over the panic I initially linked, the exact
> version info was:
>
> version = varnish-4.1.4 revision 4529ff7 ident =
> Linux,2.6.32-642.6.2.el6.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit,epoll

Yes, I definitely need to get the hang of this "reading" thing.

> I had already upgraded to 4.1.5 before mentioning the error on the list, and
> it's been smooth since. It was just odd that there were no clock changes,
> yet it panicked due to a stepping, so I figured I'd bring it up.

There's a bug I don't remember and couldn't find after a quick search.
But basically it would cause the child to not restart after a panic.

Something like:

- panic (child crashed)
- restart
- CLI timeout during restart
- kill the child
- varnish running without a cache process

The last step is similar to the one you ran into, since you couldn't
list backends via the CLI. But I couldn't find this bug after a quick
search and looking at the changelog I don't see anything related. So I
would say you may end up in the same situation even with the latest
4.1 release. You should collect syslogs (enabled by default) and see
what Varnish has to say shortly before/after a panic, it might give a
clue.

Dridi

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Odd crash [ In reply to ]
On Mon, Feb 20, 2017 at 7:13 AM, Dridi Boukelmoune <dridi@varni.sh> wrote:

> On Mon, Feb 20, 2017 at 1:09 PM, Andrei <lagged@gmail.com> wrote:
> > Hi Dridi,
> >
> > Thanks for the input. Looking over the panic I initially linked, the
> exact
> > version info was:
> >
> > version = varnish-4.1.4 revision 4529ff7 ident =
> > Linux,2.6.32-642.6.2.el6.x86_64,x86_64,-junix,-smalloc,-
> smalloc,-hcritbit,epoll
>
> Yes, I definitely need to get the hang of this "reading" thing.
>

It's overrated :)


>
> > I had already upgraded to 4.1.5 before mentioning the error on the list,
> and
> > it's been smooth since. It was just odd that there were no clock changes,
> > yet it panicked due to a stepping, so I figured I'd bring it up.
>
> There's a bug I don't remember and couldn't find after a quick search.
> But basically it would cause the child to not restart after a panic.
>
> Something like:
>
> - panic (child crashed)
> - restart
> - CLI timeout during restart
> - kill the child
> - varnish running without a cache process
>
> The last step is similar to the one you ran into, since you couldn't
> list backends via the CLI. But I couldn't find this bug after a quick
> search and looking at the changelog I don't see anything related. So I
> would say you may end up in the same situation even with the latest
> 4.1 release. You should collect syslogs (enabled by default) and see
> what Varnish has to say shortly before/after a panic, it might give a
> clue.
>
> Dridi
>

Thanks for the details, I'll track it down from there and keep an eye on
the issue(s). Nothing in /var/log/messages other than a ban sent ~6h prior:

Feb 12 17:09:17 aviator varnishd[15791]: CLI telnet 127.0.0.1 55284
127.0.0.1 6082 Rd auth NNN
Feb 12 17:09:17 aviator varnishd[15791]: CLI telnet 127.0.0.1 55284
127.0.0.1 6082 Wr 200 -----------------------------#012Varnish Cache CLI
1.0#012-----------------------------#012Linux,2.6.32-642.6.2.el6.x86_64,x86_64,-junix,-smalloc,-smalloc,-hcritbit#012varnish-4.1.4
revision 4529ff7#012#012Type 'help' for command list.#012Type 'quit' to
close CLI session.
Feb 12 17:09:17 aviator varnishd[15791]: CLI telnet 127.0.0.1 55284
127.0.0.1 6082 Rd ping
Feb 12 17:09:17 aviator varnishd[15791]: CLI telnet 127.0.0.1 55284
127.0.0.1 6082 Wr 200 PONG 1486912157 1.0
Feb 12 17:09:17 aviator varnishd[15791]: CLI telnet 127.0.0.1 55284
127.0.0.1 6082 Rd ban req.http.Host ~ ".*"
Feb 12 17:09:17 aviator varnishd[15791]: CLI telnet 127.0.0.1 55284
127.0.0.1 6082 Wr 200
Re: Odd crash [ In reply to ]
This definitely isn't an SELinux issue on my end. I've also seen Varnish
work fine with SELinux (after policy updates as Dridi mentioned).

On Mon, Feb 20, 2017 at 4:43 PM, Dridi Boukelmoune <dridi@varni.sh> wrote:

> On Mon, Feb 20, 2017 at 11:25 PM, Daniel Parthey <pada@posteo.de> wrote:
> > It might be an SElinux Problem. Varnish 4.1.3 seems incompatible with the
> > default SELinux Rules on CentOS. We ran into problems with child workers
> > when selinux was enabled.
>
> I don't think it's related to SELinux. The main problem with
> CentOS/Red Hat/Fedora is the SELinux policy shipped by those
> distributions. They give very little margin and it becomes easy to
> make a change in your configuration that ends up rejected. At the
> same time conservative defaults give a smaller attack surface...
>
> > setenforce 0
> > service varnish restart
> >
> > and for permanent boot-safe change:
> >
> > /etc/sysconfig/selinux
> > selinux=disabled
>
> This is _not_ how you solve SELinux problems. You switch to
> permissive, collect audit logs while running offending software,
> update the policy and switch back to enforcing.
>
> > Might make varnish more stable.
> >
> > Not sure why the default CentOS Policy (at least on CentOS 7) affect
> varnish
> > master/child communications.
>
> It should not, I'd like to see evidence that this is happening. Please
> open a github issue on the pkg-varnish-cache project if you manage
> to reproduce it and let us know how.
>
> Dridi
>