Mailing List Archive

Varnish nighmare after upgrading : need help
Hello list,

First of all despite my mail subject I really appreciate varnish.
We use it a lot at work (hundred of instances) with success and
unfortunately some pain these time.

TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
infrastructure brought us some serious trouble and instability on this
platform.
And we are a bit desperate/frustrated


Long story.

A bit of context :

This a very complex platform serving an IPTV service with some traffic.
(8k req/s in peak, even more when it work well).
It is compose of a two stage reverse proxy cache (3 x 2 varnish for
stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of different
backends (php applications, nodejs apps, remote backends *sigh*, and
even pipe one). This a big historical spaghetti app. We plan to rebuild
it from scratch in 2018.
The first stage varnish are separate in two pool handling different
topology of clients.

A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of
manipulation of headers, choice of a backend, and even ESI processing...
The VCL of the stage 1 varnish are almost 3000 lines long.

But for now we have to leave/deal with it.

History of the problem :

At the beginning all varnish are in 2.x version. Things works almost well.
This summer we need to upgrade the varnish version to handle very long
header (a product requirement).
So after a short battle porting our vcl to vcl4.0 we start using varnish 4.
Shortly after thing begun to goes very bad.

The first issue we hit, is a memory exhaustion on both stage, and
oom-killer...
We test a lot of things, and in the battle we upgrade to varnish5.
We fix it, resizing the pool, and using now file backend (from memory
before).
Memory is now stable (we have large pool, 32G, and strange thing, we
never have object being nuke, which it good or bad it depend).
We have also fix a lot of things in our vcl.

The problem we fight against now is only on the stage1 varnish, and
specifically on one pool (the busiest one).
When everything goes well the average cpu usage is 30%, memory stabilize
around 12G, hit cache is around 0.85.
Problem happen randomly (not everyday) but during our peaks. The cpu
increase fasly to reach 350% (4 core) and load > 3/
When the problem is here varnish still deliver requests (we didn't see
dropped or reject connections) but our application begin to lost user,
including a big lot of business. I suspect this is because timeout are
very aggressive on the client side and varnish should answer slowly

-first question : how see response time of request of the varnish server
?. (varnishnsca something ?)

I also suspect some kind of request queuing, also stracing varnish when
it happen show a lot of futex wait ?!.
The frustrating part is restarting varnish fix the problem immediately,
and the cpu remains normal after, even if the trafic peak is not finish.
So there is clearly something stacked in varnish which cause our problem.

-second question : how to see number of stacked connections, long
connections and so on ?

At this stage we accept all kind of help / hints for debuging (and
regarding the business impact we can evaluate the help of a professional
support)

PS : I always have the option to scale out, popping a lot of new varnish
instance, but this seems very frustrating...

Best,

--
Raphael Mazelier


_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish nighmare after upgrading : need help [ In reply to ]
Hi,

Let's look at the usual suspects first, can we get the output of "ps aux
|grep varnish" and a pastebin of "varnishncsa -1"?

Are you using any vmod?

man varnishncsa will help craft a format line with the response time (on
mobile now, I don't have access to it)

Cheers,

--
Guillaume Quintard

On Nov 14, 2017 23:25, "Raphael Mazelier" <raph@futomaki.net> wrote:

> Hello list,
>
> First of all despite my mail subject I really appreciate varnish.
> We use it a lot at work (hundred of instances) with success and
> unfortunately some pain these time.
>
> TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
> infrastructure brought us some serious trouble and instability on this
> platform.
> And we are a bit desperate/frustrated
>
>
> Long story.
>
> A bit of context :
>
> This a very complex platform serving an IPTV service with some traffic.
> (8k req/s in peak, even more when it work well).
> It is compose of a two stage reverse proxy cache (3 x 2 varnish for stage
> 1), 2 varnish for stage 2, (so 8 in total) and a lot of different backends
> (php applications, nodejs apps, remote backends *sigh*, and even pipe one).
> This a big historical spaghetti app. We plan to rebuild it from scratch in
> 2018.
> The first stage varnish are separate in two pool handling different
> topology of clients.
>
> A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of
> manipulation of headers, choice of a backend, and even ESI processing...
> The VCL of the stage 1 varnish are almost 3000 lines long.
>
> But for now we have to leave/deal with it.
>
> History of the problem :
>
> At the beginning all varnish are in 2.x version. Things works almost well.
> This summer we need to upgrade the varnish version to handle very long
> header (a product requirement).
> So after a short battle porting our vcl to vcl4.0 we start using varnish 4.
> Shortly after thing begun to goes very bad.
>
> The first issue we hit, is a memory exhaustion on both stage, and
> oom-killer...
> We test a lot of things, and in the battle we upgrade to varnish5.
> We fix it, resizing the pool, and using now file backend (from memory
> before).
> Memory is now stable (we have large pool, 32G, and strange thing, we never
> have object being nuke, which it good or bad it depend).
> We have also fix a lot of things in our vcl.
>
> The problem we fight against now is only on the stage1 varnish, and
> specifically on one pool (the busiest one).
> When everything goes well the average cpu usage is 30%, memory stabilize
> around 12G, hit cache is around 0.85.
> Problem happen randomly (not everyday) but during our peaks. The cpu
> increase fasly to reach 350% (4 core) and load > 3/
> When the problem is here varnish still deliver requests (we didn't see
> dropped or reject connections) but our application begin to lost user,
> including a big lot of business. I suspect this is because timeout are very
> aggressive on the client side and varnish should answer slowly
>
> -first question : how see response time of request of the varnish server
> ?. (varnishnsca something ?)
>
> I also suspect some kind of request queuing, also stracing varnish when it
> happen show a lot of futex wait ?!.
> The frustrating part is restarting varnish fix the problem immediately,
> and the cpu remains normal after, even if the trafic peak is not finish.
> So there is clearly something stacked in varnish which cause our problem.
>
> -second question : how to see number of stacked connections, long
> connections and so on ?
>
> At this stage we accept all kind of help / hints for debuging (and
> regarding the business impact we can evaluate the help of a professional
> support)
>
> PS : I always have the option to scale out, popping a lot of new varnish
> instance, but this seems very frustrating...
>
> Best,
>
> --
> Raphael Mazelier
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Varnish nighmare after upgrading : need help [ In reply to ]
Hi,

Of course the evening was quite quiet and I have no spurious output to
show. (schrodinger effect)

Anyway here the pastebin of the busiest period this night
https://pastebin.com/536LM9Nx.

We use std, and director vmod.

Btw : I found the correct format for varnishncsa (varnishncsa -F '%h %r
%s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job).
Side question : why not include hit/miss in the default output ?


Thks for the help.

Best,

--
Raphael Mazelier

On 14/11/2017 23:41, Guillaume Quintard wrote:
> Hi,
>
> Let's look at the usual suspects first, can we get the output of "ps
> aux |grep varnish" and a pastebin of "varnishncsa -1"?
>
> Are you using any vmod?
>
> man varnishncsa will help craft a format line with the response time
> (on mobile now, I don't have access to it)
>
> Cheers,
>
> --
> Guillaume Quintard
>
> On Nov 14, 2017 23:25, "Raphael Mazelier" <raph@futomaki.net
> <mailto:raph@futomaki.net>> wrote:
>
> Hello list,
>
> First of all despite my mail subject I really appreciate varnish.
> We use it a lot at work (hundred of instances) with success and
> unfortunately some pain these time.
>
> TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
> infrastructure brought us some serious trouble and instability on
> this platform.
> And we are a bit desperate/frustrated
>
>
> Long story.
>
> A bit of context :
>
> This a very complex platform serving an IPTV service with some
> traffic. (8k req/s in peak, even more when it work well).
> It is compose of a two stage reverse proxy cache (3 x 2 varnish
> for stage 1), 2 varnish for stage 2, (so 8 in total) and a lot of
> different backends (php applications, nodejs apps, remote backends
> *sigh*, and even pipe one). This a big historical spaghetti app.
> We plan to rebuild it from scratch in 2018.
> The first stage varnish are separate in two pool handling
> different topology of clients.
>
> A lot of the logic is in varnish/vcl itself, lot of url rewrite,
> lot of manipulation of headers, choice of a backend, and even ESI
> processing...
> The VCL of the stage 1 varnish are almost 3000 lines long.
>
> But for now we have to leave/deal with it.
>
> History of the problem :
>
> At the beginning all varnish are in 2.x version. Things works
> almost well.
> This summer we need to upgrade the varnish version to handle very
> long header (a product requirement).
> So after a short battle porting our vcl to vcl4.0 we start using
> varnish 4.
> Shortly after thing begun to goes very bad.
>
> The first issue we hit, is a memory exhaustion on both stage, and
> oom-killer...
> We test a lot of things, and in the battle we upgrade to varnish5.
> We fix it, resizing the pool, and using now file backend (from
> memory before).
> Memory is now stable (we have large pool, 32G, and strange thing,
> we never have object being nuke, which it good or bad it depend).
> We have also fix a lot of things in our vcl.
>
> The problem we fight against now is only on the stage1 varnish,
> and specifically on one pool (the busiest one).
> When everything goes well the average cpu usage is 30%, memory
> stabilize around 12G, hit cache is around 0.85.
> Problem happen randomly (not everyday) but during our peaks. The
> cpu increase fasly to reach 350% (4 core) and load > 3/
> When the problem is here varnish still deliver requests (we didn't
> see dropped or reject connections) but our application begin to
> lost user, including a big lot of business. I suspect this is
> because timeout are very aggressive on the client side and varnish
> should answer slowly
>
> -first question : how see response time of request of the varnish
> server ?. (varnishnsca something ?)
>
> I also suspect some kind of request queuing, also stracing varnish
> when it happen show a lot of futex wait ?!.
> The frustrating part is restarting varnish fix the problem
> immediately, and the cpu remains normal after, even if the trafic
> peak is not finish.
> So there is clearly something stacked in varnish which cause our
> problem.
>
> -second question : how to see number of stacked connections, long
> connections and so on ?
>
> At this stage we accept all kind of help / hints for debuging (and
> regarding the business impact we can evaluate the help of a
> professional support)
>
> PS : I always have the option to scale out, popping a lot of new
> varnish instance, but this seems very frustrating...
>
> Best,
>
> --
> Raphael Mazelier
>
>
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org <mailto:varnish-misc@varnish-cache.org>
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
> <https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc>
>
Re: Varnish nighmare after upgrading : need help [ In reply to ]
I think we just replicate the ncsa default format line

--
Guillaume Quintard

On Nov 15, 2017 23:52, "Raphael Mazelier" <raph@futomaki.net> wrote:

> Hi,
>
> Of course the evening was quite quiet and I have no spurious output to
> show. (schrodinger effect)
>
> Anyway here the pastebin of the busiest period this night
> https://pastebin.com/536LM9Nx.
>
> We use std, and director vmod.
>
> Btw : I found the correct format for varnishncsa (varnishncsa -F '%h %r
> %s %{Varnish:handling}x %{Varnish:side}x %T %D' does the job).
> Side question : why not include hit/miss in the default output ?
>
>
> Thks for the help.
>
> Best,
>
> --
> Raphael Mazelier
>
> On 14/11/2017 23:41, Guillaume Quintard wrote:
>
> Hi,
>
> Let's look at the usual suspects first, can we get the output of "ps aux
> |grep varnish" and a pastebin of "varnishncsa -1"?
>
> Are you using any vmod?
>
> man varnishncsa will help craft a format line with the response time (on
> mobile now, I don't have access to it)
>
> Cheers,
>
> --
> Guillaume Quintard
>
> On Nov 14, 2017 23:25, "Raphael Mazelier" <raph@futomaki.net> wrote:
>
>> Hello list,
>>
>> First of all despite my mail subject I really appreciate varnish.
>> We use it a lot at work (hundred of instances) with success and
>> unfortunately some pain these time.
>>
>> TLDR; upgrading from varnish 2 to varnish 4 and 5 on one of our
>> infrastructure brought us some serious trouble and instability on this
>> platform.
>> And we are a bit desperate/frustrated
>>
>>
>> Long story.
>>
>> A bit of context :
>>
>> This a very complex platform serving an IPTV service with some traffic.
>> (8k req/s in peak, even more when it work well).
>> It is compose of a two stage reverse proxy cache (3 x 2 varnish for stage
>> 1), 2 varnish for stage 2, (so 8 in total) and a lot of different backends
>> (php applications, nodejs apps, remote backends *sigh*, and even pipe one).
>> This a big historical spaghetti app. We plan to rebuild it from scratch in
>> 2018.
>> The first stage varnish are separate in two pool handling different
>> topology of clients.
>>
>> A lot of the logic is in varnish/vcl itself, lot of url rewrite, lot of
>> manipulation of headers, choice of a backend, and even ESI processing...
>> The VCL of the stage 1 varnish are almost 3000 lines long.
>>
>> But for now we have to leave/deal with it.
>>
>> History of the problem :
>>
>> At the beginning all varnish are in 2.x version. Things works almost well.
>> This summer we need to upgrade the varnish version to handle very long
>> header (a product requirement).
>> So after a short battle porting our vcl to vcl4.0 we start using varnish
>> 4.
>> Shortly after thing begun to goes very bad.
>>
>> The first issue we hit, is a memory exhaustion on both stage, and
>> oom-killer...
>> We test a lot of things, and in the battle we upgrade to varnish5.
>> We fix it, resizing the pool, and using now file backend (from memory
>> before).
>> Memory is now stable (we have large pool, 32G, and strange thing, we
>> never have object being nuke, which it good or bad it depend).
>> We have also fix a lot of things in our vcl.
>>
>> The problem we fight against now is only on the stage1 varnish, and
>> specifically on one pool (the busiest one).
>> When everything goes well the average cpu usage is 30%, memory stabilize
>> around 12G, hit cache is around 0.85.
>> Problem happen randomly (not everyday) but during our peaks. The cpu
>> increase fasly to reach 350% (4 core) and load > 3/
>> When the problem is here varnish still deliver requests (we didn't see
>> dropped or reject connections) but our application begin to lost user,
>> including a big lot of business. I suspect this is because timeout are very
>> aggressive on the client side and varnish should answer slowly
>>
>> -first question : how see response time of request of the varnish server
>> ?. (varnishnsca something ?)
>>
>> I also suspect some kind of request queuing, also stracing varnish when
>> it happen show a lot of futex wait ?!.
>> The frustrating part is restarting varnish fix the problem immediately,
>> and the cpu remains normal after, even if the trafic peak is not finish.
>> So there is clearly something stacked in varnish which cause our problem.
>>
>> -second question : how to see number of stacked connections, long
>> connections and so on ?
>>
>> At this stage we accept all kind of help / hints for debuging (and
>> regarding the business impact we can evaluate the help of a professional
>> support)
>>
>> PS : I always have the option to scale out, popping a lot of new varnish
>> instance, but this seems very frustrating...
>>
>> Best,
>>
>> --
>> Raphael Mazelier
>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>
Re: Varnish nighmare after upgrading : need help [ In reply to ]
>
>
> On 14/11/2017 23:41, Guillaume Quintard wrote:
>> Hi,
>>
>> Let's look at the usual suspects first, can we get the output of "ps
>> aux |grep varnish" and a pastebin of "varnishncsa -1"?
>>
>> Are you using any vmod?
>>
>> man varnishncsa will help craft a format line with the response time
>> (on mobile now, I don't have access to it)
>>
>> Cheers,

Hi All,

A short following of the situation and what seems to mitigate the
problem/make this platform works.
After lot of testing and A/B testing, the solution for us was to make
more smaller instances.
We basically double all the servers(vms) , but in the other hand divide
by two (or more) the ram, and the memory allocated to varnish.
We also revert using malloc with little ram (4G) , 12g on the vm(s). We
also make scheduled task to flush the cache (restarting varnish).
This is a completely counter intuitive, because nuking some entries
seems better to handle a big cache with no nuke.
In my understating it means that our hot content remains in the cache,
and nuking object is OK. This may also means that our ttl in objects are
completly wrong.

Anyway it seems working. Thanks a lot for the people who help us. (and
I'm sure we can find a way to re-tribute this).

Best,

PS : in the battle we upgrade to 5.2 (seems ok), just a quick question
on varnihstat. I read the changelog and understand that the varnishstat
communication with varnish have change. I have just a little glith. When
launching varnishstat the hitrate counter and average reset randomly at
the beginning, and after a while stabilize. This is a bit annoying
because we often want to have a quick view on the hitrate.


--
Raphael Mazelier

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish nighmare after upgrading : epilogue ? [ In reply to ]
On 23/11/2017 21:57, Raphael Mazelier wrote:
>
> A short following of the situation and what seems to mitigate the
> problem/make this platform works.
> After lot of testing and A/B testing, the solution for us was to make
> more smaller instances.
> We basically double all the servers(vms) , but in the other hand divide
> by two (or more) the ram, and the memory allocated to varnish.
> We also revert using malloc with little ram (4G) , 12g on the vm(s). We
> also make scheduled task to flush the cache (restarting varnish).
> This is a completely counter intuitive, because nuking some entries
> seems better to handle a big cache with no nuke.
> In my understating it means that our hot content remains in the cache,
> and nuking object is OK. This may also means that our ttl in objects are
> completly wrong.
>
> Anyway it seems working. Thanks a lot for the people who help us. (and
> I'm sure we can find a way to re-tribute this).
>

Another follow up for posterity :)

I think we have finally succeed restoring a nominal service on your
application. The main problem in the varnish side was the use of a two
stage caching pattern for non cachable requests. We completely
misunderstood the hit for pass concept ; resulting in many request being
kept in the waiting list at the two stage, specifically in peak. Since
theses requests can not be cached it seems that piping them in level 1
is more than enough. To be fair we also fix some little things in our
application code too :)

Happy Holidays.

--
Raphael Mazelier

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish nighmare after upgrading : epilogue ? [ In reply to ]
--------
In message <29807c51-f354-a0ec-51a3-c3c16dad1b6a@futomaki.net>, Raphael Mazelier writes:
>On 23/11/2017 21:57, Raphael Mazelier wrote:

>Another follow up for posterity :)

Thanks for reporting, if more people did that, searching email archives
for clues would be much more productive.

>We completely misunderstood the hit for pass concept

We have struggled with that one from almost day one of Varnish, any and
all ideas are welcome.

>Happy Holidays.

God Jul :-)

--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc