Mailing List Archive

Varnish Lurker is getting slower / Ban lists keeps increasing
Hello,

We are running Varnish 5.1.3 revision 05c5ac6b9.

During business hours (peak traffic is around 300 req/s ), our ban list
keeps increasing to reach 100K objects (and sometimes more).
Ban traffic is around 4-5 ban/s during this period. We are using only
"smart bans".

We are pushing varnishstat output to Graphite, and we have noticed than
the lurker either stop working for a long period of time or is getting
really slow.

Here is a screenshot where you can see that the number of deleted BANs by
the lurker after 9am is decreasing and the BAN list keeps increasing (scale
is log base 10).
The blue bar shows that the lurker didn't delete any ban during 12:50 and
16:30

https://pasteboard.co/GHUPY6t.png

After business hours, everything returns to normal.

I've already tried to change (increase/decrease) both "ban_lurker_batch"
and "ban_lurker_sleep" parameters. Same things.

I don't know how I can get more lurker logs to debug this.
Is there a way to make it more aggressive ?

Any ideas ?

Regards

Olivier
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
> I've already tried to change (increase/decrease) both "ban_lurker_batch" and
> "ban_lurker_sleep" parameters. Same things.
>
> I don't know how I can get more lurker logs to debug this.
> Is there a way to make it more aggressive ?
>
> Any ideas ?

Did you look at the ban_cutoff parameter?

I think it landed in the 5.1 series.

Dridi

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hello Dridi,

Thanks for your reply.
I must admit I didn't understand perfectly the description of this option.

What will happen when the ban list hits the size of bans defined in
ban_cutoff value ?

Will the ban list be truncated and all the previous (oldest ?) bans will be
simply deleted without purging the cache ?
Or will varnish suspend other regular requests (increasing the lurker
priority thread , something like that ?) until the ban list size is under
this value ?

Regards

Olivier


2017-08-29 19:04 GMT+02:00 Dridi Boukelmoune <dridi@varni.sh>:

> > I've already tried to change (increase/decrease) both "ban_lurker_batch"
> and
> > "ban_lurker_sleep" parameters. Same things.
> >
> > I don't know how I can get more lurker logs to debug this.
> > Is there a way to make it more aggressive ?
> >
> > Any ideas ?
>
> Did you look at the ban_cutoff parameter?
>
> I think it landed in the 5.1 series.
>
> Dridi
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hello,

Last night after your reply, I put a ban_cutoff value of 18500 according to
the definition (50ms of latency, 370K/s ban.lurker.tested) (I've restarted
varnish, "varnishadm param.show ban_cutoff" shows the right value)

This morning, nothing has changed : ban lists is increasing (well over
18500).

https://pasteboard.co/GI1K682.png

Regards

Olivier





2017-08-30 8:47 GMT+02:00 Olivier Hanesse <olivier.hanesse@gmail.com>:

> Hello Dridi,
>
> Thanks for your reply.
> I must admit I didn't understand perfectly the description of this option.
>
> What will happen when the ban list hits the size of bans defined in
> ban_cutoff value ?
>
> Will the ban list be truncated and all the previous (oldest ?) bans will
> be simply deleted without purging the cache ?
> Or will varnish suspend other regular requests (increasing the lurker
> priority thread , something like that ?) until the ban list size is under
> this value ?
>
> Regards
>
> Olivier
>
>
> 2017-08-29 19:04 GMT+02:00 Dridi Boukelmoune <dridi@varni.sh>:
>
>> > I've already tried to change (increase/decrease) both
>> "ban_lurker_batch" and
>> > "ban_lurker_sleep" parameters. Same things.
>> >
>> > I don't know how I can get more lurker logs to debug this.
>> > Is there a way to make it more aggressive ?
>> >
>> > Any ideas ?
>>
>> Did you look at the ban_cutoff parameter?
>>
>> I think it landed in the 5.1 series.
>>
>> Dridi
>>
>
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hi Olivier,

I'm responding to the last two emails from you in one go

On 30/08/17 08:47, Olivier Hanesse wrote:
> What will happen when the ban list hits the size of bans defined in ban_cutoff
> value ?

The ban lurker still works the list of bans as before, but when having reached
the <ban_cutoff>th ban, we kill all objects hanging off these bans without
testing the ban condition.

This way, actively used objects (which get tested against the ban list at
request time and will end up hanging off some ban near the top of the ban list)
will not get killed, but rather only those which were least frequently accessed
(iow the long tail).

On 30/08/17 11:44, Olivier Hanesse wrote:
> Last night after your reply, I put a ban_cutoff value of 18500 according to
> the definition (50ms of latency, 370K/s ban.lurker.tested) (I've restarted
> varnish, "varnishadm param.show ban_cutoff" shows the right value)
>
> This morning, nothing has changed : ban lists is increasing (well over
> 18500).

One obvious explanation would be that the lurker had not got to the cutoff value.

But I wonder what exactly you are measuring here. In your first email you wrote

On 29/08/17 18:19, Olivier Hanesse wrote:
> our ban list keeps increasing to reach 100K objects (and sometimes more).

This makes me guess that maybe you'd be graphing the number of objects hanging
off the bans. Quick reminder:

* the second column in the varnishadm ban.list output is the number of
objects associated with this ban (objects, for which this ban has
last been tested)

* what the ban_cutoff parameter is limiting is the number of bans
(that would be varnishadm ban.list | wc -l minus 2)

So can you please double check that you are graphing the latter and not the
former for ban.list?

If you'd actually be graphing the former, then we don't have a problem as this
will just be the total number of objects in your cache.

Thanks, Nils


_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hello Nils,

Thanks for your reply. Now I understand better the definition of this
parameter (cache efficiency vs response time).

I've checked and I am really graphing the size of the ban list (using the
value "MAIN.bans" from varnishstat ). This value is the same that the
"varnishadm
ban.list | wc -l" output.

I've just see that there is a "new" value in varnishstat
"MAIN.bans_lurker_obj_killed_cutoff". I will add this value to my
monitoring plugin.

Olivier

2017-08-30 14:37 GMT+02:00 Nils Goroll <slink@schokola.de>:

> Hi Olivier,
>
> I'm responding to the last two emails from you in one go
>
> On 30/08/17 08:47, Olivier Hanesse wrote:
> > What will happen when the ban list hits the size of bans defined in
> ban_cutoff
> > value ?
>
> The ban lurker still works the list of bans as before, but when having
> reached
> the <ban_cutoff>th ban, we kill all objects hanging off these bans without
> testing the ban condition.
>
> This way, actively used objects (which get tested against the ban list at
> request time and will end up hanging off some ban near the top of the ban
> list)
> will not get killed, but rather only those which were least frequently
> accessed
> (iow the long tail).
>
> On 30/08/17 11:44, Olivier Hanesse wrote:
> > Last night after your reply, I put a ban_cutoff value of 18500 according
> to
> > the definition (50ms of latency, 370K/s ban.lurker.tested) (I've
> restarted
> > varnish, "varnishadm param.show ban_cutoff" shows the right value)
> >
> > This morning, nothing has changed : ban lists is increasing (well over
> > 18500).
>
> One obvious explanation would be that the lurker had not got to the cutoff
> value.
>
> But I wonder what exactly you are measuring here. In your first email you
> wrote
>
> On 29/08/17 18:19, Olivier Hanesse wrote:
> > our ban list keeps increasing to reach 100K objects (and sometimes more).
>
> This makes me guess that maybe you'd be graphing the number of objects
> hanging
> off the bans. Quick reminder:
>
> * the second column in the varnishadm ban.list output is the number of
> objects associated with this ban (objects, for which this ban has
> last been tested)
>
> * what the ban_cutoff parameter is limiting is the number of bans
> (that would be varnishadm ban.list | wc -l minus 2)
>
> So can you please double check that you are graphing the latter and not the
> former for ban.list?
>
> If you'd actually be graphing the former, then we don't have a problem as
> this
> will just be the total number of objects in your cache.
>
> Thanks, Nils
>
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
On 30/08/17 11:44, Olivier Hanesse wrote:
> 50ms of latency, 370K/s ban.lurker.tested

BTW, the value to be used for the calculation is rate(bans_lurker_tests_tested)
not rate(bans_lurker_tested)

I've also just improved the documentation of ban_cutoff in master.

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Ok, so I could upgrade the cutoff parameter to 70K (bans_lurker_tests_tested
is around 1.4M)

I will keep it at 18500 to debug this ban list behaviour.

2017-08-30 15:14 GMT+02:00 Nils Goroll <slink@schokola.de>:

>
> On 30/08/17 11:44, Olivier Hanesse wrote:
> > 50ms of latency, 370K/s ban.lurker.tested
>
> BTW, the value to be used for the calculation is
> rate(bans_lurker_tests_tested)
> not rate(bans_lurker_tested)
>
> I've also just improved the documentation of ban_cutoff in master.
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hello,

I got some news about this issue :

The "MAIN.bans_lurker_obj_killed_cutoff" value has increased from 0 to
"322451" in one update ! (i am polling this value every minute). The ban
list size was around 100k when this has happened.
And I can see at the same time, a drop of the "n_object" value (this is
normal and expected).

Are the stats used by varnishstat about the lurker "well" updated "every
minute" ? The fact that the statistics was only updated once is kinda
strange : the ban list size is higher than the cutoff value everyday :(

Regards

Olivier


2017-08-30 15:41 GMT+02:00 Olivier Hanesse <olivier.hanesse@gmail.com>:

> Ok, so I could upgrade the cutoff parameter to 70K (bans_lurker_tests_tested
> is around 1.4M)
>
> I will keep it at 18500 to debug this ban list behaviour.
>
> 2017-08-30 15:14 GMT+02:00 Nils Goroll <slink@schokola.de>:
>
>>
>> On 30/08/17 11:44, Olivier Hanesse wrote:
>> > 50ms of latency, 370K/s ban.lurker.tested
>>
>> BTW, the value to be used for the calculation is
>> rate(bans_lurker_tests_tested)
>> not rate(bans_lurker_tested)
>>
>> I've also just improved the documentation of ban_cutoff in master.
>>
>
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
On Mon, Sep 4, 2017 at 2:12 PM, Olivier Hanesse
<olivier.hanesse@gmail.com> wrote:
> Are the stats used by varnishstat about the lurker "well" updated "every minute" ? The fact that the statistics was only updated once is kinda strange : the ban list size is higher than the cutoff value everyday :(

No, that's a limitation of the statistics, serving HTTP traffic has
higher priority than committing updates of the counters.

See this for reference:

https://github.com/varnishcache/varnish-cache/pull/2290

Dridi

_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
In this case, that means that as long as the ban lurker is working, no
statistics are updated right ?

So if I don't see any updates of statistics such as "bans_deleted", or
"bans_lurker_obj_killed_cutoff" during a long period, it doesn't mean that
the lurker is sleeping, hanged or waiting for a lock, it means that the
lurker worker is working pretty "hard", is that correct ?


2017-09-04 16:11 GMT+02:00 Dridi Boukelmoune <dridi@varni.sh>:

> On Mon, Sep 4, 2017 at 2:12 PM, Olivier Hanesse
> <olivier.hanesse@gmail.com> wrote:
> > Are the stats used by varnishstat about the lurker "well" updated "every
> minute" ? The fact that the statistics was only updated once is kinda
> strange : the ban list size is higher than the cutoff value everyday :(
>
> No, that's a limitation of the statistics, serving HTTP traffic has
> higher priority than committing updates of the counters.
>
> See this for reference:
>
> https://github.com/varnishcache/varnish-cache/pull/2290
>
> Dridi
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hello,

I am still on this issue. Today, I disable the loadbalancing to a specific
varnish server whose ban list was around 500K (n_object is around 600K)

After 4 hours without any requests other than "BAN", the ban list is still
increasing, and I got a system load around 1.5.

A "top" with thread show that the ban_lurker is eating 100% of 1 CPU (8 cpu
computer)

top - 14:47:53 up 91 days, 22:36, 2 users, load average: 1.57, 1.45, 1.49
Threads: 702 total, 2 running, 700 sleeping, 0 stopped, 0 zombie
%Cpu(s): 11.6 us, 0.2 sy, 0.0 ni, 88.1 id, 0.0 wa, 0.0 hi, 0.1 si,
0.0 st
KiB Mem: 8196832 total, 5532436 used, 2664396 free, 98028 buffers
KiB Swap: 499708 total, 0 used, 499708 free. 1077884 cached Mem

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND



4606 vcache 20 0 4212992 3.553g 85300 R 99.9 45.5 6991:50
ban-lurker





Is is possible that the ban_lurker is locked or in an infinite loop (I know
it is single threaded) ? What kind of dump can I provide to help understand
this issue ?

Regards

Olivier

2017-09-04 16:54 GMT+02:00 Olivier Hanesse <olivier.hanesse@gmail.com>:

> In this case, that means that as long as the ban lurker is working, no
> statistics are updated right ?
>
> So if I don't see any updates of statistics such as "bans_deleted", or
> "bans_lurker_obj_killed_cutoff" during a long period, it doesn't mean
> that the lurker is sleeping, hanged or waiting for a lock, it means that
> the lurker worker is working pretty "hard", is that correct ?
>
>
> 2017-09-04 16:11 GMT+02:00 Dridi Boukelmoune <dridi@varni.sh>:
>
>> On Mon, Sep 4, 2017 at 2:12 PM, Olivier Hanesse
>> <olivier.hanesse@gmail.com> wrote:
>> > Are the stats used by varnishstat about the lurker "well" updated
>> "every minute" ? The fact that the statistics was only updated once is
>> kinda strange : the ban list size is higher than the cutoff value everyday
>> :(
>>
>> No, that's a limitation of the statistics, serving HTTP traffic has
>> higher priority than committing updates of the counters.
>>
>> See this for reference:
>>
>> https://github.com/varnishcache/varnish-cache/pull/2290
>>
>> Dridi
>>
>
>
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Hi Olivier

On 29/11/17 14:53, Olivier Hanesse wrote:
> Is is possible that the ban_lurker is locked or in an infinite loop (I know it
> is single threaded) ? What kind of dump can I provide to help understand this
> issue ? 

Basically we'd need the output of

varnishadm ban.list

optionally a second run with

varnishadm param.set debug +lurker


Trouble here is that, by default, the CLI interface will truncate the output
after 48k which is way too soon for any ban list of an order of magnitude as you
are seeing it.

The highest value currently accepted for the cli limit is just under 100MB,
which can be set using

varnishadm param.set cli_limit 99999999b

As a precaution, I would advise to raise the ping timeout with such huge CLI
responses:

varnishadm param.set cli_timeout 3600

This should be undone back to the default or whatever you had before after
gathering the data.

Feel free to send me the results via private email, bzip2 or xz compression
would probably be a good idea.

Nils
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
On 29/11/17 14:53, Olivier Hanesse wrote:
> After 4 hours without any requests other than "BAN"

side note: As explained before and as documented since
5d713b2d8a796c6d21c2066d8ef1e0beab690dd4 / Wed Aug 30 15:18:13 2017 +0200, even
with ban_cutoff, we need to work the ban list and evaluate bans until we reach
the cutoff. Any once we reached that, all we ware saving is the evaluation of
the bans, we still need to do the work of nuking objects.

So my guess would be that the bans you are seeing are so expensive to evaluate
that your ban lurker has not even reached the cutoff.

While I certainly do want to understand if the issue you are reporting is due to
a genuine bug or inefficiency, I would recommend to get the issued bans under
control in the first place.

Also, taking traffic from a server with a busy ban lurker will most likely not
help, but rather lead to a longer total time to work all bans: Cache access for
requests tests bans concurrently to the ban lurker.

Nils
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
FTR: I'll continue working on this with Olivier 1:1 and will report the outcome
here.
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
On 30/11/17 20:40, Nils Goroll wrote:
> FTR: I'll continue working on this with Olivier 1:1 and will report the outcome
> here.

Summary:

Olivier's varnish instance's ban lurker is evaluating ~1M bans/s, which is a
normal rate. The instance at hand was not configured to use the cutoff. The ban
regular expressions can be optimized, but that will most likely not avoid the
problem. Other options I proposed:

- reduce the number of bans
- reduce the number of objects
- use the cutoff
- sponsor more research and possibly development to further improve ban lurker
performance

Nils
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
Re: Varnish Lurker is getting slower / Ban lists keeps increasing [ In reply to ]
Thanks a lot for you analysis Nils.


2017-12-01 12:40 GMT+01:00 Nils Goroll <slink@schokola.de>:

> On 30/11/17 20:40, Nils Goroll wrote:
> > FTR: I'll continue working on this with Olivier 1:1 and will report the
> outcome
> > here.
>
> Summary:
>
> Olivier's varnish instance's ban lurker is evaluating ~1M bans/s, which is
> a
> normal rate. The instance at hand was not configured to use the cutoff.
> The ban
> regular expressions can be optimized, but that will most likely not avoid
> the
> problem. Other options I proposed:
>
> - reduce the number of bans
> - reduce the number of objects
> - use the cutoff
> - sponsor more research and possibly development to further improve ban
> lurker
> performance
>
> Nils
>