Hi Guillaume and Reza.
This time varnish restarted but it left some more info on syslog.
It seems like the system is running out of memory.
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.297487] pool_herder invoked
oom-killer: gfp_mask=0x2000d0, order=2, oom_score_adj=0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.300992] pool_herder cpuset=/
mems_allowed=0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.303157] CPU: 1 PID: 16214
Comm: pool_herder Tainted: G C O 3.16.0-4-amd64 #1 Debian
3.16.36-1+deb8u2
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Hardware name: Xen HVM
domU, BIOS 4.2.amazon 02/16/2017
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] 0000000000000000
ffffffff815123b5 ffff8800eb3652f0 0000000000000000
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff8150ff8d
0000000000000000 ffffffff810d6e3f 0000000000000000
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] ffffffff81516d2e
0000000000000200 ffffffff810689d3 ffffffff810c43e4
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] Call Trace:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815123b5>]
? dump_stack+0x5d/0x78
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8150ff8d>]
? dump_header+0x76/0x1e8
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810d6e3f>]
? smp_call_function_single+0x5f/0xa0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81516d2e>]
? mutex_lock+0xe/0x2a
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810689d3>]
? put_online_cpus+0x23/0x80
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff810c43e4>]
? rcu_oom_notify+0xc4/0xe0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81153d1c>]
? do_try_to_free_pages+0x4ac/0x520
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff811427dd>]
? oom_kill_process+0x21d/0x370
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8114239d>]
? find_lock_task_mm+0x3d/0x90
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81142f43>]
? out_of_memory+0x473/0x4b0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81148e0f>]
? __alloc_pages_nodemask+0x9ef/0xb50
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81065c86>]
? copy_process.part.25+0x116/0x1c50
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81058301>]
? __do_page_fault+0x1d1/0x4f0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff81067990>]
? do_fork+0xe0/0x3d0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff815188f9>]
? stub_clone+0x69/0x90
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.304984] [<ffffffff8151858d>]
? system_call_fast_compare_end+0x10/0x15
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.367638] Mem-Info:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.368962] Node 0 DMA per-cpu:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.370768] CPU 0: hi: 0,
btch: 1 usd: 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.373249] CPU 1: hi: 0,
btch: 1 usd: 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.375652] Node 0 DMA32 per-cpu:
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.377508] CPU 0: hi: 186,
btch: 31 usd: 29
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.379898] CPU 1: hi: 186,
btch: 31 usd: 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_anon:846474
inactive_anon:1913 isolated_anon:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] active_file:408
inactive_file:415 isolated_file:32
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] unevictable:20736
dirty:27 writeback:0 unstable:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free:16797
slab_reclaimable:15276 slab_unreclaimable:10521
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] mapped:22002
shmem:22935 pagetables:30362 bounce:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.382318] free_cma:0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.397242] Node 0 DMA
free:15192kB min:184kB low:228kB high:276kB active_anon:416kB
inactive_anon:60kB active_file:0kB inactive_file:0kB unevictable:20kB
isolated(anon):0kB isolated(file):0kB present:15988kB managed:15904kB
mlocked:20kB dirty:0kB writeback:0kB mapped:20kB shmem:80kB
slab_reclaimable:32kB slab_unreclaimable:0kB kernel_stack:112kB
pagetables:20kB unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB
pages_scanned:0 all_unreclaimable? yes
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.416338] lowmem_reserve[]: 0
3757 3757 3757
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.419030] Node 0 DMA32
free:50120kB min:44868kB low:56084kB high:67300kB active_anon:3386780kB
inactive_anon:7592kB active_file:1732kB inactive_file:2060kB
unevictable:82924kB isolated(anon):0kB isolated(file):128kB
present:3915776kB managed:3849676kB mlocked:82924kB dirty:108kB
writeback:0kB mapped:88432kB shmem:91660kB slab_reclaimable:61072kB
slab_unreclaimable:42184kB kernel_stack:27248kB pagetables:121428kB
unstable:0kB bounce:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0
all_unreclaimable? no
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.440095] lowmem_reserve[]: 0 0
0 0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.442202] Node 0 DMA: 22*4kB
(UEM) 6*8kB (EM) 1*16kB (E) 2*32kB (UM) 2*64kB (UE) 2*128kB (EM) 3*256kB
(UEM) 1*512kB (E) 3*1024kB (UEM) 3*2048kB (EMR) 1*4096kB (M) = 15192kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.451936] Node 0 DMA32: 4031*4kB
(EM) 2729*8kB (EM) 324*16kB (EM) 1*32kB (R) 1*64kB (R) 0*128kB 0*256kB
1*512kB (R) 1*1024kB (R) 1*2048kB (R) 0*4096kB = 46820kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.460240] Node 0
hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.464122] 24240 total pagecache
pages
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.466048] 0 pages in swap cache
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.467672] Swap cache stats: add
0, delete 0, find 0/0
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.470159] Free swap = 0kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.471513] Total swap = 0kB
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.472980] 982941 pages RAM
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.474380] 0 pages
HighMem/MovableOnly
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.476190] 16525 pages reserved
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.477772] 0 pages hwpoisoned
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.479189] [ pid ] uid tgid
total_vm rss nr_ptes swapents oom_score_adj name
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.482698] [ 163] 0 163
10419 1295 21 0 0 systemd-journal
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.486646] [ 165] 0 165
10202 136 21 0 -1000 systemd-udevd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.490598] [ 294] 0 294
6351 1729 14 0 0 dhclient
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.494457] [ 319] 0 319
6869 62 18 0 0 cron
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.498260] [ 321] 0 321
4964 67 14 0 0 systemd-logind
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.502346] [ 326] 105 326
10558 101 25 0 -900 dbus-daemon
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.506315] [ 342] 0 342
65721 228 31 0 0 rsyslogd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.510222] [ 343] 0 343
88199 2108 61 0 -500 dockerd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.514022] [ 350] 106 350
18280 181 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.518040] [ 351] 106 351
18280 475 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.522041] [ 352] 106 352
18280 187 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.526025] [ 353] 106 353
18280 187 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.530067] [ 354] 106 354
18280 187 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.534033] [ 355] 106 355
18280 190 36 0 0 zabbix_agentd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.538001] [ 358] 0 358
66390 1826 32 0 0 fail2ban-server
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.541972] [ 400] 0 400
35984 444 24 0 -500 docker-containe
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.545879] [ 568] 0 568
13796 168 30 0 -1000 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.549733] [ 576] 0 576
3604 41 12 0 0 agetty
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.553569] [ 577] 0 577
3559 38 12 0 0 agetty
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.557322] [16201] 0 16201
29695 20707 60 0 0 varnishd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.561103] [16209] 108 16209
118909802 822425 29398 0 0 cache-main
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.565002] [27352] 0 27352
20131 214 42 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.568682] [27354] 1000 27354
20165 211 41 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.572307] [27355] 1000 27355
5487 146 17 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.575920] [27360] 0 27360
11211 107 26 0 0 sudo
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.579593] [27361] 0 27361
11584 97 27 0 0 su
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.583155] [27362] 0 27362
5481 142 15 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.586782] [27749] 0 27749
20131 214 41 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.590428] [27751] 1000 27751
20164 211 39 0 0 sshd
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.593979] [27752] 1000 27752
5487 147 15 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.597488] [28762] 0 28762
26528 132 17 0 0 varnishstat
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.601239] [28764] 0 28764
11211 106 26 0 0 sudo
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.604737] [28765] 0 28765
11584 97 26 0 0 su
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.608602] [28766] 0 28766
5481 141 15 0 0 bash
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.612288] [28768] 0 28768
26528 220 18 0 0 varnishstat
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.616189] Out of memory: Kill
process 16209 (cache-main) score 880 or sacrifice child
Jun 29 13:11:01 ip-172-25-2-8 kernel: [93823.620106] Killed process 16209
(cache-main) total-vm:475639208kB, anon-rss:3289700kB, file-rss:0kB
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (16209) died signal=9
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child cleanup complete
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) Started
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said Child
starts
Jun 29 13:11:01 ip-172-25-2-8 varnishd[16201]: Child (30313) said SMF.s0
mmap'ed 483183820800 bytes of 483183820800
Best,
Stefano
On Wed, Jun 28, 2017 at 11:33 AM, Reza Naghibi <reza@varnish-software.com>
wrote:
> Assuming the problem is running out of memory, you will need to do some
> memory tuning, especially given the number of threads you are using and
> your access patterns. Your options:
>
> - Add more memory to the system
> - Reduce thread_pool_max
> - Reduce jemalloc's thread cache (MALLOC_CONF="lg_tcache_max:10")
> - Use some of the tuning params in here: https://info.varnish-
> software.com/blog/understanding-varnish-cache-memory-usage
> <https://info.varnish-software.com/blog/understanding-varnish-cache-memory-usage>
>
>
> --
> Reza Naghibi
> Varnish Software
>
> On Wed, Jun 28, 2017 at 9:26 AM, Guillaume Quintard <
> guillaume@varnish-software.com> wrote:
>
>> Hi,
>>
>> can you look that "varnishstat -1 | grep g_bytes" and see if if matches
>> the memory you are seeing?
>>
>> --
>> Guillaume Quintard
>>
>> On Wed, Jun 28, 2017 at 3:20 PM, Stefano Baldo <stefanobaldo@gmail.com>
>> wrote:
>>
>>> Hi Guillaume.
>>>
>>> I increased the cli_timeout yesterday to 900sec (15min) and it restarted
>>> anyway, which seems to indicate that the thread is really stalled.
>>>
>>> This was 1 minute after the last restart:
>>>
>>> MAIN.n_object 3908216 . object structs made
>>> SMF.s0.g_alloc 7794510 . Allocations outstanding
>>>
>>> I've just changed the I/O Scheduler to noop to see what happens.
>>>
>>> One interest thing I've found is about the memory usage.
>>>
>>> In the 1st minute of use:
>>> MemTotal: 3865572 kB
>>> MemFree: 120768 kB
>>> MemAvailable: 2300268 kB
>>>
>>> 1 minute before a restart:
>>> MemTotal: 3865572 kB
>>> MemFree: 82480 kB
>>> MemAvailable: 68316 kB
>>>
>>> It seems like the system is possibly running out of memory.
>>>
>>> When calling varnishd, I'm specifying only "-s file,..." as storage. I
>>> see in some examples that is common to use "-s file" AND "-s malloc"
>>> together. Should I be passing "-s malloc" as well to somehow try to limit
>>> the memory usage by varnishd?
>>>
>>> Best,
>>> Stefano
>>>
>>>
>>> On Wed, Jun 28, 2017 at 4:12 AM, Guillaume Quintard <
>>> guillaume@varnish-software.com> wrote:
>>>
>>>> Sadly, nothing suspicious here, you can still try:
>>>> - bumping the cli_timeout
>>>> - changing your disk scheduler
>>>> - changing the advice option of the file storage
>>>>
>>>> I'm still convinced this is due to Varnish getting stuck waiting for
>>>> the disk because of the file storage fragmentation.
>>>>
>>>> Maybe you could look at SMF.*.g_alloc and compare it to the number of
>>>> objects. Ideally, we would have a 1:1 relation between objects and
>>>> allocations. If that number drops prior to a restart, that would be a good
>>>> clue.
>>>>
>>>>
>>>> --
>>>> Guillaume Quintard
>>>>
>>>> On Tue, Jun 27, 2017 at 11:07 PM, Stefano Baldo <stefanobaldo@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Guillaume.
>>>>>
>>>>> It keeps restarting.
>>>>> Would you mind taking a quick look in the following VCL file to check
>>>>> if you find anything suspicious?
>>>>>
>>>>> Thank you very much.
>>>>>
>>>>> Best,
>>>>> Stefano
>>>>>
>>>>> vcl 4.0;
>>>>>
>>>>> import std;
>>>>>
>>>>> backend default {
>>>>> .host = "sites-web-server-lb";
>>>>> .port = "80";
>>>>> }
>>>>>
>>>>> include "/etc/varnish/bad_bot_detection.vcl";
>>>>>
>>>>> sub vcl_recv {
>>>>> call bad_bot_detection;
>>>>>
>>>>> if (req.url == "/nocache" || req.url == "/version") {
>>>>> return(pass);
>>>>> }
>>>>>
>>>>> unset req.http.Cookie;
>>>>> if (req.method == "PURGE") {
>>>>> ban("obj.http.x-host == " + req.http.host + " &&
>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>> return(synth(750));
>>>>> }
>>>>>
>>>>> set req.url = regsuball(req.url, "(?<!(http:|https))\/+", "/");
>>>>> }
>>>>>
>>>>> sub vcl_synth {
>>>>> if (resp.status == 750) {
>>>>> set resp.status = 200;
>>>>> synthetic("PURGED => " + req.url);
>>>>> return(deliver);
>>>>> } elsif (resp.status == 501) {
>>>>> set resp.status = 200;
>>>>> set resp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/invalid_domain.html"));
>>>>> return(deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_backend_response {
>>>>> unset beresp.http.Set-Cookie;
>>>>> set beresp.http.x-host = bereq.http.host;
>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>
>>>>> if (bereq.url == "/themes/basic/assets/theme.min.css"
>>>>> || bereq.url == "/api/events/PAGEVIEW"
>>>>> || bereq.url ~ "^\/assets\/img\/") {
>>>>> set beresp.http.Cache-Control = "max-age=0";
>>>>> } else {
>>>>> unset beresp.http.Cache-Control;
>>>>> }
>>>>>
>>>>> if (beresp.status == 200 ||
>>>>> beresp.status == 301 ||
>>>>> beresp.status == 302 ||
>>>>> beresp.status == 404) {
>>>>> if (bereq.url ~ "\&ordenar=aleatorio$") {
>>>>> set beresp.http.X-TTL = "1d";
>>>>> set beresp.ttl = 1d;
>>>>> } else {
>>>>> set beresp.http.X-TTL = "1w";
>>>>> set beresp.ttl = 1w;
>>>>> }
>>>>> }
>>>>>
>>>>> if (bereq.url !~ "\.(jpeg|jpg|png|gif|gz|tgz|bz2|tbz|mp3|ogg|swf|flv)$")
>>>>> {
>>>>> set beresp.do_gzip = true;
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_pipe {
>>>>> set bereq.http.connection = "close";
>>>>> return (pipe);
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> unset resp.http.x-host;
>>>>> unset resp.http.x-user-agent;
>>>>> }
>>>>>
>>>>> sub vcl_backend_error {
>>>>> if (beresp.status == 502 || beresp.status == 503 || beresp.status ==
>>>>> 504) {
>>>>> set beresp.status = 200;
>>>>> set beresp.http.Content-Type = "text/html; charset=utf-8";
>>>>> synthetic(std.fileread("/etc/varnish/pages/maintenance.html"));
>>>>> return (deliver);
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_hash {
>>>>> if (req.http.User-Agent ~ "Google Page Speed") {
>>>>> hash_data("Google Page Speed");
>>>>> } elsif (req.http.User-Agent ~ "Googlebot") {
>>>>> hash_data("Googlebot");
>>>>> }
>>>>> }
>>>>>
>>>>> sub vcl_deliver {
>>>>> if (resp.status == 501) {
>>>>> return (synth(resp.status));
>>>>> }
>>>>> if (obj.hits > 0) {
>>>>> set resp.http.X-Cache = "hit";
>>>>> } else {
>>>>> set resp.http.X-Cache = "miss";
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>> On Mon, Jun 26, 2017 at 3:47 PM, Guillaume Quintard <
>>>>> guillaume@varnish-software.com> wrote:
>>>>>
>>>>>> Nice! It may have been the cause, time will tell.can you report back
>>>>>> in a few days to let us know?
>>>>>> --
>>>>>> Guillaume Quintard
>>>>>>
>>>>>> On Jun 26, 2017 20:21, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Guillaume.
>>>>>>>
>>>>>>> I think things will start to going better now after changing the
>>>>>>> bans.
>>>>>>> This is how my last varnishstat looked like moments before a crash
>>>>>>> regarding the bans:
>>>>>>>
>>>>>>> MAIN.bans 41336 . Count of bans
>>>>>>> MAIN.bans_completed 37967 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 0 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 41335 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 41336 0.68 Bans added
>>>>>>> MAIN.bans_deleted 0 0.00 Bans deleted
>>>>>>>
>>>>>>> And this is how it looks like now:
>>>>>>>
>>>>>>> MAIN.bans 2 . Count of bans
>>>>>>> MAIN.bans_completed 1 . Number of bans
>>>>>>> marked 'completed'
>>>>>>> MAIN.bans_obj 2 . Number of bans
>>>>>>> using obj.*
>>>>>>> MAIN.bans_req 0 . Number of bans
>>>>>>> using req.*
>>>>>>> MAIN.bans_added 2016 0.69 Bans added
>>>>>>> MAIN.bans_deleted 2014 0.69 Bans deleted
>>>>>>>
>>>>>>> Before the changes, bans were never deleted!
>>>>>>> Now the bans are added and quickly deleted after a minute or even a
>>>>>>> couple of seconds.
>>>>>>>
>>>>>>> May this was the cause of the problem? It seems like varnish was
>>>>>>> having a large number of bans to manage and test against.
>>>>>>> I will let it ride now. Let's see if the problem persists or it's
>>>>>>> gone! :-)
>>>>>>>
>>>>>>> Best,
>>>>>>> Stefano
>>>>>>>
>>>>>>>
>>>>>>> On Mon, Jun 26, 2017 at 3:10 PM, Guillaume Quintard <
>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>
>>>>>>>> Looking good!
>>>>>>>>
>>>>>>>> --
>>>>>>>> Guillaume Quintard
>>>>>>>>
>>>>>>>> On Mon, Jun 26, 2017 at 7:06 PM, Stefano Baldo <
>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi Guillaume,
>>>>>>>>>
>>>>>>>>> Can the following be considered "ban lurker friendly"?
>>>>>>>>>
>>>>>>>>> sub vcl_backend_response {
>>>>>>>>> set beresp.http.x-url = bereq.http.host + bereq.url;
>>>>>>>>> set beresp.http.x-user-agent = bereq.http.user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_recv {
>>>>>>>>> if (req.method == "PURGE") {
>>>>>>>>> ban("obj.http.x-url == " + req.http.host + req.url + " &&
>>>>>>>>> obj.http.x-user-agent !~ Googlebot");
>>>>>>>>> return(synth(750));
>>>>>>>>> }
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> sub vcl_deliver {
>>>>>>>>> unset resp.http.x-url;
>>>>>>>>> unset resp.http.x-user-agent;
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> Stefano
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jun 26, 2017 at 12:43 PM, Guillaume Quintard <
>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>
>>>>>>>>>> Not lurker friendly at all indeed. You'll need to avoid req.*
>>>>>>>>>> expression. Easiest way is to stash the host, user-agent and url in
>>>>>>>>>> beresp.http.* and ban against those (unset them in vcl_deliver).
>>>>>>>>>>
>>>>>>>>>> I don't think you need to expand the VSL at all.
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>
>>>>>>>>>> On Jun 26, 2017 16:51, "Stefano Baldo" <stefanobaldo@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> Hi Guillaume.
>>>>>>>>>>
>>>>>>>>>> Thanks for answering.
>>>>>>>>>>
>>>>>>>>>> I'm using a SSD disk. I've changed from ext4 to ext2 to increase
>>>>>>>>>> performance but it stills restarting.
>>>>>>>>>> Also, I checked the I/O performance for the disk and there is no
>>>>>>>>>> signal of overhead.
>>>>>>>>>>
>>>>>>>>>> I've changed the /var/lib/varnish to a tmpfs and increased its
>>>>>>>>>> 80m default size passing "-l 200m,20m" to varnishd and using
>>>>>>>>>> "nodev,nosuid,noatime,size=256M 0 0" for the tmpfs mount. There
>>>>>>>>>> was a problem here. After a couple of hours varnish died and I received a
>>>>>>>>>> "no space left on device" message - deleting the /var/lib/varnish solved
>>>>>>>>>> the problem and varnish was up again, but it's weird because there was free
>>>>>>>>>> memory on the host to be used with the tmpfs directory, so I don't know
>>>>>>>>>> what could have happened. I will try to stop increasing the
>>>>>>>>>> /var/lib/varnish size.
>>>>>>>>>>
>>>>>>>>>> Anyway, I am worried about the bans. You asked me if the bans are
>>>>>>>>>> lurker friedly. Well, I don't think so. My bans are created this way:
>>>>>>>>>>
>>>>>>>>>> ban("req.http.host == " + req.http.host + " && req.url ~ " +
>>>>>>>>>> req.url + " && req.http.User-Agent !~ Googlebot");
>>>>>>>>>>
>>>>>>>>>> Are they lurker friendly? I was taking a quick look and the
>>>>>>>>>> documentation and it looks like they're not.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> Stefano
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Fri, Jun 23, 2017 at 11:30 AM, Guillaume Quintard <
>>>>>>>>>> guillaume@varnish-software.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi Stefano,
>>>>>>>>>>>
>>>>>>>>>>> Let's cover the usual suspects: I/Os. I think here Varnish gets
>>>>>>>>>>> stuck trying to push/pull data and can't make time to reply to the CLI. I'd
>>>>>>>>>>> recommend monitoring the disk activity (bandwidth and iops) to confirm.
>>>>>>>>>>>
>>>>>>>>>>> After some time, the file storage is terrible on a hard drive
>>>>>>>>>>> (SSDs take a bit more time to degrade) because of fragmentation. One
>>>>>>>>>>> solution to help the disks cope is to overprovision themif they're SSDs,
>>>>>>>>>>> and you can try different advices in the file storage definition in the
>>>>>>>>>>> command line (last parameter, after granularity).
>>>>>>>>>>>
>>>>>>>>>>> Is your /var/lib/varnish mount on tmpfs? That could help too.
>>>>>>>>>>>
>>>>>>>>>>> 40K bans is a lot, are they ban-lurker friendly?
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> Guillaume Quintard
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jun 23, 2017 at 4:01 PM, Stefano Baldo <
>>>>>>>>>>> stefanobaldo@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hello.
>>>>>>>>>>>>
>>>>>>>>>>>> I am having a critical problem with Varnish Cache in production
>>>>>>>>>>>> for over a month and any help will be appreciated.
>>>>>>>>>>>> The problem is that Varnish child process is recurrently being
>>>>>>>>>>>> restarted after 10~20h of use, with the following message:
>>>>>>>>>>>>
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824) not
>>>>>>>>>>>> responding to CLI, killed it.
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Unexpected reply
>>>>>>>>>>>> from ping: 400 CLI communication error
>>>>>>>>>>>> Jun 23 09:15:13 b858e4a8bd72 varnishd[11816]: Child (11824)
>>>>>>>>>>>> died signal=9
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child cleanup
>>>>>>>>>>>> complete
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> Started
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said Child starts
>>>>>>>>>>>> Jun 23 09:15:14 b858e4a8bd72 varnishd[11816]: Child (24038)
>>>>>>>>>>>> said SMF.s0 mmap'ed 483183820800 bytes of 483183820800
>>>>>>>>>>>>
>>>>>>>>>>>> The following link is the varnishstat output just 1 minute
>>>>>>>>>>>> before a restart:
>>>>>>>>>>>>
>>>>>>>>>>>> https://pastebin.com/g0g5RVTs
>>>>>>>>>>>>
>>>>>>>>>>>> Environment:
>>>>>>>>>>>>
>>>>>>>>>>>> varnish-5.1.2 revision 6ece695
>>>>>>>>>>>> Debian 8.7 - Debian GNU/Linux 8 (3.16.0)
>>>>>>>>>>>> Installed using pre-built package from official repo at
>>>>>>>>>>>> packagecloud.io
>>>>>>>>>>>> CPU 2x2.9 GHz
>>>>>>>>>>>> Mem 3.69 GiB
>>>>>>>>>>>> Running inside a Docker container
>>>>>>>>>>>> NFILES=131072
>>>>>>>>>>>> MEMLOCK=82000
>>>>>>>>>>>>
>>>>>>>>>>>> Additional info:
>>>>>>>>>>>>
>>>>>>>>>>>> - I need to cache a large number of objets and the cache should
>>>>>>>>>>>> last for almost a week, so I have set up a 450G storage space, I don't know
>>>>>>>>>>>> if this is a problem;
>>>>>>>>>>>> - I use ban a lot. There was about 40k bans in the system just
>>>>>>>>>>>> before the last crash. I really don't know if this is too much or may have
>>>>>>>>>>>> anything to do with it;
>>>>>>>>>>>> - No registered CPU spikes (almost always by 30%);
>>>>>>>>>>>> - No panic is reported, the only info I can retrieve is from
>>>>>>>>>>>> syslog;
>>>>>>>>>>>> - During all the time, event moments before the crashes,
>>>>>>>>>>>> everything is okay and requests are being responded very fast.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> Stefano Baldo
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>> varnish-misc mailing list
>>>>>>>>>>>> varnish-misc@varnish-cache.org
>>>>>>>>>>>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish
>>>>>>>>>>>> -misc
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>
>> _______________________________________________
>> varnish-misc mailing list
>> varnish-misc@varnish-cache.org
>> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>>
>
>