Mailing List Archive

Varnish returns 503 error, because it "Could not get storage"
Hi all,

I'm still investigating issues with one of our varnish instances. We
use varnish as a cache and loadbalancer behind nginx and in front of a
docker platform. We experienced an outage for about 20 minutes as
clients received 503 errors being produced by varnish while the docker
containers responded correct (according to the containers' logs).

Setup is:

[ nginx ==> varnish ] ==> [ docker swarm (4 hosts, lots of containers) ]


Sites are distinguished by the exposed ports of the respective swarm
services. Mapping site to service is done with a director containing
the 4 hosts and the respective service port as backends.

By comparing nginx logs with container logs we could confirm varnish
being the culprit. It seemed like the backend request succeeds, but
varnish returns a 503 error anyway.

To investigate further, I activated some logging, which revealed some
concerning information. Apparently varnish sometimes has problems with
the storage, as the "FetchError" says "Could not get storage".

```
* << BeReq >> 70780723
- Begin bereq 70780722 pass
[...]
- Storage malloc Transient
- Fetch_Body 2 chunked -
- FetchError Could not get storage
```

I have attached two complete log examples to this mail.

I did some extensive searching including the varnish book and stuff but
so far did not come up with an explanation. Can anyone help understand
why this happens and how to avoid it?

Here are some additional information about our varnish instance:
- Debian buster
- system: HP DL360p G8, 32G RAM, Intel Xeon E5-2630
- varnish 6.6.0-1~buster (using the varnish repos)
- varnish start options:

```
ExecStart=/usr/sbin/varnishd -a :6081 \
-T :6082 \
-f /etc/varnish/default.vcl \
-p ping_interval=6 -p cli_timeout=10 -p pipe_timeout=600 \
-p listen_depth=4096 -p thread_pool_min=200
-p thread_pool_max=500 -p workspace_client=128k
-p nuke_limit=1000 -S /etc/varnish/secret \
-s malloc,12G \
-s Transient=malloc,3500M
```

Thanks in advance!

--
Marco Dickert
Re: Varnish returns 503 error, because it "Could not get storage" [ In reply to ]
Hello.

I have not looked at the attachments, but you have limited Transient to
3500 MB. Getting "Could not get storage" should not be unexpected if a
large enough amount of your transactions use Transient.

You can figure out which transactions are transient by filtering on the
Storage tag. Both varnishlog and varnishncsa (with a good formatting string
and both -b and -c enabled) can be used for this.

If no other alternative presents itself, maybe you need to switch to return
(pipe) for some of your non-cacheable traffic just to save memory, but this
disqualifies H2 and will give you a low connection reuse, so it is not
optimal.

Best,
Pål




ons. 18. aug. 2021 kl. 10:04 skrev Marco Dickert - evolver group <
marco.dickert@evolver.de>:

> Hi all,
>
> I'm still investigating issues with one of our varnish instances. We
> use varnish as a cache and loadbalancer behind nginx and in front of a
> docker platform. We experienced an outage for about 20 minutes as
> clients received 503 errors being produced by varnish while the docker
> containers responded correct (according to the containers' logs).
>
> Setup is:
>
> [ nginx ==> varnish ] ==> [ docker swarm (4 hosts, lots of containers) ]
>
>
> Sites are distinguished by the exposed ports of the respective swarm
> services. Mapping site to service is done with a director containing
> the 4 hosts and the respective service port as backends.
>
> By comparing nginx logs with container logs we could confirm varnish
> being the culprit. It seemed like the backend request succeeds, but
> varnish returns a 503 error anyway.
>
> To investigate further, I activated some logging, which revealed some
> concerning information. Apparently varnish sometimes has problems with
> the storage, as the "FetchError" says "Could not get storage".
>
> ```
> * << BeReq >> 70780723
> - Begin bereq 70780722 pass
> [...]
> - Storage malloc Transient
> - Fetch_Body 2 chunked -
> - FetchError Could not get storage
> ```
>
> I have attached two complete log examples to this mail.
>
> I did some extensive searching including the varnish book and stuff but
> so far did not come up with an explanation. Can anyone help understand
> why this happens and how to avoid it?
>
> Here are some additional information about our varnish instance:
> - Debian buster
> - system: HP DL360p G8, 32G RAM, Intel Xeon E5-2630
> - varnish 6.6.0-1~buster (using the varnish repos)
> - varnish start options:
>
> ```
> ExecStart=/usr/sbin/varnishd -a :6081 \
> -T :6082 \
> -f /etc/varnish/default.vcl \
> -p ping_interval=6 -p cli_timeout=10 -p pipe_timeout=600 \
> -p listen_depth=4096 -p thread_pool_min=200
> -p thread_pool_max=500 -p workspace_client=128k
> -p nuke_limit=1000 -S /etc/varnish/secret \
> -s malloc,12G \
> -s Transient=malloc,3500M
> ```
>
> Thanks in advance!
>
> --
> Marco Dickert
> _______________________________________________
> varnish-misc mailing list
> varnish-misc@varnish-cache.org
> https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc
>
Re: Varnish returns 503 error, because it "Could not get storage" [ In reply to ]
Hi P?l,

On 2021-08-20 12:08:03, P?l Hermunn Johansen wrote:
> I have not looked at the attachments, but you have limited Transient to
> 3500 MB. Getting "Could not get storage" should not be unexpected if a
> large enough amount of your transactions use Transient.

the problem is, that we cannot afford to not limit the transient storage,
because of this jemalloc-Problem (see [1] and the answer from Reza, pointing to
[2]). We currently limited s0 to 6GB and Transient to 3GB, however
varnish uses 25GB in total. Using different jemalloc versions didn't help.
Anyway, we will try to swap the limits and check if the problem persists or not,
or how it possibly affects it.

[1] https://varnish-cache.org/lists/pipermail/varnish-misc/2021-April/027022.html
[2] https://github.com/varnishcache/varnish-cache/issues/3511

> You can figure out which transactions are transient by filtering on the
> Storage tag. Both varnishlog and varnishncsa (with a good formatting string
> and both -b and -c enabled) can be used for this.

Apparently our ratio between transient and s0 is about 50:50.

> If no other alternative presents itself, maybe you need to switch to return
> (pipe) for some of your non-cacheable traffic just to save memory, but this
> disqualifies H2 and will give you a low connection reuse, so it is not
> optimal.

We will consider this. Maybe we can use pipe for at least a subset of requests.

So thank you very much for the hints!

--
Marco Dickert
_______________________________________________
varnish-misc mailing list
varnish-misc@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-misc