Mailing List Archive: #1743: Workspace exhaustion under load

#1743: Workspace exhaustion under load

May 28, 2015, 9:43 PM

Post #1 of 5 (1439 views)

#1743: Workspace exhaustion under load
----------------------------------+----------------------
Reporter: geoff | Type: defect
Status: new | Priority: normal
Milestone: Varnish 4.0 release | Component: varnishd
Version: 4.0.3 | Severity: normal
Keywords: workspace, esi, load |
----------------------------------+----------------------
Following up on this message in varnish-misc:

https://www.varnish-cache.org/lists/pipermail/varnish-
misc/2015-May/024426.html

The problem is workspace exhaustion in load tests: LostHeader appears in
the logs, lostheader stats counter increases, and VMODs report
insufficient workspace. We've only seen these problems under load. The
currently productive Varnish3 setup runs against the same backends without
the problem.

In V3 we have sess_workspace=256KB. In V4 we started with
workspace_client=workspace_backend=256KB, and doubled the value up to as
high as 16MB, still getting the problem. At 32MB, varnishd filled up RAM.
In V3, we run with well under 50% of available RAM on same-sized machines.

When we captured logs that include LostHeader records, we found that the
offending requests were always ESI includes. The apps have some deep ESI
nesting, up to at least esi_level=7. In some but not all cases, when can
see that there were backend retries due to VCL logic that retries requests
after 5xx responses.

The lostheader counter increases in bursts when this happens, often at a
rate of about 2K/s, up to about 9K/s. The bursts seem to go on for about
10-30 seconds, and then the increase rate goes back to 0. We have 3
proxies in the cluster, and the error bursts don't necessarily happen on
all 3 at the same time.

The problem may be related to backend availability, but I'm not entirely
sure of that. The backends occasionally redeploy while load tests are
going on, and some of the error bursts may have come when this happened.
They also tend to increase when the load is high, which may be just due to
the higher load on varnishd, but might also be related to backends
throwing errors under load. We had one run with no errors at all, in
evening hours when there are no redeployments.

On the other hand, we've had more runs with errors during evening hours,
and sometimes the error bursts have come shortly after the load test
starts, when load is still ramping up and is far from the maximum.

VMODs in use are:

* std and director
* header (V4 version from https://github.com/varnish/libvmod-header)
* urlcode (V4 version from https://github.com/fastly/libvmod-urlcode)
* uuid (as updated for V4 at https://github.com/otto-de/libvmod-uuid)
* re (https://code.uplex.de/uplex-varnish/libvmod-re)
* vtstor (https://code.uplex.de/uplex-varnish/libvmod-vtstor)

We tried working around the use of VMOD re in VCL, since it stores the
subject of regex matches in workspace, and we use it for Cookie headers,
which can be very large. But it didn't solve the problem.

VMOD vtstor only uses workspace for the size of a VXID as string
(otherwise it mallocs its own structures). uuid only uses workspace for
the size of a UUID string.

I'm learning how to read MEMPOOL.* stats, and I've noticed randry > 0,
timeouts > 0 and surplus > 0. But my reading of the code makes me think
that these don't indicate problems (except possibly surplus > 0?), and
that mempools can't help you anyway if workspaces are too small.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1743>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs

Re: #1743: Workspace exhaustion under load [ In reply to ]

varnish-bugs at varnish-cache

May 28, 2015, 10:29 PM

Post #2 of 5 (1429 views)

Permalink

Re: #1743: Workspace exhaustion under load [ In reply to ]

varnish-bugs at varnish-cache

May 29, 2015, 1:04 AM

Post #3 of 5 (1433 views)

Permalink

#1743: Workspace exhaustion under load
----------------------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: new
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: normal | Resolution:
Keywords: workspace, esi, load |
----------------------------------+----------------------------------

Comment (by geoff):

We're considering building from master and testing it under load. We'd
really prefer to go live with 4.0.3, and then migrate to 4.1 when it's
released, but we could do this to see if it has any effect on the problem.

It wouldn't be trivial, since we'd have to rebuild the VMODs, package RPMs
for distribution in the environments, etc.

So we'd appreciate any thoughts on whether it might be worth the effort --
e.g. if there is some new goodness in the ESI code worth trying, or if
master is presently too bleeding-edge for a load test.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1743#comment:2>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs

Re: #1743: Workspace exhaustion under load [ In reply to ]

varnish-bugs at varnish-cache

Jun 1, 2015, 2:15 AM

Post #4 of 5 (1406 views)

Permalink

#1743: Workspace exhaustion under load
----------------------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: new
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: normal | Resolution:
Keywords: workspace, esi, load |
----------------------------------+----------------------------------

Comment (by geoff):

It turns out that this problem was caused by a bug in VCL, and does not
indicate a problem with workspace management under load in varnishd.

We had VCL code that caused the req.url to be changed in vcl_recv() under
certain circumstances, but it should only have been done at ESI level 0,
and did not check the ESI level. That meant that the URL was replaced at
every ESI level, and the response at that URL also had deep ESI nesting,
as mentioned above.

Not only had we increased max_esi_depth to allow the deep ESI nesting, we
also had an "ESI include tree" that expands widely in breadth. The effect
was so much ESI expansion as to explode workspaces.

After fixing VCL so that the req.url replacement is only done at ESI level
0, we've been able to repeat load tests with no workspace problems.

This ticket can be closed now, thanks to phk for the help.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1743#comment:3>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs

Re: #1743: Workspace exhaustion under load [ In reply to ]

varnish-bugs at varnish-cache

Jun 1, 2015, 2:39 AM

Post #5 of 5 (1403 views)

Permalink

#1743: Workspace exhaustion under load
----------------------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: closed
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: normal | Resolution: worksforme
Keywords: workspace, esi, load |
----------------------------------+----------------------------------
Changes (by phk):

* status: new => closed
* resolution: => worksforme

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1743#comment:4>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs