Mailing List Archive

#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
---------------------------------+----------------------
Reporter: geoff | Type: defect
Status: new | Priority: normal
Milestone: Varnish 4.0 release | Component: varnishd
Version: 4.0.3 | Severity: critical
Keywords: |
---------------------------------+----------------------
We have been experiencing a problem with a VSL client -- under certain
conditions the dispatcher callback is no longer called, and the process
resident size increases slowly over a long period of time, until all
system RAM is consumed.

{{{
#6 0x00007f55851be85d in VAS_Fail_default (
func=0x7f55851ddd10 "vtx_synth_rec", file=0x7f55851dd28a
"vsl_dispatch.c",
line=1008, cond=0x7f55851dda96 "(synth) != 0", err=12,
kind=VAS_ASSERT)
at ../libvarnish/vas.c:67
#7 0x00007f55851ce903 in vtx_synth_rec (vtx=0x7d24a0, tag=76,
fmt=0x7f55851ddafc "%s (%u:%s \"%.*s\")") at vsl_dispatch.c:1008
#8 0x00007f55851cee17 in vtx_diag_tag (vtx=0x7d24a0, ptr=0x125823fe80,
reason=0x7f55851dda23 "vxid mismatch") at vsl_dispatch.c:1065
#9 0x00007f55851cdefb in vtx_scan (vslq=0x73d0d0, vtx=0x7d24a0)
at vsl_dispatch.c:863
#10 0x00007f55851ce1c2 in vtx_force (vslq=0x73d0d0, vtx=0x7d24a0,
reason=0x7f55851ddc87 "timeout") at vsl_dispatch.c:911
#11 0x00007f55851cffd4 in VSLQ_Dispatch (vslq=0x73d0d0,
func=0x40a5cd <dispatch>, priv=0x0) at vsl_dispatch.c:1345
}}}

That's err=12==ENOMEM for the ALLOC_OBJ() at the beginning of
vtx_synth_rec(). As you can see, VSL forced out a transaction due to
timeout, and vtx_scan() also found a VXID mismatch. The vtx_scan() call is
the second one in vtx_force().

We don't yet know why the VXID mismatch happened, and it's obviously not
OK, we're still investigating, but I suspect that vtx_scan() ran into an
endless loop here:

{{{
while (!(vtx->flags & VTX_F_COMPLETE) &&
vslc_vtx_next(&vtx->c.cursor) == 1) {
ptr = vtx->c.cursor.rec.ptr;
if (VSL_ID(ptr) != vtx->key.vxid) {
(void)vtx_diag_tag(vtx, ptr, "vxid mismatch");
continue;
}
}}}

I can confirm from the core dump that `VSL_ID(ptr) != vtx->key.vxid` is
true here, so it goes into vtx_diag_tag() to synthesize "vxid mismatch".
There was already a synthetic record added by vtx_force(), and
vtx_diag_tag() causes a new one to be added. Then the while loop goes back
to call vslc_vtx_next(), which advances the cursor to the synth record
that vtx_diag_tag() just added. The VXID mismatch still holds, so once
again vtx_diag_tag() gets called. Again a synth record is added, and again
vslc_vtx_next() advances to that record.

Thus we have an endless loop, and a struct synth gets allocated every
time. That's only 104 bytes, but slowly and surely all of RAM gets filled.

Our VSL client has a statistic that is incremented every time the
dispatcher is called, and we see in the log that this counter stops
increasing when memory size starts to increase. The counter is stopped for
about 10 to 20 minutes while the resident size increases, until all of RAM
is full. The process finally took up about 75 GB of RAM, meaning that it
had to be about 750 million allocations, assuming that it was allocating
the 104 bytes of struct synth every time, which is probably why it takes
so long.

We can see the large mapping is in the heap (low addresses), so it must
have been something that was malloc'ed (we've also seen this in pmap):

{{{
(gdb) info files
[...]
0x0000000000400000 - 0x0000000000401000 is load1a
0x0000000000401000 - 0x0000000000401000 is load1b
0x0000000000613000 - 0x0000000000614000 is load2
0x0000000000614000 - 0x000000000061d000 is load3
0x0000000000703000 - 0x0000000000724000 is load4
0x0000000000724000 - 0x0000001258261000 is load5
[...]
}}}

0x1258261000 - 0x724000 == 75GB.

And we can see that the last pointer of the synth list goes deep into that
large mapping.

{{{
(gdb) p vtx->synth
$87 = {vtqh_first = 0x866840, vtqh_last = 0x125823ff48}
}}}

The synth list apparently has a long cascade of "vxid mismatch" records:

{{{
(gdb) printf "%s\n", vtx->synth.vtqh_first.data[0]
:VSL "vxid mismatch (6489548:VSL "vxid mi

(gdb) printf "%s\n", vtx->synth.vtqh_first.list.vtqe_next.data[2]
h (6489548:VSL "vxid mismatch (6489548:VSL "vxid mi

(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.data[2]
6489548:VSL "vxid mi

(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi

(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi

(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi

(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi

(gdb) printf "%s\n",
vtx->synth.vtqh_first.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.list.vtqe_next.data[2]
id mismatch (6489548:VSL "vxid mi
}}}

We see all of this as evidence that vtx_scan() is trapped in the while
loop.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: new
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------

Comment (by slink):

The interesting bit is that the vtx vxid has the client bit set:

{{{
(gdb) frame 8
#8 0x00007f55851cee17 in vtx_diag_tag (vtx=0x7d24a0, ptr=0x125823fe80,
reason=0x7f55851dda23 "vxid mismatch")
at vsl_dispatch.c:1065
1065 in vsl_dispatch.c
(gdb) print /x ptr[1]
$25 = 0xc06305cc
(gdb) print /x vtx->key.vxid
$26 = 0x406305cc
}}}

Reviewing the code I am pretty sure that the root cause is not in the vsl
api, but rather the fact that 4.0 logs SLT_Links with the the
client/backend bits.

This is fixed in master cfb309cad60e0239bc7168082d73b4ab6b53744a, so I'll
just port this back to 4.0

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:1>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: closed
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution: fixed
Keywords: |
----------------------+----------------------------------
Changes (by slink):

* status: new => closed
* resolution: => fixed


Comment:

fixed in ca37d86639b90351f98484f2e6eee17a5f3638fa but I forgot the fix
marker in the commit.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:2>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: reopened
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------
Changes (by geoff):

* status: closed => reopened
* resolution: fixed =>


Comment:

This problem is not solved, since the endless loop and ENOMEM could happen
again if a VXID mismatch occurs for any other reason.

It's good to have the fix backported to 4.0, that would remove the
specific cause of the problem that we experienced recently.

But I assume that the check for VXID mismatch is there on the assumption
that it might possibly happen, and if it does, I don't see how the same
showstopper can be avoided.

If we're certain that the VXID mismatch cannot happen under any
circumstances now, then we should remove the check. But we certainly
should not leave a ticking time bomb in the code.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:3>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: closed
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution: fixed
Keywords: |
----------------------+----------------------------------
Changes (by slink):

* status: reopened => closed
* resolution: => fixed


Comment:

Replying to [comment:3 geoff]:
> This problem is not solved, since the endless loop and ENOMEM could
happen again if a VXID mismatch occurs for any other reason.

The specific issue was with this section of the code:


{{{
vtx_scan(struct VSLQ *vslq, struct vtx *vtx)
{
// [....]
if (VSL_ID(ptr) != vtx->key.vxid) {
(void)vtx_diag_tag(vtx, ptr, "vxid mismatch");
continue;
}

}}}

If we find a log chunk linked to our vtx with a vxid other than the vtx'es
vxid, vtx_diag_tag will get called and nothing is wrong with that.
vtx_diag_tag -> vtx_synth_rec will use the vtx'es vxid for the synth
records. I don't see a risk of an infinite loop here if vtx->key.vxid is
correct.

The specific issue here was that vtx->key.vxid was _not_ correct in 4.0.3
without the VXID macro fix: For instance Link records could contain vxids
with the client/backend bit set, so vtxes with these bad vxids could get
created.

The comparison above between the vtx vxid and VSL_ID() assumes that the
vtx vxid never contains client/backend bits, so it always reported a
mismatch for this case.


So I think we have good reason to assume that this issue's root cause is
really fixed.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:4>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner:
Type: defect | Status: reopened
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------
Changes (by slink):

* status: closed => reopened
* resolution: fixed =>


Comment:

Reopened until agreement with Geoff is reached.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:5>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner: slink
Type: defect | Status: new
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------
Changes (by slink):

* status: reopened => new
* owner: => slink


--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:6>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner: slink
Type: defect | Status: assigned
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------
Changes (by slink):

* status: new => assigned


--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:7>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner: slink
Type: defect | Status: assigned
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------

Comment (by slink):

commits
* 90e87e30fb92e525016ce1975fa61ad744bdc41d (master)
* 9119bfd659bc51a60d9f0bc1c84a6fd29b889b97 (4.0)
add an assertion that parsed vxids do not contain the client/server bit

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:8>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner: slink
Type: defect | Status: assigned
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------

Comment (by slink):

Further investigating into root cause scenarios in order to write a
regression resulted in the following insights:

* the bad vxid must have got into vtx->key.vxid by way of `vtx_parse_link`
* which is only called for `SLT_Begin` (`vtx_scan_begin()`) and `SLT_Link`
(`vtx_scan_link()`)

(actually this was known before, but I am now confident that these are the
only cases)

There is no case in the code as of 4.0.3 release where `SLT_Begin` is
emitted with an unmasked vxid, so the issue must be root casue in an
`SLT_Link` link record.

In both cases where unmasked vxids are emitted for `SLT_Link`, the id
comes directly from `VXID_Get()`:
* `cache_fetch.c`
{{{
wid = VXID_Get(&wrk->vxid_pool);
VSLb(bo->vsl, SLT_Link, "bereq %u retry", wid);
}}}
* `cache_req_fsm.c`
{{{
wid = VXID_Get(&wrk->vxid_pool);
// XXX: ReqEnd + ReqAcct ?
VSLb_ts_req(req, "Restart", W_TIM_real(wrk));
VSLb(req->vsl, SLT_Link, "req %u restart", wid);
}}}

So unless I have overseen anything significant, the root cause must have
been a vxid spill, which was fixed with
0dd8c0b864a9574df0f2891824b4581d0e846613 (master) /
171f3ac585f2bda639f526c31ad0689aecb8f8b4 (4.0)

`VXID()` masking would have avoided the issue to surface.

This insight is consistent with two observations:
* the issue only surfaced after `varnishd` running for longer periods of
time
* the issue didn't go away after a restart of the vsl client, a `varnishd`
restart was required

This gives confidence that the issue has really been understood completely
and that the root cause has been fixed.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:9>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner: slink
Type: defect | Status: assigned
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution:
Keywords: |
----------------------+----------------------------------

Comment (by slink):

regression test added:
* master 20362bf807eb6b0772ccdeb221b05173484d5dc3
* 4.0 9bffd9eac5baa4ef0e3a7f5d1fcab455d5980d61

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:10>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic transactions [ In reply to ]
#1762: VSL API: endless loop and out of memory in vtx_scan() on forced synthetic
transactions
----------------------+----------------------------------
Reporter: geoff | Owner: slink
Type: defect | Status: closed
Priority: normal | Milestone: Varnish 4.0 release
Component: varnishd | Version: 4.0.3
Severity: critical | Resolution: fixed
Keywords: |
----------------------+----------------------------------
Changes (by slink):

* status: assigned => closed
* resolution: => fixed


Comment:

agreed with geoff that this is solved for real

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1762#comment:11>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs