Mailing List Archive

#1823: vcl in discarded mode does not clear up
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Type: defect
Status: new | Priority: normal
Milestone: Varnish 4.1-TP1 | Component: build
Version: 4.1.0 | Severity: normal
Keywords: |
-----------------------------+--------------------
Hi,

We have a problem with Varnish holding onto VCL in a discarded state,
which has persisting referenced objects.

We timestamp our VCL as a reference to the date/time it was loaded. This
is done using a simple script using varnishadm commands. We should only
retain 3 'available' configs.

The problem being experienced is with the 'discarded' VCL, specifically
the backend polling, which being duplicated. This is causing some
customers to see high volumes of probes from our delivery servers, which
is in turn affecting their backend systems.

Example output from varnishadm -


{{{
discarded auto/cooling 52 vcl_20151125095333
discarded auto/cooling 11 vcl_20151125122048
discarded auto/cooling 136 vcl_20151125122328
discarded auto/cooling 24 vcl_20151125140126
discarded auto/cooling 1 vcl_20151125143340
discarded auto/cooling 1 vcl_20151125144531
discarded auto/cooling 12 vcl_20151125145312
discarded auto/cooling 9 vcl_20151125154306
discarded auto/cooling 2 vcl_20151125162540
available auto/cold 0 vcl_20151126140831
available auto/cold 0 vcl_20151126141010
available auto/cold 0 vcl_20151126141620
active auto/warm 22 vcl_20151126142113
}}}


The referenced objects never reduce, leaving the 'discarded' VCL on the
delivery server.

We have been able to reproduce this in our development environment doing
the following -

1. On the varnish delivery server, create a backend (with no probe) and
then a remote backend with simple website.
2. Sending requests (like 1000) to varnish delivery server.
3. On the backend add a new iptables rule to drop port 80.
4. Reload the VCL a number of times.


{{{
OS: 12.04 3.2.0-70-generic #105-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Varnish: varnishd (varnish-4.1.0 revision 3041728)

}}}

Other than restarting varnish, we have not found a mechanism to clear
these and allow the auto/cooling process to complete, removing the VCL.

Is this a possible bug or is this expected behavior of the v4.1 VMOD, when
backends become sick/unreachable?

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------
Changes (by fgsch):

* component: build => varnishd
* milestone: Varnish 4.1-TP1 =>


--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:1>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------
Description changed by fgsch:

Old description:

> Hi,
>
> We have a problem with Varnish holding onto VCL in a discarded state,
> which has persisting referenced objects.
>
> We timestamp our VCL as a reference to the date/time it was loaded. This
> is done using a simple script using varnishadm commands. We should only
> retain 3 'available' configs.
>
> The problem being experienced is with the 'discarded' VCL, specifically
> the backend polling, which being duplicated. This is causing some
> customers to see high volumes of probes from our delivery servers, which
> is in turn affecting their backend systems.
>
> Example output from varnishadm -
>

> {{{
> discarded auto/cooling 52 vcl_20151125095333
> discarded auto/cooling 11 vcl_20151125122048
> discarded auto/cooling 136 vcl_20151125122328
> discarded auto/cooling 24 vcl_20151125140126
> discarded auto/cooling 1 vcl_20151125143340
> discarded auto/cooling 1 vcl_20151125144531
> discarded auto/cooling 12 vcl_20151125145312
> discarded auto/cooling 9 vcl_20151125154306
> discarded auto/cooling 2 vcl_20151125162540
> available auto/cold 0 vcl_20151126140831
> available auto/cold 0 vcl_20151126141010
> available auto/cold 0 vcl_20151126141620
> active auto/warm 22 vcl_20151126142113
> }}}
>

> The referenced objects never reduce, leaving the 'discarded' VCL on the
> delivery server.
>
> We have been able to reproduce this in our development environment doing
> the following -
>
> 1. On the varnish delivery server, create a backend (with no probe) and
> then a remote backend with simple website.
> 2. Sending requests (like 1000) to varnish delivery server.
> 3. On the backend add a new iptables rule to drop port 80.
> 4. Reload the VCL a number of times.
>

> {{{
> OS: 12.04 3.2.0-70-generic #105-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
> Varnish: varnishd (varnish-4.1.0 revision 3041728)
>
> }}}
>
> Other than restarting varnish, we have not found a mechanism to clear
> these and allow the auto/cooling process to complete, removing the VCL.
>
> Is this a possible bug or is this expected behavior of the v4.1 VMOD,
> when backends become sick/unreachable?

New description:

Hi,

We have a problem with Varnish holding onto VCL in a discarded state,
which has persisting referenced objects.

We timestamp our VCL as a reference to the date/time it was loaded. This
is done using a simple script using varnishadm commands. We should only
retain 3 'available' configs.

The problem being experienced is with the 'discarded' VCL, specifically
the backend polling, which being duplicated. This is causing some
customers to see high volumes of probes from our delivery servers, which
is in turn affecting their backend systems.

Example output from varnishadm -


{{{
discarded auto/cooling 52 vcl_20151125095333
discarded auto/cooling 11 vcl_20151125122048
discarded auto/cooling 136 vcl_20151125122328
discarded auto/cooling 24 vcl_20151125140126
discarded auto/cooling 1 vcl_20151125143340
discarded auto/cooling 1 vcl_20151125144531
discarded auto/cooling 12 vcl_20151125145312
discarded auto/cooling 9 vcl_20151125154306
discarded auto/cooling 2 vcl_20151125162540
available auto/cold 0 vcl_20151126140831
available auto/cold 0 vcl_20151126141010
available auto/cold 0 vcl_20151126141620
active auto/warm 22 vcl_20151126142113
}}}


The referenced objects never reduce, leaving the 'discarded' VCL on the
delivery server.

We have been able to reproduce this in our development environment doing
the following -

1. On the varnish delivery server, create a backend (with no probe) and
then a remote backend with simple website.
3. On the backend add a new iptables rule to drop port 80.
2. Sending requests (like 1000) to varnish delivery server.
4. Reload the VCL a number of times.


{{{
OS: 12.04 3.2.0-70-generic #105-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Varnish: varnishd (varnish-4.1.0 revision 3041728)

}}}

Other than restarting varnish, we have not found a mechanism to clear
these and allow the auto/cooling process to complete, removing the VCL.

Is this a possible bug or is this expected behavior of the v4.1 VMOD, when
backends become sick/unreachable?

--

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:2>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------
Description changed by fgsch:

Old description:

> Hi,
>
> We have a problem with Varnish holding onto VCL in a discarded state,
> which has persisting referenced objects.
>
> We timestamp our VCL as a reference to the date/time it was loaded. This
> is done using a simple script using varnishadm commands. We should only
> retain 3 'available' configs.
>
> The problem being experienced is with the 'discarded' VCL, specifically
> the backend polling, which being duplicated. This is causing some
> customers to see high volumes of probes from our delivery servers, which
> is in turn affecting their backend systems.
>
> Example output from varnishadm -
>

> {{{
> discarded auto/cooling 52 vcl_20151125095333
> discarded auto/cooling 11 vcl_20151125122048
> discarded auto/cooling 136 vcl_20151125122328
> discarded auto/cooling 24 vcl_20151125140126
> discarded auto/cooling 1 vcl_20151125143340
> discarded auto/cooling 1 vcl_20151125144531
> discarded auto/cooling 12 vcl_20151125145312
> discarded auto/cooling 9 vcl_20151125154306
> discarded auto/cooling 2 vcl_20151125162540
> available auto/cold 0 vcl_20151126140831
> available auto/cold 0 vcl_20151126141010
> available auto/cold 0 vcl_20151126141620
> active auto/warm 22 vcl_20151126142113
> }}}
>

> The referenced objects never reduce, leaving the 'discarded' VCL on the
> delivery server.
>
> We have been able to reproduce this in our development environment doing
> the following -
>
> 1. On the varnish delivery server, create a backend (with no probe) and
> then a remote backend with simple website.
> 3. On the backend add a new iptables rule to drop port 80.
> 2. Sending requests (like 1000) to varnish delivery server.
> 4. Reload the VCL a number of times.
>

> {{{
> OS: 12.04 3.2.0-70-generic #105-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
> Varnish: varnishd (varnish-4.1.0 revision 3041728)
>
> }}}
>
> Other than restarting varnish, we have not found a mechanism to clear
> these and allow the auto/cooling process to complete, removing the VCL.
>
> Is this a possible bug or is this expected behavior of the v4.1 VMOD,
> when backends become sick/unreachable?

New description:

Hi,

We have a problem with Varnish holding onto VCL in a discarded state,
which has persisting referenced objects.

We timestamp our VCL as a reference to the date/time it was loaded. This
is done using a simple script using varnishadm commands. We should only
retain 3 'available' configs.

The problem being experienced is with the 'discarded' VCL, specifically
the backend polling, which being duplicated. This is causing some
customers to see high volumes of probes from our delivery servers, which
is in turn affecting their backend systems.

Example output from varnishadm -


{{{
discarded auto/cooling 52 vcl_20151125095333
discarded auto/cooling 11 vcl_20151125122048
discarded auto/cooling 136 vcl_20151125122328
discarded auto/cooling 24 vcl_20151125140126
discarded auto/cooling 1 vcl_20151125143340
discarded auto/cooling 1 vcl_20151125144531
discarded auto/cooling 12 vcl_20151125145312
discarded auto/cooling 9 vcl_20151125154306
discarded auto/cooling 2 vcl_20151125162540
available auto/cold 0 vcl_20151126140831
available auto/cold 0 vcl_20151126141010
available auto/cold 0 vcl_20151126141620
active auto/warm 22 vcl_20151126142113
}}}


The referenced objects never reduce, leaving the 'discarded' VCL on the
delivery server.

We have been able to reproduce this in our development environment doing
the following -

1. On the varnish delivery server, create a backend (with no probe) and
then a remote backend with simple website.
3. On the backend add a new iptables rule to drop port 80.
2. Sending requests (like 1000) to varnish delivery server.
4. Reload the VCL a number of times.


{{{
OS: 12.04 3.2.0-70-generic #105-Ubuntu SMP x86_64 x86_64 x86_64 GNU/Linux
Varnish: varnishd (varnish-4.1.0 revision 3041728)

}}}

Other than restarting varnish, we have not found a mechanism to clear
these and allow the auto/cooling process to complete, removing the VCL.

--

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:3>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by fgsch):

Via IRC: using epoll with default timeouts.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:4>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by martin):

Reporter has confirmed with netstat that all tcp connections has
terminated. This points towards a refcount leak.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:5>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by martin):

More info:

{{{
13:02:13 < phk> (Please add that info to ticket)
13:02:16 < nan0r> me & lupi are the guys
13:02:35 < phk> nan0r, cool, thanks for participating.
13:03:18 < nan0r> phk: you are welcome
13:03:21 < phk> nan0r, what's the regular traffic level ? (Just to get an
idea
of the frequency of lost references) ?
13:03:23 < lochii> how do we track refcount increments / decrements
13:03:32 < nan0r> us + lochii
13:03:55 < phk> lochii, you shouldn't have to :-)
13:04:13 < phk> lochii, so we don't have any logging for it.
13:04:21 < lochii> so this is the vcl->busy counter?
13:04:25 < phk> however, there may be varnishstat counters that spot this.
13:04:32 < phk> lochii, yes.
13:04:46 -!- hu_made_this_script [Adium@87.253.171.197] has joined
#varnish-hacking
13:04:50 < nan0r> quite a lot of traffic, about 100 backends, 300 req/s,
sessions let me check
13:05:11 < phk> so basically, whenever a worker is doing something, it
holds a
ref on the VCL so that it can call the different methods
in the
same VCL until the transaction is completed.
13:05:38 < phk> are you using dynamic backends or only static backends ?
13:06:02 < lochii> nan0r: ?
13:06:04 < lochii> I think only static
13:06:05 < nan0r> 50 connections / sec
13:06:16 < phk> lochii, any directors at all ?
13:06:18 < nan0r> phk: static backedns only
13:06:42 < lochii> I dont think any directors
13:06:44 < nan0r> all declared in main vcl, not in vmods
13:06:50 < lochii> yes, no vmods
13:06:58 < phk> cool, that eliminates a lot of variables.
}}}

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:6>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by phk):

Examination of counters indicate that this is req's stuck on the waiting
list.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:7>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by nan0r):

As promised and discussed on IRC, this is a plan to reproduce the issue
with a minimum amount of steps, from a clean environnement (in this
example a fresh docker container, running ubuntu:latest image). Please
note we don't need to "reload" varnish to see the "ref count" issue. At
the end of these, netstat -na is empty, and varnishadm still shows a
number of references attached to the boot config. sysctl -w
net.ipv4.tcp_keepalive_time=15 is there only to clean the dying tcp
connetions quicker for the end of the test.

*********************************************
* reproduce from scratch, minimal
********************************************
## just to earn some time, reduce tcp keep alive to 15 on the host
# sysctl -w net.ipv4.tcp_keepalive_time=15
# fresh minimal ubuntu 14.04 install (docker run -it ubuntu), image
ubuntu:latest
# docker run -it ubuntu
# (from now, we are running command from the fresh ubuntu environnement)
# install varnish and apache bench (ab)
apt-get -y install apt-transport-https curl apache2-utils
curl https://repo.varnish-cache.org/GPG-key.txt | apt-key add -
echo "deb https://repo.varnish-cache.org/ubuntu/ trusty varnish-4.1" >>
/etc/apt/sources.list.d/varnish-cache.list
apt-get update
apt-get -y install varnish
# check version varnish
varnishd -V | grep rev
# varnishd (varnish-4.1.0 revision 3041728)
### setup mini config with backend not reachable
echo "vcl 4.0;" > /etc/varnish/default.vcl
echo "backend nonaccessible_backend { ">> /etc/varnish/default.vcl
echo ".host = \"172.17.0.222\" ;" >> /etc/varnish/default.vcl
echo ".port = \"80\";" >> /etc/varnish/default.vcl
echo "}" >> /etc/varnish/default.vcl
# start varnish
/etc/init.d/varnish start
sleep 3
# show number ref count before
varnishadm vcl.list
ab -n 100000 -c 1000 http://127.0.0.1:6081/ &
sleep 5
kill $!
varnishadm vcl.list
# wait keepalive trigger and clean dead tcp connection
# count nomber of tcp tuples in netstat
netstat -nat | grep tcp
# wait the netstat to be cleared
sleep 20
# from now netstat should be empty from closing conenctions
netstat -nat | grep -c CLOSE_WAIT
# = 0, nothing left
netstat -nat | wc -l
# =6
# refcount still high
varnishadm vcl.list
# active auto/warm 957 boot
********************************************

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:9>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by nan0r):

We believe this exists in varnish 4.0.X and varnish 4.1.X.
from IRC:
16:59 <nan0r> we think it is related to the way objects are removed from
the waiting list
17:00 <nan0r> somehow the reference to the object is lost, and the objects
never gets removed

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:10>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by lochii):

I've been staring at the code for a while and I have a theory about what
is happening.

in the test case, we send 1000 requests to the acceptor, and, though the
backend is unreachable, we end up with a busy object and thus a waitlist.

Debugging, I count over 1000 requests being added to the waitinglist, but
only 300 being removed before activity ceases (and the acceptor TCP
sockets time out), leaving a high refcount on the objhead.

The 300 actually comes from 100 x 3 , as 3 is the default rush exponent
(so for each hsh_rush(), 3 times as many list items are removed as the
function is called), case in point, if we increase the rush exponent
parameter on the command line, more waitlist items are removed for each
hsh_rush() called and the situation improves.

now, why is hsh_rush() only called 100 times?

debugging suggests:

- 25 of these attempts come from the backend fetch error code path,
vbf_stp_error() -> HSH_Unbusy() -> hsh_rush()
- 25 of these come from cnt_deliver() -> HSH_DerefObjCore() ->
hsh_rush()
- 25 of these come from exp_expire() -> HSH_DerefObjCore() -> hsh_rush()
- 25 of these come from VBO_DerefBusyObj() - HSH_DerefObjCore() ->
hsh_rush()

it seems to be that these come from in-flight requests, before the
waitlisting started,
after the waitlisting starts, all other requests seem to go onto the list
to wait for the above
to come and remove them (but this doesn't happen, as we never reach the
required number of hsh_rush() calls)

I did hope for some epoll magic to come and save the day with the expiring
acceptor sockets, but alas, the sockfds
are only checked by a straighforward poll() , from VTCP_check_hup() , and
this only happens after the request is back off the waitlist!
(in the HTTP1 FSM)

so, I noticed that hsh_rush() has a ditching mechanism, if it tries to
reschedule the req using SES_Reschedule_Req() and there
are no available workers to do this on to, the entire waitlist (and req,
and session) are all ditched.

It seems that if we open up this ditching mechanism to the HSH_Unbusy()
that is called from vbf_stp_error() , then
when the backend is (or becomes) dysfunctional, we can stop queuing
requests (as we may not be able to service them any time soon).

Though I'm not sure if this is a good thing or not, I wrote a simple patch
to demonstrate this (see attached) which does appear to work
and, in the test case above, resolves the stuck waitlist (and refcnts)
issue.

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:11>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by lochii):

{{{
[12:37] <ruben_varnish:#varnish-hacking> right
[12:37] <martin:#varnish-hacking> phk: to me the cause of #1823 isn't
mapped yet. I do not believe it is the SES_Reschedule_Req() that fails and
stops the propagation
[12:37] -beb:#varnish-hacking- #1823: vcl in discarded mode does not clear
up [new] https://www.varnish-cache.org/trac/ticket/1823
[12:40] <martin:#varnish-hacking> and the single DerefObjcore on any
objcore of an objhead should be sufficient to cause the entire list to be
rescheduled
[12:40] <phk:#varnish-hacking> martin, not if we have a "leak" somewhere
that stops the chain.
[12:41] <martin:#varnish-hacking> phk: exactly, so i still believe there
must be a leak somewhere that we haven't spotted
[12:41] <martin:#varnish-hacking> the patch as it is would prob just mask
it
[12:41] <phk:#varnish-hacking> martin, agree. But now we know that to be
the case.
[12:42] <martin:#varnish-hacking> yes, so that is definitive progress
[12:42] <phk:#varnish-hacking> agreed too, I don't think that's the proper
solution. It may still be a good idea to speed up draining the waiting
list though, but it's not a proper fix for the bug.
[12:43] <martin:#varnish-hacking> my huch of friday that I was following
was the single OC ref held by the busyobj, and somehow the initiating
client worker and the fetch worker ref of the busyobj
got screwed up
[12:44] <martin:#varnish-hacking> causing that ref to sit idle
[12:44] <phk:#varnish-hacking> martin, Well, that would be quick to test
if we have a VTC.
[12:45] <martin:#varnish-hacking> yeah
[12:45] <phk:#varnish-hacking> I think we should focus on writing the
VTC...
[12:45] <lochii:#varnish-hacking> martin: re: 'single DerefObjcore' , how
should that work?
[12:45] <lochii:#varnish-hacking> ultimately, it just calls a normal
hsh_rush()
[12:45] <lochii:#varnish-hacking> it isn't doing anything special
[12:45] <martin:#varnish-hacking> which in turn would wake up some thread
somewhere
[12:45] <lochii:#varnish-hacking> yes, it does
[12:45] <lochii:#varnish-hacking> which calls another hsh_rush()
[12:45] <martin:#varnish-hacking> which would have a ref of some objcore
on the same objhead
[12:46] <lochii:#varnish-hacking> but not enough times
[12:46] <lochii:#varnish-hacking> if the wl is big enough
[12:46] <phk:#varnish-hacking> lochii, the point is that "big enough"
shouldn't matter.
[12:46] <martin:#varnish-hacking> which would in turn cause hsh_rush to be
called on the WL of the objhead when it's dereferenced
[12:46] Notification successfully posted.
[12:46] <martin:#varnish-hacking> so the chain is propagated
[12:47] <lochii:#varnish-hacking> right, but the hsh_rush only removes
$rush_exponent items
[12:47] <lochii:#varnish-hacking> from the wl
[12:47] <lochii:#varnish-hacking> in this case, 3
[12:47] <martin:#varnish-hacking> true, but for each of those
$rush_exponent more should be woken
[12:48] <phk:#varnish-hacking> lochii, but each of those should also
remove 3
[12:48] Notification successfully posted.
[12:48] <phk:#varnish-hacking> lochii, it's supposed to be a cascade, and
somewhere it isn't
[12:48] Notification successfully posted.
[12:48] <lochii:#varnish-hacking> hsh_rush() does
wrk->stats->busy_wakeup++;
[12:48] <lochii:#varnish-hacking> but that's about it
[12:50] <lochii:#varnish-hacking> how does (or should) this wakeup work?
[12:50] <martin:#varnish-hacking> lochii: it calls SES_Reschedule_Req()
[12:50] Notification successfully posted.
[12:50] <lochii:#varnish-hacking> yes
[12:50] <martin:#varnish-hacking> that causes a req to be processed that
is already waiting on this objhead
[12:51] <lochii:#varnish-hacking> ok, so that puts a SES_Proto_Req on the
task list
[12:51] <martin:#varnish-hacking> so it must get an objcore ref on the
same objhead
[12:51] <martin:#varnish-hacking> which when deref'ed would call hsh_rush
again
12:51] <martin:#varnish-hacking> so if those statements are all true, the
list would not stall
[12:52] <lochii:#varnish-hacking> I only ever see 25 SES_Reschedule_Req()s
being called
[12:52] <lochii:#varnish-hacking> for 1000 requests
[12:52] <martin:#varnish-hacking> yeah, and that is wrong
[12:52] <lochii:#varnish-hacking> sorry, 300
[12:52] <phk:#varnish-hacking> are there any failures to get worker
threads ?
[12:53] <phk:#varnish-hacking> that might break the chain
[12:53] <martin:#varnish-hacking> the stats are negative on that
[12:53] <lochii:#varnish-hacking> well, if the reschedule fails, it is
supposed to ditch the wl
[12:54] <lochii:#varnish-hacking> since you cascade the return of
pool_task
[12:57] <lochii:#varnish-hacking> when it calls SES_Proto_Req ->
HTTP1_Session , it ends up in S_STP_H1BUSY
[12:58] <lochii:#varnish-hacking> and this does in turn HSH_DerefObjHead()
and Req_Cleanup()
[12:58] <martin:#varnish-hacking> ah
[12:58] <martin:#varnish-hacking> there you have it
[12:58] <martin:#varnish-hacking> hsh_derefobjhead wouldn't call hsh_ursh
[12:59] <lochii:#varnish-hacking> yes
[12:59] <lochii:#varnish-hacking> it doesn't
[12:59] <lochii:#varnish-hacking> I dont know why
[12:59] <lochii:#varnish-hacking> but I've just been guessing by staring
at this code for so long, as to how it is supposed to work
[13:00] <martin:#varnish-hacking> right - so that's the broken link
[13:00] <lochii:#varnish-hacking> ok
[13:00] <martin:#varnish-hacking> the VTCP_check_hup notices the client
has gone away while on the waitinglist
[13:00] <lochii:#varnish-hacking> yes, I saw that
[13:00] <martin:#varnish-hacking> which causes a quick escape of the
handling
[13:00] <martin:#varnish-hacking> which causes a quick escape of the
handling
[13:00] <lochii:#varnish-hacking> wondering why when epoll waiter is
enabled that we don't add epoll monitoring for these sockets
[13:03] <lochii:#varnish-hacking> ok, testing now with hsh_rush() in the
HSH_DerefObjHead() path
[13:06] <phk:#varnish-hacking> lochii, the waiting list predates the
general waiter
[13:06] Notification successfully posted.
[13:06] <phk:#varnish-hacking> lochii, it's also not obvious that it would
be cheaper.
[13:06] Notification successfully posted.
[13:07] [clara] [nan0r(~ronank@195.8.70.15)] not coming to the team
meeting ?
[13:07] Notification successfully posted.
[13:09] [clara] [msg(nan0r)] incident with l3, be there shortly
[13:10] <lochii:#varnish-hacking> ok, seems to be working , albeit slowly
[13:11] <lochii:#varnish-hacking> will keep an eye on it
[13:11] <phk:#varnish-hacking> lochii, slowly in what way ?
[13:11] Notification successfully posted.
[13:11] <lochii:#varnish-hacking> as in, the refcnt is draining
[13:11] <lochii:#varnish-hacking> as the cascade is happening
[13:11] <lochii:#varnish-hacking> and threads are being woken up
[13:11] <lochii:#varnish-hacking> just happening really slowly
[13:12] <phk:#varnish-hacking> lochii so the cascade is not widening ?
[13:12] Notification successfully posted.
[13:12] <lochii:#varnish-hacking> no
[13:12] <lochii:#varnish-hacking> ok, its stuck again
[13:13] <lochii:#varnish-hacking> ah
[13:13] <lochii:#varnish-hacking> objcore refcnt is now zero
[13:13] <lochii:#varnish-hacking> so we are no longer deref objcore
[13:14] <lochii:#varnish-hacking> which means we stopped deref objhead
[13:14] <lochii:#varnish-hacking> but I"ve still got the oh with refcnt of
800+
[13:14] <lochii:#varnish-hacking> so its stalled here now
[13:16] <phk:#varnish-hacking> that sounds strange...
[14:50] <lochii:#varnish-hacking> ok, looks like I've got it working
[14:51] <lochii:#varnish-hacking> if hsh_rush() is called properly in
HSH_DerefObjHead() it does indeed work
[14:51] <lochii:#varnish-hacking> I need to test some stuff with the
private oh
[14:52] <lochii:#varnish-hacking> but generally it works
[14:52] <lochii:#varnish-hacking> so let me re-make the patch
[14:52] <lochii:#varnish-hacking> (though it would be nice to see ditch
functionality, I'll keep it out of this if it isn't needed)

another patch follows.
}}}

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:12>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+--------------------
Reporter: hamed.gholamian | Owner:
Type: defect | Status: new
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution:
Keywords: |
-----------------------------+--------------------

Comment (by lochii):

ok, two patches attached, 1823.2.patch is less invasive, and we call
hsh_rush() directly in HSH_DerefObjHead(), but an extra lock can be
involved with a non-private oh (as the hashing algorithm does its own lock
to do refcnt decrement), patch 1823.3.patch brings the hsh_rush() into the
hash algorithm , but is more invasive as a result (but you save an
additional locking step).

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:13>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs
Re: #1823: vcl in discarded mode does not clear up [ In reply to ]
#1823: vcl in discarded mode does not clear up
-----------------------------+----------------------------------------
Reporter: hamed.gholamian | Owner: Poul-Henning Kamp <phk@…>
Type: defect | Status: closed
Priority: normal | Milestone:
Component: varnishd | Version: 4.1.0
Severity: normal | Resolution: fixed
Keywords: |
-----------------------------+----------------------------------------
Changes (by Poul-Henning Kamp <phk@…>):

* status: new => closed
* owner: => Poul-Henning Kamp <phk@…>
* resolution: => fixed


Comment:

In [0be9404f18efeccd430698837c2806662b32523a]:
{{{
#!CommitTicketReference repository=""
revision="0be9404f18efeccd430698837c2806662b32523a"
Rush the objheader if there is a waiting list when it is deref'ed.

Fixes: #1823
}}}

--
Ticket URL: <https://www.varnish-cache.org/trac/ticket/1823#comment:14>
Varnish <https://varnish-cache.org/>
The Varnish HTTP Accelerator

_______________________________________________
varnish-bugs mailing list
varnish-bugs@varnish-cache.org
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-bugs