Mailing List Archive

child process restarts frequently
We have a problem with the child process regularly dying and restarting.

We use Varnish in front of three backends, each serving different types
of content. One of these backends is quite slow, and initially we
thought the problem with the child process was related to this slow
backend. We hoped that upgrading from 1.0.3 to 1.0.4 would solve this,
but unfortunately is has not.

We have a 64-bits machine with a 400G disk cache.

Below I have attached excerpts from what I think is relevant parts of
the log( with some description in between) for one of these restart events:

-------------------------------------

First request with id 138012248 is fetched and inserted in the cache:

76 ReqStart c xxx.xx.xxx.xx 4589 138012248
76 RxRequest c GET
76 RxURL c
/weatherapi/locationforecast/1.1/?lat=59.2082531856325;lon=
9.60892344715165
76 RxProtocol c HTTP/1.1
76 VCL_call c recv
76 VCL_return c lookup
76 VCL_call c hash
76 VCL_return c hash
76 VCL_call c miss
76 VCL_return c fetch
55 BackendXID b 138012248
76 Backend c 55 weatherapi
55 TxRequest b GET
55 TxURL b
/weatherapi/locationforecast/1.1/?lat=59.2082531856325;lon=
9.60892344715165
55 TxProtocol b HTTP/1.1
55 TxHeader b X-Varnish: 138012248
55 TxHeader b X-Forwarded-for:
55 RxProtocol b HTTP/1.1
55 RxStatus b 203
55 RxResponse b Non-Authoritative Information
55 RxHeader b Date: Tue, 03 Jul 2007 14:07:08 GMT
55 RxHeader b Server: Apache/1.3.34 (Debian) mod_perl/1.29
55 RxHeader b Expires: Tue, 03 Jul 2007 15:00:00 GMT
55 RxHeader b Content-length: 115322
55 RxHeader b Content-Type: text/xml; charset=utf-8
76 ObjProtocol c HTTP/1.1
76 ObjStatus c 203
76 ObjResponse c Non-Authoritative Information
76 ObjHeader c Date: Tue, 03 Jul 2007 14:07:08 GMT
76 ObjHeader c Server: Apache/1.3.34 (Debian) mod_perl/1.29
76 ObjHeader c Expires: Tue, 03 Jul 2007 15:00:00 GMT
76 ObjHeader c Content-Type: text/xml; charset=utf-8
76 ObjHeader c Content-Length: 115322
55 BackendReuse b weatherapi
76 TTL c 138012248 RFC 3172 1183471628 1183471628
1183474800 0 0
76 VCL_call c fetch
76 VCL_return c insert
76 Length c 115322
76 TxProtocol c HTTP/1.1
76 TxStatus c 203
76 TxResponse c Non-Authoritative Information
76 ObjHeader c Date: Tue, 03 Jul 2007 14:07:08 GMT
76 ObjHeader c Server: Apache/1.3.34 (Debian) mod_perl/1.29
76 ObjHeader c Expires: Tue, 03 Jul 2007 15:00:00 GMT
76 ObjHeader c Content-Type: text/xml; charset=utf-8
76 ObjHeader c Content-Length: 115322
55 BackendReuse b weatherapi
76 TTL c 138012248 RFC 3172 1183471628 1183471628
1183474800 0 0
76 VCL_call c fetch
76 VCL_return c insert
76 Length c 115322
76 TxProtocol c HTTP/1.1
76 TxStatus c 203
76 TxResponse c Non-Authoritative Information
76 TxHeader c Date: Tue, 03 Jul 2007 14:07:08 GMT
76 TxHeader c Server: Apache/1.3.34 (Debian) mod_perl/1.29
76 TxHeader c Expires: Tue, 03 Jul 2007 15:00:00 GMT
76 TxHeader c Content-Type: text/xml; charset=utf-8
76 TxHeader c Content-Length: 115322
76 TxHeader c X-Varnish: 138012248
76 TxHeader c Age: 0
76 TxHeader c Via: 1.1 varnish
76 ReqEnd c 138012248 1183471628.619347000
1183471628.781244000 0.05796
5000 0.161765000 0.000132000
0 StatAddr xxx.xx.xxx.xxx 0 22038 3349 127106 0 0 18810
37429990 95687
09644


This object is then requested and delivered from cache several times.
Then we see the following in the log, at a time immediately after 14:59:27:


0 ExpPick 138012248
0 VCL_call timeout
0 VCL_return discard

...immediately following this( ca.14:59:33 ) we get:

0 ExpBan 138012248 was banned
0 CLI Rd vcl.load boot /tmp/vcl.XXfdES7y
0 CLI Wr 0 200 Loaded "/tmp/vcl.XXfdES7y" as "boot"

0 CLI Rd vcl.use boot
0 CLI Wr 0 200
0 CLI Rd start
0 CLI Wr 0 200

This pattern repeats itself regularly every 5 - 15 hours. I am not sure
if the request and the ExpBan tag is related to the child process
beeing restarted, but since a restart is always preceded by such a tag
it does seem likely.

The weird thing is that it seems to happen at the exact same time every
day. At 07:59 and 14:59 the child processs is restarted in the manner
shown above. We have a lot of requests which expires at each whole hour,
but I have no idea if this is related in any way.

---------------------------------------------

It would be great if anyone could explain this behaviour? What does the
ExpBan tag really mean? It seems kind of counterintuitive that a request
can be banned after beeing discarded....

Also, please let me know if you need any more information to be able to
look into the problem.


Best regards,
H?vard Futs?ter
child process restarts frequently [ In reply to ]
H?vard Futs?ter <havard.futseter at met.no> writes:
> This object is then requested and delivered from cache several times.
> Then we see the following in the log, at a time immediately after
> 14:59:27:
>
> 0 ExpPick 138012248
> 0 VCL_call timeout
> 0 VCL_return discard

vcl_timeout() is called for every object in cache 30 seconds before it
expires. The default action is "discard". The alternative, "prefetch",
has not been implemented yet.

> ...immediately following this( ca.14:59:33 ) we get:
>
> 0 ExpBan 138012248 was banned
> 0 CLI Rd vcl.load boot /tmp/vcl.XXfdES7y
> 0 CLI Wr 0 200 Loaded "/tmp/vcl.XXfdES7y" as "boot"
> 0 CLI Rd vcl.use boot
> 0 CLI Wr 0 200
> 0 CLI Rd start
> 0 CLI Wr 0 200

The child process crashed while or immediately after recycling the
discarded object.

> The weird thing is that it seems to happen at the exact same time
> every day. At 07:59 and 14:59 the child processs is restarted in the
> manner shown above. We have a lot of requests which expires at each
> whole hour, but I have no idea if this is related in any way.

Does your backend deliver objects which always expire at either 08:00 or
15:00?

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no
child process restarts frequently [ In reply to ]
Thanks for the quick reply!

Dag-Erling Sm?rgrav wrote:
> H?vard Futs?ter <havard.futseter at met.no> writes:
>> This object is then requested and delivered from cache several times.
>> Then we see the following in the log, at a time immediately after
>> 14:59:27:
>>
>> 0 ExpPick 138012248
>> 0 VCL_call timeout
>> 0 VCL_return discard

>> ...immediately following this( ca.14:59:33 ) we get:
>>
>> 0 ExpBan 138012248 was banned
>> 0 CLI Rd vcl.load boot /tmp/vcl.XXfdES7y
>> 0 CLI Wr 0 200 Loaded "/tmp/vcl.XXfdES7y" as "boot"
>> 0 CLI Rd vcl.use boot
>> 0 CLI Wr 0 200
>> 0 CLI Rd start
>> 0 CLI Wr 0 200
>
> The child process crashed while or immediately after recycling the
> discarded object.

>> The weird thing is that it seems to happen at the exact same time
>> every day. At 07:59 and 14:59 the child processs is restarted in the
>> manner shown above. We have a lot of requests which expires at each
>> whole hour, but I have no idea if this is related in any way.
>
> Does your backend deliver objects which always expire at either 08:00 or
> 15:00?

Well, yes. The backend actually delivers objects that expire every hour
sharp. So, a bunch of objects( about 2000 - 2500) will be discarded
simultaneously, 30 seconds before each whole hour.

We have also found an interesting thing in the logs:

Immediately after object 138012248 was banned and the child was
restarted, we see that a request was made for object 138012248 ( or
rather, a request for the same url that this object represents).

Might this perhaps be an indication that Varnish gets into trouble when
it receives a request for a resource while the cached object for this
resource is beeing recycled?


--
Mvh,
H?vard Futs?ter
Telefon: +47 22 96 32 69
Mobil: +47 99 69 61 84
child process restarts frequently [ In reply to ]
H?vard Futs?ter <havard.futseter at met.no> writes:
> Might this perhaps be an indication that Varnish gets into trouble
> when it receives a request for a resource while the cached object for
> this resource is beeing recycled?

Sounds like it. Poul-Henning, could you take a closer look at the
locking in the expiry code?

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no
child process restarts frequently [ In reply to ]
Dag-Erling Sm?rgrav wrote:
> H?vard Futs?ter <havard.futseter at met.no> writes:
>> Might this perhaps be an indication that Varnish gets into trouble
>> when it receives a request for a resource while the cached object for
>> this resource is beeing recycled?
>
> Sounds like it. Poul-Henning, could you take a closer look at the
> locking in the expiry code?
>
> DES

Hi. I have some more information regarding the above mentioned problem:

We now think the restart is caused by a conflict between the purging
routine and the timeout routine.

We have the following scenario:

1. url.purge <regexp>.
2. An object that will be affected by the purge times out and is discarded.
3. AFTER ExpPick, but BEFORE ExpKill, there is a request for the same
object. The object is consequently banned.
4. Immediately after this ban the child process restarts.

Perhaps the expire-system tries to move the object to deathrow twice?
Speculation only...

For testing purposes we have temporarily removed the url.purge command,
and have seen no restart after this was done.


--
Mvh,
H?vard Futs?ter
Telefon: +47 22 96 32 69
Mobil: +47 99 69 61 84
child process restarts frequently [ In reply to ]
Hi. There have not been any replies to my previous mail (seen below),
and I was wondering if this was because my description of the problem
was unclear? Or perhaps I did not supply enough data to find the source
of the problem?

I would gladly have attached more log-data here, but I am afraid there
aren't a whole lot more relevant logs available.

Best regards,
H?vard Futs?ter

H?vard Futs?ter wrote:
> Dag-Erling Sm?rgrav wrote:
>> H?vard Futs?ter <havard.futseter at met.no> writes:
>>> Might this perhaps be an indication that Varnish gets into trouble
>>> when it receives a request for a resource while the cached object for
>>> this resource is beeing recycled?
>> Sounds like it. Poul-Henning, could you take a closer look at the
>> locking in the expiry code?
>>
>> DES
>
> Hi. I have some more information regarding the above mentioned problem:
>
> We now think the restart is caused by a conflict between the purging
> routine and the timeout routine.
>
> We have the following scenario:
>
> 1. url.purge <regexp>.
> 2. An object that will be affected by the purge times out and is discarded.
> 3. AFTER ExpPick, but BEFORE ExpKill, there is a request for the same
> object. The object is consequently banned.
> 4. Immediately after this ban the child process restarts.
>
> Perhaps the expire-system tries to move the object to deathrow twice?
> Speculation only...
>
> For testing purposes we have temporarily removed the url.purge command,
> and have seen no restart after this was done.
>
>


--
Mvh,
H?vard Futs?ter
Telefon: +47 22 96 32 69
Mobil: +47 99 69 61 84
child process restarts frequently [ In reply to ]
H?vard Futs?ter <havard.futseter at met.no> writes:
> Hi. There have not been any replies to my previous mail (seen below),
> and I was wondering if this was because my description of the problem
> was unclear? Or perhaps I did not supply enough data to find the
> source of the problem?

We know what causes it, but it isn't trivial to fix without degrading
performance too much.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no
child process restarts frequently [ In reply to ]
Hi. I was wondering if there has been any change of status for this
issue ( that is, conflict between purging and timeout)?

Dag-Erling Sm?rgrav wrote:
> H?vard Futs?ter <havard.futseter at met.no> writes:
>> Hi. There have not been any replies to my previous mail (seen below),
>> and I was wondering if this was because my description of the problem
>> was unclear? Or perhaps I did not supply enough data to find the
>> source of the problem?
>
> We know what causes it, but it isn't trivial to fix without degrading
> performance too much.

Does this mean it is difficult to figure out a satisfactory solution at
all? Or that you have a solution, but the implementation would take a
lot of work?

> DES

Would it be helpful if I opened up a bug issue, and reiterated my
problems there?

--
Best regards,
H?vard Futs?ter
child process restarts frequently [ In reply to ]
H?vard Futs?ter <havard.futseter at met.no> writes:
> Dag-Erling Sm?rgrav <des at linpro.no> writes:
> > We know what causes it, but it isn't trivial to fix without degrading
> > performance too much.
> Does this mean it is difficult to figure out a satisfactory solution
> at all? Or that you have a solution, but the implementation would
> take a lot of work?

The former. The quick and easy fix is to add a mutex, but that will
kill performance. We need to find a better solution.

DES
--
Dag-Erling Sm?rgrav
Senior Software Developer
Linpro AS - www.linpro.no