Mailing List Archive

found the cause of my bgpd hangs
There it is. Reentrancy problem, calling zlog from within the SIGHUP
handler.

sighup() in bgpd does various things it shouldn't be doing :( It's too
ugly in there for a simple fix, methinks.

(gdb) bt
#0 0xffffe002 in ?? ()
#1 0x080a0969 in vzlog ()
#2 0x080a0a35 in zlog ()
#3 0x0804a102 in sighup ()
#4 <signal handler called>
#5 0xffffe000 in ?? ()
#6 0x080a0969 in vzlog ()
#7 0x080a0a35 in zlog ()
#8 0x0805c6ce in bgp_update ()
#9 0x0805d75e in bgp_nlri_parse ()
#10 0x08072aac in bgp_update_receive ()
#11 0x08073d4a in bgp_read ()
#12 0x08098e94 in thread_call ()
#13 0x0804a582 in main ()
#14 0x42015704 in __libc_start_main () from /lib/tls/libc.so.6



On Sun, Dec 28, 2003 at 06:20:24PM -0500, buytenh wrote:

> Again it hangs on a mutex (???)..
>
> [root@noc root]# ps ax | grep bgpd | grep -v grep
> 12526 ? S 7:52 /usr/sbin/bgpd -d
> [root@noc root]# strace -p 12526
> futex(0x42133ccc, FUTEX_WAIT, 2, NULL
>
>
>
Re: found the cause of my bgpd hangs [ In reply to ]
On Sun, 28 Dec 2003, Lennert Buytenhek wrote:

> There it is. Reentrancy problem, calling zlog from within the SIGHUP
> handler.

yum.

> sighup() in bgpd does various things it shouldn't be doing :( It's
> too ugly in there for a simple fix, methinks.

Out of curiosity, where's the HUP coming from?

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
"When the going gets tough, the tough get empirical."
-- Jon Carroll
Re: found the cause of my bgpd hangs [ In reply to ]
On Mon, Dec 29, 2003 at 02:03:34AM +0000, Paul Jakma wrote:

> > There it is. Reentrancy problem, calling zlog from within the SIGHUP
> > handler.
>
> yum.

Gotta love being a maintainer :)


> > sighup() in bgpd does various things it shouldn't be doing :( It's
> > too ugly in there for a simple fix, methinks.
>
> Out of curiosity, where's the HUP coming from?

I'm sending it, every hour, to rotate the log files.

I'm not sure whether it explains all the bgpd hangs seen so far,
probably it doesn't. I'm also not sure whether this is the only
bgpd hang I've seen so far.


--L
Re: found the cause of my bgpd hangs [ In reply to ]
On Mon, 29 Dec 2003, Lennert Buytenhek wrote:

> > Out of curiosity, where's the HUP coming from?
>
> I'm sending it, every hour, to rotate the log files.

err.. SIGHUP causes bgpd to reset and reread its config.. to rotate
logs you want SIGUSR1.

> I'm not sure whether it explains all the bgpd hangs seen so far,
> probably it doesn't. I'm also not sure whether this is the only
> bgpd hang I've seen so far.

Well, if you've been sending it SIGHUP every hour, it explains a lot
:)

The other big bug is in vty output. let some command with lots of
output sit in the pager - bgpd will be blocked. (i /think/ this is
specific to the vtysh UNIX socket.)

> --L

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Tomorrow's computers some time next month.
-- DEC
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, Jan 02, 2004 at 04:10:38AM +0000, Paul Jakma wrote:

> > > Out of curiosity, where's the HUP coming from?
> >
> > I'm sending it, every hour, to rotate the log files.
>
> err.. SIGHUP causes bgpd to reset and reread its config.. to rotate
> logs you want SIGUSR1.

OK, I'll change that.


> > I'm not sure whether it explains all the bgpd hangs seen so far,
> > probably it doesn't. I'm also not sure whether this is the only
> > bgpd hang I've seen so far.
>
> Well, if you've been sending it SIGHUP every hour, it explains a lot
> :)

I'm not sure what it should explain. It shouldn't just hang?


> The other big bug is in vty output. let some command with lots of
> output sit in the pager - bgpd will be blocked. (i /think/ this is
> specific to the vtysh UNIX socket.)

I don't use vtysh. I'll change the signal to SIGUSR1 and post again
if it still hangs? Or does SIGUSR1 have the same bug?


--L
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, 2 Jan 2004, Lennert Buytenhek wrote:

> I'm not sure what it should explain. It shouldn't just hang?

Well, the double signal is still curious yes. AFAIK The OS should not
deliver a signal while the process is still processing previous
signals. The only flag set is SA_RESTART, if available, which
(afaict) shouldnt cause that problem.

however, HUP does explain why you see your connections go.

> I don't use vtysh. I'll change the signal to SIGUSR1 and post
> again if it still hangs? Or does SIGUSR1 have the same bug?

it might do. however, i have a sneaking suspicion this is a bug
outside of bgpd. We just should not be given a signal while we're
still inside a sighandler as SA_NODEFER is /not/ set for our
sighandlers (afaict). Could you give more details about your system?
Eg, your block shows it waiting on a futex, which means you're using
a very recent linux kernel (2.6 or else vendor customised 2.4, eg
Fedora or recent RH9) and glibc.

> --L

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Experiments must be reproducible; they should all fail in the same way.
Re: found the cause of my bgpd hangs [ In reply to ]
On Sun, 28 Dec 2003, Lennert Buytenhek wrote:

> (gdb) bt
> #0 0xffffe002 in ?? ()
^^^^^^^^^^
> #1 0x080a0969 in vzlog ()
> #2 0x080a0a35 in zlog ()
> #3 0x0804a102 in sighup ()
> #4 <signal handler called>
> #5 0xffffe000 in ?? ()
^^^^^^^^^^
> #6 0x080a0969 in vzlog ()
> #7 0x080a0a35 in zlog ()
> #8 0x0805c6ce in bgp_update ()

ah wait.. its a libc reentrancy problem. right...

so we have 2 choices: make the handler set some flag and exit (add
an event thread?) or make zlog signal safe.

former is simpler, but i bet lots of other daemons call zlog from
signal context, so latter might potentially be better.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
If you sell diamonds, you cannot expect to have many customers.
But a diamond is a diamond even if there are no customers.
-- Swami Prabhupada
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, Jan 02, 2004 at 11:36:27AM +0000, Paul Jakma wrote:

> > I'm not sure what it should explain. It shouldn't just hang?
>
> Well, the double signal is still curious yes. AFAIK The OS should not
> deliver a signal while the process is still processing previous
> signals. The only flag set is SA_RESTART, if available, which
> (afaict) shouldnt cause that problem.

We didn't see double signals? We saw that the main execution context
(i.e. not a signal) was writing BGP update messages to a logfile,
and while in zlog, a signal came in and tried to call zlog as well

Either we make the signal handler set a flag (write its signal number
into a pipe or something, and have the main execution context poll
that pipe), or we disable signals during the entire zebra execution
unless we are waiting in the main loop or something.


> however, HUP does explain why you see your connections go.

I certainly don't see my bgp sessions drop every hour. Yes, once
bgpd gets into a fit and hangs, yes, then my bgp sessions timeout,
but I normally send it SIGHUP every hour and bgp sessions stay up.


> > I don't use vtysh. I'll change the signal to SIGUSR1 and post
> > again if it still hangs? Or does SIGUSR1 have the same bug?
>
> it might do. however, i have a sneaking suspicion this is a bug
> outside of bgpd. We just should not be given a signal while we're
> still inside a sighandler as SA_NODEFER is /not/ set for our
> sighandlers (afaict). Could you give more details about your system?
> Eg, your block shows it waiting on a futex, which means you're using
> a very recent linux kernel (2.6 or else vendor customised 2.4, eg
> Fedora or recent RH9) and glibc.

Red Hat 9 with all updates applied. However, see above as to why I
don't think this is relevant.


--L
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, Jan 02, 2004 at 11:54:01AM +0000, Paul Jakma wrote:

> ah wait.. its a libc reentrancy problem. right...
>
> so we have 2 choices: make the handler set some flag and exit (add
> an event thread?) or make zlog signal safe.
>
> former is simpler, but i bet lots of other daemons call zlog from
> signal context, so latter might potentially be better.

iirc it's not just zlog they call from signal context.

i see no reason why zlog should be made reentrancy-safe while keeping
all the other functionality reentrancy-unsafe.

i don't know enough of quagga's internal threading system to comment
in a more intelligent way on this matter, i'm afraid.


--l
Re: found the cause of my bgpd hangs [ In reply to ]
* Lennert Buytenhek <buytenh@gnu.org> [040102 10:18]:
> On Fri, Jan 02, 2004 at 11:54:01AM +0000, Paul Jakma wrote:
>
> > ah wait.. its a libc reentrancy problem. right...
> >
> > so we have 2 choices: make the handler set some flag and exit (add
> > an event thread?) or make zlog signal safe.
> >
> > former is simpler, but i bet lots of other daemons call zlog from
> > signal context, so latter might potentially be better.
>
> iirc it's not just zlog they call from signal context.
>
> i see no reason why zlog should be made reentrancy-safe while keeping
> all the other functionality reentrancy-unsafe.

It's even worse than making them re-entrant.

To be called from signal context, they must be async-safe. I.e. nothing
that could block or deadlock...

> i don't know enough of quagga's internal threading system to comment
> in a more intelligent way on this matter, i'm afraid.

Theading isn't relevant.

Signal handlers must be async-safe (i.e. only use async-safe system
calls, and not try and block/take mutexes, etc.)

For a list of async-safe-functions, have a browse to the bottom of:
http://www.opengroup.org/onlinepubs/007904975/functions/xsh_chap02_04.html

a.

--
Aidan Van Dyk Create like a god,
aidan@highrise.ca command like a king,
http://www.highrise.ca/ work like a slave.
Re: found the cause of my bgpd hangs [ In reply to ]
--On Friday, January 2, 2004 7:26 AM -0500 Lennert Buytenhek
<buytenh@gnu.org> wrote:

> I certainly don't see my bgp sessions drop every hour. Yes, once
> bgpd gets into a fit and hangs, yes, then my bgp sessions timeout,
> but I normally send it SIGHUP every hour and bgp sessions stay up.


Lots of people seem to think that SIGHUP drops sessions. I gave up trying
to explain that it doesn't well over a year ago.
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, Jan 02, 2004 at 10:28:36AM -0500, Aidan Van Dyk wrote:

> > i don't know enough of quagga's internal threading system to comment
> > in a more intelligent way on this matter, i'm afraid.
>
> Theading isn't relevant.

It certainly is. If you don't want to do the work in signal context you
have to move it to the main context, and how to signal the main context that
a thread came in depends on how the internal threading is implemented.


--L
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, Jan 02, 2004 at 06:40:40PM -0500, Lennert Buytenhek wrote:

> > > i don't know enough of quagga's internal threading system to comment
> > > in a more intelligent way on this matter, i'm afraid.
> >
> > Theading isn't relevant.
>
> It certainly is. If you don't want to do the work in signal context you
> have to move it to the main context, and how to signal the main context that
> a thread came in depends on how the internal threading is implemented.

s/a thread came in/a signal came in/

Sorry, it's late here.


--L
Re: found the cause of my bgpd hangs [ In reply to ]
On Fri, 2 Jan 2004, John Payne wrote:

> --On Friday, January 2, 2004 7:26 AM -0500 Lennert Buytenhek
> <buytenh@gnu.org> wrote:
>
> > I certainly don't see my bgp sessions drop every hour. Yes, once
> > bgpd gets into a fit and hangs, yes, then my bgp sessions timeout,
> > but I normally send it SIGHUP every hour and bgp sessions stay up.
>
>
> Lots of people seem to think that SIGHUP drops sessions. I gave up trying
> to explain that it doesn't well over a year ago.
>

Well, it didn't drop the sessions but, I did find some
"unexpected" behavior.

I went into /usr/local/etc and modified my bgpd.conf adding a "test
peer". I sent a HUP to bgpd. It was unresponsive to telnet 2605 for a
minute or so but, when it answered, the new peer was there.

I then REMOVED the peer from bgpd.conf and sent another HUP. Well, it
didn't REMOVE the peer. It was still there. I verified that I had
actually removed it from the file, sent another HUP... It was still
there. So, it appears that you can addeth but you can't taketh away! ;-)


---
John Fraizer | High-Security Datacenter Services |
President | Dedicated circuits 64k - 155M OC3 |
EnterZone, Inc | Virtual, Dedicated, Colocation |
http://www.enterzone.net/ | Network Consulting Services |
Re: found the cause of my bgpd hangs [ In reply to ]
John Fraizer wrote:
>
> On Fri, 2 Jan 2004, John Payne wrote:
>
> > --On Friday, January 2, 2004 7:26 AM -0500 Lennert Buytenhek
> > <buytenh@gnu.org> wrote:
> >
> > > I certainly don't see my bgp sessions drop every hour. Yes, once
> > > bgpd gets into a fit and hangs, yes, then my bgp sessions timeout,
> > > but I normally send it SIGHUP every hour and bgp sessions stay up.
> >
> >
> > Lots of people seem to think that SIGHUP drops sessions. I gave up trying
> > to explain that it doesn't well over a year ago.
> >
>
> Well, it didn't drop the sessions but, I did find some
> "unexpected" behavior.
>
> I went into /usr/local/etc and modified my bgpd.conf adding a "test
> peer". I sent a HUP to bgpd. It was unresponsive to telnet 2605 for a
> minute or so but, when it answered, the new peer was there.
>
> I then REMOVED the peer from bgpd.conf and sent another HUP. Well, it
> didn't REMOVE the peer. It was still there. I verified that I had
> actually removed it from the file, sent another HUP... It was still
> there. So, it appears that you can addeth but you can't taketh away! ;-)

You can only remove peers or configuration statements via the CLI.
If you want bgpd to redo your entire config you have to restart it.

I would like bgpd to be able to completely reconfigure itself from
config file upon HUP.

--
Andre
Re: found the cause of my bgpd hangs [ In reply to ]
--On Friday, January 9, 2004 9:49 AM -0500 John Fraizer
<syscow@EnterZone.Net> wrote:

> I then REMOVED the peer from bgpd.conf and sent another HUP. Well, it
> didn't REMOVE the peer. It was still there. I verified that I had
> actually removed it from the file, sent another HUP... It was still
> there. So, it appears that you can addeth but you can't taketh away! ;-)


Right. bgpd reads the config without clearing the old config first. It's
kinda like a 'conf net'

You can remove with 'no'.