Mailing List Archive

Ospfd looses adjacency
Testsetup:
I have testrouter with two interfaces in backbone area eth0
10.10.10.11/24 and eth1.101 (vlan) 10.88.88.1/24. In eth1.101 side
there is no neighbors, in eth0 side there is one neighbor -
10.10.10.125(Juniper). (Vlan isn't case, I tested it later with
normal ethernet interface as well).

After some time (1 - 20 minutes) ospfd just stops sending hello's. See
attached fragment from log.

2004/01/31 15:50:24 OSPF: ISM[eth1.101:10.88.88.1]: Timer (Hello timer
expire)
2004/01/31 15:50:24 OSPF: make_hello: options: 2, int:
eth1.101:10.88.88.1

Although detail debugging is enabled.

I'd blame something else at first (multicast etc), but strange thing
is that if I don't put eth1.101 into area, all works fine (ospfd ran
24h without problems). I'm out of ideas at the moment.

Has anybody seen similar problems? Any ideas how to debug it?

Quagga is latest CVS on Debian stable, kernel 2.4.24.

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Ospfd looses adjacency [ In reply to ]
Hasso Tepper wrote:
> Testsetup:
> I have testrouter with two interfaces in backbone area eth0
> 10.10.10.11/24 and eth1.101 (vlan) 10.88.88.1/24. In eth1.101 side
> there is no neighbors, in eth0 side there is one neighbor -
> 10.10.10.125(Juniper). (Vlan isn't case, I tested it later with
> normal ethernet interface as well).
>
> After some time (1 - 20 minutes) ospfd just stops sending hello's.

OK. I think I tracked it down, but as I don't understand
filedescriptors/sockets etc. yet enough to decide what to blame.
Result of debugging - after some time socket isn't just ready for
writing (lib/thread.c:737 "if (FD_ISSET (THREAD_FD (thread), fdset))"
returns false) and it remains this way until ospfd restart.

Problem appears ONLY if some interface is down actually, but ospfd has
info that it's up and ospf is enabled on that interface. If ospfd
gets info that interface went down, there is no problem.

So, why it happens? Is it intended and only way to fix this is to make
sure that ospfd will get info when interface goes down? Or is it bug
in somewhere else (libc, kernel)?

In my testcase eth1 was administratively up, but cable wasn't plugged
in. Result - ospfd still thought that interface was up (it appeared
in "show ip ospf interfaces" as up).
Command "link-detect" helped in this case. If I plug off cable "sho ip
ospf int" shows interface as down.

(How this link-detect stuff works? Is it Linux only?)

Vlan case is worse. Vlan's are always up in Linux. If I plug off cable
from eth1, RUNNING flag is removed from eth1("show int"), but not
from eth1.x vlan's.
Zebra (and therefore ospfd as well) doesn't mark eth1.x vlan's as down
if eth1 goes down. "link-detect" doesn't help either.

PS. See [zebra 21079], same problem probably.

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Ospfd looses adjacency [ In reply to ]
Problem appears ONLY if some interface is down actually, but ospfd has
info that it's up and ospf is enabled on that interface. If ospfd
gets info that interface went down, there is no problem.

I suspect that there is a general case here of an interface that is
'up' but can't send packets. So the queue fills up. Then write can
either silently drop the packet or return an error. I think on BSD
you get ENOBUFS from write. I don't know if this gets propagated
back to the select call.

I would argue that if the kernel says packets can't be sent on an
interface and ospfd doesn't send any that this is not a bug in ospfd.
If the kernel says no packets can be sent but they can be, the kernel
has a bug.

But, if this wedges the rest of ospfd, that's bad, and perhaps more
writes need to be non-blocking.

--
Greg Troxel <gdt@ir.bbn.com>
Re: Ospfd looses adjacency [ In reply to ]
Greg Troxel wrote:

> Problem appears ONLY if some interface is down actually, but ospfd has
> info that it's up and ospf is enabled on that interface. If ospfd
> gets info that interface went down, there is no problem.
>
>I suspect that there is a general case here of an interface that is
>'up' but can't send packets. So the queue fills up. Then write can
>
>
Happens often due to flow control. PPP, PPP-like VPN interfaces, other.

>either silently drop the packet or return an error. I think on BSD
>you get ENOBUFS from write. I don't know if this gets propagated
>back to the select call.
>
>
AFAIK: BSD does. Linux does not.

>I would argue that if the kernel says packets can't be sent on an
>interface and ospfd doesn't send any that this is not a bug in ospfd.
>If the kernel says no packets can be sent but they can be, the kernel
>has a bug.
>
>But, if this wedges the rest of ospfd, that's bad, and perhaps more
>writes need to be non-blocking.
>
>
>
Re: Ospfd looses adjacency [ In reply to ]
On Thu, 12 Feb 2004, Hasso Tepper wrote:

> Problem appears ONLY if some interface is down actually, but ospfd
> has info that it's up and ospf is enabled on that interface. If
> ospfd gets info that interface went down, there is no problem.

Aha. That makes sense so :)

> So, why it happens? Is it intended and only way to fix this is to
> make sure that ospfd will get info when interface goes down? Or is
> it bug in somewhere else (libc, kernel)?

ospfd does (should at a minimum) honour the IFF_UP flag, by way of
if_is_up() - provided by lib/if::.

Additionally, some systems have IFF_RUNNING, but quality of that flag
is spotty. It is use of this flag which is controlled by link-detect.
See below.

> In my testcase eth1 was administratively up, but cable wasn't
> plugged in. Result - ospfd still thought that interface was up (it
> appeared in "show ip ospf interfaces" as up). Command "link-detect"
> helped in this case. If I plug off cable "sho ip ospf int" shows
> interface as down.
>
> (How this link-detect stuff works? Is it Linux only?)

It works on any system with IFF_RUNNING I think. Essentially it adds:

if_is_operative()

Which, if link-detect is enabled, makes use of the IFF_RUNNING flag.
if link-detect is not enabled, it just uses IFF_UP.

Whether to enable it is an issue for the administrator, as to whether
IFF_RUNNING is reliable. Which depends often on drivers correctly
setting this flag. Indeed, I'm not even certain whether IFF_RUNNING
== link detection is valid on all platforms.

> Vlan case is worse. Vlan's are always up in Linux. If I plug off
> cable from eth1, RUNNING flag is removed from eth1("show int"), but
> not from eth1.x vlan's. Zebra (and therefore ospfd as well) doesn't
> mark eth1.x vlan's as down if eth1 goes down. "link-detect" doesn't
> help either.

That would be a problem for the 8021q linux people so.

> PS. See [zebra 21079], same problem probably.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Cleanliness becomes more important when godliness is unlikely.
-- P.J. O'Rourke
Re: Ospfd looses adjacency [ In reply to ]
Paul Jakma wrote:

[snip IFF_RUNNING and link-detect stuff] - it's clear

> > Vlan case is worse. Vlan's are always up in Linux. If I plug off
> > cable from eth1, RUNNING flag is removed from eth1("show int"),
> > but not from eth1.x vlan's. Zebra (and therefore ospfd as well)
> > doesn't mark eth1.x vlan's as down if eth1 goes down.
> > "link-detect" doesn't help either.
>
> That would be a problem for the 8021q linux people so.

Yes. I'll contact them. But it sounds more like a workaround for me.

Main problem is though what can be done in quagga code to handle
situation. It's not acceptable IMHO that ospfd looses all adjacencies
if one link goes down and stays there more than 20 minutes :).

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Ospfd looses adjacency [ In reply to ]
On Thu, 12 Feb 2004, Hasso Tepper wrote:

> Main problem is though what can be done in quagga code to handle
> situation. It's not acceptable IMHO that ospfd looses all
> adjacencies if one link goes down and stays there more than 20
> minutes :).

Looses _all_ adjacencies?

Definitely not right that, agreed. But how does that happen?

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Repartee is something we think of twenty-four hours too late.
-- Mark Twain
Re: Ospfd looses adjacency [ In reply to ]
Paul Jakma wrote:
> On Thu, 12 Feb 2004, Hasso Tepper wrote:
> > Main problem is though what can be done in quagga code to handle
> > situation. It's not acceptable IMHO that ospfd looses all
> > adjacencies if one link goes down and stays there more than 20
> > minutes :).
>
> Looses _all_ adjacencies?

Yes.

> Definitely not right that, agreed. But how does that happen?

It uses same filedescriptor to read/write packets no matter what
interface is used. If it's not ready for writing, ospfd stops sending
all packets.

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Ospfd looses adjacency [ In reply to ]
Hasso Tepper wrote:

>Paul Jakma wrote:
>
>
>>On Thu, 12 Feb 2004, Hasso Tepper wrote:
>>
>>
>>>Main problem is though what can be done in quagga code to handle
>>>situation. It's not acceptable IMHO that ospfd looses all
>>>adjacencies if one link goes down and stays there more than 20
>>>minutes :).
>>>
>>>
>>Looses _all_ adjacencies?
>>
>>
>
>Yes.
>
>
>
>>Definitely not right that, agreed. But how does that happen?
>>
>>
>
>It uses same filedescriptor to read/write packets no matter what
>interface is used. If it's not ready for writing, ospfd stops sending
>all packets.
>
>
>
This should be per interface. My 2p


--

A. R. Ivanov
E-mail mailto:arivanov@sigsegv.cx
pub 1024D/DDE5E715 2002-03-03 Anton R. Ivanov <arivanov@sigsegv.cx>
Fingerprint: C824 CBD7 EE4B D7F8 5331 89D5 FCDA 572E DDE5 E715
pub 2048R/DB33CE33 1999-01-05 Anton Ivanov <anton.ivanov@level3.com>
Fingerprint: 11 2C 68 F3 79 58 58 58 56 F5 94 E6 F3 2B 3B 7A
Re: Ospfd looses adjacency [ In reply to ]
Additionally, some systems have IFF_RUNNING, but quality of that flag
is spotty. It is use of this flag which is controlled by link-detect.
See below.

In BSD, IFF_RUNNING means that 'resources have been allocated for the
interface', or semothing. It is not necessarily an indication of
whether packets will be transmitted.

--
Greg Troxel <gdt@ir.bbn.com>
Re: Ospfd looses adjacency [ In reply to ]
It uses same filedescriptor to read/write packets no matter what
interface is used. If it's not ready for writing, ospfd stops sending
all packets.

Then perhaps either:

set nonblocking, and send even if select says no,

or

open one socket per interface


In BSD, I think that packets are not queued in the socket, but get put
on if output queues or dropped. So you get ENOBUFS on sending (due to
hitting queue max size), but this doesn't affect sending to other
interfaces.

--
Greg Troxel <gdt@ir.bbn.com>
Re: Ospfd looses adjacency [ In reply to ]
On Fri, 13 Feb 2004, Greg Troxel wrote:

> In BSD, IFF_RUNNING means that 'resources have been allocated for the
> interface', or semothing. It is not necessarily an indication of
> whether packets will be transmitted.

On Linux an interface's link-status is indicated via IFF_RUNNING, if
the driver supports link-status reporting and does it properly.
IFF_UP indicates the system (eg administrative) state of an
interface.

Actually, it'd be most useful to build a matrix of how systems, which
support link-status reporting via interface flags, set the flags for
various conditions, ie which flags zebra reports under "show
interface"

system iface up iface up iface down iface down
link present no link link present no link

Linux <UP,RUNNING> <UP> <?> <?>

For now, the 'link-detect' flag should allow admins to enable use of
the RUNNING flag on interfaces which support use of this flag as a
link-status flag.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
I've always considered statesmen to be more expendable than soldiers.
Re: Ospfd looses adjacency [ In reply to ]
On Fri, 13 Feb 2004, Greg Troxel wrote:

> Then perhaps either:
>
> set nonblocking, and send even if select says no,

What should return EWOULDBLOCK or EAGAIN then, without sending.

> or
>
> open one socket per interface

Could do that.

> In BSD, I think that packets are not queued in the socket, but get
> put on if output queues or dropped. So you get ENOBUFS on sending
> (due to hitting queue max size), but this doesn't affect sending to
> other interfaces.

Its strange really. Surely we should get either an error or the OS
should drop the packet (sendmsg() being unreliable).

I guess socket per interface would fix this.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
All the really good ideas I ever had came to me while I was milking a cow.
-- Grant Wood
Re: Ospfd looses adjacency [ In reply to ]
On Thu, 12 Feb 2004, Hasso Tepper wrote:

> It uses same filedescriptor to read/write packets no matter what
> interface is used. If it's not ready for writing, ospfd stops
> sending all packets.

I find it strange a raw socket would block though. Shouldnt the
packet just be discarded if it can not be sent? (sendmsg() is
explicitly unreliable).

anyway, find attached a (barely tested) patch to move ospfd from
global fd to fd-per-oi. Whether this is a good thing or not depends
on whether linux /should/ block writes and/or whether other systems
behave similarly.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
What this country needs is a good five dollar plasma weapon.
Re: Ospfd looses adjacency [ In reply to ]
anyway, find attached a (barely tested) patch to move ospfd from
global fd to fd-per-oi. Whether this is a good thing or not depends
on whether linux /should/ block writes and/or whether other systems
behave similarly.

linux shouldn't. I believe other OSes don't.

But, this change (which I didn't find attached) should make the code
more robust against various behaviors. Plus, the sockets should be
set to non-blocking.

--
Greg Troxel <gdt@ir.bbn.com>
Re: Ospfd looses adjacency [ In reply to ]
On Wed, 18 Feb 2004, Greg Troxel wrote:

> linux shouldn't. I believe other OSes don't.

Why shouldnt it?

> But, this change (which I didn't find attached) should make the
> code more robust against various behaviors. Plus, the sockets
> should be set to non-blocking.

Maybe, yes. But that wouldnt fix the problem here though, which is
that the socket blocks /after/ we've sendmsg()'d down it[1]. There's
no way to deal with that.

1. Hasso: I'm presuming you dont get:

*** sendmsg in ospf_write to .... failed with <error>

in the logs?

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
I tell ya, I was an ugly kid. I was so ugly that my dad kept the kid's
picture that came with the wallet he bought.
-- Rodney Dangerfield
Re: Ospfd looses adjacency [ In reply to ]
On Thu, 19 Feb 2004, Paul Jakma wrote:

> Maybe, yes. But that wouldnt fix the problem here though, which is
> that the socket blocks /after/ we've sendmsg()'d down it[1].
> There's no way to deal with that.

Err.. by use of non-blocking sockets at least.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Money isn't everything -- but it's a long way ahead of what comes next.
-- Sir Edmond Stockdale
Re: Ospfd looses adjacency [ In reply to ]
> linux shouldn't. I believe other OSes don't.

Why shouldnt it?

If you have a datagram socket and send messages to different
destinations, there should be no queueing in the socket, because it
causes head of line blocking, which is never what one wants.
I believe that the right behavior is for the socket layer to generate
a UDP packet and ask the IP part of the stack to deliver it, with the
behavior being that it gets dropped if it can't be queued. It's
perfectly reasonable to have sendto/sendmsg return ENOBUFS in this
case - that's useful feedback to the application.

Maybe, yes. But that wouldnt fix the problem here though, which is
that the socket blocks /after/ we've sendmsg()'d down it[1]. There's
no way to deal with that.

Good point. But if they weren't nonblocking, it would be at least
somewhat reasonable for a kernel to block in sendmsg, waiting until
the output succeeded and retrying it once a second. But probably no
one would implement this in a kernel, since it is almost never the
desired behavior.
Re: Ospfd looses adjacency [ In reply to ]
On Thu, 19 Feb 2004, Greg Troxel wrote:

> behavior being that it gets dropped if it can't be queued. It's
> perfectly reasonable to have sendto/sendmsg return ENOBUFS in this
> case - that's useful feedback to the application.

Sure, but it doesnt :)

> Good point. But if they weren't nonblocking, it would be at least
> somewhat reasonable for a kernel to block in sendmsg, waiting until
> the output succeeded and retrying it once a second. But probably no
> one would implement this in a kernel, since it is almost never the
> desired behavior.


No, that would be exactly the desired behaviour. It would allow us to
deal with the problem. Either by retrying or discarding the packet
ourselves.

At present the kernel does _not_ put ospfd to sleep, yet the kernel
_does_ block the socket for writes (rather than discarding the packet).
Silly behaviour from the kernel.


--paulj
Re: Ospfd looses adjacency [ In reply to ]
Paul Jakma wrote:
> At present the kernel does _not_ put ospfd to sleep, yet the kernel
> _does_ block the socket for writes (rather than discarding the
> packet). Silly behaviour from the kernel.

Seems that it is related to multicast. I defined eth1.101 (vlan) as
non-brodcast interface and specified neighbor behind that interfaces
=> unicast hellos are sent. Ospfd worked 24h, no problems.

It would be nice to write small testcase for kernel network developers
IMHO. But it should be done by someone who has more knowledge than me
about this stuff ;).

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Ospfd looses adjacency [ In reply to ]
On Fri, 20 Feb 2004, Hasso Tepper wrote:

> Seems that it is related to multicast. I defined eth1.101 (vlan) as
> non-brodcast interface and specified neighbor behind that interfaces
> => unicast hellos are sent. Ospfd worked 24h, no problems.

Can you replicate on plain interfaces? (ie to rule out VLAN).

> It would be nice to write small testcase for kernel network developers
> IMHO. But it should be done by someone who has more knowledge than me
> about this stuff ;).

Possible, but I dont have access to vlans :)

--paulj
Re: Ospfd looses adjacency [ In reply to ]
On Fri, 20 Feb 2004, Greg Troxel wrote:

> Then the program would be stuck in sendmsg forever.

Right, but we could deal with that by setting the socket to non-blocking
:)

The (linux) kernel does not let us deal with it in any way - bad.

> Absolutely. What if we just ignore the select return and send anyway?

That would involve fuglyfying the write thread code. no thanks :)

--paulj
Re: Ospfd looses adjacency [ In reply to ]
> Good point. But if they weren't nonblocking, it would be at least
> somewhat reasonable for a kernel to block in sendmsg, waiting until
> the output succeeded and retrying it once a second. But probably no
> one would implement this in a kernel, since it is almost never the
> desired behavior.

No, that would be exactly the desired behaviour. It would allow us to
deal with the problem. Either by retrying or discarding the packet
ourselves.

Then the program would be stuck in sendmsg forever.
Basically I think having any blocking on a socket that sends to
multiple places which can independently block or not block makes no
sense.

At present the kernel does _not_ put ospfd to sleep, yet the kernel
_does_ block the socket for writes (rather than discarding the packet).
Silly behaviour from the kernel.

Absolutely. What if we just ignore the select return and send anyway?
Re: Ospfd looses adjacency [ In reply to ]
On Fri, 20 Feb 2004, Greg Troxel wrote:

> Exactly; hence my comment that while this behavior sort of makes
> sense, no one should actually want it.

sendmsg() is documented as being able to return EAGAIN, so if the kernel
thinks it will block (eg link state down), it should either block the
process/return EAGAIN or throw the packet away - not the
sort-of-one-and-half-the-other behaviour presently.

> OK, well I guess then someone should file a linux bug report and we
> should not worry about it.

Tried to. Didnt generate much interest in linux-kernel.

Hasso, feel like nagging some more? :)

--paulj
Re: Ospfd looses adjacency [ In reply to ]
> Then the program would be stuck in sendmsg forever.

Right, but we could deal with that by setting the socket to non-blocking
:)

Exactly; hence my comment that while this behavior sort of makes
sense, no one should actually want it.

> Absolutely. What if we just ignore the select return and send anyway?

That would involve fuglyfying the write thread code. no thanks :)

OK, well I guess then someone should file a linux bug report and we
should not worry about it.