Mailing List Archive

Bounty for netlink/zebra desync problem
Hello,

The issue with Zebra being desynchronized from kernel because of lost
netlink messages is really causing issues for me.

To help expedite this, I'll put 250$ bounty on the fix.

Fix should include:
* automatic detection that message was lost and resynchronizing
* a way to initiate resync manually, whether by signal or command-line

This should be for *both* routing messages and interface IP address
changes.

--
Alex Pilosov | DSL, Colocation, Hosting Services
President | alex@pilosoft.com (800) 710-7031
Pilosoft, Inc. | http://www.pilosoft.com
Re: Bounty for netlink/zebra desync problem [ In reply to ]
alex@pilosoft.com wrote:
> This should be for *both* routing messages and interface IP address
> changes.

Interface (dis)appearing as well, I think ... :)

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Bounty for netlink/zebra desync problem [ In reply to ]
On Thu, 5 Aug 2004 alex@pilosoft.com wrote:

> To help expedite this, I'll put 250$ bounty on the fix.

wowser :)

> Fix should include:
> * automatic detection that message was lost and resynchronizing

Ie, when ENOBUFS errno is encounter in
rt_netlink.c::netlink_parse_info().

> * a way to initiate resync manually, whether by signal or command-line

This shouldnt be neccessary, if netlink can resync when needed.

> This should be for *both* routing messages and interface IP address
> changes.

Note that Hasso has a patch:

http://hasso.linux.ee/quagga/pending-patches/ht-20040512-netlink-rcvbuf.patch

with which one can mitigate the problem by specifying a bigger
receive buffer for the netlink socket on the zebra command line.
Indeed, with a big enough receive buffer, the need for troublesome
resyncs dissappears..

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
Fortune:
Reisner's Rule of Conceptual Inertia:
If you think big enough, you'll never have to do it.
Re: Bounty for netlink/zebra desync problem [ In reply to ]
> > To help expedite this, I'll put 250$ bounty on the fix.
>
> wowser :)
>
> > Fix should include: * automatic detection that message was lost and
> > resynchronizing
>
> Ie, when ENOBUFS errno is encounter in
> rt_netlink.c::netlink_parse_info().
Yeah

> > * a way to initiate resync manually, whether by signal or command-line
>
> This shouldnt be neccessary, if netlink can resync when needed.
I don't trust it. :)

> > This should be for *both* routing messages and interface IP address
> > changes.
>
> Note that Hasso has a patch:
>
>
> http://hasso.linux.ee/quagga/pending-patches/ht-20040512-netlink-rcvbuf.patch
>
> with which one can mitigate the problem by specifying a bigger receive
> buffer for the netlink socket on the zebra command line. Indeed, with a
> big enough receive buffer, the need for troublesome resyncs
> dissappears..
I'm not so sure. I think [correct me if i am wrong] that max rcv buffer is
64k. I think desyncs happen when external routing event occurs at same
time as Zebra is adding or deleting a large batch of routes. I am not sure
if 64k is enough to cover that.

Although, I'm going to apply Hanso's patch - it might solve my
problem...Bounty still stands ;)

-alex
Re: Bounty for netlink/zebra desync problem [ In reply to ]
alex@pilosoft.com wrote:
> I'm not so sure. I think [correct me if i am wrong] that max rcv
> buffer is 64k. I think desyncs happen when external routing event
> occurs at same time as Zebra is adding or deleting a large batch of
> routes. I am not sure if 64k is enough to cover that.

You can change max via /proc/sys/net/core/rmem_max.

> Although, I'm going to apply Hanso's patch - it might solve my
> problem...Bounty still stands ;)

My patch is temporary solution and hack and ... etc. It might help,
but I don't use comfortable using it even myself ;). I would be more
than glad to see resync implemented as well.

--
Hasso Tepper
Elion Enterprises Ltd.
WAN administrator
Re: Bounty for netlink/zebra desync problem [ In reply to ]
On Wed, 11 Aug 2004 alex@pilosoft.com wrote:

> I don't trust it. :)

A 'fixed' netlink that would need manual syncs is by definition not
fixed ;). We can detect need for resync by ENOBUFS.

> I'm not so sure. I think [correct me if i am wrong] that max rcv
> buffer is 64k.

Nah, you can set it much higher, into the MBs if you need.

> I think desyncs happen when external routing event occurs at same
> time as Zebra is adding or deleting a large batch of routes. I am
> not sure if 64k is enough to cover that.

Right. And part of the solution here will be workqueues to allow
long-rounds of deletes (eg due to bgp peer down) to be broken up into
smaller units of work and allow the the thread system to schedule the
netlink reader in between.

This would also fix the biggest problem with lots of interfaces: when
zebra starts up, it can generate netlink commands while reading the
config file, however it wont read the netlink socket till that's
done, so kernel generated ACKs and resulting other netlink broadcasts
build up and overflow the socket. Work queues again should fix this.

> Although, I'm going to apply Hanso's patch - it might solve my
> problem...Bounty still stands ;)
>
> -alex

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
Fortune:
Above all things, reverence yourself.