Mailing List Archive

ospfd dies
Here is log of ospfd (lastest CVS),

I trying to change ospf cost of one of my pppX interfaces.

After 'ospf cost 5' command:

2004/06/11 23:02:28 OSPF: LSA[Type1]: Cancel previous router-LSA timer
2004/06/11 23:02:28 OSPF: LSA[Type1]: Scheduling router-LSA origination
right away
2004/06/11 23:02:28 OSPF: LSA[Type1]: Create router-LSA instance
2004/06/11 23:02:28 OSPF: LSA[Type1]: Set link Point-to-Point
2004/06/11 23:02:28 OSPF: LSA[Type1]: Set link Point-to-Point
2004/06/11 23:02:28 OSPF: LSA[Type1]: Set link Point-to-Point
2004/06/11 23:02:28 OSPF: LSA[Type1]: Set link Point-to-Point
2004/06/11 23:02:28 OSPF: LSA: freed 0x81431f0
2004/06/11 23:02:28 OSPF: LSA[Type1:10.223.1.1]: data freed 0x8147148
2004/06/11 23:02:28 OSPF: LSA[Type1]: ID 10.223.1.1 seq 0x80000ad5 is
self-originated
2004/06/11 23:02:28 OSPF: LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]:
Install router-LSA to Area 0.0.0.0
2004/06/11 23:02:28 OSPF: RXmtL(0)++, NBR(10.130.11.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: RXmtL(0)++, NBR(10.223.7.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: RXmtL(0)++, NBR(10.223.3.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: RXmtL(0)++, NBR(10.223.3.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: LSA[Type1:10.223.1.1]: router-LSA refresh
2004/06/11 23:02:28 OSPF: LSA Header
2004/06/11 23:02:28 OSPF: LS age 0
2004/06/11 23:02:28 OSPF: Options 2 (*|-|-|-|-|-|E|*)
2004/06/11 23:02:28 OSPF: LS type 1 (router-LSA)
2004/06/11 23:02:28 OSPF: Link State ID 10.223.1.1
2004/06/11 23:02:28 OSPF: Advertising Router 10.223.1.1
2004/06/11 23:02:28 OSPF: LS sequence number 0x80000ad5
2004/06/11 23:02:28 OSPF: LS checksum 0x290a
2004/06/11 23:02:28 OSPF: length 132
2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.223.3.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: LSA: freed 0x8147020
2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.130.11.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: LSA: freed 0x8147020
2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.223.7.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:28 OSPF: LSA: freed 0x81431f0
2004/06/11 23:02:29 OSPF: RXmtL(1)--, NBR(10.223.3.1),
LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
2004/06/11 23:02:29 OSPF: LSA: freed 0x81431f0


You can see double free of 0x8147020 and 0x81431f0, then ospfd died.

How can I fix this?
Re: ospfd dies [ In reply to ]
On Mon, 14 Jun 2004, Kir Kostuchenko wrote:

> Here is log of ospfd (lastest CVS),
>
> I trying to change ospf cost of one of my pppX interfaces.
>
> After 'ospf cost 5' command:
>
> 2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.223.3.1),
> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
> 2004/06/11 23:02:28 OSPF: LSA: freed 0x8147020
> 2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.130.11.1),
> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
> 2004/06/11 23:02:28 OSPF: LSA: freed 0x8147020
> 2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.223.7.1),
> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
> 2004/06/11 23:02:28 OSPF: LSA: freed 0x81431f0
> 2004/06/11 23:02:29 OSPF: RXmtL(1)--, NBR(10.223.3.1),
> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
> 2004/06/11 23:02:29 OSPF: LSA: freed 0x81431f0


> You can see double free of 0x8147020 and 0x81431f0, then ospfd died.

Hmm.. that would be bad. However, are you sure it's a double free? It
could easily be an alloc, free, alloc, free of the same address -
only the free is being logged above. LSAs are refcounted and only
freed (implicitely) when reference count hits 0.

If I try replicate your problem, I cant get it to crash, but I do see
ospf_lsa_free called multiple times for same address:

Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
231 new->retransmit_counter = 0;
new: 0x9dedb30

Breakpoint 7, ospf_lsa_free (lsa=0x9dea9d0) at ospf_lsa.c:272
272 assert (lsa->lock == 0);
lsa: 0x9dea9d0

Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
231 new->retransmit_counter = 0;
new: 0x9de8558 <----- allocated

Breakpoint 7, ospf_lsa_free (lsa=0x9de8558) at ospf_lsa.c:272
272 assert (lsa->lock == 0);
lsa: 0x9de8558 <------ freed

Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
231 new->retransmit_counter = 0;
new: 0x9de8558 <----- previously freed allocation reused

Breakpoint 7, ospf_lsa_free (lsa=0x9de8558) at ospf_lsa.c:272
272 assert (lsa->lock == 0);
lsa: 0x9de8558 <----- freed again

Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
231 new->retransmit_counter = 0;
new: 0x9de8558 <----- used again

Breakpoint 7, ospf_lsa_free (lsa=0x9de8558) at ospf_lsa.c:272
272 assert (lsa->lock == 0);
lsa: 0x9de8558 <----- freed again

Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
231 new->retransmit_counter = 0;
new: 0x9dea9d0

Breakpoint 7, ospf_lsa_free (lsa=0x9dea9d0) at ospf_lsa.c:272
272 assert (lsa->lock == 0);
---Type <return> to continue, or q <return> to quit---
lsa: 0x9dea9d0

So, i doubt you're really seeing a double-free.

> How can I fix this?

Get a core dump or a backtrace of the crash. Describe your config (ie
state of network, machine (including OS) and ospfd - all relevant
information please, eg the state of your interface (both with
ifconfig -a and/or the linux 'ip address' command if appropriate, and
with the equivalent commands within the Quagga daemons.).

To get a backtrace, run ospfd (compiled with debug symbols, eg if
using redhat use the -debuginfo package, otherwise compile a binary
yourself with -g flags to gcc, should be the default) under gdb:

# gdb /path/to/ospfd-with-symbols
(gdb) set args <whatever args are needed, eg -f /path/to/config>
(gdb) run
<wait till ospfd is settled>
<make it crash>
Program received signal SIGSEGV
(gdb) bt

<gdb will print out a trace of where it crashed>

Mail the backtrace to this list.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
God may be subtle, but he isn't plain mean.
-- Albert Einstein
Re: ospfd dies [ In reply to ]
Paul Jakma wrote:

> On Mon, 14 Jun 2004, Kir Kostuchenko wrote:
>
>> Here is log of ospfd (lastest CVS),
>>
>> I trying to change ospf cost of one of my pppX interfaces.
>>
>> After 'ospf cost 5' command:
>>
>> 2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.223.3.1),
>> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
>> 2004/06/11 23:02:28 OSPF: LSA: freed 0x8147020
>> 2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.130.11.1),
>> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
>> 2004/06/11 23:02:28 OSPF: LSA: freed 0x8147020
>> 2004/06/11 23:02:28 OSPF: RXmtL(1)--, NBR(10.223.7.1),
>> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
>> 2004/06/11 23:02:28 OSPF: LSA: freed 0x81431f0
>> 2004/06/11 23:02:29 OSPF: RXmtL(1)--, NBR(10.223.3.1),
>> LSA[Type1,id(10.223.1.1),ar(10.223.1.1)]
>> 2004/06/11 23:02:29 OSPF: LSA: freed 0x81431f0
>
>
>
>> You can see double free of 0x8147020 and 0x81431f0, then ospfd died.
>
>
> Hmm.. that would be bad. However, are you sure it's a double free? It
> could easily be an alloc, free, alloc, free of the same address - only
> the free is being logged above. LSAs are refcounted and only freed
> (implicitely) when reference count hits 0.
>
> If I try replicate your problem, I cant get it to crash, but I do see
> ospf_lsa_free called multiple times for same address:
>
> Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
> 231 new->retransmit_counter = 0;
> new: 0x9dedb30
>
> Breakpoint 7, ospf_lsa_free (lsa=0x9dea9d0) at ospf_lsa.c:272
> 272 assert (lsa->lock == 0);
> lsa: 0x9dea9d0
>
> Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
> 231 new->retransmit_counter = 0;
> new: 0x9de8558 <----- allocated
>
> Breakpoint 7, ospf_lsa_free (lsa=0x9de8558) at ospf_lsa.c:272
> 272 assert (lsa->lock == 0);
> lsa: 0x9de8558 <------ freed
>
> Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
> 231 new->retransmit_counter = 0;
> new: 0x9de8558 <----- previously freed allocation reused
>
> Breakpoint 7, ospf_lsa_free (lsa=0x9de8558) at ospf_lsa.c:272
> 272 assert (lsa->lock == 0);
> lsa: 0x9de8558 <----- freed again
>
> Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
> 231 new->retransmit_counter = 0;
> new: 0x9de8558 <----- used again
>
> Breakpoint 7, ospf_lsa_free (lsa=0x9de8558) at ospf_lsa.c:272
> 272 assert (lsa->lock == 0);
> lsa: 0x9de8558 <----- freed again
>
> Breakpoint 12, ospf_lsa_new () at ospf_lsa.c:231
> 231 new->retransmit_counter = 0;
> new: 0x9dea9d0
>
> Breakpoint 7, ospf_lsa_free (lsa=0x9dea9d0) at ospf_lsa.c:272
> 272 assert (lsa->lock == 0);
> ---Type <return> to continue, or q <return> to quit---
> lsa: 0x9dea9d0
>
> So, i doubt you're really seeing a double-free.
>
>> How can I fix this?
>
>
> Get a core dump or a backtrace of the crash. Describe your config (ie
> state of network, machine (including OS) and ospfd - all relevant
> information please, eg the state of your interface (both with ifconfig
> -a and/or the linux 'ip address' command if appropriate, and with the
> equivalent commands within the Quagga daemons.).
>
> To get a backtrace, run ospfd (compiled with debug symbols, eg if
> using redhat use the -debuginfo package, otherwise compile a binary
> yourself with -g flags to gcc, should be the default) under gdb:
>
> # gdb /path/to/ospfd-with-symbols
> (gdb) set args <whatever args are needed, eg -f /path/to/config>
> (gdb) run
> <wait till ospfd is settled>
> <make it crash>
> Program received signal SIGSEGV
> (gdb) bt
>
> <gdb will print out a trace of where it crashed>
>
> Mail the backtrace to this list.
>
> regards,


Ghm, the problem seems in the gcc or libc..

I using Adamantix (http://www.adamantix.org) on my VPN routers and
compile quagga from the lastest CVS.

Adamantix it is:

gcc: 2.95.4
libc: 2.2.5-11.5.3

Then, I found the problem with the ospfd and try to reproduce the
problem on my home PC with 2.6.6 kernel and gcc: 3.3.3.

I can't segfault ospfd on my home pc.

Then, I statically link ospfd on home PC and copy to the router. No
segfaults.

Will test...
Re: ospfd dies [ In reply to ]
So, after some tests I found that problem still exists and I
reproduce it on my home PC.

Setup is quite simple:


RT1 with 2 interfaces.

RT2 have 2 pptp-linux VPNs to RT1 on each interface.

pppX interfaces running ospf:

ppp0 is up
Internet Address 192.168.0.3/32, Area 0.0.0.0
Router ID 192.168.0.6, Network Type POINTOPOINT, Cost: 10
Transmit Delay is 1 sec, State Point-To-Point, Priority 1
No designated router on this network
No backup designated router on this network
Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5
Hello due in 00:00:08
Neighbor Count is 1, Adjacent neighbor count is 1
ppp1 is up
Internet Address 192.168.0.6/32, Area 0.0.0.0
Router ID 192.168.0.6, Network Type POINTOPOINT, Cost: 10
Transmit Delay is 1 sec, State Point-To-Point, Priority 1
No designated router on this network
No backup designated router on this network
Timer intervals configured, Hello 10, Dead 40, Wait 40, Retransmit 5
Hello due in 00:00:08
Neighbor Count is 1, Adjacent neighbor count is 1


Than, changing ospf cost of one tunnel:

ospfd> enable
ospfd# configure terminal
ospfd(config)# interface ppp1
ospfd(config-if)# ospf cost 5
ospfd(config-if)# end
ospfd# quit


valgrind output of ospfd:

==10234==
==10234== Invalid read of size 4
==10234== at 0x8060F4C: ospf_spf_consider_nexthop (ospf_spf.c:354)
==10234== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10234== by 0x80614E6: ospf_spf_next (ospf_spf.c:662)
==10234== by 0x8061D1D: ospf_spf_calculate (ospf_spf.c:979)
==10234== Address 0x3C3930F4 is 0 bytes inside a block of size 12 free'd
==10234== at 0x3C01F918: free (vg_replace_malloc.c:127)
==10234== by 0x8082478: listnode_free (linklist.c:60)
==10234== by 0x8060F7B: ospf_spf_consider_nexthop (ospf_spf.c:361)
==10234== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10234==
==10234== Invalid read of size 4
==10234== at 0x8060F4C: ospf_spf_consider_nexthop (ospf_spf.c:354)
==10234== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10234== by 0x8061462: ospf_spf_next (ospf_spf.c:689)
==10234== by 0x8061D1D: ospf_spf_calculate (ospf_spf.c:979)
==10234== Address 0x3C3932F4 is 0 bytes inside a block of size 12 free'd
==10234== at 0x3C01F918: free (vg_replace_malloc.c:127)
==10234== by 0x8082478: listnode_free (linklist.c:60)
==10234== by 0x8060F7B: ospf_spf_consider_nexthop (ospf_spf.c:361)
==10234== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)

Any suggestions?
Re: ospfd dies [ In reply to ]
Here is full valgrind backtrace of memory errors after 'ospf cost'
command:


==10680==
==10680== Invalid read of size 4
==10680== at 0x8060F4C: ospf_spf_consider_nexthop (ospf_spf.c:354)
==10680== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10680== by 0x80614E6: ospf_spf_next (ospf_spf.c:662)
==10680== by 0x8061D1D: ospf_spf_calculate (ospf_spf.c:979)
==10680== by 0x8061EB5: ospf_spf_calculate_timer (ospf_spf.c:1064)
==10680== by 0x808E1A4: thread_call (thread.c:850)
==10680== by 0x804A017: main (ospf_main.c:308)
==10680== Address 0x3C382AA4 is 0 bytes inside a block of size 12 free'd
==10680== at 0x3C01F918: free (vg_replace_malloc.c:127)
==10680== by 0x8082478: listnode_free (linklist.c:60)
==10680== by 0x8060F7B: ospf_spf_consider_nexthop (ospf_spf.c:361)
==10680== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10680== by 0x80614E6: ospf_spf_next (ospf_spf.c:662)
==10680== by 0x8061D1D: ospf_spf_calculate (ospf_spf.c:979)
==10680== by 0x8061EB5: ospf_spf_calculate_timer (ospf_spf.c:1064)
==10680== by 0x808E1A4: thread_call (thread.c:850)
==10680== by 0x804A017: main (ospf_main.c:308)
==10680==
==10680== Invalid read of size 4
==10680== at 0x8060F4C: ospf_spf_consider_nexthop (ospf_spf.c:354)
==10680== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10680== by 0x8061462: ospf_spf_next (ospf_spf.c:689)
==10680== by 0x8061D1D: ospf_spf_calculate (ospf_spf.c:979)
==10680== by 0x8061EB5: ospf_spf_calculate_timer (ospf_spf.c:1064)
==10680== by 0x808E1A4: thread_call (thread.c:850)
==10680== by 0x804A017: main (ospf_main.c:308)
==10680== Address 0x3C382CA4 is 0 bytes inside a block of size 12 free'd
==10680== at 0x3C01F918: free (vg_replace_malloc.c:127)
==10680== by 0x8082478: listnode_free (linklist.c:60)
==10680== by 0x8060F7B: ospf_spf_consider_nexthop (ospf_spf.c:361)
==10680== by 0x8061209: ospf_nexthop_calculation (ospf_spf.c:443)
==10680== by 0x8061462: ospf_spf_next (ospf_spf.c:689)
==10680== by 0x8061D1D: ospf_spf_calculate (ospf_spf.c:979)
==10680== by 0x8061EB5: ospf_spf_calculate_timer (ospf_spf.c:1064)
==10680== by 0x808E1A4: thread_call (thread.c:850)
==10680== by 0x804A017: main (ospf_main.c:308)
Re: ospfd dies [ In reply to ]
We can see dumb error in deleting list element in the LIST_LOOP:

ospf_spf_consider_nexthop() function after listnode_delete (nexthops, hop);


Original function:

/* Consider supplied next-hop for inclusion to the supplied list
* of next-hops, adjust list as neccessary
*/
void
ospf_spf_consider_nexthop (struct list *nexthops,
struct vertex_nexthop *newhop)
{
struct listnode *nnode;
struct vertex_nexthop *hop;

LIST_LOOP (nexthops, hop, nnode)
{
assert (hop->oi);
/* weed out hops with higher cost than the newhop */
if (hop->oi->output_cost > newhop->oi->output_cost)
{
/* delete the existing nexthop */
listnode_delete (nexthops, hop);
vertex_nexthop_free (hop);
}
else if (hop->oi->output_cost < newhop->oi->output_cost)
{
return;
}
}

/* new hop is <= existing hops, add it */
listnode_add (nexthops, newhop);

return;
}


I suspect, this function should remove all hops from nexthop list with
the cost above newhop->oi->cost and add newhop to the nexthop list, if
newhop->oi->cost <= of all nexthops and it is not in the list allready.

My version of this function (with extra debug):

/* Consider supplied next-hop for inclusion to the supplied list
* of next-hops, adjust list as neccessary
*/
void
ospf_spf_consider_nexthop (struct list *nexthops,
struct vertex_nexthop *newhop)
{
struct listnode *nnode, *nextnode;
struct vertex_nexthop *hop;
u_int32_t max_cost = 0;

if (IS_DEBUG_OSPF_EVENT) {
zlog_info ("ospf_spf_consider_nexthop(): with nexthop %s(cost %d),
consists of:",
newhop->oi->ifp->name, newhop->oi->output_cost );
LIST_LOOP (nexthops, hop, nnode)
{
zlog_info ("hop %s(cost %d)", hop->oi->ifp->name,
hop->oi->output_cost );
}
}


LIST_LOOP (nexthops, hop, nnode)
{
after_delete:
assert (hop->oi);
/* weed out hops with higher cost than the newhop */
if (hop->oi->output_cost > newhop->oi->output_cost)
{
if (IS_DEBUG_OSPF_EVENT)
zlog_info ("ospf_spf_consider_nexthop(): deleting nexthop
%s(cost %d), becouse %s(cost %d) better.",
hop->oi->ifp->name, hop->oi->output_cost,
newhop->oi->ifp->name, newhop->oi->output_cost );

nextnode = nnode->next;

/* delete the existing nexthop */
LISTNODE_DELETE (nexthops, nnode);
vertex_nexthop_free (hop);

if( nextnode ) {
nnode = nextnode;
hop = nnode->data;
goto after_delete;
}

/* nnode was the last list elementh */
break;
}

if ( max_cost < hop->oi->output_cost )
max_cost = hop->oi->output_cost;
}

/* new hop is <= existing hops, add it */
if( !listnode_lookup(nexthops, newhop) && (!max_cost ||
newhop->oi->output_cost <= max_cost) ) {
if (IS_DEBUG_OSPF_EVENT)
zlog_info ("ospf_spf_consider_nexthop(): adding nexthop %s(cost %d)
to the nexthops list.",
newhop->oi->ifp->name, newhop->oi->output_cost );
listnode_add (nexthops, newhop);
}

return;
}


I'm testing my code now.
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Fri, Jun 18, 2004 at 03:10:43AM +0600, Kir Kostuchenko wrote:
> We can see dumb error in deleting list element in the LIST_LOOP:
>
> ospf_spf_consider_nexthop() function after listnode_delete (nexthops, hop);

That certainly does look like a bug to me (accessing the node
after it was deleted).

However, you replaced the call to listnode_delete with a call
to the macro LISTNODE_DELETE (which is smart, there's no need to search
the list), but the LISTNODE_DELETE macro does not actually delete
the node (if you look at the macro, it removes the links but fails
to call listnode_free).

It looks like the LISTNODE_DELETE macro is never actually used anywhere
in quagga, so it's probably just a bug in that macro. It might
make sense to patch the macro to call listnode_free, and to
patch the linklist.c:listnode_delete() function to use the macro.

-Andy
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Fri, 18 Jun 2004, Andrew J. Schorr wrote:

> That certainly does look like a bug to me (accessing the node
> after it was deleted).

Oof.. that is quite a nice thinko.

Probably the LIST_LOOP should be replaced with:

#define LIST_LOOP(L,V,N) \
for ((N) = (L)->head; (N);) \
(N) = (N)->next; \
if (((V) = (N)->data) != NULL)

shouldnt break anything, as no users of LIST_LOOP are likely
interested in N (its sets V after all) and i'll bet the crash Kir
noted isnt the only such case of this problematic usage of LIST_LOOP.

Kir, could you test with the above macro and see if it works?

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Thirteen at a table is unlucky only when the hostess has only twelve chops.
-- Groucho Marx
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Sat, Jun 19, 2004 at 11:10:27PM -0400, Paul Jakma wrote:
> Probably the LIST_LOOP should be replaced with:
>
> #define LIST_LOOP(L,V,N) \
> for ((N) = (L)->head; (N);) \
> (N) = (N)->next; \
> if (((V) = (N)->data) != NULL)
>
> shouldnt break anything, as no users of LIST_LOOP are likely
> interested in N (its sets V after all) and i'll bet the crash Kir
> noted isnt the only such case of this problematic usage of LIST_LOOP.
>
> Kir, could you test with the above macro and see if it works?

Actually, don't you need to set (V) = (N)->data before you
set (N) = (N)->next?

Even so, that leaves a curly brace problem (i.e. the "if" statement is
no longer inside the "for" loop).

-Andy
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Sat, 19 Jun 2004, Paul Jakma wrote:

> Probably the LIST_LOOP should be replaced with:
>
> #define LIST_LOOP(L,V,N) \
> for ((N) = (L)->head; (N);) \
> (N) = (N)->next; \
> if (((V) = (N)->data) != NULL)

I shouldnt post at half four in the morning before going to bed, (why
is my TZ wrong? I'm not in EDT, I'm in IST) above is obviously
borken..

> shouldnt break anything, as no users of LIST_LOOP are likely
> interested in N (its sets V after all)

especialyl when this specific case is because someone _is_ interested
in N.

> Kir, could you test with the above macro and see if it works?

never mind,

there's two answers really:

#define LIST_LOOP(L,V,N,O) \
for ((N) = (L)->head; (N); (N)=(O)) \
(O) = (N)->next; \
if (((V) = (N)->data) != NULL)

or Kir's problematic LIST_LOOP should just not use this macro.

I say the latter.

> regards,

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
"The trouble with doing something right the first time is that nobody
appreciates how difficult it was."
-- Walt West
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Sat, 19 Jun 2004, Andrew J. Schorr wrote:

> Actually, don't you need to set (V) = (N)->data before you set (N)
> = (N)->next?

yes ;)

> Even so, that leaves a curly brace problem (i.e. the "if" statement
> is no longer inside the "for" loop).

yep.

> -Andy

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
He who fears the unknown may one day flee from his own backside.
-- Sinbad
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Sun, Jun 20, 2004 at 09:45:34AM -0400, Paul Jakma wrote:
> there's two answers really:
>
> #define LIST_LOOP(L,V,N,O) \
> for ((N) = (L)->head; (N); (N)=(O)) \
> (O) = (N)->next; \
> if (((V) = (N)->data) != NULL)
>
> or Kir's problematic LIST_LOOP should just not use this macro.
>
> I say the latter.

I think it's got to be the latter, because the patched LIST_LOOP macro
still has a curly brace problem, and it would require all invocations
of LIST_LOOP to add a 4th argument. I count 174 uses of the LIST_LOOP
macro in the current code, so that seems like a daunting task.

The question that still remains is whether to fix the LISTNODE_DELETE
macro as I discussed in [quagga-dev 1295]. So far, there are no
uses of LISTNODE_DELETE except for Kir's patch. If we are going to
keep that macro, I think it should be fixed, and it should be
used in the listnode_delete function (to reduce code duplication).

-Andy
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Sun, 20 Jun 2004, Andrew J. Schorr wrote:

> I think it's got to be the latter, because the patched LIST_LOOP
> macro still has a curly brace problem,

sigh yes. I'll go brew some more coffee.

> and it would require all invocations of LIST_LOOP to add a 4th
> argument.

Yes. Hence why not using it for this case seems quickest option.

> I count 174 uses of the LIST_LOOP macro in the current code, so
> that seems like a daunting task.

I'll wager this instance of LIST_LOOP is not the only case that
deletes the node, so they all need to be audited.

> The question that still remains is whether to fix the
> LISTNODE_DELETE macro as I discussed in [quagga-dev 1295]. So far,
> there are no uses of LISTNODE_DELETE except for Kir's patch. If we
> are going to keep that macro, I think it should be fixed, and it
> should be used in the listnode_delete function (to reduce code
> duplication).

Well, it should be renamed, but there's no point having the macro do
deletion of the object - just call the function.

Linked list usage needs cleaning up and auditing anyway, to make some
of the implicit behaviours explicit for instance and check stuff like
deleting node in LIST_LOOP.

> -Andy

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Tallulah Bankhead barged down the Nile last night as Cleopatra and sank.
-- John Mason Brown, drama critic
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
I'll make patch againist lastest CVS for LISTNODE_DELETE and
ospf_spf_consider_nexthop tomorrow.

I don't think we could patch LIST_LOOP. Adding one more parameter is
not good (imho).

But.......

ospf_spf_consider_nexthop called from ospf_nexthop_calculation, what
really ospf_spf_consider_nexthop should do?

Delete all hops with higher cost and add newhop if not in the list?

P.S. My test setup works OK.
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
This patch works for me.
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Sun, 20 Jun 2004, Andrew J. Schorr wrote:

> I think it's got to be the latter, because the patched LIST_LOOP
> macro still has a curly brace problem,

Right, how's about:

#define LIST_LOOP(L,V,T,N) \
for ((T) = (L)->head; (N) = (T)->next), (T != NULL); (T) = (N)) \
if (((V) = (T)->data) != NULL)

> and it would require all invocations of LIST_LOOP to add a 4th
> argument. I count 174 uses of the LIST_LOOP macro in the current
> code, so that seems like a daunting task.

I think they all have to be audited anyway, guaranteed the instance
Kir found is not the only one. So either way, it's a chore but we can
at least avail of the opportunity to prevent this buglet ever
occuring again.

Alternatively, we can create above as LISTNODE_LOOP_DELETE_SAFE or
somesuch and update those cases of LIST_LOOP which need it (as well
as updating those cases which get it right by manual means to the
macro) ;)

> The question that still remains is whether to fix the
> LISTNODE_DELETE macro as I discussed in [quagga-dev 1295]. So far,
> there are no uses of LISTNODE_DELETE except for Kir's patch. If we
> are going to keep that macro, I think it should be fixed, and it
> should be used in the listnode_delete function (to reduce code
> duplication).

Hmm, LISTNODE_DELETE is useful to shuffle a listnode from one list to
another.

I dont think it should be changed, I think all normal list users
should use the function. Perhaps a comment by LISTNODE_DELETE would
be enough to fix the not-a-problem, or rename it to LISTNODE_DETACH
or somesuch to make it more obvious. (ditto for LISTNODE_ADD,
s/ADD/ATTACH/)

> -Andy

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
I've run DOOM more in the last few days than I have the last few
months. I just love debugging ;-)
(Linus Torvalds)
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, 1 Jul 2004, Paul Jakma wrote:

> I think they all have to be audited anyway, guaranteed the instance Kir found
> is not the only one. So either way, it's a chore but we can at least avail of
> the opportunity to prevent this buglet ever occuring again.

The other way of course is:

LIST_LOOP
<whatever, delete the referenced value but do _not_ call listnode_delete>
list_delete (list)

still needs an audit.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Reality is bad enough, why should I tell the truth?
-- Patrick Sky
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, 1 Jul 2004, Paul Jakma wrote:

> #define LIST_LOOP(L,V,T,N) \
> for ((T) = (L)->head; (N) = (T)->next), (T != NULL); (T) = (N)) \
> if (((V) = (T)->data) != NULL)

Urg, no go with that either, wont work. I cant think of a way to do
this.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Blessed is he who has reached the point of no return and knows it,
for he shall enjoy living.
-- W.C. Bennett
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, Jul 01, 2004 at 03:25:45PM +0100, Paul Jakma wrote:
> On Thu, 1 Jul 2004, Paul Jakma wrote:
>
> >#define LIST_LOOP(L,V,T,N) \
> >for ((T) = (L)->head; (N) = (T)->next), (T != NULL); (T) = (N)) \
> > if (((V) = (T)->data) != NULL)
>
> Urg, no go with that either, wont work. I cant think of a way to do
> this.

Why won't that work? It looks OK to me, it just has the problem of
being incompatible with the old macro (requires an additional argument).
I guess I'm missing something...

-Andy
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, Jul 01, 2004 at 09:29:26AM +0100, Paul Jakma wrote:
> >The question that still remains is whether to fix the
> >LISTNODE_DELETE macro as I discussed in [quagga-dev 1295]. So far,
> >there are no uses of LISTNODE_DELETE except for Kir's patch. If we
> >are going to keep that macro, I think it should be fixed, and it
> >should be used in the listnode_delete function (to reduce code
> >duplication).
>
> Hmm, LISTNODE_DELETE is useful to shuffle a listnode from one list to
> another.

But how useful is it really? It's currently not used anywhere in the code...

> I dont think it should be changed, I think all normal list users
> should use the function. Perhaps a comment by LISTNODE_DELETE would
> be enough to fix the not-a-problem, or rename it to LISTNODE_DETACH
> or somesuch to make it more obvious. (ditto for LISTNODE_ADD,
> s/ADD/ATTACH/)

I would agree that it should be renamed if kept as is. It's quite
confusing, in my opinion, if the function listnode_delete calls listnode_free,
but the macro LISTNODE_DELETE does not.

-Andy
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, 1 Jul 2004, Andrew J. Schorr wrote:

>>
>>> #define LIST_LOOP(L,V,T,N) \
>>> for ((T) = (L)->head; (N) = (T)->next), (T != NULL); (T) = (N)) \
>>> if (((V) = (T)->data) != NULL)
>>
>> Urg, no go with that either, wont work. I cant think of a way to do
>> this.
>
> Why won't that work? It looks OK to me,

(N) = (T)->next), (T != NULL)

Will blow up once it gets to last T. T will be set NULL at end of
last loop and it will then evaluate N=T->next.

What you want is something like:

for (this = list->head;
((this != NULL) ? next = this->next : next = this); this = next)

but conditionals are not allowed as lvalues, so its not possible.

> it just has the problem of being incompatible with the old macro
> (requires an additional argument).

That is a trivial problem ;)

> I guess I'm missing something...
>
> -Andy

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Writing is easy; all you do is sit staring at the blank sheet of paper until
drops of blood form on your forehead.
-- Gene Fowler
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, 1 Jul 2004, Andrew J. Schorr wrote:

> But how useful is it really? It's currently not used anywhere in
> the code...

It's useful. I use it in a rough patch I have that i need to revisit
some time. And I'm working on something at moment that might use it
too.

> I would agree that it should be renamed if kept as is.

Yeah, I've renamed it here already.

> It's quite confusing, in my opinion, if the function
> listnode_delete calls listnode_free, but the macro LISTNODE_DELETE
> does not.

Agreed.

> -Andy

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
You can't mend a wristwatch while falling from an airplane.
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
Right, best I can think of is something like:

#define LIST_LOOP_SAFE (LIST,VAL,THIS,NEXT) \
for ((THIS) = (LIST)->head; (THIS) != NULL; (THIS) = (NEXT))\
if (((VAL) = (THIS)->data) != NULL)

#define LIST_LOOP_SAFE_NEXT (THIS,NEXT) \
((THIS)->next != NULL) ? ((NEXT) = (THIS)->next) : ((NEXT) = (THIS));

usage would obviously be:

LIST_LOOP_SAFE (list, val, tn, nn)
{
LIST_LOOP_SAFE_NEXT (tn, nn);
<etc>
}

1. is it worth it?

2. should we just replace existing LIST_LOOP with above (s/_SAFE//)
and update all users?

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
Everything ends badly. Otherwise it wouldn't end.
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, 1 Jul 2004, Paul Jakma wrote:

> #define LIST_LOOP_SAFE_NEXT (THIS,NEXT) \
> ((THIS)->next != NULL) ? ((NEXT) = (THIS)->next) : ((NEXT) = (THIS));
^^^^^^

should be NULL of course.

regards,
--
Paul Jakma paul@clubi.ie paul@jakma.org Key ID: 64A2FF6A
warning: do not ever send email to spam@dishone.st
Fortune:
QOTD:
"If he learns from his mistakes, pretty soon he'll know everything."
Re: ospfd dies -- bug in ospf_spf_consider_nexthop [ In reply to ]
On Thu, Jul 01, 2004 at 04:44:27PM +0100, Paul Jakma wrote:
> On Thu, 1 Jul 2004, Paul Jakma wrote:
>
> >#define LIST_LOOP_SAFE_NEXT (THIS,NEXT) \
> > ((THIS)->next != NULL) ? ((NEXT) = (THIS)->next) : ((NEXT) = (THIS));
> ^^^^^^
>
> should be NULL of course.

I don't think the ternary expression is accomplishing anything there.
How is that different from:

(NEXT) = (THIS)->next;

?

-Andy

1 2  View All