Mailing List Archive

[Bug 330] routes disappear with 'could not determine nexthop' log entry
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-03 12:50 -------
Created an attachment (id=127)
--> (http://bugzilla.quagga.net/attachment.cgi?id=127&action=view)
ospfd.log around that happening

attaching a log excrept with all sorts of debugging enabled



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


windo@p6drad-teel.net changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #127 is|0 |1
obsolete| |




------- Additional Comments From windo@p6drad-teel.net 2007-01-03 13:26 -------
Created an attachment (id=128)
--> (http://bugzilla.quagga.net/attachment.cgi?id=128&action=view)
log excrept around the problem


------- Additional Comments From ajschorr@alumni.princeton.edu 2007-01-08 21:15 -------
I think it might help to post the output of 'show interface'
and 'show ip ospf interface' so we can get an idea of the
config...

Regards,
Andy



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-11 08:51 -------
This time i had to block and reenable the lower-cost link back and forth as
quickly as possible (so that ospf would see that it's gone and come back) to
reproduce this behaviour.

both ends have the same config:

bond0 is up
ifindex 24, MTU 1500 bytes, BW 0 Kbit <UP,BROADCAST,RUNNING,MULTICAST>
Internet Address 192.168.36.149/24, Broadcast 192.168.36.255, Area 0.0.0.0
MTU mismatch detection:enabled
Router ID 192.168.36.149, Network Type NBMA, Cost: 10
Transmit Delay is 1 sec, State DR, Priority 1
Designated Router (ID) 192.168.36.149, Interface Address 192.168.36.149
No backup designated router on this network
Multicast group memberships: OSPFAllRouters
Timer intervals configured, Hello 10s, Dead 40s, Wait 40s, Retransmit 5
Hello due in 7.957s
Neighbor Count is 0, Adjacent neighbor count is 0
bond1 is up
ifindex 25, MTU 1500 bytes, BW 0 Kbit <UP,BROADCAST,RUNNING,MULTICAST>
OSPF not enabled on this interface
dummy0 is up
ifindex 4, MTU 1500 bytes, BW 0 Kbit <UP,BROADCAST,RUNNING,NOARP>
Internet Address 192.168.98.0/24, Broadcast 192.168.98.255, Area 0.0.0.0
MTU mismatch detection:enabled
Router ID 192.168.36.149, Network Type BROADCAST, Cost: 10
Transmit Delay is 1 sec, State DR, Priority 1
Designated Router (ID) 192.168.36.149, Interface Address 192.168.98.0
No backup designated router on this network
Multicast group memberships: OSPFAllRouters OSPFDesignatedRouters
Timer intervals configured, Hello 10s, Dead 40s, Wait 40s, Retransmit 5
Hello due in 7.957s
Neighbor Count is 0, Adjacent neighbor count is 0
eth0 is up
ifindex 1, MTU 1500 bytes, BW 0 Kbit <UP,BROADCAST,RUNNING,MULTICAST>
OSPF not enabled on this interface
eth1 is up
ifindex 2, MTU 1500 bytes, BW 0 Kbit <UP,BROADCAST,RUNNING,MULTICAST>
OSPF not enabled on this interface
lo is up
ifindex 3, MTU 16436 bytes, BW 0 Kbit <UP,LOOPBACK,RUNNING>
OSPF not enabled on this interface
t-dcl-cpn-tdc is up
ifindex 27, MTU 1500 bytes, BW 0 Kbit <UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>
Internet Address 10.1.0.2/32, Peer 10.1.0.1, Area 0.0.0.0
MTU mismatch detection:enabled
Router ID 192.168.36.149, Network Type POINTOPOINT, Cost: 40
Transmit Delay is 1 sec, State Point-To-Point, Priority 1
No designated router on this network
No backup designated router on this network
Multicast group memberships: OSPFAllRouters
Timer intervals configured, Hello 5s, Dead 20s, Wait 20s, Retransmit 5
Hello due in 3.169s
Neighbor Count is 1, Adjacent neighbor count is 1
t-dcl-cpn-tdc-2 is up
ifindex 26, MTU 1500 bytes, BW 0 Kbit <UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>
Internet Address 10.1.0.4/32, Peer 10.1.0.3, Area 0.0.0.0
MTU mismatch detection:enabled
Router ID 192.168.36.149, Network Type POINTOPOINT, Cost: 50
Transmit Delay is 1 sec, State Point-To-Point, Priority 1
No designated router on this network
No backup designated router on this network
Multicast group memberships: OSPFAllRouters
Timer intervals configured, Hello 5s, Dead 20s, Wait 20s, Retransmit 5
Hello due in 3.021s
Neighbor Count is 1, Adjacent neighbor count is 1



Interface bond0 is up, line protocol detection is disabled
index 24 metric 1 mtu 1500
flags: <UP,BROADCAST,RUNNING,MULTICAST>
HWaddr: 00:0e:7f:b4:a9:81
inet 192.168.36.149/24 broadcast 192.168.36.255
31034 input packets (8510 multicast), 6273020 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
8802 output packets, 617875 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface bond1 is up, line protocol detection is disabled
index 25 metric 1 mtu 1500
flags: <UP,BROADCAST,RUNNING,MULTICAST>
HWaddr: 00:0e:7f:b4:89:e3
inet 80.92.84.57/24 broadcast 80.92.84.255
505828 input packets (193 multicast), 44771931 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
192800 output packets, 29091375 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface dummy0 is up, line protocol detection is disabled
index 4 metric 1 mtu 1500
flags: <UP,BROADCAST,RUNNING,NOARP>
HWaddr: 8a:40:a3:6f:73:c6
inet 192.168.98.0/24 broadcast 192.168.98.255
0 input packets (0 multicast), 0 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
104454 output packets, 8147412 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface eth0 is up, line protocol detection is disabled
index 1 metric 1 mtu 1500
flags: <UP,BROADCAST,RUNNING,MULTICAST>
HWaddr: 00:0e:7f:b4:89:e3
505828 input packets (193 multicast), 44771931 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
192800 output packets, 29091375 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface eth1 is up, line protocol detection is disabled
index 2 metric 1 mtu 1500
flags: <UP,BROADCAST,RUNNING,MULTICAST>
HWaddr: 00:0e:7f:b4:a9:81
31034 input packets (8510 multicast), 6273020 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
8802 output packets, 617875 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface lo is up, line protocol detection is disabled
index 3 metric 1 mtu 16436
flags: <UP,LOOPBACK,RUNNING>
inet 127.0.0.1/8
5143981 input packets (0 multicast), 449793261 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
5143981 output packets, 449793261 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface t-dcl-cpn-tdc is up, line protocol detection is disabled
index 27 metric 1 mtu 1500
flags: <UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>
inet 10.1.0.2/32 pointopoint 10.1.0.1
21529 input packets (0 multicast), 2002586 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
10220 output packets, 950158 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions
Interface t-dcl-cpn-tdc-2 is up, line protocol detection is disabled
index 26 metric 1 mtu 1500
flags: <UP,POINTOPOINT,RUNNING,NOARP,MULTICAST>
inet 10.1.0.4/32 pointopoint 10.1.0.3
48253 input packets (0 multicast), 3863806 bytes, 0 dropped
0 input errors, 0 length, 0 overrun, 0 CRC, 0 frame
0 fifo, 0 missed
38064 output packets, 4085038 bytes, 0 dropped
0 output errors, 0 aborted, 0 carrier, 0 fifo, 0 heartbeat
0 window, 0 collisions



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-11 12:22 -------
I think I understand how this problem occurs (I don't think I understand quagga
well enough to patch it though). To produce this problem, it is neccessary, that
there are multiple (in this case, 2) links between two routers.

When I disable the low-cost link at a time, so that the HELLO in one direction
gets through but the one in other direction doesn't (I deduced this, because i
had 'watch -n1 "show ip ospf neighbor"' running on both routers and there was a
5s difference between the dead timers - my hello interval) then when the first
router thinks the link is dead, it sends a new LSA which triggers a SPF
calculation on the second router.

Now, the logs read like this for a situation like this:

2007/01/11 11:55:39 OSPF: ospf_nexthop_calculation(): Start
2007/01/11 11:55:39 OSPF: V (parent): Router vertex 192.168.36.149 distance 0
flags 0
2007/01/11 11:55:39 OSPF: W (dest) : Router vertex 192.168.25.149 distance 40
flags 0
2007/01/11 11:55:39 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.2
2007/01/11 11:55:39 OSPF: ospf_nexthop_calculation(): could not determine
nexthop for link
2007/01/11 11:55:39 OSPF: found Router LSA 192.168.25.149
2007/01/11 11:55:39 OSPF: ospf_intra_add_router: Start
2007/01/11 11:55:39 OSPF: ospf_intra_add_router: LS ID: 192.168.25.149
2007/01/11 11:55:39 OSPF: ospf_intra_add_router: this router is neither ASBR nor
ABR, skipping it
2007/01/11 11:55:39 OSPF: found Router LSA 192.168.36.149
2007/01/11 11:55:39 OSPF: The LSA is already in SPF
2007/01/11 11:55:39 OSPF: SPF Result: 0 [R] 192.168.36.149

I think it is because the spf calculation code for point-to-point links checks
if any of the remote links terminates to the local link, but none do (since the
disappearing link was what triggered the LSA in the first place) and it does not
check any of the other local links (only one "considering link").

Now, when the dead timer on the second router (the one i'm pasting the logs
from) reaches zero as well, the logs show the nexthop calculation using the
other, working link:

2007/01/11 11:55:44 OSPF: ospf_nexthop_calculation(): Start
2007/01/11 11:55:44 OSPF: V (parent): Router vertex 192.168.36.149 distance 0
flags 0
2007/01/11 11:55:44 OSPF: W (dest) : Router vertex 192.168.25.149 distance 50
flags 0
2007/01/11 11:55:44 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.4
2007/01/11 11:55:44 OSPF: ospf_intra_add_router: Start
2007/01/11 11:55:44 OSPF: ospf_intra_add_router: LS ID: 192.168.25.149
2007/01/11 11:55:44 OSPF: ospf_intra_add_router: this router is neither ASBR nor
ABR, skipping it
2007/01/11 11:55:44 OSPF: found Router LSA 192.168.36.149
2007/01/11 11:55:44 OSPF: The LSA is already in SPF
2007/01/11 11:55:44 OSPF: SPF Result: 0 [R] 192.168.36.149
2007/01/11 11:55:44 OSPF: SPF Result: 1 [R] 192.168.25.149
2007/01/11 11:55:44 OSPF: nexthop 0x80c1e20 10.1.0.3 t-dcl-cpn-tdc-2:10.1.0.4

Which results in working routing solution later on.

I think this is a problem, because no routing on router results in icmp net
unreachable which causes clients to fail and give up rather than send retrys.
And this is apart from the fact that there are (or at least could be) other
working routes available.

This could propably fixed by either trying other links to build the tree as well
(if first ones fail)?

There were a couple of longer routing outages in our live environment as well
(the hello/dead timers are 10/40 there): first a 30 second one and then a 5
second one. If my understanding of this bug is right, then that could have
theoretically been the same issue, where the 30-second gap could have been
produced by a short loss of connectivity in one direction?

PS:
This latest test was with quagga-0.99.5, but a diff with the 0.99.6 source
revealed no differences in ospf_spf.c, so I'll refrain from testing with the cvs
version for now (especially since I saw the problem occuring with the cvs
version as well).



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From ajschorr@alumni.princeton.edu 2007-01-11 14:36 -------
Sorry to be so slow, but can you explain a bit more about the topology?
Which links are you bringing up and down? Is it the
two PtP links t-dcl-cpn-tdc and t-dcl-cpn-tdc-2? What's the output
of 'show ip ospf neighbor' when everything is working normally?



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-11 14:42 -------
t-dcl-cpn-tdc and t-dcl-cpn-tdc-2 are the two ptp links. t-dcl-cpn-tdc has cost
40 and t-dcl-cpn-tdc-2 has cost 50 (the costs are symmetrical on both ends).

By bringing down the link I mean blocking everything being sent to and received
from those interfaces by iptables.

Neighbor ID Pri State Dead Time Address Interface
RXmtL RqstL DBsmL
192.168.25.149 1 Full/DROther 16.619s 10.1.0.1
t-dcl-cpn-tdc:10.1.0.2 0 0 0
192.168.25.149 1 Full/DROther 16.627s 10.1.0.3
t-dcl-cpn-tdc-2:10.1.0.4 0 0 0

------- Additional Comments From windo@p6drad-teel.net 2007-01-11 14:44 -------
and those two links are all the topology there really is (there are other
interfaces that get distributed to ospf but only these 2 routers with only these
2 links between them)



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From ajschorr@alumni.princeton.edu 2007-01-11 14:54 -------
What exactly are you doing with iptables to block the packets?
Do you see the same problems if you simply unplug the connection
physically (instead of using iptables)? One of your posts
seems to suggest that traffic is still flowing in one direction
over the link. Are you using iptables to block traffic
in only one direction on the link?


------- Additional Comments From windo@p6drad-teel.net 2007-01-11 14:59 -------
I am not physically near the servers and can't unplug the cable. Traffic is
blocked in both directionsm, but I'll try to see if blocking only one direction
makes a difference.

Here's a script i use to block the traffic:

#!/bin/sh

action=$1
tunnel="t-tdc-lux-dcl"

case $action in
failover)
echo Sleeping until Hello packet
sleep $(vtysh -c "show ip ospf interface ${tunnel}" | grep
'Hello due in' | awk '{print $4}')

echo Dropping connections
iptables -I OUTPUT -o ${tunnel} -j DROP
iptables -I INPUT -i ${tunnel} -j DROP
iptables -I FORWARD -o ${tunnel} -j DROP
iptables -I FORWARD -i ${tunnel} -j DROP
;;
failback)
iptables -D OUTPUT -o ${tunnel} -j DROP
iptables -D INPUT -i ${tunnel} -j DROP
iptables -D FORWARD -o ${tunnel} -j DROP
iptables -D FORWARD -i ${tunnel} -j DROP
;;
*)
echo "usage: $0 <failover | failback>"
;;
esac



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-11 15:09 -------
Blocking at different times allows to reproduce this with 100% certanty.

The two routers are cpn and lux.

So, first i drop everything coming from lux and wait for 6 seconds and then drop
the other direction as well. This way, the dead timer in cpn will always reach 0
before the one in lux and the routing will always be lost until the dead timer
in lux reaches 0 too (all the tests were done from the perspective of lux)



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From ajschorr@alumni.princeton.edu 2007-01-11 15:15 -------
Given that Hello is 5 and Dead is 20, what's the longest
outage that you see when you take down a link? Is it clear
that there is a bug here, or is this just normal time
for the network to reconverge?

Have you considered using faster timers to get quicker
convergence? You can use fast hellos perhaps...



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-11 15:33 -------
i think you misunderstand the problem.

two routers, cpn and lux have two link between them. link 1 has cost 40 and link
2 has cost 50. during normal operation, link 1 is used.

if link 1 fails, what should happen is that the dead timer should reach zero,
ospf would decide that that link is dead and reroute through link 2 that has
higher cost.

this does not always happen.

when the dead timers of the routers get out of sync (as with blocking one link
before the other) - lets say that lux->cpn direction gets blocked first - then
cpn dead timer reaches 0 before lux dead timer and cpn sends an LSA that
triggers SPF calculation in lux. I already described what I think happens inside
ospfd (you can reread that if you want to), but the result is that all routes
that are connected to cpn are deleted, since no nexthop is found (although there
is a working link 2).

This is clearly a bug. Either of the two should happen (I dont knwo which one is
the correct behavior described by the RFC):

1) When the LSA is recieved that does not claim to see lux on link 1, agree that
link 1 is dead (despite of the dead timer being > 0) and do not try to use it
as nexthop
2) When the SPF calculation is run and nexthop is not found on link 1, link 2 is
also tried.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-11 15:55 -------
however reluctant I am to patch this myself, the following at least seemingly
would fix it. But I really think someone who actually knows the codebase should
do it.

--- ospfd/ospf_spf.c 2007-01-11 17:52:51.545808111 +0200
+++ ospfd/ospf_spf.c 2007-01-11 17:46:04.730275363 +0200
@@ -808,7 +808,8 @@
w->distance = distance;

ospf_nexthop_calculation (area, v, w, l);
- pqueue_enqueue (w, candidate);
+ if(w->parents->count != 0)
+ pqueue_enqueue (w, candidate);
}
else if (w_lsa->stat >= 0)
{


------- Additional Comments From paul@dishone.st 2007-01-15 00:31 -------
Hi Siim,

I'm confused by (with respect to SPF, comment #5):

"it does not check any of the other local links (only one "considering link")."

This is what the SPF should do. Just /not/ at this point in the code. Paths to
vertices are considered on the basis of the /pair/ of (V,W) (the vertex and its
parent). If, at the logs below, we're trying to find V via W, where W is the
calculating router, then we simply can not know about any paths via 3rd-party
routers - for we always explore our OSPF links before exploring paths through
other vertices (local links must always have a lower cost..).

I.e. the messages you are looking for in the SPF debug must come /after/ the
debug message pertaining to a local link (if the local link is down / not Full).

Could you supply:

- show ip ospf database router
- show ip ospf database network
- logs, showing the SPF calculation, same as your comment

at the time of this possible SPF failure, so we can reconstruct how the SPF
calculation should have run Vs how it actually ran?



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-15 11:38 -------
i'm not very good at spf terms. what i meant was: "it does not consider other
ospf links".

Okay, I'll explain what is happening in the code again, tell me if you
understand this so I can move on from that point (because I still get the
feeling you don't see it as I do).

The problem lies in that ospf_nexthop_calculation may succeed or not succeed,
but it does not return anything. So, if you pleas see comment #5, the first log
shows only one "considering link .. link data 10.1.0.2" line followed by "could
not determine nexthop. So, ospf_nexthop_calculation did not succeed, but in
ospf_spf_next, the allocated vertex w is assigned a cost and it is added to the
candidates (line 811). So, ospf_spf_next does not try to calculate nexthop
through 10.1.0.4 (the other link with higher cost) because for it's purposes
there apparently already is a path with a lower cost at place (line 813, the
cost comparisons).

If I apply my patch, then the check for number of parents effectively checks if
ospf_nexthop_calculation did succeed and only adds the vertex to candidates list
if a nexthop was found (so that the ospf links with higher cost are checked
too). So, now the log is like this for the same situation:

2007/01/15 12:21:14 OSPF: ospf_spf_calculate: Start
2007/01/15 12:21:14 OSPF: ospf_spf_calculate: running Dijkstra for area 0.0.0.0
2007/01/15 12:21:14 OSPF: ospf_vertex_new: Created Router vertex 192.168.36.149
2007/01/15 12:21:14 OSPF: found Router LSA 192.168.25.149
2007/01/15 12:21:14 OSPF: The LSA is already in SPF
2007/01/15 12:21:14 OSPF: ospf_vertex_new: Created Router vertex 192.168.25.149
2007/01/15 12:21:14 OSPF: ospf_nexthop_calculation(): Start
2007/01/15 12:21:14 OSPF: V (parent): Router vertex 192.168.36.149 distance 0
flags 0
2007/01/15 12:21:14 OSPF: W (dest) : Router vertex 192.168.25.149 distance 40
flags 0
2007/01/15 12:21:14 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.2
2007/01/15 12:21:14 OSPF: ospf_nexthop_calculation(): could not determine
nexthop for link
2007/01/15 12:21:14 OSPF: found Router LSA 192.168.25.149
2007/01/15 12:21:14 OSPF: The LSA is already in SPF
2007/01/15 12:21:14 OSPF: ospf_vertex_new: Created Router vertex 192.168.25.149
2007/01/15 12:21:14 OSPF: ospf_nexthop_calculation(): Start
2007/01/15 12:21:14 OSPF: V (parent): Router vertex 192.168.36.149 distance 0
flags 0
2007/01/15 12:21:14 OSPF: W (dest) : Router vertex 192.168.25.149 distance 50
flags 0
2007/01/15 12:21:14 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.4
2007/01/15 12:21:14 OSPF: ospf_intra_add_router: Start
2007/01/15 12:21:14 OSPF: ospf_intra_add_router: LS ID: 192.168.25.149
2007/01/15 12:21:14 OSPF: ospf_intra_add_router: this router is neither ASBR nor
ABR, skipping it
2007/01/15 12:21:14 OSPF: found Router LSA 192.168.36.149
2007/01/15 12:21:14 OSPF: The LSA is already in SPF
2007/01/15 12:21:14 OSPF: SPF Result: 0 [R] 192.168.36.149
2007/01/15 12:21:14 OSPF: SPF Result: 1 [R] 192.168.25.149

As you can see, this time the connection to the other router is found, although
exacly the same thing happens on the network.

So, timeframe of what is happening. Router 1 and Router 2 are R1 and R2. Without
the patch:

0: Connectivity is stopped in direction R1->R2
15: Connectiviry is stopped in the other direction too
20: Dead timer on one R2 times out, LSA is sent R2->R1, SPF calculation
triggered on R1
20.xx: SPF calculation does not find a route R1->R2 because only one - the one
with the lower cost - of the (OSPF enabled) links is checked because of what i
think is a bug, routes to R2 (and networks connected to R2) are deleted on R1.
35: Dead timer on R1 times out, the other (OSPF enabled) link is tried (because
the first one is dead) and routes from R2 are added on R1.

with the patch:

0: Connectivity is stopped in direction R1->R2
15: Connectiviry is stopped in the other direction too
20: Dead timer on one R2 times out, LSA is sent R2->R1, SPF calculation
triggered on R1
20.xx: SPF calculation checks the lower-cost link but ospf_nexthop_calculation
fails. My "patch" checks and sees that nexthop was not found and does not add
the vertex to candidates list so that when the next higher-cost link is tried
and a nexthop is found, it gets used. Routes from R2 are not deleted but are
routed through the second link.

I agree that this may be a very bad place to check this (or a bad way), but I
also told you I don't know anything about the way the code is organised, so I
can't patch it any better. Any questions about this?



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From paul@dishone.st 2007-01-15 13:58 -------
Ah, I understand you now. So another fix would be to make nexthop_calculation
indicate whether it failed or not. See the simple patch I'm about to attach.

Thanks!



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From paul@dishone.st 2007-01-15 14:00 -------
Created an attachment (id=129)
--> (http://bugzilla.quagga.net/attachment.cgi?id=129&action=view)
SPF nexthop calculation can fail, caller needs to know


------- Additional Comments From paul@dishone.st 2007-01-15 14:19 -------
Actually, the deeper bug here is some kind of disparity between
ospf_get_next_link and ospf_lsa_has_link. ospf_spf_nexthop_calculation shouldn't
really fail, it shouldn't be called unless it is known V and W are connected.
ospf_spf_next() uses lsa_hsa_link, nexthop_calculation uses get_link. The
failure seems to be because the former indicated a link exists from W to V, and
then the latter can not find it when called from nexthop_calculation.

Hmm...



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-15 15:12 -------
i think the problem is, that w and v are connected, but not through link l. so
ospf_lsa_has_link(w, v) would succeed while ospf_nexthop_calculation(area, v, w,
l) would not (because of the restriction of v's link l)

------- Additional Comments From paul@dishone.st 2007-01-15 21:35 -------
Yes, indeed.

Interesting question this: If we had two routers, with two different links
connecting them, which links are assymetrical (for whatever, e.g. because of a
transient failure):

---->
A B
<----

Which routes should SPF calculate?

If we take the 'strict' fix for your problem, and demand that when vertices A
and B are considered, that A->B must match B->A, then the above two routers
could not use the assymetric links. If we allow "any link back", ie that to
consider A->B, any link back from B->A will suffice, then during convergence
after a failure SPF would needlessly calculate non-working routes (rather than,
say, going and using a completely path that may be available, e.g. A->C->B).

I'm not sure what the OSPF RFC demands. RFC2328 16.1 2(b) states:

"or it does not have a link back to vertex V, examine the next link"

It's unclear if that means "a link back corresponding to V's link" or "any link
back".

If we go with the stricter definition (which seems to make sense: routers with a
pair of uni-directional links would never form adjacencies on those links, and
such a pair of links would be easily or better presented as a single, normal,
bi-directional links obviously), the main problem is to how to fix
ospf_lsa_has_link to /efficiently/ find a corresponding link (or replace it with
get_next_link).





------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


paul@dishone.st changed:

What |Removed |Added
----------------------------------------------------------------------------
Severity|normal |enhancement
Priority|Very High |High




------- Additional Comments From paul@dishone.st 2007-01-15 21:52 -------
Also, I'm going to set this to 'enhancement'. The current behaviour is, I think,
within spec for the RFC. The behaviour occurs during convergence, as Andrew
noted, and the routers do converge correctly. What is being asked for is a
convergence enhancement.

Implementing this cleanly/efficiently will require some re-organisation i think.




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


paul@dishone.st changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #129 is|0 |1
obsolete| |




------- Additional Comments From paul@dishone.st 2007-01-16 00:01 -------
Created an attachment (id=130)
--> (http://bugzilla.quagga.net/attachment.cgi?id=130&action=view)
Push parent-list flush down into spf_add_parent

Make ospf_spf_add_parent() handle flush of parent-vertex list when shorter path
parent is added. As spf_add_parent() always occurs after nexthop-calculation,
this solves the problems of bug #330.

Patch is untested as of submission.




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


paul@dishone.st changed:

What |Removed |Added
----------------------------------------------------------------------------
OS/Version|Linux |All




------- Additional Comments From windo@p6drad-teel.net 2007-01-16 07:15 -------
You did not explicitly say it, so I'd just point out, that currently when the
situation is like

----->
A B
<---->

Routing breaks as well, which shouldn't happen even according to the strict view.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-16 09:36 -------
I think this problem can occur at least in one other situation. I was adding
links to the mesh today and the router I added the link on itself created a LSA,
tried to find nexthop on the newly-added link, failed with "could not find
nexthop" and deleted all routings from the router from the other end of the
link. The routing reappeared after the other router sent it's LSA, around 5
seconds later (There were other working links available there as well).

I haven't tried the patch yet, but I'll do it now.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-17 11:41 -------
the patch does not seem to help. The logs look like without the patch:

2007/01/17 12:39:55 OSPF: SPF: Timer (SPF calculation expire)
2007/01/17 12:39:55 OSPF: ospf_spf_calculate: Start
2007/01/17 12:39:55 OSPF: ospf_spf_calculate: running Dijkstra for area 0.0.0.0
2007/01/17 12:39:55 OSPF: ospf_vertex_new: Created Router vertex 192.168.36.149
2007/01/17 12:39:55 OSPF: found Router LSA 192.168.25.149
2007/01/17 12:39:55 OSPF: The LSA is already in SPF
2007/01/17 12:39:55 OSPF: ospf_vertex_new: Created Router vertex 192.168.25.149
2007/01/17 12:39:55 OSPF: ospf_nexthop_calculation(): Start
2007/01/17 12:39:55 OSPF: V (parent): Router vertex 192.168.36.149 distance 0
flags 0
2007/01/17 12:39:55 OSPF: W (dest) : Router vertex 192.168.25.149 distance 40
flags 0
2007/01/17 12:39:55 OSPF: ospf_nexthop_calculation(): considering link type 1
link_id 192.168.25.149 link_data 10.1.0.2
2007/01/17 12:39:55 OSPF: ospf_nexthop_calculation(): could not determine
nexthop for link
2007/01/17 12:39:55 OSPF: found Router LSA 192.168.25.149
2007/01/17 12:39:55 OSPF: The LSA is already in SPF
2007/01/17 12:39:55 OSPF: ospf_intra_add_router: Start
2007/01/17 12:39:55 OSPF: ospf_intra_add_router: LS ID: 192.168.25.149
2007/01/17 12:39:55 OSPF: ospf_intra_add_router: this router is neither ASBR nor
ABR, skipping it
2007/01/17 12:39:55 OSPF: found Router LSA 192.168.36.149
2007/01/17 12:39:55 OSPF: The LSA is already in SPF
2007/01/17 12:39:55 OSPF: SPF Result: 0 [R] 192.168.36.149




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


windo@p6drad-teel.net changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #130 is|0 |1
obsolete| |




------- Additional Comments From windo@p6drad-teel.net 2007-01-17 12:04 -------
Created an attachment (id=131)
--> (http://bugzilla.quagga.net/attachment.cgi?id=131&action=view)
Push parent-list flush down into spf_add_parent + fix

looks like you missed a return in ospf_nexthop_calculation.

i tested it and it fixes the bug AND works for the interesting situation where
the connectivity is assymmetrical.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


paul@dishone.st changed:

What |Removed |Added
----------------------------------------------------------------------------
Attachment #131 is|0 |1
obsolete| |




------- Additional Comments From paul@dishone.st 2007-01-19 14:28 -------
Created an attachment (id=132)
--> (http://bugzilla.quagga.net/attachment.cgi?id=132&action=view)
Proposed final form of patch for testing

redo return argument handling in nexthop_calculation

------- Additional Comments From paul@dishone.st 2007-01-19 14:49 -------
Thanks for spotting that Siim. I think there were two other places which
/potentially/ could incorrectly return positive when no nexthop had actually
been added. I've rejiggled the patch slightly, hopefully this is now ready for
integration.

Could you retest this form of the patch and, if you can, eyeball the returns in
nexthop_calculation?

Thanks!




------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330





------- Additional Comments From windo@p6drad-teel.net 2007-01-23 14:51 -------
Okay, I tested the patch and it works for the general case, but not for the
assymmetrical links. I also retested the previous patch and couldn't get that to
work for assymmetrical links either.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


paul@dishone.st changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|NEW |ASSIGNED




------- Additional Comments From paul@dishone.st 2007-01-23 19:00 -------
Hi Siim,

That's great, i'll integrate the patch soon. Thanks for your hard work in
diagnosis and testing on this bug!

The assymetrical case, sorry - that can't work. OSPF requires bi-directional
communication on a link, so ultimately that must prevent asymmetric links from
working. I just speculated about how it would work within SPF (which should, in
abstract concept at least, work with unidirectional di-graph). As I said at the
end of the comment:

"routers with a pair of uni-directional links would never form adjacencies on
those links, and such a pair of links would be easily or better presented as a
single, normal, bi-directional links obviously)"

Sorry if that wasn't clear ;).

Thanks again for your help!



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs
[Bug 330] routes disappear with 'could not determine nexthop' log entry [ In reply to ]
Please do not reply directly to this email. All additional
comments should be made in the comments box of this bug
report.

http://bugzilla.quagga.net/show_bug.cgi?id=330


paul@dishone.st changed:

What |Removed |Added
----------------------------------------------------------------------------
Status|ASSIGNED |CLOSED
Resolution| |FIXED




------- Additional Comments From paul@dishone.st 2007-01-24 16:46 -------
Integrated into CVS earlier today -> Fixed.



------- You are receiving this mail because: -------
You are on the CC list for the bug, or are watching someone who is.
_______________________________________________
Quagga-bugs mailing list
Quagga-bugs@lists.quagga.net
http://lists.quagga.net/mailman/listinfo/quagga-bugs