Mailing List Archive

Weird BGP Behavoiur
Hi

I am seeing odd behavour on one of our quagga boxes here and I am
looking for some help

We have approx 6 Quagga servers which are installed on Supermicro
Servers with 8G RAM each. RAM usage is low and CPU usage is generally
low.

In the main server in question BGP summay shows:

BGP router identifier a.b.c.d, local AS number 12345
RIB entries 1186136, using 127 MiB of memory
Peers 330, using 2929 KiB of memory
Peer groups 5, using 160 bytes of memory

With lots of normal sessions eg:


217.146.96.73 4 4234 702 587 0 0 0 01:36:32
1459

Which is sending / receiving... However with some sessions which I am
sending a ful table to:


1.2.3.4 4 23454 100 672 0 0 630540 01:36:40
1
2.3.4.5 4 23454 100 668 0 0 630555 01:36:40
1

And some internal sessions:

9.8.7.6 4 12345 100 661 0 0 530342 01:36:41
3
5.6.3.2 4 12345 18551 1364 0 0 525874 01:36:43
95301

You will see the 'OutQ' Colum is HUGE and does not decrease. I can't
see any errors - IF I run sh ip bgp 8.8.8.8 for example it will show
the route is being advertised to a peer router - BUT the remote router
does not have the prefix prsumably as it i stuck in the queue.

From what I can see this is the only router seeing this - all have
the same kernel & quagga version. I have tried upgrading the kernel
(all are Linux 3.10.17 based on slakeware and with Intel e1000 NICs)
and the Quagga version (to 1.1)

I can't see anything being logged on the linux logs OR the quagga logs
- happy to look at anything that will help

Can someone point out what I have done stupid here ?

Thanks in advance
Richard Palmer | Director | Merula Limited
Company Registered in England and Wales No. 3243995
5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
Phone 01480 222940 | Support 0845 330 0666
Support Email support@merula.net
Re: Weird BGP Behavoiur [ In reply to ]
Hi

Welcome to a world of fun and pain.

On 05/03/2017 00:40, Richard J Palmer wrote:
> Hi
>
> I am seeing odd behavour on one of our quagga boxes here and I am
> looking for some help

> And some internal sessions:
>
> 9.8.7.6 4 12345 100 661 0 0 530342 01:36:41 3
> 5.6.3.2 4 12345 18551 1364 0 0 525874 01:36:43 95301
>
> You will see the 'OutQ' Colum is HUGE and does not decrease. I can't see
> any errors - IF I run sh ip bgp 8.8.8.8 for example it will show the
> route is being advertised to a peer router - BUT the remote router does
> not have the prefix prsumably as it i stuck in the queue.

This is similar to the "IPv6 routes not propagating via iBGP" problem
that a number of us have randomly on some devices. I have also seen
this with some v4 prefixes, normally customers who are dual-homed and
have flapped a bit between primary and backup; and you end up with the
route 'stuck' on one or more routers.

We actually have a script that runs every few hours now to check that
across the AS, every router has the same next-hop for these prefixes
(they have set local-prefs so this should absolutely be the case) - as
in the past the failure of the internal routes to update caused routing
loops and general badness.

It is a royal pain to debug because it seems almost un-reproducible in
the lab :(

FWIW, I have downgraded to 0.99.23.1 across the board as that is the
last release of Quagga that I can find that definitely does not have
this issue.

Paul.

_______________________________________________
Quagga-users mailing list
Quagga-users@lists.quagga.net
https://lists.quagga.net/mailman/listinfo/quagga-users
Re: Weird BGP Behavoiur [ In reply to ]
HI

Thanks for the email

We are seeing this every time we start Quagga it's not a simply 'it
stops after x time' I'm happy to help debug this if there's anything I
can do ....

I will try the downgrade later as well if nothing else.

It's odd that it's ONLY on this one router none of the others are
affected ... very strange

Richard Palmer | Director | Merula Limited
Company Registered in England and Wales No. 3243995
5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
Phone 01480 222940 | Support 0845 330 0666
Support Email support@merula.net


> --- Original message ---
> Subject: [quagga-users 14631] Re: Weird BGP Behavoiur
> From: Paul Thornton <paul@prt.org>
> To: <quagga-users@lists.quagga.net>
> Date: Sunday, 05/03/2017 9:11 AM
>
> Hi
>
> Welcome to a world of fun and pain.
>
> On 05/03/2017 00:40, Richard J Palmer wrote:
>>
>> Hi
>>
>> I am seeing odd behavour on one of our quagga boxes here and I am
>> looking for some help
>
>>
>> And some internal sessions:
>>
>> 9.8.7.6 4 12345 100 661 0 0 530342 01:36:41
>> 3
>> 5.6.3.2 4 12345 18551 1364 0 0 525874 01:36:43
>> 95301
>>
>> You will see the 'OutQ' Colum is HUGE and does not decrease. I can't
>> see
>> any errors - IF I run sh ip bgp 8.8.8.8 for example it will show the
>> route is being advertised to a peer router - BUT the remote router
>> does
>> not have the prefix prsumably as it i stuck in the queue.
>
> This is similar to the "IPv6 routes not propagating via iBGP" problem
> that a number of us have randomly on some devices. I have also seen
> this with some v4 prefixes, normally customers who are dual-homed and
> have flapped a bit between primary and backup; and you end up with the
> route 'stuck' on one or more routers.
>
> We actually have a script that runs every few hours now to check that
> across the AS, every router has the same next-hop for these prefixes
> (they have set local-prefs so this should absolutely be the case) - as
> in the past the failure of the internal routes to update caused
> routing
> loops and general badness.
>
> It is a royal pain to debug because it seems almost un-reproducible in
> the lab :(
>
> FWIW, I have downgraded to 0.99.23.1 across the board as that is the
> last release of Quagga that I can find that definitely does not have
> this issue.
>
> Paul.
>
> _______________________________________________
> Quagga-users mailing list
> Quagga-users@lists.quagga.net
> https://lists.quagga.net/mailman/listinfo/quagga-users
Re: Weird BGP Behavoiur [ In reply to ]
Hi,

On 05/03/2017 11:30, Alexis Rosen wrote:
> On Mar 5, 2017, at 4:04 AM, Paul Thornton <paul@prt.org> wrote:
>> It is a royal pain to debug because it seems almost un-reproducible in the lab :(
>
> Because of this, I suspect the only way to resolve this in a reasonable amount of time is by bisect. Doing that on a production network will suck. But this bug has been running around free for at least a year, and if you're right (below) more than 2.5.
>
>> FWIW, I have downgraded to 0.99.23.1 across the board as that is the last release of Quagga that I can find that definitely does not have this issue.
>
> That's an interesting claim, and an important one. All the other reports on this have been about versions 1.x.x, IIRC. Specifically, I was under the impression that 0.99.24.1 did NOT show this bug. If you can reproduce this bug in .24.1 (and .24.0), that will provide new and helpful info on where to look to fix it. How sure are you about this?

Actually, I'm not 100% sure. I thought I was, but now I'm not!

On our current production routers, we're running 0.99.23.1 - I am not
certain whether this is just a case of: "that was the version we were
running pre-upgrade to 1.0 and we know it works so are sticking with it".

Now it just so happens that during some lab staging of a new production
network for a customer a few weeks ago, I had a FreeBSD box acting as
"The Internet" which, amongst other things, carried a full routing table.

This *was* indeed running 1.something (I think it was 1.2) and saw the
problem, reproducibly with v6, and I downgraded that to *I think*
0.99.24.1 as we needed to get testing completed in a bit of a hurry. I
cannot check right now what version that was but will be able to clarify
that next week. Whatever version I used worked fine.

I may also be in a position where I can use this lab setup briefly after
our testing (as it is essentially a Quagga router providing transit to a
'production' network) to do some Quagga testing. I don't know *what* I
can do in terms of trying to troubleshoot the code, but I could
theoretically run something with loads of debug and terrible performance
in an attempt to track down the issue. If people can tell me what to
look for / do here, I can try and fit it in. Certainly, if nothing
else, I can use that to absolutely determine which version is the first
affected one.

Paul.

--
Paul Thornton

_______________________________________________
Quagga-users mailing list
Quagga-users@lists.quagga.net
https://lists.quagga.net/mailman/listinfo/quagga-users
Re: Weird BGP Behavoiur [ In reply to ]
I'm also happy to look / debug if needed ....

It is deeply odd ...

--?
Richard Palmer | Director | Merula Limited
Company Registered in England and Wales No. 3243995
5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
Phone 01480 222940 | Support 0845 330 0666
Support Email support@merula.net

-----Original Message-----
From: Paul Thornton [mailto:paul@prt.org]
Sent: 05 March 2017 11:53
To: Alexis Rosen <alexis@panix.com>
Cc: quagga-users@lists.quagga.net
Subject: [quagga-users 14633] Re: Weird BGP Behavoiur

Hi,

On 05/03/2017 11:30, Alexis Rosen wrote:
> On Mar 5, 2017, at 4:04 AM, Paul Thornton <paul@prt.org> wrote:
>> It is a royal pain to debug because it seems almost un-reproducible
>> in the lab :(
>
> Because of this, I suspect the only way to resolve this in a reasonable
amount of time is by bisect. Doing that on a production network will suck.
But this bug has been running around free for at least a year, and if you're
right (below) more than 2.5.
>
>> FWIW, I have downgraded to 0.99.23.1 across the board as that is the last
release of Quagga that I can find that definitely does not have this issue.
>
> That's an interesting claim, and an important one. All the other reports
on this have been about versions 1.x.x, IIRC. Specifically, I was under the
impression that 0.99.24.1 did NOT show this bug. If you can reproduce this
bug in .24.1 (and .24.0), that will provide new and helpful info on where to
look to fix it. How sure are you about this?

Actually, I'm not 100% sure. I thought I was, but now I'm not!

On our current production routers, we're running 0.99.23.1 - I am not
certain whether this is just a case of: "that was the version we were
running pre-upgrade to 1.0 and we know it works so are sticking with it".

Now it just so happens that during some lab staging of a new production
network for a customer a few weeks ago, I had a FreeBSD box acting as "The
Internet" which, amongst other things, carried a full routing table.

This *was* indeed running 1.something (I think it was 1.2) and saw the
problem, reproducibly with v6, and I downgraded that to *I think*
0.99.24.1 as we needed to get testing completed in a bit of a hurry. I
cannot check right now what version that was but will be able to clarify
that next week. Whatever version I used worked fine.

I may also be in a position where I can use this lab setup briefly after our
testing (as it is essentially a Quagga router providing transit to a
'production' network) to do some Quagga testing. I don't know *what* I can
do in terms of trying to troubleshoot the code, but I could theoretically
run something with loads of debug and terrible performance in an attempt to
track down the issue. If people can tell me what to look for / do here, I
can try and fit it in. Certainly, if nothing else, I can use that to
absolutely determine which version is the first affected one.

Paul.

--
Paul Thornton

_______________________________________________
Quagga-users mailing list
Quagga-users@lists.quagga.net
https://lists.quagga.net/mailman/listinfo/quagga-users


_______________________________________________
Quagga-users mailing list
Quagga-users@lists.quagga.net
https://lists.quagga.net/mailman/listinfo/quagga-users
Re: Weird BGP Behavoiur [ In reply to ]
Well downgrading to 99.23.1 Fixed the issue here .....

So whatever 'broke' seems to have done so between that release and the
current one .....

Oddy this is NOT affecting all quagga routers BUT is affecting that
one if I can get any details off it to help anyone ...



Richard Palmer | Director | Merula Limited
Company Registered in England and Wales No. 3243995
5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
Phone 01480 222940 | Support 0845 330 0666
Support Email support@merula.net


> --- Original message ---
> Subject: [quagga-users 14633] Re: Weird BGP Behavoiur
> From: Paul Thornton <paul@prt.org>
> To: Alexis Rosen <alexis@panix.com>
> Cc: <quagga-users@lists.quagga.net>
> Date: Sunday, 05/03/2017 11:58 AM
>
> Hi,
>
> On 05/03/2017 11:30, Alexis Rosen wrote:
>>
>> On Mar 5, 2017, at 4:04 AM, Paul Thornton <paul@prt.org> wrote:
>>>
>>> It is a royal pain to debug because it seems almost un-reproducible in
>>> the lab :(
>>
>> Because of this, I suspect the only way to resolve this in a
>> reasonable amount of time is by bisect. Doing that on a production
>> network will suck. But this bug has been running around free for at
>> least a year, and if you're right (below) more than 2.5.
>>
>>>
>>> FWIW, I have downgraded to 0.99.23.1 across the board as that is the
>>> last release of Quagga that I can find that definitely does not have
>>> this issue.
>>
>> That's an interesting claim, and an important one. All the other
>> reports on this have been about versions 1.x.x, IIRC. Specifically, I
>> was under the impression that 0.99.24.1 did NOT show this bug. If you
>> can reproduce this bug in .24.1 (and .24.0), that will provide new and
>> helpful info on where to look to fix it. How sure are you about this?
>
> Actually, I'm not 100% sure. I thought I was, but now I'm not!
>
> On our current production routers, we're running 0.99.23.1 - I am not
> certain whether this is just a case of: "that was the version we were
> running pre-upgrade to 1.0 and we know it works so are sticking with
> it".
>
> Now it just so happens that during some lab staging of a new
> production
> network for a customer a few weeks ago, I had a FreeBSD box acting as
> "The Internet" which, amongst other things, carried a full routing
> table.
>
> This *was* indeed running 1.something (I think it was 1.2) and saw the
> problem, reproducibly with v6, and I downgraded that to *I think*
> 0.99.24.1 as we needed to get testing completed in a bit of a hurry.
> I
> cannot check right now what version that was but will be able to
> clarify
> that next week. Whatever version I used worked fine.
>
> I may also be in a position where I can use this lab setup briefly
> after
> our testing (as it is essentially a Quagga router providing transit to
> a
> 'production' network) to do some Quagga testing. I don't know *what*
> I
> can do in terms of trying to troubleshoot the code, but I could
> theoretically run something with loads of debug and terrible
> performance
> in an attempt to track down the issue. If people can tell me what to
> look for / do here, I can try and fit it in. Certainly, if nothing
> else, I can use that to absolutely determine which version is the
> first
> affected one.
>
> Paul.
>
> --
> Paul Thornton
>
> _______________________________________________
> Quagga-users mailing list
> Quagga-users@lists.quagga.net
> https://lists.quagga.net/mailman/listinfo/quagga-users
Re: Weird BGP Behavoiur [ In reply to ]
On 3/4/2017 7:40 PM, Richard J Palmer wrote:
> From what I can see this is the only router seeing this - all have the
> same kernel & quagga version. I have tried upgrading the kernel (all are
> Linux 3.10.17 based on slakeware and with Intel e1000 NICs) and the
> Quagga version (to 1.1)

Can you see if version 1.2.0 shows this bug ? There seem to be quite a
few fixes according to the change log. Not sure if your issue per se is
addressed or not.

---Mike


--
-------------------
Mike Tancsa, tel +1 519 651 3400
Sentex Communications, mike@sentex.net
Providing Internet services since 1994 www.sentex.net
Cambridge, Ontario Canada http://www.tancsa.com/

_______________________________________________
Quagga-users mailing list
Quagga-users@lists.quagga.net
https://lists.quagga.net/mailman/listinfo/quagga-users
Re: Weird BGP Behavoiur [ In reply to ]
HI

I will tey ... but I have been failing to compile 1.2 so far :(

--

checking whether system has GNU regex... checking for regexec in
-lc... yes
checking for CARES... no
configure: error: Package requirements (libcares) were not met:

No package 'libcares' found

Consider adjusting the PKG_CONFIG_PATH environment variable if you
installed software in a non-standard prefix.

Alternatively, you may set the environment variables CARES_CFLAGS
and CARES_LIBS to avoid the need to call pkg-config.
See the pkg-config man page for more details.

----

This is with Slakeware 14.1 (it also seems to fail on 14.1)

I have compiled / installed libcares but this has not as yet helped...
Open to suggestions on what I am missing ...



Richard Palmer | Director | Merula Limited
Company Registered in England and Wales No. 3243995
5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
Phone 01480 222940 | Support 0845 330 0666
Support Email support@merula.net


> --- Original message ---
> Subject: Re: [quagga-users 14630] Weird BGP Behavoiur
> From: Mike Tancsa <mike@sentex.net>
> To: Richard J Palmer <richard@merula.net>,
> <quagga-users@lists.quagga.net>
> Date: Monday, 06/03/2017 3:08 PM
>
> On 3/4/2017 7:40 PM, Richard J Palmer wrote:
>>
>> From what I can see this is the only router seeing this - all have
>> the
>> same kernel & quagga version. I have tried upgrading the kernel (all
>> are
>> Linux 3.10.17 based on slakeware and with Intel e1000 NICs) and the
>> Quagga version (to 1.1)
>
> Can you see if version 1.2.0 shows this bug ? There seem to be quite a
> few fixes according to the change log. Not sure if your issue per se
> is
> addressed or not.
>
> ---Mike
>
>
> --
> -------------------
> Mike Tancsa, tel +1 519 651 3400
> Sentex Communications, mike@sentex.net
> Providing Internet services since 1994 http://www.sentex.net
> Cambridge, Ontario Canada http://www.tancsa.com/
Re: Weird BGP Behavoiur [ In reply to ]
HI All

Sorry for the break on this BUT I have been working on the routers in
the background...

I can confirm that this issue is now occurring on other servers here
an SEEMS to be any release after 0.99.24

I have now managed to compile the latest versions (1.2.1) and this
does not help

I am using the latest slackware distribution and kernel Linux 4.4.38
(it happens on earlier kernels too)
As before the router *starts* OK but then the OutQ for some sessions
(typically those with a lot of prefixes / internal peers)

I am more than happy to help with resolving the core as this is
causing real issues. I do have a background in c coding (though I have
noot touched quagga itself)

Is someone able to help please ?

Richard



Richard Palmer | Director | Merula Limited
Company Registered in England and Wales No. 3243995
5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
Phone 01480 222940 | Support 0845 330 0666
Support Email support@merula.net


> --- Original message ---
> Subject: Weird BGP Behavoiur
> From: Richard J Palmer <richard@merula.net>
> To: <quagga-users@lists.quagga.net>
> Date: Sunday, 05/03/2017 12:40 AM
>
>
>
>
> Hi
>
> I am seeing odd behavour on one of our quagga boxes here and I am
> looking for some help
>
> We have approx 6 Quagga servers which are installed on Supermicro
> Servers with 8G RAM each. RAM usage is low and CPU usage is generally
> low.
>
> In the main server in question BGP summay shows:
>
> BGP router identifier a.b.c.d, local AS number 12345
> RIB entries 1186136, using 127 MiB of memory
> Peers 330, using 2929 KiB of memory
> Peer groups 5, using 160 bytes of memory
>
> With lots of normal sessions eg:
>
>
> 217.146.96.73 4 4234 702 587 0 0 0 01:36:32
> 1459
>
> Which is sending / receiving... However with some sessions which I am
> sending a ful table to:
>
>
> 1.2.3.4 4 23454 100 672 0 0 630540 01:36:40
> 1
> 2.3.4.5 4 23454 100 668 0 0 630555 01:36:40
> 1
>
> And some internal sessions:
>
> 9.8.7.6 4 12345 100 661 0 0 530342 01:36:41
> 3
> 5.6.3.2 4 12345 18551 1364 0 0 525874 01:36:43
> 95301
>
> You will see the 'OutQ' Colum is HUGE and does not decrease. I can't
> see any errors - IF I run sh ip bgp 8.8.8.8 for example it will show
> the route is being advertised to a peer router - BUT the remote router
> does not have the prefix prsumably as it i stuck in the queue.
>
> From what I can see this is the only router seeing this - all have
> the same kernel & quagga version. I have tried upgrading the kernel
> (all are Linux 3.10.17 based on slakeware and with Intel e1000 NICs)
> and the Quagga version (to 1.1)
>
> I can't see anything being logged on the linux logs OR the quagga logs
> - happy to look at anything that will help
>
> Can someone point out what I have done stupid here ?
>
> Thanks in advance
> Richard Palmer | Director | Merula Limited
> Company Registered in England and Wales No. 3243995
> 5 Avro Court, Huntingdon, Cambridgeshire, PE29 6XS
> Phone 01480 222940 | Support 0845 330 0666
> Support Email support@merula.net
>
>