Mailing List Archive

1 2 3  View All
Re: BGP route hijack by AS10990 [ In reply to ]
Sabri Berisha wrote on 01/08/2020 20:59:
> My point is that there can be operational reasons to do so, and whatever
> they wish to do on their network is perfectly fine. As long as they don't
> bother the rest of the world with it.

I get what you're saying, and am a big fan of personal responsibility,
but when a vendor ships a product like a BGP optimiser, it requires that
you run your network with the safety controls removed.

It's no different in principle to shipping guns with the safety welded
to off, or hot-wiring 20kW cables to bypass your RCDs. It can produce
some great results, no doubt about it, but sooner or later you're
guaranteed that there's going to be a nasty accident.

In any individual case, it's understandable to assign blame to an
operator for messing up their configs. In the general case, shipping
products with dangerous-by-default configurations is going lead to more
accidents happening.

At this point, a large proportion of the major routing leaks on the
internet can be associated with bgp optimisers and Noction's name
appears with disturbing regularity. This is an appalling record, not
least because it's almost entirely preventable.

Nick
Re: BGP route hijack by AS10990 [ In reply to ]
On 1/Aug/20 20:14, Hank Nussbacher wrote:

> AS  level filtering is easy.  IP prefix level filtering is hard. 
> Especially when you are in the top 200:
>
> https://asrank.caida.org/
>

Doesn't immediately make sense to me why prefix filtering is hard.


>
> That being said, and due to these BGP "polluters" constantly doing the
> same thing, wouldn't an easy fix be to use the max-prefix/prefix-limit
> option:
>
> https://www.cisco.com/c/en/us/support/docs/ip/border-gateway-protocol-bgp/25160-bgp-maximum-prefix.html
>
> https://www.juniper.net/documentation/en_US/junos/topics/reference/configuration-statement/prefix-limit-edit-protocols-bgp.html
>
>
> For every BGP peer,  the ISP determines what the current max-prefix
> currently is.  Then add in 2% and set the max-prefix. 
>
> An errant BGP polluter would then only have limited damage to the
> Internet routing table.
>
> Not the greatest solution, but easy to implement via a one line change
> on every BGP peer.
>

It's about combining multiple solutions to ensure several catch-points.
AS_PATH filtering, prefix filtering and max-prefix.


>
> Smaller ISPs can easily do it on their 10 BGP peers so as to limit
> damage as to what they will hear from their neighbors.
>

All ISP's should do this. All ISP's can.

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On 1/Aug/20 21:03, Sabri Berisha wrote:

> The same can be said here. Noction and/or its operators appear to not understand
> how BGP works, and/or what safety measures must be deployed to ensure that the
> larger internet will not be hurt by misconfiguration.

I think the latter would be more appropriate. Their implementation of
BGP is likely correct, but they aren't putting any emphasis on what the
deployment of their use-case can do to global BGP security and
performance. This where I'd say they can add more focus.


> I also agree with Job, that Noction has some responsibility here. And as I
> understand more and more about it, I must now agree with Mark T that this
> was an avoidable incident (although not because of Telia, but because Noction's
> decision to not enable NO_EXPORT by default).

I see it differently.

The chain is only as strong as its weakest actor. It is not unreasonable
to expect that global actors of significant scale have enough clue to
make sure any mistakes committed downstream are not propagated by them
to the rest of the Internet.

So while I do not absolve Noction (and their customer) of any
responsibility here, I'd apportion the blame as:

    - Telia 51%
    - Noction 30%
    - Noction's customer 19%

When the weaker chains of the link fail, we should be able to count on
the strongest chain in that link to be the last line of defence...
Telia, in this case. Simply for no other reason than they "know best",
and have such global scope which comes with significant responsibility.

But that isn't to say that neither Noction nor their customer cannot do
better either. After all, BGP security and performance only works well
when we all do our part, and not just some of us.

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On 1/Aug/20 21:20, Owen DeLong wrote:

> IP Prefix level filtering at the customer edge is not that hard, no
> matter how large of a transit
> provider you are. Customer edge filtration by Telia in this case would
> have prevented this
> problem from spreading beyond the misconfigured ASN.

+1.

There's simply no excuse - even if 100% of your eBGP sessions may be
customers :-).

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On 1/Aug/20 21:31, Owen DeLong wrote:

> I disagree. I think Noction and Telia are both culpable here. Most of the top 200 providers
> manage to do prefix filtering at the customer edge, so I don’t see any reason to give
> Telia a free pass here.

Both Noction and Telia are culpable, because they both (should) know
about past incidents, and how to do their part in protecting against them.

I mean, this is what we talk about on and at *NOG everyday of the year.
It's not like they've been living under a rock.

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
> On Aug 1, 2020, at 12:59 PM, Sabri Berisha <sabri@cluecentral.net> wrote:
>
> ----- On Aug 1, 2020, at 12:50 PM, Nick Hilliard nick@foobar.org wrote:
>
> Hi,
>
>> Sabri Berisha wrote on 01/08/2020 20:03:
>>> but because Noction's decision to not enable NO_EXPORT by default
>>
>> the primary problem is not this but that Noction reinjects prefixes into
>> the local ibgp mesh with the as-path stripped and then prioritises these
>> prefixes so that they're learned as the best path.
>
> Yeah, but that's not problem as far as I'm concerned. Their network,
> their rules. I've done weirder stuff than that, in tightly controlled
> environments.

Your network, your rules is fine as far as your border. When you start announcing crap to the rest of the world, then the rest of the world has a right to object.

When your product makes it easy for your customers to accidentally announce crap to the rest of the world, then it’s the moral equivalent of building a car without a seatbelt. Sure, before the technology was widely known and its life saving capabilities well understood, it was legitimate to dismiss it as an unnecessary added cost. Today, there’s no excuse for such an action. The hazards of BGP optimizers are pretty well known and it’s not unreasonable to expect vendors to implement appropriate safeguards into their products and/or recommend appropriate safeguards by their customers in their other routing devices. Certainly no-export by default is an example of something that there’s really no reason not to do in any BGP optimizer.

>> The as-path is the primary loop detection mechanism in eBGP. Removing
>> this is like hot-wiring your electrical distribution board because you
>> found out you could get more power if you bypass those stupid RCDs.
>
> Well, let's be honest. Sometimes we need to get rid of that pesky mechanism.
> For example, when using BGP-as-IGP, the "allowas-in" disregards the as-path,
> in a controlled manner (and yes, I know, different use case).

Also a much more constrained case… allowas-in (which I still argue is a poor substitute
for getting different ASNs for your different non-backboned sites) only allows you to loop your own AS and only at your own sites. It doesn’t support allowing you to feed crap to the internet.

> My point is that there can be operational reasons to do so, and whatever
> they wish to do on their network is perfectly fine. As long as they don't
> bother the rest of the world with it.

But the whole reason we’re having this conversation is that they _DID_ bother the rest of the world with it. Kind of takes the wind out of that particular argument, wouldn’t you say?

Owen
Re: BGP route hijack by AS10990 [ In reply to ]
On Jul 30, 2020, at 5:37 PM, Baldur Norddahl <baldur.norddahl@gmail.com> wrote:
>
> Telia implements RPKI filtering so the question is did it work? Were any affected prefixes RPKI signed? Would any prefixes have avoided being hijacked if RPKI signing had been in place?
>
> Regards
>
> Baldur - who had to turn off RPKI filtering at the request of JTAC to stop our mx204s from crashing :-(
>

Oh uh, I’m getting close to getting RPKI going on my mx204s, or was until you posted that. What’s the story there, and perhaps which junos version?
Re: BGP route hijack by AS10990 [ In reply to ]
On 2/Aug/20 19:22, Darrell Budic wrote:
> Oh uh, I’m getting close to getting RPKI going on my mx204s, or was until you posted that. What’s the story there, and perhaps which junos version?

None that I know if.

We have it working well (RPKI + ROV) on MX204's running Junos 19.2.

Curious to hear about Baldur's bug.

Mark.
RE: BGP route hijack by AS10990 [ In reply to ]
> Darrell Budic
> Sent: Sunday, August 2, 2020 6:23 PM
>
> On Jul 30, 2020, at 5:37 PM, Baldur Norddahl <baldur.norddahl@gmail.com>
> wrote:
> >
> > Telia implements RPKI filtering so the question is did it work? Were any
> affected prefixes RPKI signed? Would any prefixes have avoided being
> hijacked if RPKI signing had been in place?
> >
> > Regards
> >
> > Baldur - who had to turn off RPKI filtering at the request of JTAC to stop our
> mx204s from crashing :-(
> >
>
> Oh uh, I’m getting close to getting RPKI going on my mx204s, or was until you
> posted that. What’s the story there, and perhaps which junos version?

Same here, would be interested in affected Junos versions or any details you can share please,

adam
Re: BGP route hijack by AS10990 [ In reply to ]
> On 3 Aug 2020, at 11:04, adamv0025@netconsultings.com wrote:
>
>> Darrell Budic
>> Sent: Sunday, August 2, 2020 6:23 PM
>>
>> On Jul 30, 2020, at 5:37 PM, Baldur Norddahl <baldur.norddahl@gmail.com>
>> wrote:
>>>
>>> Telia implements RPKI filtering so the question is did it work? Were any
>> affected prefixes RPKI signed? Would any prefixes have avoided being
>> hijacked if RPKI signing had been in place?
>>>
>>> Regards
>>>
>>> Baldur - who had to turn off RPKI filtering at the request of JTAC to stop our
>> mx204s from crashing :-(
>>>
>>
>> Oh uh, I’m getting close to getting RPKI going on my mx204s, or was until you
>> posted that. What’s the story there, and perhaps which junos version?
>
> Same here, would be interested in affected Junos versions or any details you can share please,

According to the information I received from the community[1], you should read PR1461602 and PR1309944 before deploying.

-Alex

[1] https://rpki.readthedocs.io/en/latest/rpki/router-support.html
Re: BGP route hijack by AS10990 [ In reply to ]
>
> We can all do better. We should all do better.
>

Agreed.

However, every time we go on this Righteous Indignation of Should Do
crusade, it would serve us well to stop and remember that in every one of
our jobs, at many points in our careers, we have been faced with a
situation where something we SHOULD do ends up being deferred for something
we MUST to do. It is a universal truth that there will never enough time
and resources to complete both, especially not in our current business
environment that the only thing that matters is the numbers for the next
quarter. Sometimes as engineers we have to make choices, sometimes
choices are imposed on us by pointy hairs.

Telia made a mistake. They owned it and will endeavor to do better. What
more can be asked?

On Fri, Jul 31, 2020 at 5:51 PM Mark Tinka <mark.tinka@seacom.com> wrote:

>
>
> On 31/Jul/20 23:38, Sabri Berisha wrote:
>
> > Kudos to Telia for admitting their mistakes, and fixing their processes.
>
> Considering Telia's scope and "experience", that is one thing. But for
> the general good of the Internet, the number of intended or
> unintentional route hijacks in recent years, and all the noise that
> rises on this and other lists each time we have such incidents (this
> won't be the last), Telia should not have waited to be called out in
> order to get this fixed.
>
> Do we know if they are fixing this on just this customer of theirs, or
> all their customers? I know this has been their filtering policy with us
> (SEACOM) since 2014, as I pointed out earlier today. There has not been
> a shortage of similar incidents between now and then, where the
> community has consistently called for more deliberate and effective
> route filtering across inter-AS arrangements.
>
> There is massive responsibility for the community to act correctly for
> the Internet to succeed. Especially so during these Coronavirus times
> where the world depends on us to keep whatever shred of an economy is
> left up and running. Doubly so if you are a major concern (like Telia)
> for the core of the Internet.
>
> It's great that they are fixing this - but this was TOTALLY avoidable.
> That we won't see this again - even from the same the actors - isn't
> something I have high confidence in guaranteeing, based on current
> experience.
>
> We can all do better. We should all do better.
>
> Mark.
>
>
Re: BGP route hijack by AS10990 [ In reply to ]
On 3/Aug/20 14:36, Alex Band wrote:
> According to the information I received from the community[1], you should read PR1461602 and PR1309944 before deploying.

The good news is the code that fixes both of those issues is shipping.

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On Mon, Aug 03, 2020 at 02:36:25PM +0200, Alex Band wrote:
> According to the information I received from the community[1], you
> should read PR1461602 and PR1309944 before deploying.
>
> [1] https://rpki.readthedocs.io/en/latest/rpki/router-support.html

My take on PR1461602 is that it can be ignored, as it appears to only
manifest itself in a mostly cosmetic way: initial RTR session
establishment takes multiple minutes, but once RTR sessions are up
things work smoothly.

Under no circumstances should you enable RPKI ROV functionality on boxes
that suffer from PR1309944. That one is a real showstopper.

Kind regards,

Job
Re: BGP route hijack by AS10990 [ In reply to ]
On 3/Aug/20 14:57, Tom Beecher wrote:

> Agreed. 
>
> However, every time we go on this Righteous Indignation of Should Do
> crusade, it would serve us well to stop and remember that in every one
> of our jobs, at many points in our careers, we have been faced with a
> situation where something we SHOULD do ends up being deferred for
> something we MUST to do. It is a universal truth that there will never
> enough time and resources to complete both, especially not in our
> current business environment that the only thing that matters is the
> numbers for the next quarter. Sometimes as engineers we have to make
> choices,  sometimes choices are imposed on us by pointy hairs. 
>
> Telia made a mistake. They owned it and will endeavor to do better.
> What more can be asked?

I think we've now gone past Telia's mistake and are considering what we
can all do as BGP actors to prevent this particular issue from making a
reprise.

Agreed, we all have bits we need to prioritize our time on. But the BGP
requires concerted effort of all actors on the Internet. How an operator
in Omsk works with BGP has a potentially direct impact on another
operator in Ketchikan. So whether I choose to spend more time on
attending conferences vs. upgrading my core network, neither of those
has an impact on the BGP. But if I'm going to not take BGP filtering as
seriously as I should, the engineer, their employer and customer,
sitting all the way in Yangon, could feel that.

The devices we use, nowadays, are only as useful as their connectedness.
No connectivity, and they're just bricks. Particularly in these
Coronavirus times, the Internet is what is keeping economies alive, and
folk employed. So rather than go back to the old days of, "We are busy,
it is what it is", let's figure out how to make it better. We don't have
to fix all of the Internet's governance issues this century - let's just
start with making this "BGP optimizer danger" fix + "all operators
should filter more deliberately" a reality.

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On 1/Aug/20 02:44, Rafael Possamai wrote:

> To your point with regards to multiple failures combined causing an
> outage, here's some basic reading on the Swiss cheese model:
> https://en.wikipedia.org/wiki/Swiss_cheese_model

You just reminded me of the defense's strategy in the court case against
HealthSouth's CEO Richard Scrushy, when they used a picture of a rat
carrying Swiss cheese (full of holes) in their closing arguments to the
jurors, to discredit the prosecution :-).

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On Mon, Aug 3, 2020 at 3:54 PM Job Snijders <job@ntt.net> wrote:

> On Mon, Aug 03, 2020 at 02:36:25PM +0200, Alex Band wrote:
> > According to the information I received from the community[1], you
> > should read PR1461602 and PR1309944 before deploying.
> >
> > [1] https://rpki.readthedocs.io/en/latest/rpki/router-support.html
>
> My take on PR1461602 is that it can be ignored, as it appears to only
> manifest itself in a mostly cosmetic way: initial RTR session
> establishment takes multiple minutes, but once RTR sessions are up
> things work smoothly.
>
> Under no circumstances should you enable RPKI ROV functionality on boxes
> that suffer from PR1309944. That one is a real showstopper.
>
>
We suffered a series of crashes that led to JTAC recommending disabling
RPKI. We had a core dump which matches PR1332626 which is confidential, so
I have no idea what it is about. Apparently what happened was the server
running the RPKI validation server rebooted and the service was not
configured to automatically restart. Also we did not have it redundant nor
did we monitor the service. So we had no working RPKI validation server and
that apparently caused the MX204 to become unstable in various ways. It
might run for a day but it would do all sorts of things like packet loss,
delays and generally be "strange". The first crash caused BGP, ssh and
subscriber management to be down, but LDP, OSPF, SNMP to be up. It became a
black hole we could not login to. The worst possible kind of crash for a
router. We had to go onsite and pull the power.

The router appears to run fine after disabling RPKI. I suppose starting the
validation service may also fix the issue. But I am not going to go there
until I know what is in that PR and also I feel the RPKI funktion needs to
be failsafe before we can use it. I know we are at fault for not deploying
the validation service in a redundant setup and for failing at monitoring
the service. But we did so because we thought it not to be too important,
because a failed validation service should simply lead to no validation,
not a crashed router.

This is on JUNOS 20.1R1.11.

Regards,

Baldur
Re: BGP route hijack by AS10990 [ In reply to ]
On 3/Aug/20 17:09, Baldur Norddahl wrote:

>
> We suffered a series of crashes that led to JTAC recommending
> disabling RPKI. We had a core dump which matches PR1332626 which is
> confidential, so I have no idea what it is about. Apparently what
> happened was the server running the RPKI validation server rebooted
> and the service was not configured to automatically restart. Also we
> did not have it redundant nor did we monitor the service. So we had no
> working RPKI validation server and that apparently caused the MX204 to
> become unstable in various ways. It might run for a day but it would
> do all sorts of things like packet loss, delays and generally be
> "strange". The first crash caused BGP, ssh and subscriber management
> to be down, but LDP, OSPF, SNMP to be up. It became a black hole we
> could not login to.  The worst possible kind of crash for a router. We
> had to go onsite and pull the power.
>
> The router appears to run fine after disabling RPKI. I suppose
> starting the validation service may also fix the issue. But I am not
> going to go there until I know what is in that PR and also I feel the
> RPKI funktion needs to be failsafe before we can use it. I know we are
> at fault for not deploying the validation service in a redundant setup
> and for failing at monitoring the service. But we did so because we
> thought it not to be too important, because a failed validation
> service should simply lead to no validation, not a crashed router.
>
> This is on JUNOS 20.1R1.11.

That's a really nasty bug.

Loss of an RTR session shouldn't kill the box, even if you are running
only one validator. If you can share details about why this happens when
you get them, that would be most helpful.

I'd be curious to know whether this is dependent on a specific
validator, or all of them.

Are there bits in Junos 20 that you can't get in fixed versions of 19?

Mark.
Re: BGP route hijack by AS10990 [ In reply to ]
On Mon, Aug 03, 2020 at 08:57:53AM -0400, Tom Beecher wrote:
> Telia made a mistake. They owned it and will endeavor to do better. What
> more can be asked?

Figure out how that mistake happened -- what factors led to it? Then make
changes so that it can't happen again, at least not in that particular
way. (And if those changes are applicable to more than this isolated
case: excellent. In that case, share them with all of us so that maybe
they'll keep us from repeating the error.) "Stopping myself from making
the same mistake twice" has probably been the most effective thing
I've ever done.

---rsk

1 2 3  View All