Mailing List Archive: plea for comcast/sprint handoff debug help

plea for comcast/sprint handoff debug help

Oct 28, 2020, 6:18 AM

Post #1 of 28 (1608 views)

tl;dr:

comcast: does your 50.242.151.5 westin router receive the announcement
of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

details:

3130 in the westin announces
147.28.0.0/19 and
147.28.0.0/20
to sprint, ntt, and the six
and we want to remove the /19

when we stop announcing the /19, a traceroute to comcast through sprint
dies at the handoff from sprint to comcast.

r0.sea#traceroute 73.47.196.134 source 147.28.7.1
Type escape sequence to abort.
Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134)
VRF info: (vrf in name/id, vrf out name/id)
1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec
2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 1 msec 0 msec
3 * * *
4 * * *
5 * * *
6 * * *

this would 'normally' (i.e. when the /19 is announced) be

r0.sea#traceroute 73.47.196.134 source 147.28.7.1
Type escape sequence to abort.
Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134)
VRF info: (vrf in name/id, vrf out name/id)
1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec
2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 0 msec 1 msec
3 be-207-pe02.seattle.wa.ibone.comcast.net (50.242.151.5) [AS 7922] 1 msec 0 msec 0 msec
4 be-10847-cr01.seattle.wa.ibone.comcast.net (68.86.86.225) [AS 7922] 1 msec 1 msec 2 msec
etc

specifically, when 147.28.0.0/19 is announced, traceroute from
147.28.7.2 through sprint works to comcast. withdraw 147.28.0.0/19,
leaving only 147.28.0.0/20, and the traceroute enters sprint but fails
at the handoff to comcast. Bad next-hop? not propagated? covid?
magic?

which is why we wonder what comcast (50.242.151.5) hears from sprint at
that handoff

note that, at the minute, both the /19 and the /20 are being announced,
as we want things to work. so you will not be able to reproduce.

so, comcast, are you receiving the announcement of the /20 from sprint?
with a good next-hop?

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 28, 2020, 8:58 AM

Post #2 of 28 (1608 views)

> tl;dr:
>
> comcast: does your 50.242.151.5 westin router receive the announcement
> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

tl;dr: diagnosed by comcast. see our short paper to be presented at imc
tomorrow https://archive.psg.com/200927.imc-rp.pdf

lesson: route origin relying party software may cause as much damage as
it ameliorates

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 28, 2020, 10:11 AM

Post #3 of 28 (1608 views)

Hello,

On Wed, 28 Oct 2020 at 16:58, Randy Bush <randy@psg.com> wrote:
> tl;dr: diagnosed by comcast. see our short paper to be presented at imc
> tomorrow https://archive.psg.com/200927.imc-rp.pdf
>
> lesson: route origin relying party software may cause as much damage as
> it ameliorates

There is a myth that ROV is inherently fail-safe (it isn't if your
production routers have stale VRP's) which leads to the assumption
that proper monitoring is neglectable.

I'm working on a shell script using rtrdump to detect stale RTR
servers (based on serial changes and the actual data). Of course this
would never detect partial failures that affect only some child-CAs,
but it does detect a hung RTR server (or a standalone RTR server where
the validator validates no more).

lukas

Re: plea for comcast/sprint handoff debug help [ In reply to ]

alex at nlnetlabs

Oct 29, 2020, 1:14 PM

Post #4 of 28 (1604 views)

> On 28 Oct 2020, at 16:58, Randy Bush <randy@psg.com> wrote:
>
>> tl;dr:
>>
>> comcast: does your 50.242.151.5 westin router receive the announcement
>> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
>
> tl;dr: diagnosed by comcast. see our short paper to be presented at imc
> tomorrow https://archive.psg.com/200927.imc-rp.pdf
>
> lesson: route origin relying party software may cause as much damage as
> it ameliorates
>
> randy

To clarify this for the readers here: there is an ongoing research experiment where connectivity to the RRDP and rsync endpoints of several RPKI publication servers is being purposely enabled and disabled for prolonged periods of time. This is perfectly fine of course.

While the resulting paper presented at IMC is certainly interesting, having relying party software fall back to rsync when RRDP is unavailable is not a requirement specified in any RFC, as the paper seems to suggest. In fact, we argue that it's actually a bad idea to do so:

https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/

We're interested to hear views on this from both an operational and security perspective.

-Alex

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 29, 2020, 1:17 PM

Post #5 of 28 (1604 views)

>>> tl;dr:
>>>
>>> comcast: does your 50.242.151.5 westin router receive the announcement
>>> of 147.28.0.0/20 from sprint's westin router 144.232.9.61?
>>
>> tl;dr: diagnosed by comcast. see our short paper to be presented at imc
>> tomorrow https://archive.psg.com/200927.imc-rp.pdf
>>
>> lesson: route origin relying party software may cause as much damage as
>> it ameliorates
>>
>> randy
>
> To clarify this for the readers here: there is an ongoing research
> experiment where connectivity to the RRDP and rsync endpoints of
> several RPKI publication servers is being purposely enabled and
> disabled for prolonged periods of time. This is perfectly fine of
> course.
>
> While the resulting paper presented at IMC is certainly interesting,
> having relying party software fall back to rsync when RRDP is
> unavailable is not a requirement specified in any RFC, as the paper
> seems to suggest. In fact, we argue that it's actually a bad idea to
> do so:
>
> https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
>
> We're interested to hear views on this from both an operational and
> security perspective.

in fact, <senior op at an isp> has found your bug. if you find an http
server, but it is not serving the new and not-required rrdp protocol, it
does not then use the mandatory to implement rsync.

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 29, 2020, 5:10 PM

Post #6 of 28 (1604 views)

i'll see your blog post and raise you a peer reviewed academic paper and
two rfcs :)

in dnssec, we want to move from the old mandatory to implement (mti) rsa
signatures to the more modern ecdsa.

how would the world work out if i fielded a validating dns cache server
which *implemented* rsa, because it is mti, but chose not to actually
*use* it for validation on odd numbered wednesdays because of my
religious belief that ecdsa is superior?

perhaps go over to your unbound siblings and discuss this analog.

but thanks for your help in getting jtk's imc paper accepted. :)

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 30, 2020, 3:06 AM

Post #7 of 28 (1604 views)

On Thu, Oct 29, 2020 at 09:14:16PM +0100, Alex Band wrote:
> In fact, we argue that it's actually a bad idea to do so:
>
> https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
>
> We're interested to hear views on this from both an operational and
> security perspective.

I don't see a compelling reason to not use rsync when RRDP is
unavailable.

Quoting from the blog post:

"While this isn’t threatening the integrity of the RPKI – all data
is cryptographically signed making it really difficult to forge data
– it is possible to withhold information or replay old data."

RRDP does not solve the issue of withholding data or replaying old data.
The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP
protocol basically is rsync wrapped in XML over HTTPS.

Withholding of information is detected through verification of RPKI
manifests (something Routinator didn't verify up until last week!),
and replaying of old data is addressed by checking validity dates and
CRLs (something Routinator also didn't do until last week!).

Of course I see advantages to this industry mainly using RRDP, but those
are not security advantages. The big migration towards RRDP can happen
somewhere in the next few years.

The arguments brought forward in the blog post don't make sense to me.
The '150,000' number in the blog post seems a number pulled from thin
air.

Regards,

Job

Re: plea for comcast/sprint handoff debug help [ In reply to ]

alex at nlnetlabs

Oct 30, 2020, 4:47 AM

Post #8 of 28 (1604 views)

> On 30 Oct 2020, at 01:10, Randy Bush <randy@psg.com> wrote:
>
> i'll see your blog post and raise you a peer reviewed academic paper and
> two rfcs :)

For the readers wondering what is going on here: there is a reason there is only a vague mention to two RFCs instead of the specific paragraph where it says that Relying Party software must fall back to rsync immediately if RRDP is temporarily unavailable. That is because this section doesn’t exist. The point is that there is no bug and in fact, Routinator has a carefully thought out strategy to deal with transient outages. Moreover, we argue that our strategy is the better choice, both operationally and from a security standpoint.

The paper shows that Routinator is the most used RPKI relying party software, and we know many of you here rely on it for route origin validation in a production environment. We take this responsibility and therefore this matter very seriously, and would not want you to think we have been careless in our software design. Quite the opposite.

We have made several attempts within the IETF to have a discussion on technical merit, where aspects such as overwhelming an rsync server with traffic, or using aggressive fallback to rsync as an entry point to a downgrade attack have been brought forward. Our hope was that our arguments would be considered on technical merit, but that did not happen yet. Be that as it may, operators can rest assured that if consensus goes against our logic, we will change our design.

> perhaps go over to your unbound siblings and discuss this analog.

The mention of Unbound DNS resolver in this context is interesting, because we have in fact discussed our strategy with the developers on this team as there is a lot to be learned from other standards and operational experiences.

We feel very strongly about this matter because the claim that using our software negatively affects Internet routing robustness strikes at the core of NLnet Labs’ existence: our reputation and our mission to work for the good of the Internet. They are the core values that make it possible for a not-for-profit foundation like ours to make free, liberally licensed open source software.

We’re proud of what we’ve been able to achieve and look forward to a continued open discussion with the community.

Respectfully,

Alex

Re: plea for comcast/sprint handoff debug help [ In reply to ]

tim at nlnetlabs

Oct 30, 2020, 5:19 AM

Post #9 of 28 (1604 views)

Hi Job, all,

> On 30 Oct 2020, at 11:06, Job Snijders <job@ntt.net> wrote:
>
> On Thu, Oct 29, 2020 at 09:14:16PM +0100, Alex Band wrote:
>> In fact, we argue that it's actually a bad idea to do so:
>>
>> https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/
>>
>> We're interested to hear views on this from both an operational and
>> security perspective.
>
> I don't see a compelling reason to not use rsync when RRDP is
> unavailable.
>
> Quoting from the blog post:
>
> "While this isn’t threatening the integrity of the RPKI – all data
> is cryptographically signed making it really difficult to forge data
> – it is possible to withhold information or replay old data."
>
> RRDP does not solve the issue of withholding data or replaying old data.
> The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP
> protocol basically is rsync wrapped in XML over HTTPS.
>
> Withholding of information is detected through verification of RPKI
> manifests (something Routinator didn't verify up until last week!),
> and replaying of old data is addressed by checking validity dates and
> CRLs (something Routinator also didn't do until last week!).
>
> Of course I see advantages to this industry mainly using RRDP, but those
> are not security advantages. The big migration towards RRDP can happen
> somewhere in the next few years.

Routinator does TLS verification when it encounters an RRDP repository. If the repository cannot be reached, or its HTTPS certificate is somehow invalid, it will use rsync instead. It's only after it found a *valid* HTTPS connection, that it refuses to fall back.

There is a security angle here.

Malicious-in-the-middle attacks can lead an RP to a bogus HTTPS server and force the software to downgrade to rsync, which has no channel security. The software can then be given old data (new ROAs can be withheld), or the attacker can simply withhold a single object. With the stricter publication point completeness validation introduced by RFC6486-bis this will lead to the rejecting of all ROAs published there.

The result is the exact same problem that Randy et al.'s research pointed at. If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes.

The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load), but it increases the attack surface for repositories that keep their RRDP server available.

Regards,
Tim

>
> The arguments brought forward in the blog post don't make sense to me.
> The '150,000' number in the blog post seems a number pulled from thin
> air.
>
> Regards,
>
> Job

Re: plea for comcast/sprint handoff debug help [ In reply to ]

beecher at beecher

Oct 30, 2020, 8:21 AM

Post #10 of 28 (1604 views)

Alex:

When I follow the RFC rabbit hole :

RFC6481 : A Profile for Resource Certificate Repository Structure

The publication repository MUST be available using rsync
> [RFC5781 <https://tools.ietf.org/html/rfc5781>] [RSYNC <https://tools.ietf.org/html/rfc6481#ref-RSYNC>]. Support of additional retrieval mechanisms
> is the choice of the repository operator. The supported
> retrieval mechanisms MUST be consistent with the accessMethod
> element value(s) specified in the SIA of the associated CA or
> EE certificate.
>
>
Then :

RFC8182 : The RPKI Repository Delta Protocol (RRDP)

This document allows the use of RRDP as an additional repository
> distribution mechanism for RPKI. In time, RRDP may replace rsync
> [RSYNC <https://tools.ietf.org/html/rfc8182#ref-RSYNC>] as the only mandatory-to-implement repository distribution
> mechanism. However, this transition is outside of the scope of this
> document.
>
>
Is this not the case then that currently rsync is still mandatory, even if
RRDP is in place? Or is there a more current RFC that has defined the
transition that I did not locate?

On Fri, Oct 30, 2020 at 7:49 AM Alex Band <alex@nlnetlabs.nl> wrote:

>
> > On 30 Oct 2020, at 01:10, Randy Bush <randy@psg.com> wrote:
> >
> > i'll see your blog post and raise you a peer reviewed academic paper and
> > two rfcs :)
>
> For the readers wondering what is going on here: there is a reason there
> is only a vague mention to two RFCs instead of the specific paragraph where
> it says that Relying Party software must fall back to rsync immediately if
> RRDP is temporarily unavailable. That is because this section doesn’t
> exist. The point is that there is no bug and in fact, Routinator has a
> carefully thought out strategy to deal with transient outages. Moreover, we
> argue that our strategy is the better choice, both operationally and from a
> security standpoint.
>
> The paper shows that Routinator is the most used RPKI relying party
> software, and we know many of you here rely on it for route origin
> validation in a production environment. We take this responsibility and
> therefore this matter very seriously, and would not want you to think we
> have been careless in our software design. Quite the opposite.
>
> We have made several attempts within the IETF to have a discussion on
> technical merit, where aspects such as overwhelming an rsync server with
> traffic, or using aggressive fallback to rsync as an entry point to a
> downgrade attack have been brought forward. Our hope was that our arguments
> would be considered on technical merit, but that did not happen yet. Be
> that as it may, operators can rest assured that if consensus goes against
> our logic, we will change our design.
>
> > perhaps go over to your unbound siblings and discuss this analog.
>
> The mention of Unbound DNS resolver in this context is interesting,
> because we have in fact discussed our strategy with the developers on this
> team as there is a lot to be learned from other standards and operational
> experiences.
>
> We feel very strongly about this matter because the claim that using our
> software negatively affects Internet routing robustness strikes at the core
> of NLnet Labs’ existence: our reputation and our mission to work for the
> good of the Internet. They are the core values that make it possible for a
> not-for-profit foundation like ours to make free, liberally licensed open
> source software.
>
> We’re proud of what we’ve been able to achieve and look forward to a
> continued open discussion with the community.
>
> Respectfully,
>
> Alex
>

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 30, 2020, 8:55 PM

Post #11 of 28 (1604 views)

> If there is a covering less specific ROA issued by a parent, this will
> then result in RPKI invalid routes.

i.e. the upstream kills the customer. not a wise business model.

> The fall-back may help in cases where there is an accidental outage of
> the RRDP server (for as long as the rsync servers can deal with the
> load)

folk try different software, try different configurations, realize that
having their CA gooey exposed because they wanted to serve rrdp and
block, ...

randy, finding the fort rp to be pretty solid!

Re: plea for comcast/sprint handoff debug help [ In reply to ]

ttauber at 1-4-5

Oct 30, 2020, 11:17 PM

Post #12 of 28 (1604 views)

As I've pointed out to Randy and others and I'll share here.
We planned, but hadn't yet upgraded our Routinator RP (Relying Party)
software to the latest v0.8 which I knew had some improvements.
I assumed the problems we were seeing would be fixed by the upgrade.
Indeed, when I pulled down the new SW to a test machine, loaded and ran it,
I could get both Randy's ROAs.
I figured I was good to go.
Then we upgraded the prod machine to the new version and the problem
persisted.
An hour or two of analysis made me realize that the "stickiness" of a
particular PP (Publication Point) is encoded in the cache filesystem.
Routinator seems to build entries in its cache directory under either
rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but
moving the cache directory aside and forcing it to rebuild fixed the issue.

A couple of points seem to follow:

- Randy says: "finding the fort rp to be pretty solid!" I'll say that
if you loaded a fresh Fort and fresh Routinator install, they would both
have your ROAs.
- The sense of "stickiness" is local only; hence to my mind the
protection against "downgrade" attack is somewhat illusory. A fresh install
knows nothing of history.

Tony

On Fri, Oct 30, 2020 at 11:57 PM Randy Bush <randy@psg.com> wrote:

> > If there is a covering less specific ROA issued by a parent, this will
> > then result in RPKI invalid routes.
>
> i.e. the upstream kills the customer. not a wise business model.
>
> > The fall-back may help in cases where there is an accidental outage of
> > the RRDP server (for as long as the rsync servers can deal with the
> > load)
>
> folk try different software, try different configurations, realize that
> having their CA gooey exposed because they wanted to serve rrdp and
> block, ...
>
> randy, finding the fort rp to be pretty solid!
>

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 31, 2020, 3:06 AM

Post #13 of 28 (1604 views)

> - Randy says: "finding the fort rp to be pretty solid!" I'll say that
> if you loaded a fresh Fort and fresh Routinator install, they would both
> have your ROAs.
> - The sense of "stickiness" is local only; hence to my mind the
> protection against "downgrade" attack is somewhat illusory. A fresh install
> knows nothing of history.

fort running
enabled rrdp on server
router reports

r0.sea#sh ip bgp rpki table | i 3130
147.28.0.0/20 20 3130 0 147.28.0.84/323
147.28.0.0/19 19 3130 0 147.28.0.84/323
147.28.64.0/19 19 3130 0 147.28.0.84/323
147.28.96.0/19 19 3130 0 147.28.0.84/323
147.28.128.0/19 19 3130 0 147.28.0.84/323
147.28.160.0/19 19 3130 0 147.28.0.84/323
147.28.192.0/19 19 3130 0 147.28.0.84/323
192.83.230.0/24 24 3130 0 147.28.0.84/323
198.180.151.0/24 24 3130 0 147.28.0.84/323
198.180.153.0/24 24 3130 0 147.28.0.84/323

disabled rrdp on server
added new roa 198.180.151.0/25
waited a while
router reports

r0.sea#sh ip bgp rpki table | i 3130
147.28.0.0/20 20 3130 0 147.28.0.84/323
147.28.0.0/19 19 3130 0 147.28.0.84/323
147.28.64.0/19 19 3130 0 147.28.0.84/323
147.28.96.0/19 19 3130 0 147.28.0.84/323
147.28.128.0/19 19 3130 0 147.28.0.84/323
147.28.160.0/19 19 3130 0 147.28.0.84/323
147.28.192.0/19 19 3130 0 147.28.0.84/323
192.83.230.0/24 24 3130 0 147.28.0.84/323
198.180.151.0/25 25 3130 0 147.28.0.84/323 <<<===
198.180.151.0/24 24 3130 0 147.28.0.84/323
198.180.153.0/24 24 3130 0 147.28.0.84/323

as i said, fort seems solid

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 31, 2020, 3:13 AM

Post #14 of 28 (1604 views)

> r0.sea#sh ip bgp rpki table | i 3130
> 147.28.0.0/20 20 3130 0 147.28.0.84/323
> 147.28.0.0/19 19 3130 0 147.28.0.84/323
> 147.28.64.0/19 19 3130 0 147.28.0.84/323
> 147.28.96.0/19 19 3130 0 147.28.0.84/323
> 147.28.128.0/19 19 3130 0 147.28.0.84/323
> 147.28.160.0/19 19 3130 0 147.28.0.84/323
> 147.28.192.0/19 19 3130 0 147.28.0.84/323
> 192.83.230.0/24 24 3130 0 147.28.0.84/323
> 198.180.151.0/25 25 3130 0 147.28.0.84/323 <<<===
> 198.180.151.0/24 24 3130 0 147.28.0.84/323
> 198.180.153.0/24 24 3130 0 147.28.0.84/323

note rov ops: if you do not see that /25 in your router(s), the RP
software you are running can be damaging to your customers and to
others.

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

alex at nlnetlabs

Oct 31, 2020, 5:02 AM

Post #15 of 28 (1604 views)

Hi Tony,

I realise there are quite some moving parts so I'll try to summarise our design choices and reasoning as clearly as possible.

Rsync was the original transport for RPKI and is still mandatory to implement. RRDP (which uses HTTPS) was introduced to overcome some of the shortcomings of rsync. Right now, all five RIRs make their Trust Anchors available over HTTPS, all but two RPKI repositories support RRDP and all but one relying party software packages support RRDP. There is currently an IETF draft to deprecate the use of rsync.

As a result, the bulk of RPKI traffic is currently transported over RRDP and only a small amount relies on rsync. For example, our RPKI repository is configured accordingly: rrdp.rpki.nlnetlabs.nl is served by a CDN and rsync.rpki.nlnetlabs.nl runs rsyncd on a simple, small VM to deal with the remaining traffic. When operators deploying our Krill Delegated RPKI software ask us what to expect and how to provision their services, this is how we explain the current state of affairs.

With this is mind, Routinator currently has this fetching strategy:

1. It starts by connecting to the Trust Anchors of the RIRs over HTTPS, if possible, and otherwise use rsync.
2. It follows the certificate tree, following several pointers to publication servers along the way. These pointers can be rsync only or there can be two pointers, one to rsync and one to RRDP.
3. If an RRDP pointer is found, Routinator will try to connect to the service, verify if there is a valid TLS certificate and data can be successfully fetched. If it can, the server is marked as usable and it'll prefer it. If the initial check fails, Routinator will use rsync, but verify RRDP works on the next validation run.
4. If RRDP worked before but is unavailable for any reason, Routinator will used cached data and try again on the next run instead of immediately falling back to rsync.
5. If the RPKI publication server operator takes away the pointer to RRDP to indicate they no longer offer this communication protocol, Routinator will use rsync.
6. If Routinator's cache is cleared, the process will start fresh

This strategy was implemented with repository server provisioning in mind. We are assuming that if you actively indicate that you offer RRDP, you actually provide a monitored service there. As such, an outage would be assumed to be transient in nature. Routinator could fall back immediately, of course. But our thinking was that if the RRDP service would have a small hiccup, currently a 1000+ Routinator instances would be hammering a possibly underprovisioned rsync server, perhaps causing even more problems for the operator.

"Transient" is currently the focus. In Randy's experiment, he is actively advertising he offers RRDP, but doesn't offer a service there for weeks at a time. As I write this, ca.rg.net. cb.rg.net and cc.rg.net have been returning a 404 on their RRDP endpoint several weeks and counting. cc.rg.net was unavailable over rsync for several days this week as well.

I would assume this is not how operators would run their RPKI publication server normally. Not having an RRDP service for weeks when you advertise you do is fine for an experiment but constitutes pretty bad operational practice for a production network. If a service becomes unavailable, the operator would swiftly be contacted and the issue would be resolved, like Randy and I have done in happier times:

https://twitter.com/alexander_band/status/1209365918624755712
https://twitter.com/enoclue/status/1209933106720829440

On a personal note, I realise the situation has a dumpster fire feel to it. I have contacted Randy about his outages months ago, not knowing they were a research project. I never got a reply. Instead of discussing his research and the observed effects, it feels like a 'gotcha' to present the findings in this way. It could even be considered irresponsible, if the fallout is as bad as he claims. The notion that using our software is quote, "a disaster waiting to happen", is disingenuous at best:

https://www.ripe.net/ripe/mail/archives/members-discuss/2020-September/004239.html

Routinator design was to try to deal with outages in a responsible manner for all actors involved. Again, of course we can change our strategy as a result of this discussion, which I'm happy we're now actually having. In that case I would advise operators who offer an RPKI publication server to ensure that they provision their rsyncd service so that it is capable of handling all of the traffic that their RRDP service normally handles, in case RRDP has a glitch. And, even if people will scale their rsync service accordingly, they will only ever find out if it actually does in a time of crisis.

Kind regards,

-Alex

> On 31 Oct 2020, at 07:17, Tony Tauber <ttauber@1-4-5.net> wrote:
>
> As I've pointed out to Randy and others and I'll share here.
> We planned, but hadn't yet upgraded our Routinator RP (Relying Party) software to the latest v0.8 which I knew had some improvements.
> I assumed the problems we were seeing would be fixed by the upgrade.
> Indeed, when I pulled down the new SW to a test machine, loaded and ran it, I could get both Randy's ROAs.
> I figured I was good to go.
> Then we upgraded the prod machine to the new version and the problem persisted.
> An hour or two of analysis made me realize that the "stickiness" of a particular PP (Publication Point) is encoded in the cache filesystem.
> Routinator seems to build entries in its cache directory under either rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but moving the cache directory aside and forcing it to rebuild fixed the issue.
>
> A couple of points seem to follow:
> • Randy says: "finding the fort rp to be pretty solid!" I'll say that if you loaded a fresh Fort and fresh Routinator install, they would both have your ROAs.
> • The sense of "stickiness" is local only; hence to my mind the protection against "downgrade" attack is somewhat illusory. A fresh install knows nothing of history.
> Tony
>
> On Fri, Oct 30, 2020 at 11:57 PM Randy Bush <randy@psg.com> wrote:
> > If there is a covering less specific ROA issued by a parent, this will
> > then result in RPKI invalid routes.
>
> i.e. the upstream kills the customer. not a wise business model.
>
> > The fall-back may help in cases where there is an accidental outage of
> > the RRDP server (for as long as the rsync servers can deal with the
> > load)
>
> folk try different software, try different configurations, realize that
> having their CA gooey exposed because they wanted to serve rrdp and
> block, ...
>
> randy, finding the fort rp to be pretty solid!

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Oct 31, 2020, 1:00 PM

Post #16 of 28 (1604 views)

> cc.rg.net was unavailable over rsync for several days this week as
> well.

sorry. it was cb and cc. it seems some broken RPs did not have the
ROA needed to get to our westin pop. cf this whole thread.

luckily such things never happen in real operations. :)

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

tim at nlnetlabs

Nov 2, 2020, 12:13 AM

Post #17 of 28 (1599 views)

Hi Randy, all,

> On 31 Oct 2020, at 04:55, Randy Bush <randy@psg.com> wrote:
>
>> If there is a covering less specific ROA issued by a parent, this will
>> then result in RPKI invalid routes.
>
> i.e. the upstream kills the customer. not a wise business model.

I did not say it was. But this is the problematic case.

For the vast majority of ROAs the sustained loss of the repository would lead to invalid ROA *objects*, which will not be used in Route Origin Validation anymore leading to the state 'Not Found' for the associated announcements.

This is not the case if there are other ROAs for the same prefixes published by others (most likely the parent). Quick back of the envelope analysis: this affects about 0.05% of ROA prefixes.

>> The fall-back may help in cases where there is an accidental outage of
>> the RRDP server (for as long as the rsync servers can deal with the
>> load)
>
> folk try different software, try different configurations, realize that
> having their CA gooey exposed because they wanted to serve rrdp and
> block, ...

We are talking here about the HTTPS server being unavailable, while rsync *is*.

So this means, your HTTPS server is down, unreachable, or has an issue with its HTTPS certificate. Your repository could use a CDN if they don't want to do all this themselves. They could monitor, and fix things.. there is time.

Thing is even if HTTPs becomes unavailable this still leaves hours (8 by default for the Krill CA, configurable) to fix things. Routinator (and the RIPE NCC Validator, and others) will use cached data if they cannot retrieve new data. It's only when manifests and CRLs start to expire that the objects would become invalid.

So the fallback helps in case of incidents with HTTPS that were not fixed within 8 hours for 0.05% of prefixes.

On the other hand, the fallback exposes a Malicious-in-the-Middle replay attack surface for 100% of the prefixes published using RRDP, 100% of the time. This allows attackers to prevent changes in ROAs to be seen.

This is a tradeoff. I think that protecting against replay should be considered more important here, given the numbers and time to fix HTTPS issue.

> randy, finding the fort rp to be pretty solid!

Unrelated, but sure I like Fort too.

Tim

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Nov 2, 2020, 1:37 AM

Post #18 of 28 (1599 views)

On Mon, Nov 02, 2020 at 09:13:16AM +0100, Tim Bruijnzeels wrote:
> On the other hand, the fallback exposes a Malicious-in-the-Middle
> replay attack surface for 100% of the prefixes published using RRDP,
> 100% of the time. This allows attackers to prevent changes in ROAs to
> be seen.

This is a mischaracterization of what is going on. The implication of
what you say here is that RPKI cannot work reliably over RSYNC, which is
factually incorrect and an injustice to all existing RSYNC based
deployment. Your view on the security model seems to ignore the
existence of RPKI manifests and the use of CRLs, which exist exactly to
mitigate replays.

Up until 2 weeks ago Routintar indeed was not correctly validating RPKI
data, fortunately this has now been fixed:
https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html

Also via the RRDP protocol old data be replayed, because because just
like RSYNC, the RRDP protocol does not have authentication. When RPKI
data is transported from Publication Point (RP) to Relying Party, the RP
cannot assume there was an unbroken 'chain of custody' and therefor has
to validate all the RPKI signatures.

For example, if a CDN is used to distribute RRDP data, the CDN is the
MITM (that is literally what CDNs are: reverse proxies, in the middle).
The CDN could accidentally serve up old (cached) content or misserve
current content (swap 2 filenames with each other).

> This is a tradeoff. I think that protecting against replay should be
> considered more important here, given the numbers and time to fix
> HTTPS issue.

The 'replay' issue you perceive is also present in RRDP. The RPKI is a
*deployed* system on the Internet and it is important for Routinator to
remain interopable with other non-nlnetlabs implementations.

Routinator not falling back to rsync does *not* offer a security
advantage, but does negatively impact our industry's ability to migrate
to RRDP. We are in 'phase 0' as described in Section 3 of
https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync

Regards,

Job

Re: plea for comcast/sprint handoff debug help [ In reply to ]

morrowc.lists at gmail

Nov 5, 2020, 10:26 PM

Post #19 of 28 (1599 views)

I hate to jump in late. but... :)

After reading this a few times it seems like what's going on is:
o a set of assumptions were built into the software stack
this seems fine, hard to build with some assumptions :)

o the assumptions seem to include: "if rrdp fails <how?> feel free
to jump back/to rsync"
I think SOME of the problem is the 'how' there.
Admittedly someone (randy) injected a pretty pathological failure
mode into the system
and didn't react when his 'monitoring' said: "things are broke yo!"

o absent a 'failure' the software kept on getting along as it had before.
Afterall, maybe the operator here intentionally put their
repository into this whacky state?
How is an RP software stack supposed to know what the PP's
management is meaning to do?

o lots of debate about how we got to where we are, I don't know that
much of it is really helpful.

I think a way forward here is to offer a suggestion for the software
folk to cogitate on and improve?
"What if (for either rrdp or rsync) there is no successful
update[0] in X of Y attempts,
attempt the other protocol to sync down to bring the remote PP back
to life in your local view."

This both allows the RP software to pick their primary path (and stick
to that path as long as things work) AND
helps the PP folk recover a bit quicker if their deployment runs into troubles.

0: I think 'failure' here is clear (to me):
1) the protocol is broken (rsync no connect, no http connect)
2) the connection succeeds but there is no sync-file (rrdp) nor
valid MFT/CRL

The 6486-bis rework effort seems to be getting to: "No MFT? no CRL?
you r busted!"
so I think if you don't get MFT/CRL in X of Y attempts it's safe to
say the PP over that protocol is busted,
and attempting the other proto is acceptable.

thanks!
-chris

On Mon, Nov 2, 2020 at 4:37 AM Job Snijders <job@ntt.net> wrote:
>
> On Mon, Nov 02, 2020 at 09:13:16AM +0100, Tim Bruijnzeels wrote:
> > On the other hand, the fallback exposes a Malicious-in-the-Middle
> > replay attack surface for 100% of the prefixes published using RRDP,
> > 100% of the time. This allows attackers to prevent changes in ROAs to
> > be seen.
>
> This is a mischaracterization of what is going on. The implication of
> what you say here is that RPKI cannot work reliably over RSYNC, which is
> factually incorrect and an injustice to all existing RSYNC based
> deployment. Your view on the security model seems to ignore the
> existence of RPKI manifests and the use of CRLs, which exist exactly to
> mitigate replays.
>
> Up until 2 weeks ago Routintar indeed was not correctly validating RPKI
> data, fortunately this has now been fixed:
> https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html
>
> Also via the RRDP protocol old data be replayed, because because just
> like RSYNC, the RRDP protocol does not have authentication. When RPKI
> data is transported from Publication Point (RP) to Relying Party, the RP
> cannot assume there was an unbroken 'chain of custody' and therefor has
> to validate all the RPKI signatures.
>
> For example, if a CDN is used to distribute RRDP data, the CDN is the
> MITM (that is literally what CDNs are: reverse proxies, in the middle).
> The CDN could accidentally serve up old (cached) content or misserve
> current content (swap 2 filenames with each other).
>
> > This is a tradeoff. I think that protecting against replay should be
> > considered more important here, given the numbers and time to fix
> > HTTPS issue.
>
> The 'replay' issue you perceive is also present in RRDP. The RPKI is a
> *deployed* system on the Internet and it is important for Routinator to
> remain interopable with other non-nlnetlabs implementations.
>
> Routinator not falling back to rsync does *not* offer a security
> advantage, but does negatively impact our industry's ability to migrate
> to RRDP. We are in 'phase 0' as described in Section 3 of
> https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync
>
> Regards,
>
> Job

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Nov 6, 2020, 2:47 AM

Post #20 of 28 (1599 views)

> Admittedly someone (randy) injected a pretty pathological failure
> mode into the system

really? could you be exact, please? turning an optional protocol off
is not a 'failure mode'.

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

ttauber at 1-4-5

Nov 6, 2020, 7:14 AM

Post #21 of 28 (1599 views)

On Fri, Nov 6, 2020 at 1:28 AM Christopher Morrow <morrowc.lists@gmail.com>
wrote:
<snip>

> I think a way forward here is to offer a suggestion for the software
> folk to cogitate on and improve?
> "What if (for either rrdp or rsync) there is no successful
> update[0] in X of Y attempts,
> attempt the other protocol to sync down to bring the remote PP back
> to life in your local view."
>
>
100% Please do this.
I also agree with Job's pleas to consider this work as part of the plath
outlined in the RSYNC->RRDP transition draft mentioned below.

Tony

> This both allows the RP software to pick their primary path (and stick
> to that path as long as things work) AND
> helps the PP folk recover a bit quicker if their deployment runs into
> troubles.
>
<more snip>

> >
> > > This is a tradeoff. I think that protecting against replay should be
> > > considered more important here, given the numbers and time to fix
> > > HTTPS issue.
> >
> > The 'replay' issue you perceive is also present in RRDP. The RPKI is a
> > *deployed* system on the Internet and it is important for Routinator to
> > remain interopable with other non-nlnetlabs implementations.
> >
> > Routinator not falling back to rsync does *not* offer a security
> > advantage, but does negatively impact our industry's ability to migrate
> > to RRDP. We are in 'phase 0' as described in Section 3 of
> > https://tools.ietf.org/html/draft-sidrops-bruijnzeels-deprecate-rsync
> >
> > Regards,
> >
> > Job
>

Re: plea for comcast/sprint handoff debug help [ In reply to ]

morrowc.lists at gmail

Nov 6, 2020, 10:48 AM

Post #22 of 28 (1599 views)

On Fri, Nov 6, 2020 at 5:47 AM Randy Bush <randy@psg.com> wrote:
>
> > Admittedly someone (randy) injected a pretty pathological failure
> > mode into the system
>
> really? could you be exact, please? turning an optional protocol off
> is not a 'failure mode'.

I suppose it depends on how you think you are serving the data.
If you thought you were serving it on both protocols, but 'suddenly' the RRDP
location was empty that would be a failure.

Same if your RRDP location's tls certificate dies...
One of my points was that it appeared that the software called 'bad
tls cert' (among other things I'm sure)
a failure, but not 'empty directory' (or no diff file). It's possible
that ALSO 'no diff' is considered a failure
but that swapping to alternate transport after a few failures was not
implemented. (I don't know, I have not looked
at that part of the code, and I don't think alex/tim said either way).

I don't think alex is wrong in stating that 'ideally the operator
monitors/alerts on health of their service', I
think it's shockingly often that this isn't actually done though. (and
isn't germaine in the case of the test / research in question)

My suggestion is that checking the alternate transport is helpful.

-chris

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Nov 6, 2020, 12:09 PM

Post #23 of 28 (1599 views)

>> really? could you be exact, please? turning an optional protocol off
>> is not a 'failure mode'.
> I suppose it depends on how you think you are serving the data.
> If you thought you were serving it on both protocols, but 'suddenly'
> the RRDP location was empty that would be a failure.

not necessarily. it could merely be a decision to stop serving rrdp.
perhaps a security choice; perhaps a software change; perhaps a phase
of the moon.

> One of my points was that it appeared that the software called 'bad
> tls cert' (among other things I'm sure) a failure, but not 'empty
> directory' (or no diff file). It's possible that ALSO 'no diff' is
> considered a failure

what the broken client software called what is not my probem. every
http[s] server in the universe is not necessarily an rrdp server. if
the client has some belief, for whatever reason, that it should be is
a brokenness.

> I don't think alex is wrong in stating that 'ideally the operator
> monitors/alerts on health of their service'

i do. i run clients.

> My suggestion is that checking the alternate transport is helpful.

as i do not see rrdp as a critical service, after all it is not mti,
but i am quite aware of whether it is running or not. the problem is
that rotinator seems not to be.

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

Nov 6, 2020, 6:06 PM

Post #24 of 28 (1599 views)

i may understand one place you could get confused. unlike a root CA
which publishes a TAL which describes transports, a non-root CA does not
publish a TAL describing what transports it supports. of course, rsync
is mandatory to provide; but anything else is "if it works, enjoy it.
otherwise use rsync."

randy

Re: plea for comcast/sprint handoff debug help [ In reply to ]

morrowc.lists at gmail

Nov 9, 2020, 8:22 PM

Post #25 of 28 (1596 views)

On Fri, Nov 6, 2020 at 3:09 PM Randy Bush <randy@psg.com> wrote:
>
> >> really? could you be exact, please? turning an optional protocol off
> >> is not a 'failure mode'.
> > I suppose it depends on how you think you are serving the data.
> > If you thought you were serving it on both protocols, but 'suddenly'
> > the RRDP location was empty that would be a failure.
>
> not necessarily. it could merely be a decision to stop serving rrdp.
> perhaps a security choice; perhaps a software change; perhaps a phase
> of the moon.

right this is all in the same set of: "failure modes not caught"
(I think, I don't care so much WHY you stopped serving RRDP, just that
after a few failures
the caller should try my other number (rsync))

>
> as i do not see rrdp as a critical service, after all it is not mti,
> but i am quite aware of whether it is running or not. the problem is
> that routinator seems not to be.

sure... it's just made one set of decisions. I was hoping with some
discussion we'd get to:
Welp, sure we can fallback and try rsync if we don't see success in <some> time.