Mailing List Archive

Automation - The Skinny (Was: Re: ACX5448 & ACX710)
On 23/Jan/20 12:59, Thomas Scott wrote:

> I had that conversation the again other day - someone said they were
> working on "automation" and when I probed deeper it revealed some
> (very useful, albeit not scalable) scripting. 'You keep using that
> word. I do not think it means what you think it means' seems to be the
> gist of most "automation" conversations, and I'm guilty often as not -
> 'automate it' seems to be a catch all for most of our operations issues.

You raise an important point.

The "automation" story is something most operators feel they need to be
talking about, and if they aren't doing anything yet or don't know how
to contribute to the discussion, feel better about not admitting that
they aren't doing anything about it nor have a plan yet that they are
really comfortable with.

When it all started going to hell with "SDN" back in 2011, skeptical-me
was immediately activated. I attended the MPLS/SDN/NFV/IPv6 congress in
Paris between 2013 - 2015, and realized I had little appetite for venues
where most presenters are trying to out-hype each other with the next
buzzword. I mean, you start to get a little "Uh huh" when someone comes
up to the stage and announces, "MPLS is dead and will no longer be seen
in networks from 2015".

Since about 2012, every time we've felt we've come close to finding an
"automation" solution that actually works and scales as good as it
sounds on the tin and from the community, we just get that niggling
feeling that things just aren't yet quite ready.

We don't have swaths of software developers lining our corridors ready
to write code for whatever we want. The few we do have are inundated
with other tasks that could have immediate but long-lasting solutions,
as we wade through the maze and sea that is "automation".

In 2020, after all the stories and buzzwords, we are honing in on
Ansible and Terraform, and even with those, we are being careful not to
waste time and resources we don't have. All other solution, IMNSHO, are
just a waste of time, in our view.

My 1+1 assessment of all of these issues is, I believe, down to the fact
that the industry wants to automate in an open standards manner, where
both vendors and operators are able to solve all implementation and
operational tasks with automation. If we did not have to have an open
standards mechanism for this, we'd all be automating - individually - to
our heart's content. I mean, we all know that the largest of largest
operators have had internal automation for decades, and despite all the
work going on at an industry level "to automate", they continue to have
and build their own internal automation tools that are bespoke and
proprietary to them. While they see the need to join the industry in
standardizing stuff they've already been doing for decades, they also
aren't feeling the pressure to push that harder than they could. After
all, the tools they've built are what has given them the edge, and they
aren't going to let that go anytime soon. And yes, even though we
sometimes get crumbs and drabs from the large content providers, when
they are in the mood, it's what they are willing to share with us, their
subjects. There's heaps more of interesting and exciting code they will
never share with the community, even though they are some of the loudest
faces you will find pushing for "industry-standard automation".

Since general consensus amongst those who haven't been "automating" for
a while now is that we should only automate if it is standardized, the
majority of the industry really interested in that agenda gets both
stuck in limbo and yo-yo, almost in an endless loop. Why, because unlike
large network operators who've been automating for decades, and unlike
large content providers who can throw 1,000 software developers at one
command line, the rest of the Internet community is not as gifted.

While I'm not suggesting that automating internally and independently of
the industry is the solution, my recommendation is to take a step back
from all the noise, as an individual operator, and assess your place in
the entire mix, and what this all means to you as a single network, and
to the industry that you love and want to support into the next 100 years.

For us, we still have a healthy dose of CLI communication with our
network to keep our engineers fresh and actually in tune with really
understanding what the network is actually doing. But we are also going
down the Ansible + Terraform path like Indiana Jones into a cave...
slowly and looking out for the mine fields laid not by Ansible or
Terraform, but by the politics automation has created in our industry.

It is also good to realize that if we are working from a "Day 1 to Day
End" position, that's never going to come. So we need to be deliberately
(and perhaps, selfishly) smarter about the paths we want to go down re:
"automation", as individual network operators.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On Fri, 24 Jan 2020 at 10:33, Mark Tinka <mark.tinka@seacom.mu> wrote:

> Since about 2012, every time we've felt we've come close to finding an

In my opinion we do roughly the same thing, the same way in networks,
with the same protocols since my start of career in 90s, very little
has changed and you could drop competent neteng from 90s to today and
they'd be immediately productive. Compare this to what has happened to
compute the difference is striking.

> My 1+1 assessment of all of these issues is, I believe, down to the fact
> that the industry wants to automate in an open standards manner, where

People who think that netconf and yang are solving big problems and
are key to solve automation probably haven't done much automation.
Roughly netconf is new snmp and yang is new mib, what ever they enable
could have been enabled by existing protocols decades ago, the
advantages are modest and will remain so. The key enabler for
automation is device accepting arbitrary new B config when it is
running arbitrary new A config and transition there hitlessly.
Generating full new config from DB+template is trivial problem, trying
to be aware of network state and move from arbitrary state A to
arbitrary state B with minimal amount of changes is hard and
unnecessary problem.


If/when network becomes more cloudified, more as-a-service, where you
use API to turn up your own active devices and circuits where you
want, when you want, instead of owning anything and once those
proprietary APIs get some subset standard APIs we'll probably start to
see openstack, kubernetes type of complexity explosion in networks
too. But as long as we keep owning the network most will keep running
it CLI jjockey network, touch when you must, but in many cases no one
touches it for weeks or months.


--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On 24/Jan/20 12:10, Saku Ytti wrote:

> In my opinion we do roughly the same thing, the same way in networks,
> with the same protocols since my start of career in 90s, very little
> has changed and you could drop competent neteng from 90s to today and
> they'd be immediately productive. Compare this to what has happened to
> compute the difference is striking.

Agreed - but is it really enough to the extent that the common buzz
sentence nowadays is "Network engineers are dead, they'll all be
replaced by software [developers]"?

I mean, I'd wager that more than half of the problems you find with
automation and tooling development is a total lack of protocol between
software developers and network engineers; in the same company. While
there has been plenty of success with a software developer reading a
networking-related RFC and writing code for that without needing to
understand, really, how IP/MPLS networks work, it's a whole other issue
trying to teach a network engineer how to write code, or a software
developer what IS-IS actually does.

I can't remember if I gave this example here before, but I know of a
network operator in Vienna who had to scramble and get their engineer
trained on CLI when they'd been setting up peering sessions fine for 3
years via a GUI, and when the GUI and automation front-end all went to
hell, that network engineer didn't know how to fall back to simple CLI
to setup even simpler BGP sessions for peering, by hand.

While clicking on GUI's is great, I don't have confidence that a network
of any decent scale can be ran, today, without some form of CLI
jockeying. And on the back of that, do we want to kill off the basics of
a network engineer in favour of Day 1 university graduates eager to
click a GUI button when provisioning your backbone, and they don't
actually understand what the "Wide Metric" checkbox actually means?



> People who think that netconf and yang are solving big problems and
> are key to solve automation probably haven't done much automation.

Totally agreed. But to also be fair, NETCONF/YANG are normally being
touted by vendors (much like Segment Routing, 5G and SD-WAN, but I
digress). I've not really found actual operators with anything
meaningful and useful to say about NETCONF/YANG.

Raise your hands if I'm talking nonesense.

For us, we find this whole NETCONF/YANG thing to be too heavy for simple
instructions you need to send to devices, not to mention the fact that
support within and between vendors is questionable (FlowSpec, anyone?).

I mean, that's why Ansible was so pleasing to our fingertips - all you
need is SSH and a large-enough, repetitive problem you want to go away
quickly.


> Roughly netconf is new snmp and yang is new mib, what ever they enable
> could have been enabled by existing protocols decades ago, the
> advantages are modest and will remain so.

Completely agreed!


> The key enabler for
> automation is device accepting arbitrary new B config when it is
> running arbitrary new A config and transition there hitlessly.
> Generating full new config from DB+template is trivial problem, trying
> to be aware of network state and move from arbitrary state A to
> arbitrary state B with minimal amount of changes is hard and
> unnecessary problem.

I tend to agree with you, Saku. What I've heard (from the vendors,
again) is that Ansible is not great because you don't inherently get
state confirmation feedback after posting the new configuration, and
that adding that intelligence into Ansible requires time and energy to
code. Okay, fair point, I'll bite. But also, we are network engineers -
we know what commands do when they run, and we've spent decades building
templates from as simple as a Windows Notepad text to as complex as a
MySQL database.

Then again, Terraform is meant to fix that downside of Ansible, but for
me, I don't really see that as a big issue. We aren't trying to
provision services across network domains (despite what MEF's LSO
architecture will have you believe), and even if we were, do I really
want you fiddling in my network. We each know our networks better than
outsiders know them, so what gives?


> If/when network becomes more cloudified, more as-a-service, where you
> use API to turn up your own active devices and circuits where you
> want, when you want, instead of owning anything and once those
> proprietary APIs get some subset standard APIs we'll probably start to
> see openstack, kubernetes type of complexity explosion in networks
> too.

MEF's LSO, which they've been pushing since about 2014. The concept is
sexy, but honestly, I've not heard much ado in 6 years re: real-world
deployment.

Also, while I'm wild enough to be one of the first maniacs to run a
network-wide Route Reflector on a VM on a server in 2014, you won't find
me deploying said RR's in AWS or Azure, so I can access them over some
API into an Openstack/Kubernetes/Docker enclosure. Life is too
interesting enough as it is :-).


> But as long as we keep owning the network most will keep running
> it CLI jjockey network, touch when you must, but in many cases no one
> touches it for weeks or months.

Words of wisdom.

I just want to get back to building, operating and optimizing IP/MPLS
networks. I don't mind seeing the back of the last 10 years of SD-this
and 5G-that.

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
> Mark Tinka
> Sent: Friday, January 24, 2020 2:32 PM
>
> On 24/Jan/20 12:10, Saku Ytti wrote:
>
> > In my opinion we do roughly the same thing, the same way in networks,
> > with the same protocols since my start of career in 90s, very little
> > has changed and you could drop competent neteng from 90s to today and
> > they'd be immediately productive. Compare this to what has happened to
> > compute the difference is striking.
>
> Agreed - but is it really enough to the extent that the common buzz
sentence
> nowadays is "Network engineers are dead, they'll all be replaced by
software
> [developers]"?
>
> I mean, I'd wager that more than half of the problems you find with
> automation and tooling development is a total lack of protocol between
> software developers and network engineers; in the same company. While
> there has been plenty of success with a software developer reading a
> networking-related RFC and writing code for that without needing to
> understand, really, how IP/MPLS networks work, it's a whole other issue
> trying to teach a network engineer how to write code, or a software
> developer what IS-IS actually does.
>
You nailed it Mark,
My opinion is that this new NetDevOps/NetOps initiative is the biggest
blunder of the networking industry.
If as a network engineer/architect you have some coding skills well good for
you,
But are programming skills a requirement to get into network
engineering/architecture nowadays - that absolutely should not be the case.
We need skilled network engineers and architects to know how to build and
operate complex networks
We need skilled developers and system architects to know how to build and
operate complex systems (including network automation systems)
We need these two groups to be able to talk to each other in a constructive
manner - check out Model-driven engineering to get you started.
Following these 3 simple premises you can then afford to have an army of
web-ui clickers - provisioning network services not knowing the first thing
about what' going on in the background of the network automation system. Or
not, and you just handover the web-ui/API to your customers and have them
self-service.

Imagine a case where network engineer builds an automation solution based on
number of hacks involving ansible, python, ydk, whatever... and this
solution gets traction and is used by the company.
Now that poor networking guy has a full-time job supporting the automation
solution, fixing bugs, developing new functionality and you just lost one
network engineer. This is a good example of jack of all trades but master of
none.
Even if you're a small operation or a start-up hiring a developer and make
him talk to the network engineer in a virtual team is a much better option.



> > People who think that netconf and yang are solving big problems and
> > are key to solve automation probably haven't done much automation.
>
> Totally agreed. But to also be fair, NETCONF/YANG are normally being
touted
> by vendors (much like Segment Routing, 5G and SD-WAN, but I digress). I've
> not really found actual operators with anything meaningful and useful to
say
> about NETCONF/YANG.
>
> Raise your hands if I'm talking nonesense.
>
> For us, we find this whole NETCONF/YANG thing to be too heavy for simple
> instructions you need to send to devices, not to mention the fact that
> support within and between vendors is questionable (FlowSpec, anyone?).
>
> I mean, that's why Ansible was so pleasing to our fingertips - all you
need is
> SSH and a large-enough, repetitive problem you want to go away quickly.
>
Don't judge the book by its cover, in other words just give NETCONF and YANG
a try, seriously.

I'd say that NETCONF's biggest advantage over SNMP/CLI is it's transaction
mechanism particularly atomicity and consistency ("all or nothing" and "all
at once") from the full ACID, but all these are addressed by all NOS-es
supporting two stage commit via CLI (As Saku mentioned below), so not a
biggie.
Sorry Mark XE is not one of those NOS-es, but you could still get the
functionality on XE using NETCONF ;)

YANG on the other hand gives one a common modelling language for
representing services layer configuration and network layer configuration,
which I find useful.
But I'm a minority, I guess there aren't many of you using RFC8299 & RFC8466
as bases for decomposing your L2 and L3 services and building a service
abstraction layer, on top of network configuration layer, so YMMV.

> > Roughly netconf is new snmp and yang is new mib, what ever they enable
> > could have been enabled by existing protocols decades ago, the
> > advantages are modest and will remain so.
>
> Completely agreed!
>
Regarding SNMP vs NETCONF similarities,
For pulling operational data yes, for pushing configuration not really...
see ACID above.


> > The key enabler for
> > automation is device accepting arbitrary new B config when it is
> > running arbitrary new A config and transition there hitlessly.
Agreed.

> > Generating full new config from DB+template is trivial problem,
>
Agreed.

> > trying
> > to be aware of network state and move from arbitrary state A to
> > arbitrary state B with minimal amount of changes is hard and
> > unnecessary problem.
>
With using CLI scrubbing it most definitely is a hard problem, but with
using YANG model representations of service and network state it's as easy
as commit compare/rollback/check on a single network element.
Necessary or not I guess it depends on the level of automation desired.

> I tend to agree with you, Saku. What I've heard (from the vendors,
> again) is that Ansible is not great because you don't inherently get state
> confirmation feedback after posting the new configuration, and that adding
> that intelligence into Ansible requires time and energy to code. Okay,
fair
> point, I'll bite. But also, we are network engineers - we know what
> commands do when they run, and we've spent decades building templates
> from as simple as a Windows Notepad text to as complex as a MySQL
> database.
>
I'd say there are different levels of automation,
At the entry level the aim might be just faster CLI interaction/simpler CLI
scraping (Ansible and the like) and at the other extreme there's the full
potential of model based engineering realized with frameworks like ONAP.
And operators naturally find themselves somewhere between these two poles in
their automation efforts which is perfectly fine.

> Then again, Terraform is meant to fix that downside of Ansible, but for
me, I
> don't really see that as a big issue. We aren't trying to provision
services
> across network domains (despite what MEF's LSO architecture will have you
> believe), and even if we were, do I really want you fiddling in my
network.
> We each know our networks better than outsiders know them, so what
> gives?
>
If you give your customers web-UI or API to your self-service portal that in
turn talks to your automation system then it doesn't really matter whether
it's your customers or partner operator using the API - these APIs have been
around for quite some time, the MEF LSO is just an attempt on standardizing
things.

>
> > If/when network becomes more cloudified, more as-a-service, where you
> > use API to turn up your own active devices and circuits where you
> > want, when you want, instead of owning anything and once those
> > proprietary APIs get some subset standard APIs we'll probably start to
> > see openstack, kubernetes type of complexity explosion in networks
> > too.
>
> MEF's LSO, which they've been pushing since about 2014. The concept is
> sexy, but honestly, I've not heard much ado in 6 years re: real-world
> deployment.
>
As I said the APIs at various levels for various functions have been around
for quite some time...


> Also, while I'm wild enough to be one of the first maniacs to run a
network-
> wide Route Reflector on a VM on a server in 2014, you won't find me
> deploying said RR's in AWS or Azure, so I can access them over some API
into
> an Openstack/Kubernetes/Docker enclosure. Life is too interesting enough
> as it is :-).
>
>
> > But as long as we keep owning the network most will keep running it
> > CLI jjockey network, touch when you must, but in many cases no one
> > touches it for weeks or months.
>
> Words of wisdom.
>
That's a pretty lonely network where customer change or new customer request
won't come in months...
No seriously, why would anyone spend any effort automating IP/MPLS core
which as you rightly pointed out is implemented once and then just sits
there for months without change?
You'd automate the low hanging fruit i.e. repetitive tasks in large volume-
like tasks associated with customer service lifecycle (i.e. PEs and CPEs and
everything in between like aggregation/access networks or interaction with
providers of access/aggregation networks).
And then there's the whole another universe of value added services in your
telco cloud (that's where your customer gets a firewall VM spun up and
provisioned when he/she clicks on the "I want a protected internet service
add on") -which needs to tie in with your network service provisioning
workflows.

> I just want to get back to building, operating and optimizing IP/MPLS
> networks. I don't mind seeing the back of the last 10 years of SD-this and
5G-
> that.
>
I see it almost as two separate things -the "simple" plumbing between PEs
and then there's the complex stuff happening at the customer facing side of
PEs.

adam

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
Hi Adam,

I would almost agree entirely with you except that there are two completely
different reasons for automation.

One as you described is related to service provisioning - here we have full
agreement.

The other one is actually of keeping your network running. Imagine router
maintaining entire control plane perfectly fine, imagine BFD working fine
to the box from peers but dropping between line cards via fabric from 20%
to 80% traffic. Unfortunately this is not a theory but real world :(

Without proper automation in place going way above basic IGP, BGP, LDP, BFD
etc ... you need a bit of clever automation to detect it and either alarm
noc or if they are really smart take such router out of the SPF network
wide. If not you sit and wait till pissed customers call - which is already
a failure.

Sure not everyone needs to be great coder ... but having network eng with
skills sufficient enough to understand code, ability to debug it or at min
design functional blocks of the automation routines are really must have
today.

And I am not even mentioning about all of the new OEM platforms with OS
coming from completely different part of the world :) That's when the real
fun starts and rubber hits the road when network eng can not run gdb on a
daily basis.

Cheers,
Robert.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On Mon, 27 Jan 2020 at 00:18, Robert Raszuk <robert@raszuk.net> wrote:

> The other one is actually of keeping your network running. Imagine router maintaining entire control plane perfectly fine, imagine BFD working fine to the box from peers but dropping between line cards via fabric from 20% to 80% traffic. Unfortunately this is not a theory but real world :(
>
> Without proper automation in place going way above basic IGP, BGP, LDP, BFD etc ... you need a bit of clever automation to detect it and either alarm noc or if they are really smart take such router out of the SPF network wide. If not you sit and wait till pissed customers call - which is already a failure.

Automation and monitoring to me are a very different subjects.
Everyone has war stories of those long tail problems when something
utterly weird is happening in the network and how problematic it was
to find. But this particular example is fairly easy, either you are
polling drop counter which shows the drops or your packets in -
packets out+drop delta is off.
But there will always be massive amount of long tail risks which your
nms won't know about, things break in a very creative and complex
ways. And you can monitor these very carefully, you can screenscrape
all NPU counters and your network is behaving _right now_
suboptimally, you see NPU exceptions/trapstats increasing which should
not and you can spend months figuring out 1 issue out of hundred you
have, all of which are real issues, but which might affect one packet
in a billion.
Is it worth knowing these? We are screenscraping and graphing all NPU
counters, as these typically are not available in GUI in case of JunOS
they are not even modelled because they are PFE counters. We rarely
proactively tend to them, because fixing them causes more outages than
letting them be. But often when strange issues do happen at scale
which customers care about, these counters reduce MTTR.
So if you think you don't have active issues, you're not monitoring
well enough. When you do monitor well enough you have to decide which
issues to fix and which to let be.


--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
> From: Robert Raszuk <robert@raszuk.net>
> Sent: Sunday, January 26, 2020 10:18 PM
>
> Hi Adam,
>
> I would almost agree entirely with you except that there are two completely
> different reasons for automation.
>
> One as you described is related to service provisioning - here we have full
> agreement.
>
> The other one is actually of keeping your network running. Imagine router
> maintaining entire control plane perfectly fine, imagine BFD working fine to
> the box from peers but dropping between line cards via fabric from 20% to
> 80% traffic. Unfortunately this is not a theory but real world :(
>
Very good point Robert,
There are indeed two parts to the whole automation story
(it' obvious that this theme deserve a series of blog posts, but I keep on finding excuses).

The analogy I usually use in presentations is the left brain right brain analogy,
Where left brain is responsible for logical thinking and right brain is responsible for creative thinking and intuition.
So a complete automation solution is built similarly:
Left brain is responsible for routine automated service provisioning
- and contains models of resources, services, devices, workflows, policies -and you can teach it by loading new/additional models.
Right brain on the other hand is responsible for "self-driving" the network (yeah I know can't think of better term)
- and collects data from network and acts on distributed policies, and also performs trending, analytics, correlation, arbitration etc...
Now left brain and right brain talk to each other obviously,
Policies are defined in left brain and distributed to right brain to act on them.
Also right brain can trigger workflows in left brain.

Major paradigm shift for our service designers here will be that they are now going to be responsible not only for putting the individual service building blocks together in term of config (and service lifecycle workflow -tbd), but also in terms of policies - determining the health of the provisioned service (including thresholds, post-checks, ongoing checks etc...)
But following the MDE (Model driven Engineering) theme it's not just service designers contributing to the policy library, it's Ops teams, Security teams, etc...
Main advantage is see is that some of the policies that will be created for the soon to be automated service certification testing could then be reused for the particular service provisioning post-test and service lifecycle monitoring and vice versa.
Then obviously there are policies defining what to do in various DDoS scenarios, and I consider the vendor solutions actually doing analytics, correlation, arbitration all part of the left brain).

> Without proper automation in place going way above basic IGP, BGP, LDP,
> BFD etc ... you need a bit of clever automation to detect it and either alarm
> noc or if they are really smart take such router out of the SPF network wide.
> If not you sit and wait till pissed customers call - which is already a failure.
>
Then nowadays there's also the possibility to enable tons upon tons of streaming telemetry -where I could see it all landing in a common data lake where some form of deep convolutional neural networks could be used for unsupervised pattern/feature learning, -reason being I'd like the system to tell me look if this counter is high and that one too and this is low then this usually happens. But I'd rather wait to see what the industry offers in this area than developing such solutions internally. For now I'm glad I have automation projects going, when I asked whether we should have AI in network strategy for 2020 I got awkward silence in response.


> Sure not everyone needs to be great coder ... but having network eng with
> skills sufficient enough to understand code, ability to debug it or at min
> design functional blocks of the automation routines are really must have
> today.
>
I don't know, my experience is that working in tandem with a devops person (as opposed to trying to figure it myself) gets me the desired results much faster (and in line with whatever their sys-architecture guidelines or coding principles are) while I can focus on WHAT (from the network perspective) not HOW (coding/system perspective). Although yes for some of the POC stuff I wish I had some coding skills.
But to give you a concrete example from my work, when I had a choice to read some python books or some more microservice architecture books I chose the latter as it was more important for me to know the difference between for instance orchestration and choreography among other aspects of microservice architectures to assess the pros and cons of each in order to make an educated argument for the service workflow engine architecture choice - so it lines up with what I had in mind for service layer workflows flexibility/agility.

> And I am not even mentioning about all of the new OEM platforms with OS
> coming from completely different part of the world :) That's when the real
> fun starts and rubber hits the road when network eng can not run gdb on a
> daily basis.
>
Well we are starting to get a glimpse of it already with VM of a Route-Reflector running on a server - who owns the host (HW & SW) is it sys-admins or ip-ops, which Mark could shed some light on based on his experience running vRRs.
But I guess my argument stands in this area as well, once you develop a successful OEM HW and NOS match that gets traction within the company you're no longer a network engineer as you've been promoted to full-time vendor of this product (dealing with bugs, new features, the overall support of this inhouse built platform).
So I stay by my MDE mantra -I'd rather stay as a SME for the networking side of things on the project and let devops/sysadmins do what they are best at.


adam

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On Mon, 27 Jan 2020 at 22:30, <adamv0025@netconsultings.com> wrote:

> Then nowadays there's also the possibility to enable tons upon tons of streaming telemetry -where I could see it all landing in a common data lake where some form of deep convolutional neural networks could be used for unsupervised pattern/feature learning, -reason being I'd like the system to tell me look if this counter is high and that one too and this is low then this usually happens. But I'd rather wait to see what the industry offers in this area than developing such solutions internally. For now I'm glad I have automation projects going, when I asked whether we should have AI in network strategy for 2020 I got awkward silence in response.

We should learn to crawl before we take rocket to proxima centauri.

You don't need ML/AI to find problems in your network, using algorithm
'this counter which increments at rate X stopped incrementing or
started to increment 100 times slower' and 'this counter which does
not increment, started to increment', and you'll find a lot of
problems in your network. But do you care about every problem in your
network, or only problems that customers care about?

Juniper once in EBC had some really smart academics explaining us
their ML/AI project which predicts resource needs on a given system.
They quoted how close they got to real numbers then I asked how does
it perform against naive system, after explaining by naive system I
mean system like 'my box has 1M FIB entries so FIB entry uses
RLDRAM/1M' to extrapolate FIB usage in arbitrary config. They hadn't
tried this and couldn't tell how well the ML/AI performs against this.

Can you really train today ML/AI to determine what actually matters? I
don't think you can, because what actually matters is something that
impacted customer, and you simply cannot put enough learning data in,
you don't have nearly enough customer trouble tickets to be able to
correlate them to network data you're collecting and start predicting
which complex counter combinations are predicting customer ticket
later.

But are you at least monitoring how many networks are lost inside your
network? Delta of input/output? That is fairly trivial to cover _all
reasons for packet loss_, of course latency/jitter are not covered,
but still, it covers alot of ground fast. Do you have a single system
where you collect all data? Have you enrichened the data stuff like
npu, linecard, city, country, region? Almost no one is doing even very
basic stuff, so I think ML/AI isn't going to be the low hanging fruit
any time soon. If you have a single system with lot of labels for
every counter, you can do a lot with very naive analytics. If you
don't have the data, you can't do anything with the smartest possible
system. And I think almost no one is collecting data in such a manner
that it's actually capitalisable, because we can keep running the
network with how how we did in 90s, IF-MIB and netflow, in separate
systems, with no encrichement at all.

--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
> And I think almost no one is collecting data in such a manner
> that it's actually capitalisable, because we can keep running the
> network with how how we did in 90s, IF-MIB and netflow, in separate
> systems, with no encrichement at all.

Spot on !

Btw Saku - you keep suggesting measuring delta of input/output ... well to
do it well I am afraid it is not trivial.

So at t0+N I record how many packets entered my system. (We are already at
loss here as RE can generate packets unless you add to this RE outbound
packets). Then at t0+N+uS (uS) delta of switching via fabric you record
number of packets which left the box.

What is your N and uS ?

Do you subtract BFD packets which enter and leave on the same line card
both ingress and egress ?

Monitoring drops is much easier if we are dealing with platforms which is
honest in recording them.

> You don't need ML/AI to find problems in your network, using algorithm
> 'this counter which increments at rate X stopped incrementing or
> started to increment 100 times slower'

Well the way I read Adam's note was that learning this rate X is what he
(IMHO correctly) calls ML :)

Cheers,
R.





On Tue, Jan 28, 2020 at 8:45 AM Saku Ytti <saku@ytti.fi> wrote:

> On Mon, 27 Jan 2020 at 22:30, <adamv0025@netconsultings.com> wrote:
>
> > Then nowadays there's also the possibility to enable tons upon tons of
> streaming telemetry -where I could see it all landing in a common data lake
> where some form of deep convolutional neural networks could be used for
> unsupervised pattern/feature learning, -reason being I'd like the system to
> tell me look if this counter is high and that one too and this is low then
> this usually happens. But I'd rather wait to see what the industry offers
> in this area than developing such solutions internally. For now I'm glad I
> have automation projects going, when I asked whether we should have AI in
> network strategy for 2020 I got awkward silence in response.
>
> We should learn to crawl before we take rocket to proxima centauri.
>
> You don't need ML/AI to find problems in your network, using algorithm
> 'this counter which increments at rate X stopped incrementing or
> started to increment 100 times slower' and 'this counter which does
> not increment, started to increment', and you'll find a lot of
> problems in your network. But do you care about every problem in your
> network, or only problems that customers care about?
>
> Juniper once in EBC had some really smart academics explaining us
> their ML/AI project which predicts resource needs on a given system.
> They quoted how close they got to real numbers then I asked how does
> it perform against naive system, after explaining by naive system I
> mean system like 'my box has 1M FIB entries so FIB entry uses
> RLDRAM/1M' to extrapolate FIB usage in arbitrary config. They hadn't
> tried this and couldn't tell how well the ML/AI performs against this.
>
> Can you really train today ML/AI to determine what actually matters? I
> don't think you can, because what actually matters is something that
> impacted customer, and you simply cannot put enough learning data in,
> you don't have nearly enough customer trouble tickets to be able to
> correlate them to network data you're collecting and start predicting
> which complex counter combinations are predicting customer ticket
> later.
>
> But are you at least monitoring how many networks are lost inside your
> network? Delta of input/output? That is fairly trivial to cover _all
> reasons for packet loss_, of course latency/jitter are not covered,
> but still, it covers alot of ground fast. Do you have a single system
> where you collect all data? Have you enrichened the data stuff like
> npu, linecard, city, country, region? Almost no one is doing even very
> basic stuff, so I think ML/AI isn't going to be the low hanging fruit
> any time soon. If you have a single system with lot of labels for
> every counter, you can do a lot with very naive analytics. If you
> don't have the data, you can't do anything with the smartest possible
> system. And I think almost no one is collecting data in such a manner
> that it's actually capitalisable, because we can keep running the
> network with how how we did in 90s, IF-MIB and netflow, in separate
> systems, with no encrichement at all.
>
> --
> ++ytti
>
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On Tue, 28 Jan 2020 at 11:19, Robert Raszuk <robert@raszuk.net> wrote:


> So at t0+N I record how many packets entered my system. (We are already at loss here as RE can generate packets unless you add to this RE outbound packets). Then at t0+N+uS (uS) delta of switching via fabric you record number of packets which left the box.
>
> What is your N and uS ?

You're not gonna get it 1:1, you will monitor the delta rate you see
and react when the delta rate increases. Of course you can keep tuning
this, by adding more and more drop counters to reduce known delta
rate, but you always have to accept you can't explain it perfectly.
But not every small issue is an important issue. Certainly your fabric
lost 30% would have been blatantly obvious even in most naive such
system.

> > You don't need ML/AI to find problems in your network, using algorithm
> > 'this counter which increments at rate X stopped incrementing or
> > started to increment 100 times slower'
>
> Well the way I read Adam's note was that learning this rate X is what he (IMHO correctly) calls ML :)

What I mean current_rate = X, if now_rate > X*100 or now_rate < X/100,
no ML, just stupid static comparison of dramatic rate change. And even
this is advanced by today's standard. Even counter rate went to 0 from
non-zero or went to non-zero from 0 exposes lot of real issues, but
issues which happen so rarely customers are not complaining about
them.

Particular example, all of us have some ip checksum errors in the
network, when it's on edge router, edge interface ingress direction
you can ignore it 'someone elses problem', but we also see it in other
interface/direction where it means we flipped bits somewhere and
calculated correct FCS over the broken data, i.e. we have broken
memory somewhere. But it probably isn't broken enough to matter, it
probably mangles packets rather rarely.



--
++ytti
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On 26/Jan/20 22:46, adamv0025@netconsultings.com wrote:

> You nailed it Mark,
> My opinion is that this new NetDevOps/NetOps initiative is the biggest
> blunder of the networking industry.
> If as a network engineer/architect you have some coding skills well good for
> you,
> But are programming skills a requirement to get into network
> engineering/architecture nowadays - that absolutely should not be the case.
> We need skilled network engineers and architects to know how to build and
> operate complex networks
> We need skilled developers and system architects to know how to build and
> operate complex systems (including network automation systems)
> We need these two groups to be able to talk to each other in a constructive
> manner - check out Model-driven engineering to get you started.

Totally agreed.


> Following these 3 simple premises you can then afford to have an army of
> web-ui clickers - provisioning network services not knowing the first thing
> about what' going on in the background of the network automation system. Or
> not, and you just handover the web-ui/API to your customers and have them
> self-service.

Which is fine for me, as long as there are still some real network
engineers at the company that can make sure the network runs when the
automation system bombs out.


>
> Imagine a case where network engineer builds an automation solution based on
> number of hacks involving ansible, python, ydk, whatever... and this
> solution gets traction and is used by the company.
> Now that poor networking guy has a full-time job supporting the automation
> solution, fixing bugs, developing new functionality and you just lost one
> network engineer. This is a good example of jack of all trades but master of
> none.
> Even if you're a small operation or a start-up hiring a developer and make
> him talk to the network engineer in a virtual team is a much better option.

Agreed - but we, generally, have to start from somewhere. And if you are
going to start small with Ansible, I've found that network engineers
will do that better with limited sysadmin experience than trying to get
the sysadmin to stand up Ansible and get it to talk to routers. Over
time, if its successful, you can farm out bits about Ansible to the
software team that best suits their skills.

I'm not for the idea that network engineers are obsolete and the only
way to run an IP/MPLS network, over the next decade, is by giving it to
the software heads.


> Don't judge the book by its cover, in other words just give NETCONF and YANG
> a try, seriously.
>
> I'd say that NETCONF's biggest advantage over SNMP/CLI is it's transaction
> mechanism particularly atomicity and consistency ("all or nothing" and "all
> at once") from the full ACID, but all these are addressed by all NOS-es
> supporting two stage commit via CLI (As Saku mentioned below), so not a
> biggie.
> Sorry Mark XE is not one of those NOS-es, but you could still get the
> functionality on XE using NETCONF ;)
>
> YANG on the other hand gives one a common modelling language for
> representing services layer configuration and network layer configuration,
> which I find useful.
> But I'm a minority, I guess there aren't many of you using RFC8299 & RFC8466
> as bases for decomposing your L2 and L3 services and building a service
> abstraction layer, on top of network configuration layer, so YMMV.

I believed a lot in NETCONF/YANG in the middle of this past decade, and
actually insisted solutions support it. But as I've said before, over
time, you get to learn how to sniff the smell in the air, and while a
lot of those solutions paint a good picture, for our particular
use-case, it just felt like a whole lot of complexity for the 2 simple
things we wanted to achieve first - customer service
provisioning/de-provisioning + network deployment.

Which is not to say that NETCONF/YANG have no use-case. Down the line,
if what we are doing with Ansible becomes overly complex to require
models based on NETCONF/YANG, happy to consider. But to get off the
ground, I feel they are too heavy. I don't want to design overly
elaborate data models - we know what it takes to deploy or run a device;
we've been doing it for years. That doesn't change regardless of if it's
being done by humans in a semi- or fully hands-off approach.

The last 10 years have been bogged down by trying to figure how much
automation to do, when to do it and how. In the end, we are still in the
same place, and even more confused. If starting off slow with Ansible is
okay for me for the next decade, I'm good with that.


> I'd say there are different levels of automation,
> At the entry level the aim might be just faster CLI interaction/simpler CLI
> scraping (Ansible and the like) and at the other extreme there's the full
> potential of model based engineering realized with frameworks like ONAP.
> And operators naturally find themselves somewhere between these two poles in
> their automation efforts which is perfectly fine.

And this right here is my issue with where the industry is - we are
trying to define "automation" as a single thing that everyone should
aspire to in a particular way or set of ways.

I counter that if automation, to you, is loading a bunch of scripts into
RANCID and letting it run off and do stuff for you that you find
repetitive, great! I counter that if automation is you taking a Windows
Notepad and dumping it into a database with some templates, great! I
counter that if automation is using a vendor-proprietary solution to
provision and operate the network, great!

Let's not get bogged by trying to get everyone into the same boat, as
this is where, I feel, we are all failing each other.


> If you give your customers web-UI or API to your self-service portal that in
> turn talks to your automation system then it doesn't really matter whether
> it's your customers or partner operator using the API - these APIs have been
> around for quite some time, the MEF LSO is just an attempt on standardizing
> things.

Which is my point - large operators have had homegrown solutions for
years. They really aren't struggling. We just think they are because we
don't know what they are running, but they aren't.

Now, if they want to open their systems up to other "standards-based"
solutions is totally up to them. And if they don't, I won't castigate
them for not wanting to conform. Ultimately, if they are happy with the
way they automate, who am I to tell them they are wrong for not running
it in some Docker image over Openstack in AWS?


> That's a pretty lonely network where customer change or new customer request
> won't come in months...
> No seriously, why would anyone spend any effort automating IP/MPLS core
> which as you rightly pointed out is implemented once and then just sits
> there for months without change?
> You'd automate the low hanging fruit i.e. repetitive tasks in large volume-
> like tasks associated with customer service lifecycle (i.e. PEs and CPEs and
> everything in between like aggregation/access networks or interaction with
> providers of access/aggregation networks).
> And then there's the whole another universe of value added services in your
> telco cloud (that's where your customer gets a firewall VM spun up and
> provisioned when he/she clicks on the "I want a protected internet service
> add on") -which needs to tie in with your network service provisioning
> workflows.

I feel like I'm reading a vendor white paper :-).

Seriously though, if that makes you happy, go for it. What I'm saying is
if I find another way that leverages the industry without breaking my
back (or brain), I'll do it to my satisfaction. I won't be chasing the
panacea that is "automation", but rather, what automation means to me.


> I see it almost as two separate things -the "simple" plumbing between PEs
> and then there's the complex stuff happening at the customer facing side of
> PEs.

"Complex", like automation, can mean different things to different
people :-).

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On 27/Jan/20 00:18, Robert Raszuk wrote:

>
> Without proper automation in place going way above basic IGP, BGP,
> LDP, BFD etc ... you need a bit of clever automation to detect it and
> either alarm noc or if they are really smart take such router out of
> the SPF network wide. If not you sit and wait till pissed customers
> call - which is already a failure.

So we use Packet Design (now Ciena Blue Planet) for stuff like this.
Works like a treat :-).

We also have a dear NMS that can be told what to look for and notify the
team.

If NOC's are sitting and waiting for customers to get pissed off,
automation is going to make your problems worse, and not better.

Mark.
_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On 27/Jan/20 22:30, adamv0025@netconsultings.com wrote:

> Very good point Robert,
> There are indeed two parts to the whole automation story
> (it' obvious that this theme deserve a series of blog posts, but I keep on finding excuses).
>
> The analogy I usually use in presentations is the left brain right brain analogy,
> Where left brain is responsible for logical thinking and right brain is responsible for creative thinking and intuition.
> So a complete automation solution is built similarly:
> Left brain is responsible for routine automated service provisioning
> - and contains models of resources, services, devices, workflows, policies -and you can teach it by loading new/additional models.
> Right brain on the other hand is responsible for "self-driving" the network (yeah I know can't think of better term)
> - and collects data from network and acts on distributed policies, and also performs trending, analytics, correlation, arbitration etc...
> Now left brain and right brain talk to each other obviously,
> Policies are defined in left brain and distributed to right brain to act on them.
> Also right brain can trigger workflows in left brain.
>
> Major paradigm shift for our service designers here will be that they are now going to be responsible not only for putting the individual service building blocks together in term of config (and service lifecycle workflow -tbd), but also in terms of policies - determining the health of the provisioned service (including thresholds, post-checks, ongoing checks etc...)
> But following the MDE (Model driven Engineering) theme it's not just service designers contributing to the policy library, it's Ops teams, Security teams, etc...
> Main advantage is see is that some of the policies that will be created for the soon to be automated service certification testing could then be reused for the particular service provisioning post-test and service lifecycle monitoring and vice versa.
> Then obviously there are policies defining what to do in various DDoS scenarios, and I consider the vendor solutions actually doing analytics, correlation, arbitration all part of the left brain).

Not to sound silly, but you're taking me back to where I was right
around the time Cisco decided to pick up Tail-f :-).


> Then nowadays there's also the possibility to enable tons upon tons of streaming telemetry -where I could see it all landing in a common data lake where some form of deep convolutional neural networks could be used for unsupervised pattern/feature learning, -reason being I'd like the system to tell me look if this counter is high and that one too and this is low then this usually happens. But I'd rather wait to see what the industry offers in this area than developing such solutions internally.

For me, this makes a lot of sense. I'm happy to support standardization
of telemetry streaming (box vendor) and decoding (NMS vendor) because
that enhances the NMS capabilities, which takes away a lot of the
corner-case issues Saku highlighted (well, that's the hope).


> For now I'm glad I have automation projects going, when I asked whether we should have AI in network strategy for 2020 I got awkward silence in response.

I'm not even going to touch the ML/AI pole :-).


> I don't know, my experience is that working in tandem with a devops person (as opposed to trying to figure it myself) gets me the desired results much faster (and in line with whatever their sys-architecture guidelines or coding principles are) while I can focus on WHAT (from the network perspective) not HOW (coding/system perspective). Although yes for some of the POC stuff I wish I had some coding skills.
> But to give you a concrete example from my work, when I had a choice to read some python books or some more microservice architecture books I chose the latter as it was more important for me to know the difference between for instance orchestration and choreography among other aspects of microservice architectures to assess the pros and cons of each in order to make an educated argument for the service workflow engine architecture choice - so it lines up with what I had in mind for service layer workflows flexibility/agility.

Totally agree with you there.

Network engineers should stop feeling the pressure about needing to
become software heads. I can tell you now, where we are with getting
software to solve of our operational problems, network engineers are
going to be in huge demand. Let's just not ruin the pot by teaching them
"GUI is the absolute answer" the moment they leave the university gates
and enter the real world. It's been hard enough disabusing them of
"Class A, Class B, Class C".


> Well we are starting to get a glimpse of it already with VM of a Route-Reflector running on a server - who owns the host (HW & SW) is it sys-admins or ip-ops, which Mark could shed some light on based on his experience running vRRs.

As you know, we started running CSR1000v on HP servers under ESXi back
in 2014.

Ultimately, the overall system is owned by the Network team. However,
day-to-day support is handled by the IT team, i.e., ESXi licenses +
management, iLO access to the server, replacement of faulty server or
server parts, that sort of thing.

The server is treated like a router, i.e., it is not part of the vSphere
cloud, each device is an island, it does not run any other VM's,
CSR1000v VM upgrades are decided by the Network team, ESXi version
upgrades are decided by the Network team, the server is hosted in the
same rack as other core routing/switching devices rather than being in a
server farm rack, e.t.c.


> But I guess my argument stands in this area as well, once you develop a successful OEM HW and NOS match that gets traction within the company you're no longer a network engineer as you've been promoted to full-time vendor of this product (dealing with bugs, new features, the overall support of this inhouse built platform).

Not in our experience.

We just recently replaced all our 2014 HP servers running CSR1000v with
a bunch of new Dell numbers at the end of 2019. 6 years was a good run.
Until then, these things sat there humming. Not much needed to manage
them on a day-to-day basis. Our biggest issue is the servers are a lot
more sensitive to ambient temperature increases than IP/MPLS or
Transport gear.


> So I stay by my MDE mantra -I'd rather stay as a SME for the networking side of things on the project and let devops/sysadmins do what they are best at.

I don't disagree with this at all. In the start, network engineers
should figure out what they are doing, so that when they want to scale
it up, they know exactly what to ask the software heads. Reversing that
process is where the pain starts, and the reason you have so many people
falling over themselves clamoring to get hired for a DevOps skill, or to
get trained in it.

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp
Re: Automation - The Skinny (Was: Re: ACX5448 & ACX710) [ In reply to ]
On 28/Jan/20 09:45, Saku Ytti wrote:


> We should learn to crawl before we take rocket to proxima centauri.

Agreed!


>
> You don't need ML/AI to find problems in your network, using algorithm
> 'this counter which increments at rate X stopped incrementing or
> started to increment 100 times slower' and 'this counter which does
> not increment, started to increment', and you'll find a lot of
> problems in your network. But do you care about every problem in your
> network, or only problems that customers care about?

Agreed!


>
> Juniper once in EBC had some really smart academics explaining us
> their ML/AI project which predicts resource needs on a given system.
> They quoted how close they got to real numbers then I asked how does
> it perform against naive system, after explaining by naive system I
> mean system like 'my box has 1M FIB entries so FIB entry uses
> RLDRAM/1M' to extrapolate FIB usage in arbitrary config. They hadn't
> tried this and couldn't tell how well the ML/AI performs against this.
>
> Can you really train today ML/AI to determine what actually matters? I
> don't think you can, because what actually matters is something that
> impacted customer, and you simply cannot put enough learning data in,
> you don't have nearly enough customer trouble tickets to be able to
> correlate them to network data you're collecting and start predicting
> which complex counter combinations are predicting customer ticket
> later.

Agreed!

>
> But are you at least monitoring how many networks are lost inside your
> network? Delta of input/output? That is fairly trivial to cover _all
> reasons for packet loss_, of course latency/jitter are not covered,
> but still, it covers alot of ground fast. Do you have a single system
> where you collect all data? Have you enrichened the data stuff like
> npu, linecard, city, country, region? Almost no one is doing even very
> basic stuff, so I think ML/AI isn't going to be the low hanging fruit
> any time soon. If you have a single system with lot of labels for
> every counter, you can do a lot with very naive analytics. If you
> don't have the data, you can't do anything with the smartest possible
> system. And I think almost no one is collecting data in such a manner
> that it's actually capitalisable, because we can keep running the
> network with how how we did in 90s, IF-MIB and netflow, in separate
> systems, with no encrichement at all.

Agreed!

For us, between Iris (a South African-written NMS), Kentik and Blue
Planet ROA (formerly Packet Design) gives us plenty of insight into what
our network is doing, what it did, and what it may do. We pay Iris for
their NMS, and this gives us quite a bit of flexibility in what we can
monitor and alert, provided there is way we can get the data off the box.

We don't believe spending too much time and effort in building ML/AI
engines will solve a real problem in our specific network, today.
Tomorrow, maybe.

I'd rather spend time upgrading Iris to support telemetry streaming, as
this has immediate and tangible benefits.

Mark.

_______________________________________________
juniper-nsp mailing list juniper-nsp@puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp