Enterprise DevOps

DevOps and Networking: Working to Achieve Nirvana

Enterprises don’t own the cloud, just the experience. If there’s an outage or even a hiccup in the performance or network service, it’s on the network and application teams (not the cloud provider) to fix it. But often with problems in the cloud comes finger-pointing, as network and application teams look to determine liability and responsibility when it comes down to crisis situations.

There is no question that applications are growing more complex, as is the infrastructure to power those applications. According to the 2019 “Stack Overflow Developers Survey,” comparing year to year the amount of time it takes for software engineers to become effective on teams is increasing. This trend could be, in part, due to the amount of tooling and rise of an operate-what-you-build model (e.g., the dawn of the full lifecycle developer). As our applications become more distributed—and to cope with scaling and availability demands—skill sets that traditionally were infrastructure and development-based have begun to blur.

Networking Shifting Left

Networking-centric capabilities such as service meshes (e.g., Istio and Linkerd) and container networking interfaces (e.g., Calico and WeaveNet) are gaining a lot of traction and visibility in the cloud-native application development world. Though SDN infrastructure has been shifting left toward development pipelines, application development teams are learning networking concepts. However, the core networking infrastructure is still maintained by the networking teams.

This paradigm of software-defined infrastructure is creating a learning curve for multiple teams. With the dawn of NetOps, software development lifecycle (SDLC) principles are being introduced to the networking teams. Traditionally, ticket- or service-driven networking teams can participate in iterations and they are now building the next generation of cloud-agnostic networking capabilities. With NetOps, there is more interaction and innovation. But change is not without resistance; the dynamic nature of modern infrastructure and the short-lived nature of pre-emptible/ephemeral infrastructure do not bode well for static networking rules.

Networking Team Pressures Today

With infrastructure headed to the trifecta of clustering, replication and load balancing, networking teams have to deal with many more dynamic addresses than before. Static ports and addresses that used to be loaded into networking tickets are now being replaced with wide ranges and service addresses powering the networking stack of our cloud-native and containerized workloads. Adding to the complexity, the networking team is leveraging public cloud infrastructure. The infrastructure that the networking teams are responsible for is leaving the data center and distributed web infrastructure technologies, such as distributed domain name systems (DNS) and content delivery networks (CDNs), are common for the networking teams to be involved with. Like the application development teams, the amount of new technology hitting the networking teams is increasing.

One of the great advantages of using public cloud infrastructure is the ability to quickly expand or contract available resources. With this ability, we can spin up and down test environments to try out changes and even the infrastructure level that was not even possible. With this, we have the ability to integrate more. The application development team, which has been iterating for some time, now has common ground with the networking team to try out iterations.

All in a CIDR

To illustrate the needed collaboration between networking and application teams, let’s look at an example:

In one of the biggest incidents of my career, my company was migrating pieces of our platform to a public cloud vendor. Because the organization was early on in this transformation, a lot of minutiae was yet to be flushed out. We were packaging classless inter-domain routing (CIDR) calculations into our first-time VPC configurations—a task the development team was responsible for at the onset of our cloud migration. It was my job to enter this for my service. Instead of a /8 CIDR, I implemented a /16 CIDR. A seemingly small miscalculation on several configs turned into an incident, blocking a majority of traffic hitting the service. The mistake was quickly corrected, but the incident response framework was well on its way to being executed.

So, who was responsible? I was the one who created the configuration and pushed to production, so this was my fault. But in retrospect, the best solution would have been to involve the networking team immediately in these types of changes. Because the application development teams were vanguarding into the public cloud, however, we cut out the networking team. This was an organizational problem that needed to be addressed. If we were going to deploy into a data center successfully, the networking teams needed to look over my requests and catch my CIDR error.

Silver Lining, Right?

During this CIDR incident, I was trying to look at the silver lining: Since we can’t access our application, that means the instances/app must have a lot of extra capacity, right? Untimely, yes, but the analogy did make sense with the silver lining. We were getting billed by a public cloud vendor and because there was a drop in usage, our bill technically should have been lower. But it still impacted the customer. The money saved in our incident was outweighed by external measures (revenue and brand damage). A cloud outage and a subsequent dip in digital experience can have a massive impact on revenue, averaging $38,855 per hour of outage. During high impact or busier times, this number can be much more.

DevNetOps Nirvana

The adage of bringing everyone to the table early proves to be valuable. Each team in an organization can have a strategic impact and improve architecture in multiple parts of the stack. Different teams have different sets of skills and can have mutual respect for the vertical skills of each of the teams. In the end, a transparent and positive relationship between network and application teams will help with network remediation and outage prevention. And, the closer the teams can work together and the more they can communicate, the more their investments and efforts can contribute to the business bottom line and quicker remediation long term.

Ravi Lachhman

Ravi Lachhman

Ravi Lachhman is the Field CTO at Shipa, a cloud native application-as-code platform. Prior to Shipa, Ravi was an Evangelism Leader / Chief Architect at Harness. Ravi has held various sales and engineering roles at AppDynamics, Mesosphere, Red Hat, and IBM helping commercial and federal clients build the next generation of distributed systems. Ravi is obsessed with Korean BBQ and will travel for food.

Recent Posts

The Role of AI in Securing Software and Data Supply Chains

Expect attacks on the open source software supply chain to accelerate, with attackers automating attacks in common open source software…

4 hours ago

Exploring Low/No-Code Platforms, GenAI, Copilots and Code Generators

The emergence of low/no-code platforms is challenging traditional notions of coding expertise. Gone are the days when coding was an…

22 hours ago

Datadog DevSecOps Report Shines Spotlight on Java Security Issues

Datadog today published a State of DevSecOps report that finds 90% of Java services running in a production environment are…

2 days ago

OpenSSF warns of Open Source Social Engineering Threats

Linux dodged a bullet. If the XZ exploit had gone undiscovered for only a few more weeks, millions of Linux…

2 days ago

Auto Reply

We're going to send email messages that say, "Hope this finds you in a well" and see if anybody notices.

2 days ago

From CEO Alan Shimel: Futurum Group Acquires Techstrong Group

I am happy and proud to announce with Daniel Newman, CEO of Futurum Group, an agreement under which Futurum has…

2 days ago