Enterprises don’t own the cloud, just the experience. If there’s an outage or even a hiccup in the performance or network service, it’s on the network and application teams (not the cloud provider) to fix it. But often with problems in the cloud comes finger-pointing, as network and application teams look to determine liability and responsibility when it comes down to crisis situations.
There is no question that applications are growing more complex, as is the infrastructure to power those applications. According to the 2019 “Stack Overflow Developers Survey,” comparing year to year the amount of time it takes for software engineers to become effective on teams is increasing. This trend could be, in part, due to the amount of tooling and rise of an operate-what-you-build model (e.g., the dawn of the full lifecycle developer). As our applications become more distributed—and to cope with scaling and availability demands—skill sets that traditionally were infrastructure and development-based have begun to blur.
Networking Shifting Left
Networking-centric capabilities such as service meshes (e.g., Istio and Linkerd) and container networking interfaces (e.g., Calico and WeaveNet) are gaining a lot of traction and visibility in the cloud-native application development world. Though SDN infrastructure has been shifting left toward development pipelines, application development teams are learning networking concepts. However, the core networking infrastructure is still maintained by the networking teams.
This paradigm of software-defined infrastructure is creating a learning curve for multiple teams. With the dawn of NetOps, software development lifecycle (SDLC) principles are being introduced to the networking teams. Traditionally, ticket- or service-driven networking teams can participate in iterations and they are now building the next generation of cloud-agnostic networking capabilities. With NetOps, there is more interaction and innovation. But change is not without resistance; the dynamic nature of modern infrastructure and the short-lived nature of pre-emptible/ephemeral infrastructure do not bode well for static networking rules.
Networking Team Pressures Today
With infrastructure headed to the trifecta of clustering, replication and load balancing, networking teams have to deal with many more dynamic addresses than before. Static ports and addresses that used to be loaded into networking tickets are now being replaced with wide ranges and service addresses powering the networking stack of our cloud-native and containerized workloads. Adding to the complexity, the networking team is leveraging public cloud infrastructure. The infrastructure that the networking teams are responsible for is leaving the data center and distributed web infrastructure technologies, such as distributed domain name systems (DNS) and content delivery networks (CDNs), are common for the networking teams to be involved with. Like the application development teams, the amount of new technology hitting the networking teams is increasing.
One of the great advantages of using public cloud infrastructure is the ability to quickly expand or contract available resources. With this ability, we can spin up and down test environments to try out changes and even the infrastructure level that was not even possible. With this, we have the ability to integrate more. The application development team, which has been iterating for some time, now has common ground with the networking team to try out iterations.
All in a CIDR
To illustrate the needed collaboration between networking and application teams, let’s look at an example:
In one of the biggest incidents of my career, my company was migrating pieces of our platform to a public cloud vendor. Because the organization was early on in this transformation, a lot of minutiae was yet to be flushed out. We were packaging classless inter-domain routing (CIDR) calculations into our first-time VPC configurations—a task the development team was responsible for at the onset of our cloud migration. It was my job to enter this for my service. Instead of a /8 CIDR, I implemented a /16 CIDR. A seemingly small miscalculation on several configs turned into an incident, blocking a majority of traffic hitting the service. The mistake was quickly corrected, but the incident response framework was well on its way to being executed.
So, who was responsible? I was the one who created the configuration and pushed to production, so this was my fault. But in retrospect, the best solution would have been to involve the networking team immediately in these types of changes. Because the application development teams were vanguarding into the public cloud, however, we cut out the networking team. This was an organizational problem that needed to be addressed. If we were going to deploy into a data center successfully, the networking teams needed to look over my requests and catch my CIDR error.
Silver Lining, Right?
During this CIDR incident, I was trying to look at the silver lining: Since we can’t access our application, that means the instances/app must have a lot of extra capacity, right? Untimely, yes, but the analogy did make sense with the silver lining. We were getting billed by a public cloud vendor and because there was a drop in usage, our bill technically should have been lower. But it still impacted the customer. The money saved in our incident was outweighed by external measures (revenue and brand damage). A cloud outage and a subsequent dip in digital experience can have a massive impact on revenue, averaging $38,855 per hour of outage. During high impact or busier times, this number can be much more.
DevNetOps Nirvana
The adage of bringing everyone to the table early proves to be valuable. Each team in an organization can have a strategic impact and improve architecture in multiple parts of the stack. Different teams have different sets of skills and can have mutual respect for the vertical skills of each of the teams. In the end, a transparent and positive relationship between network and application teams will help with network remediation and outage prevention. And, the closer the teams can work together and the more they can communicate, the more their investments and efforts can contribute to the business bottom line and quicker remediation long term.