One of the three “ways”—or principles that DevOps guru Gene Kim articulates as defining DevOps—is amplifying feedback loops. One of the key ways that Ops contributes to this amplification is by implementing rich monitoring instrumentation of operational conditions. The move toward more automated configuration management and the rise of cloud-friendly application performance management suites have gone a long way to providing strong feedback loops. However, network performance monitoring (NPM) is sorely in need of cloudification to become relevant to DevOps.
In one of my previous articles on DevOps.com, I wrote about the impact of DevOps on IT operations management, and noted that infrastructure management tools have largely lagged behind APM in terms of becoming cloudified and API-friendly. NPM is a case in point. Just to be clear on terms, by network performance monitoring, I’m talking about monitoring that collects metrics such as latency, TCP retransmits and out-of-order and fragmented packets. These four metrics can tell a network engineer a lot about whether the network is the root cause or a major contributing cause to application performance issues. Other metrics and data points that are highly relevant to NPM are interface utilization percentage, traffic flow statistics and, for any apps or services that rely on internet communications, BGP paths. These latter metrics and data sets help engineers locate underlying causes in the network, such as congestion on an interface, large spikes in flows, internet path changes or correlation of performance issues with traffic traversing a particular ISP’s network. Sounds great, right?
The problem is, the way these metrics have been collected is all wrong for DevOps practice. For the last two decades, NPM tools have fallen into two main buckets: free, manual tools such as TCPDUMP, MRTG and WireShark; and monolithic, siloed commercial tools. Nobody will ever dispute the value of free tools, but at a certain point, chasing problems around with manual tools just doesn’t scale, so sooner or later most NetOps organizations have to invest in commercial tools.
Commercial NPM tools were primarily architected in a pre-cloud, pre-DevOps era, when data centers all were inside the WAN, applications were centralized and monolithic and all users were somewhere else on the WAN in a campus or branch office. Metrics collection was performed by appliances via connecting to the span interfaces of major routers and switches at choke points—the junctures of LAN and WAN connectivity.
That era is disappearing rapidly. Most new applications are being built on a distributed basis, leveraging hybrid cloud. With distributed application components, there is a much more complex mesh of communications at play, much of which will cross networks and the internet. Further, in most cloud environments, you simply can’t deploy an appliance—not even a virtual one, since in many cases there simply are no span ports.
The problem with traditional NPM goes beyond a lack of fit with modern application architectures. NPM appliances are designed as vertically integrated systems with a GUI as the only real way to get data out of them. As a result, their APIs are usually at best second-class citizens. This is problematic for DevOps because amplifying the feedback loop means being able to serve meaningful real-world data up across different teams.
The cloud and DevOps gap in NPM is pronounced enough that major analyst firms have taken notice. In May of this year, Gartner analyst Sanjit Ganguli published a research note titled, “Network Performance Monitoring Tools Leave Gaps in Cloud Monitoring.” It’s a fairly biting critique of the NPM space that says, essentially, that the vast majority of current NPM approaches were largely built for a pre-cloud era, and are unable to because the new complexities brought by decentralization and full-stack virtualization. As a result, network managers are left in the lurch when trying to adapt to the realities of digital operations.
What is needed to cloudify NPM and make it DevOps-friendly? First of all, with distributed, hybrid cloud applications, there need to be options to embed NPM metric collection agents in cloud server deployments, either directly on app servers or load balancers such as NGINX or HAProxy. With container management and configuration automation tools, deploying lightweight NPM agents to more distributed points is very doable. By collecting NPM metrics on or very close to application servers, you get the most accurate read on “is it the network?” when app performance issues arise.
Another key aspect of cloudifying NPM is where the analytics happens. Appliances have severe compute and storage constraints. In the era of big data and cloud-scale compute economics, relying on appliances for NPM analytics is beyond outmoded. There are a plethora of open-source big data platforms as well as a growing number of emerging commercial SaaS options for NPM analytics. Today, big data network analytics platforms allow engineers to perform deep ad-hoc analyses on billion-row datasets and get answers in seconds.
DevOps’ mandate for amplifying the feedback loop plus the availability of copious cloud compute capacity means that there is no excuse to not aggressively measure everything, including network performance. Networking teams can and must go beyond config automation to change the way and the scale they gather and share metrics to server and app dev teams. Cloud-friendly NPM is a needed step to bring about Net DevOps as a lived reality in IT.
About the Author / Alex Henthorn-Iwane
Alex Henthorn-Iwane is the vice president of Marketing at Kentik. He has more than 20 years of experience bringing new technologies in networking, security and software to global markets. Henthorn-Iwane leads the global marketing strategy for Kentik and helps the company bring its story and solutions to network-dependent organizations around the world. Connect with him on LinkedIn, Twitter or SlideShare.