Operations (or ops, as it is known colloquially) has been necessary since the first shared systems came online in the ’60s. While the job title didn’t necessarily exist, the same questions needed answers then as they do now:
- Is it up and accessible?
- Is it responding in a reasonable time frame?
- Is it performing better or worse now than previously?
In a simple system, say a static web site, these questions are answered easily by requesting the page. In an automated fashion, of course.
However, when you look at any “late model” technical infrastructure, you’ll find many, many moving parts. These range from border routers to web servers, work queues to databases and data caches to big data clusters. There are thousands upon thousands of computing instances, software processes, network storage devices and network links that can go down, up or just plain wonky. This means potentially millions of places where things can go wrong.
How do you keep an eye on all of these potential issues? How do you know if they’re responsive? How do you know if things are getting better or worse?
Some Monitoring Progress
The “old” method was to write some sort of monitoring script that polled some data that you thought would represent the overall health of that component. Through the years things have gotten more granular. Our monitoring script libraries have grown and we’ve even instrumented our applications with counters (statsd, librato, etc.) and come up with big data storage (opentsdb) to house all of these data points.
The question remains, Do we have any more clarity than before? We unarguably have more data points. So we must know more about the performance of our systems, right?
Turns out, in practice, we just have a lot of data and a lot of people attempting to decipher what that data means. If you’ve worked in operations you’ve been up at 2 a.m. looking at status dashboards with green, yellow and red icons and many, many graphs, all in an effort to answer the question: What is going on?
Sure, some incidents are easy: An instance was retired, or a fiber line was cut somewhere in Kansas.
I’m talking about the ones that aren’t so clear cut, so to speak:
- Site feels slow
- Median response times are up 25 percent from yesterday
- Usage is down week over week compared to some relevant time frame
How do you answer those questions in a haystack of data points, up/down status and, in some cases, just plain noise? Application performance managers (APMs) came along and gave us tooling in the app frameworks and the browser. Now we have visibility into which calls are performing poorly and how our app “feels” to the end user.
This has helped by breaking down the haystack into smaller haystacks and provided a way to search and sort potential performance issues. However, most APM tools depend on instrumenting code such as Java, .Net or Ruby, and are not able to provide monitoring insights for the rapidly growing open source frameworks such as Grails, Akka, Netflix OSS and the like. We’re still left with the haystack problem: What is going on?
When I’m up at 2 a.m., stumbling over to the computer to find out what issue PagerDuty has alerted me to, I don’t really want to pore through page after page of graphs and up/down status. What would be really helpful is a system that can aggregate all those monitoring data flows and statuses, actually apply some intelligence. Then it could tell me, “Look at this response time of this component, and this metric is reporting something anomalous. This is most likely causing your issue right now.”
In other words, it would be helpful to have a system that can show both the symptoms and the root cause of the issue. That system also would provide high-quality, high-value information about my infrastructure and combine anomaly detection with root cause analysis to provide real root cause detection.
One new startup, OpsClarity, is hoping to make troubleshooting more accurate and efficient by combining data science and anomaly detection with root cause analysis.
Perhaps this is the long overdue beginning of the next age in monitoring.
About the Author/Michael Hobbs
Michael Hobbs is the director of Technical Operations at Expa, a startup studio based in San Francisco. Significant experience in building, operating and automating highly scalable application infrastructures has allowed him to contribute to the success of Infusionsoft and Stumbleupon, among others. Michael continues to put that expertise to work at Expa by advising and assisting the next generation of Internet companies. Michael is also a significant contributor to many open source projects, including dokku and herokuish.