If you’re not collecting metrics, how do you know how your organization is doing? Metrics provide the data to get to the root cause of an outage or uncover a bug that would otherwise go undetected. Metrics can also surface trends that are difficult to discern without visualization. Here are some ideas for making the most of the data around you.
What to measure
What are the right metrics? As Etsy put it “If it moves, graph it. If it doesn’t, graph it anyway in case it moves later.” Sure, you should collect system metrics like CPU, memory, and disk utilization but those alone won’t tell you how your product is really doing.
Sell widgets? Measure the rate of widget sales, the time it takes to sell a widget, and the availability of the widget selling system. Serve cat pics? Measure the time to process the images and the size of the processing queue. At the end of the day, these are the types of metrics that indicate how the product is performing.
Your applications should be instrumented to emit counts and timing for key code paths. This is especially important in service-oriented architectures where a slowdown in one service can impact other services. So, record how long it takes for services to interact with other services so you can pinpoint the source of the latency. If you use a database, there’s a wealth of metrics exposed by the database system that can help determine why a query is slow.
You should record every code deploy so you can correlate issues with deploys. If memory usage spikes and throughput drops after a deploy – you know it’s probably a new problem introduced by that deploy. Furthermore, you should collect and examine metrics from your pre-production environments so that problems are caught before they go to production and impact users.
High value metrics
Non-application metrics are often overlooked but they can provide important insights into the health of the business. Have a support system or call center? Measure the number of new support cases opened. Use social media? Measure mentions and followers. There’s also a wealth of data generated from sales and marketing activities. Imagine being able to correlate an uptick in sales with increased system reliability or the opposite, where outages and bugs are actually causing customer churn. You can’t make these inferences if you don’t have the data.
Free your metrics
Collecting metrics is great, but the real value comes from sharing the data across the organization. When you share your metrics, you increase visibility and transparency into the entire stack, which in turn allows you make informed data-driven decisions. So break down the silos and open up your metrics. Put them up on big screens that everyone can see. Put them up in unusual places like the break room. Then watch how people, who wouldn’t normally care about operations, suddenly start paying attention.
Stay alert!
You should also alert from your metrics. Response times jump beyond an SLA? Alert on it. Is a queue for a critical part of your workflow high? Alert on it. By reducing the time until detection you can get to fixing the problem faster. After every production outage ask yourself, ‘do I have the correct monitoring in place to detect this problem if it happens again?’ If you don’t, make it a priority to get it in place.
You’ve got the tools, now go and free those metrics!
About the Author/Eric Heydrick
Eric Heydrick is a DevOps Engineer at Praesidio, a big data Cybersecurity Management (CsM) company that bridges the gap between IT Security and Policy Governance in protecting Financial Institutions from cyberattacks. Praesidio’s integrated policies and best practices help Financial Institutions know definitively that they are safe. Eric has 15+ years of experience wrangling computers and building scalable infrastructures. Eric lives in Seattle, WA where he enjoys only the best microbrews. Email: [email protected] | Twitter: @eheydrick