Event Management: Let the Noise Wail Without Going Deaf

In part 1 of the blog series “Putting the Ops in DevOps,” by James Moore, we discussed the awareness and anticipation of operational management and the best practices that can improve the app at every stage of its life cycle. In this blog, I would like to cover one of the event management discipline maturity and share some best practices.

If you are an operations professional on a team that’s beginning to embrace DevOps methodologies, you probably are facing some significant challenges.

You know the business is applying pressure to get to continuous integrations/delivery, where release cycles can be measured by value delivered in days or even hours rather than weeks or months. Your Dev teams have embraced agile development, and are producing (what should be) production-ready code in short iterations. Your combined orgs are working toward implementing an integrated delivery toolchain that can build, test and deploy a new release version at the touch of a button. Your Dev teams are leveraging cloud technologies and architectural patterns that enhance their agility, auto-scaling microservice architectures of unprecedented scale and modularity. They are producing, as a matter of best practice, highly instrumented code that provides a dense trail of data relating to the state of the applications deployed on runtimes that constantly report their own state. You may be responsible for physical and virtual compute, network and storage infrastructure that itself is monitored with tools that spew a continuous stream of state information.

In other words, you find yourself faced with continuous change (both automatic and human driven) and a barrage of potential noise. If you are a seasoned operations professional, bitter experience may have lead you to see change as the enemy—as change tends to zero, then so does operational risk. Now, you may to have to deal with continuous change as an inevitable force for advancing your company’s business goals. How will you separate the signal of service status from the noise of service state? Doing this successfully will be the difference between knowing quickly that there is a problem with an application or service before your users notice and being told about it by your users. It may also be the difference between getting a good night sleep and unnecessarily responding to a page indicating a benign application state.

Enter event and incident management. The discipline of event management has matured, and has continued to mature, over decades as a mechanism for determining the status of elements of a managed environment from the state. Best practice in event management consists of:

Filtering out events that are not likely to be service affecting. A trace message from an app log is not worth your attention, unless you need to view it as a part of a diagnostic procedure. A synthetic user transaction timeout may require immediate investigation.
Correlating events that are likely to be related, so you get one notification per true incident rather than, say, 20.
Enriching events with context, so if a single service instance fails in a redundant array of five instances, route the event to the operations console but don’t wake anyone up unless the service is affected.
Implementing X-in-Y policies. In a large and complex system, a single microservice-to-microservice HTTP timeout may not be a huge problem. But you may want to investigate if you start getting 20 per second.
Implementing runbooks for common failures. Collaborate with development on the definition of those runbooks. Tie those runbooks to incidents as they occur so that the first responder has a process they can reinstate a service or prevent an outage in the first place.
If possible, automating those runbooks. If the problem is a failed process, then restart it. A disk is full? Free up some space.
Leveraging analytics and machine learning to gain insights from the reams of data your tool collects. Can you learn from event history to suggest correlations or areas for improvement to ops efficiency?

With rapid delivery of new features and capabilities, the Dev team should already be mindful of the operability of the software they produce. Shift awareness of the event management tools and logic to the left. Have development use the same management tools that Ops use in production, but in pre-production. And then learn from what they see in pre-prod to improve. This should be done in combination with systems failure testing. Perform the test and see how the operations tools responds. Ask yourself, “Would I have been able to fix this if it had happened in production? Would I even have known what to do?”

As is obvious by now, successfully operating an application or service in a modern business requires unprecedented co-operation between development and operations teams. Event management as a discipline and the tools that support that discipline are as critical now as they have ever been, if not more so.

Listen to my recent podcast where I spoke about the common operational challenges many DevOps teams are facing today and how the traditional IT Operations best practices could be leveraged for use in a DevOps methodology.

Also, download this white paper, which includes best practices for DevOps transformation and improving event or incident management.

About the Author / Dr. Kristian J. Stewart

Kristian Stewart is Architect – Hybrid Cloud Event Management and Analytics at IBM. He currently leads architecture for IBM’s Netcool Event Management offering, and is part of the team providing as-a-service capabilities to IBM’s clients with Cloud Event Management. He has worked in Systems and Service Management for 18 years. He lives in England with his wife and two daughters, two cats, and five Raspberry Pis.