This week at the InterOp conference in New York I had the chance to talk to a lot of IT managers transforming their organizations to adopt a DevOps culture. One of the key questions I heard repeatedly was – are there some best practices to break down the wall between Dev and Ops? How do you get to a point where we can deploy code multiple times a day like Facebook , Amazon and Netflix without impacting our users’ experience ? How do we minimize the risk of deployment failures and the blame game that ensues? With the need to innovate and close the customer feedback loop quickly Ops is being asked to deploy code more rapidly than ever before. The complexity of hundreds of developers updating code makes it a very sensitive environment. The last thing Operations wants is an outage leading to financial loss and the last thing development wants is to get a call in the middle of the night to figure out what code change caused it.
To have tight control over this fast paced environment we need Application Performance Management at every step. APM can help break down the Chinese wall between Dev and Ops and eliminate the finger pointing that happens during outages. Here are 5 APM best practices to enable a smooth DevOps working environment
- Use the same APM tool in Dev and Ops: APM should not be an afterthought , should not be a tool that only Operations looks at to monitor apps. APM is essential in every step along the way , from dev/test to staging to production. As a part of test deployment, developers should use the same APM tool as Ops to see the impact of the code deployment on the application’s performance and catch issues before hand. With the same tool everyone is on the same page and there is less confusion on the metrics being tracked. Developers can’t say “Sorry Ops, we are not seeing the problem”
- Isolate and fix performance problems sooner with Code Level Diagnostics: Today’s APM tools come equipped with deep diagnostics that lets you drill down to the line of code that may be causing a performance slowdown. When code deployed in the Test or Staging environment results in a response time spike, developers can use deep dive code diagnostics to detect the line of code, the SQL query, the method call that may have caused the spike and fix the problem before it is deployed on production. For example, you can detect that a SQL query is taking way longer than it did last week for a specific mobile device platform.
- Detect changes with Deployment Detection: When new version of code is deployed in production, the change should be detected automatically without having to reconfigure the APM tool.
- Analyze Pre and Post Deployment Performance Trends: Get before and after snapshots of the end to end transactions from your APM tool and compare performance trends over time. With topological visualizations you can, at a glance, see if any particular business transaction slowed down post deployment. With historical data you can see how the application’s performance and resource consumption is evolving over time. This will help fine tune your apps’ resources as well. For example you can determine whether an AJAX call has been slowing down over the last 2 weeks due to the slow down of a third party API call or because a disk consumption is increasing over time.
- Automate actions on failures: Configure your APM tool to take actions on failures. For example configure an action to automatically increase memory if a memory spike alert is generated post-deployment while you take the time to diagnose the root cause.
Using APM let your developers focus on writing code rather than spend time fighting deployment catastrophes and enable a healthy collaborative working environment between Dev and Ops.