Gremlin today added a risk detection capability to its chaos engineering service that automatically identifies issues that could cause outages along with recommendations to resolve them.
Enabled by agent software deployed in an IT environment, the Detected Risk tool makes it possible to benefit from chaos engineering techniques without needing to construct experiments or run reliability tests.
Gremlin CTO Kolton Andrus said that approach eliminates much of the friction that previously limited adoption of chaos engineering best practices to improve the resiliency of IT environments.
Chaos engineering is rooted in a resiliency testing philosophy that posits applications should be able to keep functioning regardless of any service or infrastructure failure. Most IT organizations, however, don’t have the expertise required to build the tools required to test resiliency based on the “Chaos Monkey” principles—pioneered by a Netflix project by that name—that, for example, randomly turns services off to see if an application fails.
Each DevOps team will need to evaluate Gremlin’s recommendations before implementing them, but the overall goal is to provide IT teams with a better understanding of the level of risk attached to, for example, a misconfiguration of a server. To do so, the tool uses a scoring mechanism provided by Gremlin, noted Andrus.
The challenge is that as IT environments become more complex, the level of dependencies between applications and infrastructure continues to increase. It’s often impossible for DevOps teams to manually keep track of all the hidden dependencies that might exist when IT environments are continuously being updated.
In theory, of course, adopting cloud-native applications based on microservices should make IT environments more resilient as calls to application programming interfaces (APIs) are rerouted in the event of an outage. Nevertheless, there are almost always dependencies that get overlooked as these applications evolve, which creates a potential single point of failure.
Naturally, not every IT team is always comfortable with the concept of deliberately breaking things to better understand what issues might cause an outage. As organizations become more dependent on software to drive critical processes, however, a more proactive approach to ensure resiliency is required, added Andrus.
In fact, business leaders are becoming less tolerant of outages that result in lost revenue, so DevOps teams are being held to a higher level of accountability. DevOps teams may appreciate the value of a blameless culture that encourages everyone to learn from their mistakes, but IT leaders still get fired when there are too many outages, noted Andrus.
Less clear is the degree to which the rise of artificial intelligence might affect chaos engineering. At the very least, AI models should make it easier to assess levels of risk, but the cost of using those technologies to achieve that goal is substantial, especially when platforms that provide the same outcome already exist.
Regardless of approach, however, when it comes to resiliency, everyone agrees that the value of application uptime is now beyond measure.