Just as organisms are susceptible to diseases and viruses, software systems are susceptible to hacks and errors. And within many complex interconnected systems, a minor bug could have a cascading effect across an entire organism, causing it to break. As such, increasing the resiliency of a software ecosystem is something engineers continually strive for. But how do we get there?
One concept to consider is anti-fragile. Being anti-fragile refers to the ability of a system to thrive in adversity and adapt to change. Specifically, with regard to software, anti-fragile can equate to responsive coding practices, such as auto-scaling or default backups, to mitigate unforeseen events. Chaos engineering, for example, is one method to expose edge cases and build with adaptability in mind. And with the introduction of intelligent automation, often fueled by machine learning (ML) and artificial intelligence (AI), it’s becoming more and more feasible to build anti-fragile software systems.
I recently met with Asim Razzaq, co-founder and CEO of Yotascale and former head of platform engineering at PayPal, to discover what it takes to build anti-fragile software. According to Razzaq, anti-fragile represents the next evolution of reactive, resilient software systems. And introducing anti-fragile concepts into your application networks can be done with just a few (relatively) simple steps. Below, we’ll define what anti-fragile is and see how this concept can be applied to software development and DevOps.
Understanding Anti-Fragile
As the adage goes, “What doesn’t kill you makes you stronger.” And it’s true—many natural systems tend to thrive under inhospitable conditions. As described by Nassim Nicholas Taleb in the book ‘Antifragile: Things That Gain from Disorder,’ anti-fragile systems thrive in adversity. For example, consider how human muscle grows under stress and load. Immune systems similarly adapt and respond to new viruses and improve the body’s resilience over time. So, how does this relate to software development?
Well, organizations need to develop highly-tolerant software systems to meet rising digital consumer expectations. Especially when under extreme loads, systems must be elastic to adapt and survive. To meet high user demands, as well as specific service-level objectives (SLOs) and agreements (SLAs), most software companies are already investing in cloud-native tools that improve reliability. For example, a hallmark of site reliability engineering (SRE) is introducing automation to enhance the stability of applications.
One example of gauging anti-fragile in software environments is chaos engineering, says Razzaq. Chaos engineering is the act of throwing a wrench into the machine. You randomly introduce bugs and malformed requests and attempt to create as much havoc as possible to see what happens. This could mean pulling out cables and restarting servers to gauge how software systems respond.
“Software that can respond the best adapts the best,” said Razzaq. For example, if you manually remove a few nodes from the cluster but the system knows it needs seven to function, it could automatically spin up additional nodes. Or if the whole cluster goes down, ideally, the system would automatically restart and scale to the necessary requirements. In essence, anti-fragile software systems can understand and adapt to new environments without much, or any, human input.
Tips to Construct Anti-Fragile Software Systems
So, how can we construct anti-fragile software systems — especially when real-world challenges are often unpredictable? Razzaq shared helpful advice to consider when crafting anti-fragile software architecture.
1. Start With Your Most Critical Systems First
First, introduce chaos engineering to your most absolute critical systems. Define these vital components at the heart of your product which must remain functional. For a payments processor SaaS, for example, this would be the infrastructure supporting its transactional engine, such as the login and payment microservices.
2. Determine How to Test Fragility
Once you’ve identified this critical infrastructure, figure out how to test for fragility in your environment. Introduce chaos engineering to the component to see how it behaves. For example, returning to our payment platform example, this may include switching off a payment transaction type to see how the system functions in an extreme scenario.
3. Decouple Components
If systems are too closely coupled, errors and outages can ripple across multiple software services. To avoid that, it’s a good practice to prevent tight coupling between components—especially between critical systems and non-critical ones.
4. Introduce Alternate Paths
One method to decouple is by introducing alternate paths. For example, if a credit card verification component goes offline, the system should naturally accept other forms of payment and automatically suggest another form, such as a bank transfer. If all systems are down, the system could still accept unverified payments in a queue and let this continue until the threshold reaches a certain risk level.
5. Learn From Past Events
The next frontier of designing anti-fragile architecture is developing self-learning, autonomous systems. Of course, there is a limit to how automation can respond to unforeseen issues. But by feeding a machine learning model with more and more experiences, it could deal with untested new scenarios more gracefully. Of course, AI has the potential to make harmful decisions and even introduce unintentional bias, so organizations should have accountability for the algorithms they oversee, cautioned Razzaq.
Benefits of Building With Anti-Fragile in Mind
The goal is to able to thrive and function even under duress. By identifying the elements of core infrastructure to keep alive and putting it through the wringer, you can discover frailties before they cause significant consequences. And by decoupling components and introducing alternate paths, software systems can be better equipped for whatever may come. Uplevelling system operations with increased automation and AI is likely the future of constructing more fault-tolerant applications.
“Software is everything in this day and age,” said Razzaq. “It increasingly plays a critical role in life-enhancing capabilities.” As such, the end benefits of developing an anti-fragile style could have hugely positive effects. In health care situations, it could make a difference in saving lives. For other less dire scenarios, increased fault tolerance leads to less lost revenue during peak traffic times and more consistent end-user experiences.