AWS’ Well-Architected Framework provides a wealth of incredibly valuable best practices and guidelines structured around five distinct pillars that more cloud-native organizations would do well to heed. The first of these pillars, Operational Excellence, recommends a number of particularly valuable techniques. However, successfully adopting the design principles and practices within Pillar No. 1 requires a strong foundation: a company culture that is willing to continuously work to improve.
In practice, this can mean holding high-profile “Game Days” focused on advancing internal procedures, nurturing an environment where engineers and product leaders really collaborate to promote business value while pursuing rapid iterative release cycles, championing operational automation and related approaches. Fortunately, adopting the tenants of AWS’ Operational Excellence Pillar can naturally assist in implementing cultural changes where they are needed.
Design Principles for Operational Excellence
To achieve operational excellence within AWS, incorporate these six design principles:
Perform Operations as Code
The cloud makes it possible to apply the same rigorous practices used to reduce risks in software development (such as version control and automated testing) across your full application environment. Cloud infrastructure allows all aspects of operations to be defined, managed and implemented in software code, spurring a revolution in their automation and consistency.
Annotated Documentation
Without effective automation, maintaining documentation is tedious to keep in sync with reality—ultimately making it challenging to troubleshoot issues when they inevitably arise.
Fortunately, the cloud makes it possible to create documentation automatically as an artifact of the build process. This documentation is detailed and annotated, always up-to-date and can be read by us humans (as well as systems). With automated documentation, engineers can react to issues more rapidly and accurately, and enlist automated systems to do the same.
Make Frequent, Small and Reversible Changes
While agile, iterative development methods have become commonplace in many areas, product-driven organizations tend to stick to large releases. However, operational excellence calls for a systems design that incorporates tightly focused, failure-resistant components in an environment designed for constant change and improvement. Implementing this flexibility allows for more rapid deployment of new iterations—but easy rollback if issues pop up.
Refine Operations Procedures Frequently
Operational excellence requires holding regular meetings to reflect, suggest and implement improvements to operations procedures, the same as those held when practicing any agile methodology. Game Days, where procedures are tested and iterated on, will help build and strengthen excellence as well.
Anticipate Failure
Rather than waiting for failures to strike and then studying the causes in a post-mortem, practicing routine “pre-mortem” exercises that recognize failure points before any harm is done. Game Days can then be used to investigate these scenarios and ultimately bolster resilience. For example, premier cloud-native organization Netflix demonstrates this design principle with its Chaos Monkey project, which is purpose-built to strengthen application resiliency against random instance failures.
Learn From All Operational Failures
Operationally excellent cultures have the openness to share post-mortem findings across the organization without pointing fingers. This broadly shared knowledge can then inform improvements in all areas of the business, not just in engineering and product departments but marketing, finance, and others as well.
Critical Components of Operational Excellence
Embracing the six design principles above paves the way for you to practice operational excellence in three key areas defined by the Well-Architected Framework: Preparation, Operation and Evolution.
Preparation
Operational excellence requires operations teams to possess a clear and complete understanding of the goals for each workload within the complete system, and how those goals will be achieved.
Operational Priorities
This clarity must include knowledge of the workloads the operations team is responsible for, the shared business goals and their role in achieving them and any relevant regulatory or compliance requirements. This information allows teams to correctly prioritize those workloads with greater importance or complexity.
AWS offers these helpful resources for setting operational priorities:
- AWS Support. This including the AWS Knowledge Center, AWS Discussion Forums, and AWS Support Center.
- AWS Documentation, now available on GitHub as an open source project.
- AWS Cloud Compliance.
- AWS Trusted Advisor, which includes core checks for environmental improvements.
Businesses can also enlist AWS Certified Partners—experts selected by AWS to provide consulting, and professional and managed services as an extension of a company’s own operations team—to make sure priorities are set wisely.
Design for Operations
A well-architected workload is designed for operations from the beginning, with consideration for deployment, updates and similar needs. AWS provides the ability to model entire workloads as code, from applications to infrastructure, policy, governance and operations. It also offers a robust set of tools and services for operation design, such as CloudFormation and the AWS Developer Tools. Businesses can also create systems featuring the observability and insights needed to operate workloads by using AWS Cloudtrail, AWS VPC Flow Logs, AWS CloudWatch and more.
Operational Readiness
Beyond the technology used, operational excellence is also a function of elegant, consistent processes and procedures for deploying and operating workloads. This means maintaining accurate documentation, building a well-trained and resourced team of experts and having governance that makes sure everything is done the right way.
It’s also smart to utilize automation when evaluating environments. CloudWatch events can be automatically addressed by scripting procedures created with AWS Systems Manager’s Run Command and Lambda. Configurations can be automated and benchmarked against best practices using AWS Config rules.
Operation
The first step to operational success is clearly defining it with key metrics based on shared business goals, so that events can be addressed with those goals in mind and success can be recognized when (or when not) achieved.
Understanding Operational Health
Because no two businesses are the same, the definition of operational health and success will vary. For example, one company might have workloads that demand low-latency and high-throughput performance to be prioritized over cost. Another might require a focus on cost-effectiveness and high availability. Operations teams must first understand those priorities and how they serve business goals to deliver what is actually needed.
AWS offers tools and services for analyzing workloads to serve specific goals. CloudWatch Logs and Dashboards can provide system and business level views of key metrics. Amazon ElasticSearch and Kibana can offer further visualization of operational health metrics, as can outside tools and services like Logstash and Grafana.
Responding to Events
When events occur, operational excellence means having prepared and even anticipated failures and putting metrics, processes and procedures into action for a quick and effective response. At the same time, teams must have metrics to understand the business impact of workload components to resolve the highest priority issues first. AWS enables teams to script automatic event responses, such as automated rollbacks triggered by failures. Amazon CloudWatch offers a powerful service for these automated responses.
Evolution
Curiosity and a desire to learn and improve are essential traits for operations teams pursuing operational excellence. Every experience should be viewed as an opportunity to learn and to share that new knowledge base. The best teams enjoy learning from failure and by asking for the perspective of other business units.
AWS offers an incredible platform for the analysis and experimentation needed to evolve practices. Amazon CloudWatch and CloudTrail can be combined with Amazon ElasticSearch with Kibana. Data exported to Amazon S3 can enable analysis with Amazon Athena and Amazon QuickSight. AWS also helps share best practices with CloudFormation templates, Chef Cookbooks, Ansible Playbooks, Lambda functions and more.