How to Achieve Cloud Operational Excellence

Cloud operational excellence means delivering the right mix of cloud-based services at the optimal cost and quality to support your organization’s mission and strategy. You might be thinking, “Why go to the trouble? Isn’t ‘almost good enough’ sufficient?” It turns out that it matters—a lot. The difference between the top 10% and the bottom 10% in cost per relevant delivery metric can vary by an order of magnitude or more.

In the mid-1990s, Gartner acquired an IT metrics firm called Real Decisions. They offered benchmarking services so customers could compare their IT efficiency with similar organizations. With hundreds of “Global 2000”-sized customers, their database was rich. Over time, the Real Decisions team refined their catalog of metrics and enriched their data with repeat studies over time, to develop indices of efficiency.

Within their user population, the difference in cost per unit of productive work was 11 times better in the top 10% compared with the bottom 10%. That doesn’t mean 11% better, it means 1,000% better.

Note that the user base was self-selected. All participants wanted to get objective metrics of performance relevant to their business goals and paid for extensive studies involving questionnaires, financial audits and technical benchmarks. Notably, all participants had cost recovery programs (beyond chargeback) in place. The bottom 10% of this segment is still within the top 10% of the IT industry—and their score is 11 times worse than the best of the best.

Which raises the question: What is the industry average for IT efficiency? Is it possible that the hundreds of benchmarking users are all doing IT wrong, and the search for relevant metrics is misguided? I think not.

Choosing the Right Metrics

In 1911, Fredrick W. Taylor published The Principles of Scientific Management, which discussed approaches to optimizing two important variables: Output quality and worker compensation. Taylor recognized that successful firms work collaboratively, with management and workers jointly setting goals and developing methods and tools to achieve both profitability and proportionate compensation. Nowhere in this text—or in any of his recorded speeches or documents—does he say, “If you can’t measure it, you can’t manage it.” He never said that for two reasons: First, he did not believe it; second, it is not true. What he did emphasize was that if you do measure it, you will manage it. That was a warning: Pick the right metrics or you will exert effort pursuing a meaningless goal.

Here are some potential cloud performance metrics compiled from several cloud providers and users across various industries, not-for-profits and governmental agencies:

Service Metrics
- Reliability – mean time between failure (MTBF)
- Availability – Uptime, expressed as a meaningful percentage of demand
- Serviceability – Mean time to repair (MTTR)
IT Metrics
- Capacity
- Latency
- Bandwidth
- Response time
Strategic Metrics
- Business agility
- Customer engagement
- Customer reach
- Financial impact
- Solution performance

The journey to cloud excellence starts with developing the metrics that are most relevant to your business goals. Picking the right metrics with the right scale matters. As a rule of thumb, generally right beats precisely wrong.

Sustaining Excellence

After you agree on a set of metrics that are both statistically reliable, repeatable, objective and aligned with your firm’s mission, how do you achieve and sustain excellence?

In the 1970s, the U.S. Department of Defense bought a lot of custom software. Sometimes it worked well, other times it didn’t. So, the DoD funded research on code quality. It turns out that the key difference between great and mediocre quality code flowed from how the organization managed problems. More specifically, how did the team react to an unexpected event? The spectrum runs from confusion and dismay through fire drill chaos to calm, rational assessment and remediation. That methodology gets baked into the code itself. This study produced the Capability Maturity Model (CMMi), which was created by the Software Engineering Institute at Carnegie Mellon University.

The CMMi framework identifies five levels of process maturity. A simplified assessment of an organization’s process maturity level comes down to two questions. First, is there current comprehensive documentation for the process, including how to deal with defects? If the answer is yes, the organization is at level three or higher. If the answer is no, the second question is, “Does anyone know what’s going on?” If the answer is yes, that’s level two; if no, level one.

A level one organization has no standard method in place to deal with problems. When something goes wrong, everybody grabs tools and tries to figure out what went wrong and how to fix it. Organizations like this do not spend much on training or analysis. Their focus is continuing to produce whatever they are trying to make. Over time, an individual may develop expertise in diagnosing a component, and when things go wrong, the call goes out to “Get Fred in here!” to troubleshoot the problem. Organizations with pockets of expertise are moving into level two. Most organizations fall within one of these two levels.

Organizational transformation is very difficult, and often requires significant changes in the management team, as well as funding different activities. Training and communications skills are crucial to proceeding beyond level two. Management rewards heroes who can shoot the most difficult bugs. They get the big bonus, the promotion, the better office, a parking space near the door. This behavior reinforces the culture of heroes. But moving forward requires the heroes take on a new role.

Once the organization creates documentation, it is on the path to level three. Note that these transformations are wrenching. It is not easy to tell the heroes that their greatest value to the organization is now how well they can write or teach. But with proper management attention, it can be done. And the benefits of moving forward are many:

Dramatically fewer crises. Staff could make plans and keep them – no emergencies interrupting a family gathering, a school event, or a get-together with friends.
High quality code. Maintenance tasks became much simpler, and customers experienced improved reliability. The documentation was helpful, and troubleshooting became routine rather than overwhelming.
Reliable planning. In a mature organization, plans hold true because they are based in proven metrics, continuously validated processes and uniformly high competence within the team. Project estimates are accurate because the data stems from reliable, repeatable evidence.

Cloud excellence is not a phantom or an unachievable goal. It is the result of clear thinking and sound documentation. Over time, practices improve, skills build. To quote Macklemore, “The greats weren’t great because at birth they could paint, the greats were great because they painted a lot.” With practice and focus, your organization can achieve cloud excellence.