On December 6, 2018, approximately 32 million customers of the UK telecom provider O2 woke up to a frustrating reality: The most widely used mobile network in the country was experiencing a day-long outage of its 4G mobile network.
For some, not being able to check the weather forecast, traffic report or football scores was a minor inconvenience. For others, the impact was much more severe, as people throughout the country realized they couldn’t collect email, use their connected apps or perform any other task requiring data usage—no matter how critical to their business or personal lives. The world quickly came to realize that O2 subscribers in the UK were not alone: Other mobile providers around the globe, including Softbank in Japan, suffered similar outages in the same time frame.
What caused this global outage? Was it a software bug? A poorly tested update, shared too soon? Maybe even a cyberattack targeting mobile network infrastructure? In the end, the explanation was much simpler: A digital certificate expiration in a back-end mobile data service from Ericsson caused a cascading systems failure, ultimately resulting in data outage for these mobile services. This issue highlighted the potential impact of certificate errors in our modern digital enterprise and the value of automation in managing such a critical process.
Human Errors, Absences Lead to Outages
While the O2 and Softbank service outages had serious consequences in their home countries—the U.S. equivalent would be a provider like Verizon losing data service for an entire day—they are hardly the only ones to experience this issue. Earlier this year, during the shutdown of the U.S. Federal Government, a number of important sites went down or lost functionality because no one was there to manually renew expiring web certificates. Even popular agencies such as NASA and critical services such as those provided by the Department of Justice were not spared, and both agencies and the citizens they serve were forced to watch helplessly as their certificates expired and services went down.
Although it is of little comfort to the customers who woke up without mobile data or citizens who were unable to use government services, this scenario could easily have been prevented had automation been utilized to manage the certificates in question. When certificates are managed manually, it isn’t always easy to determine the cause of an outage like the one experienced by O2 and Softbank. An expiring certificate may not be the first answer that springs to mind, and may only be noticed after poring over code, looking at network monitoring and checking other potential culprits. Certificates are not always front of mind for IT teams, which underlines the value in simplifying their management.
The technology sector has made significant strides in this regard, and organizations have tools available to them that can help streamline certificate issuance, management, renewal and revocation. Gone are the days when certificates had to be manually issued one at a time, and organizations kept spreadsheets of certificate expiration dates hoping they remembered to renew them. While we’re not able to say specifically why or how the root-cause Ericsson certificate was allowed to lapse, unexpected certificate expirations can be eliminated through proper use of certificate automation—making such a disaster unlikely to repeat itself if the proper steps are taken.
Peace of Mind at Scale
Getting started is easier than you might expect. The rise of automation in managing digital certificates presents organizations with a simple and exciting solution that extends well beyond just outage protection. Perhaps best of all, many organizations operating in a Windows environment may already have access to auto-enrollment and auto-renewal tools using the Microsoft CA, which has been a pioneer in the field of automated certificate services. Chances are, you may already be familiar with using Microsoft CA as an easy way to take care of your internal (private) certificates.
But what about public certificates? True, Microsoft CA doesn’t handle those, so most organizations will need to find a third-party vendor to manage the SSL certificates for their web servers, load balancers, VPN devices and other networking gear. There are vendors capable of managing all certificate requirements under one roof, whether you operate in Windows, MacOS, Linux or other environments. The rise of standards like the Automation Certificate Management Environment (ACME) have made integrating different environments increasingly simple.
Designed by the Internet Security Research Group, ACME has become a popular certificate management tool. More than 150 million websites and 130 open-source client tools use ACME. In fact, ACME is supported by a number of tools familiar to those who use DevOps to manage deployments, including Kubernetes, Chef, Ansible, Salt Stack, Terraform, Istio, Docker and others.
As enterprises look to leverage DevOps with increasing regularity, effective certificate management can help facilitate container, micro-service and application security. Furthermore, orchestration tools such as Kubernetes support certificate management natively, leveraging ACME as the underlying protocol. The private keys may be stored in Kubernetes Secret or Hashicorp Vault so the certificate management system integrates seamlessly.
Digitally signing your DevOps containers with a private CA helps verify the identity of a given container as it communicates over TLS while also preventing unauthorized applications—and the tools available to ACME users add significant value. Of course, this is just the tip of the iceberg—using automation to manage these certificates means plugging into a vast ecosystem of resources, reducing deployment time and eliminating the possibility of human error. It really is that simple.
A New Automated Future
Certificate management is hardly the only area where human error can have an outsized impact, but the O2/Softbank outage clearly illustrates severe effects such errors can have. Service outages, damaged reputations and frustrated consumers lie in the wake of poorly managed certificates, and organizations are increasingly turning to automation to solve the problem. As protocols such as ACME become integrated into an increasing number of commonly used web tools and outside vendors expand their ability to shoulder the certificate management burden, it will only become easier for organizations to incorporate automated elements into their own environments.
— Abul Salek