So Long to Configuration Management Systems?

When I was an operations architect at Ning way back in 2009, we were struggling to manage our deployments and configurations of more than 50 different application servers, or what the cool kids today call microservices. Applications were deployed across three distinct environments: development, staging and production. None of these environments were exactly the same as another, and differences in these dependencies, together with human errors, frequently broke our deployments.

Ning was nowhere near the first company to encounter such problems. In the mid-1990s, as the public internet emerged, much of what was understood at the time about IT operations was challenged by the need to scale internet-facing systems. For many years preceding the internet boom, the maximum number of users was strictly limited to the number of employees in a company. That all changed with the World Wide Web, where the need to serve thousands and millions of simultaneous users easily exceeded the computing capacity of even the largest and most expensive computers.

As three-tier application architectures were broken down into services and distributed across dozens or hundreds of individual servers, this proliferation of servers proved to be far too much for operations teams to handle. The worst problems lived within application deployment processes, where environments and requirements were ever-changing in successive releases of software. The lifetime of a server was a long one. Once deployed, servers were subjected to repeated changes by operations and developers in what would come to be called “configuration drift.” The combined effects of changes, human errors and unknown states would create disastrous outages that damaged brands and destroyed online revenue. It would take sites days and even weeks to recover from these outages. Some would never recover.

It was in the midst of this chaos that modern configuration management systems were born. CFEngine was the first to emerge in academia, where the need to manage multiple UNIX systems was felt early on. Later, Puppet, Chef, SaltStack and Ansible emerged directly from web-scale systems. Configuration management systems fulfilled several important needs. First, they abstracted the differences between operating systems, providing a high-level language for the system administrator to describe the desired state. Second, they automatically synchronized the desired state across large numbers of servers, reducing costly human errors. Finally, they provided idempotency to ensure that destructive operations were not applied unnecessarily.

There is never a free lunch. The benefits of better abstraction were offset to a degree by the need to build and maintain evermore complicated recipes. Recipes frequently broke from time to time as package management features were updated, renamed or otherwise changed from the time the recipe was first written. Even when recipes worked, they were frequently broken by differences in environments or runtime errors. The configuration management systems themselves began to suffer from scale challenges, and evolved message bus architectures to improve their speed and responsiveness. By overcoming many of these obstacles, configuration management systems have now become a first-line tool for managing web-scale systems.

But back at Ning, modern tools such as Puppet and Chef were still in their infancy and simply could not cope with the scale of our 60,000+ node virtualized environment. Ning had a lot of brilliant software engineers floating around, and they set about creating their own deployment automation system called Galaxy, and later a meta-orchestrator called Cosmos. Galaxy and Cosmos were great tools, but they required constant care and feeding and suffered from severe scale problems. There was just a lot of complexity in dealing with these automation systems while also trying to scale a top-tier social network for a hot startup company.

As we began to analyze and post-mortem our deployment failures, it became clear that a majority of the problems stemmed from host and environment configuration errors. We needed a simpler way to deploy our software. Virtual machine images were a promising way to ensure consistency between environments with the tried and true “gold master image” approach. The problem with this approach was that building full system virtual machines was difficult in automation terms. Once deployed, virtual machines are also subject to configuration drift. Furthermore, while we were eager to adopt virtualization to get better resource isolation, we were wary of the overhead that hypervisors would create. Our benchmarks showed that if we adopted virtual machines, we would need to increase our physical server capacity to accommodate the overhead.

At the time, Linux containers were already available in the kernel, but the distributions had almost no support for managing them. At the insistence of our VP of Operations, I cooked up an alternative solution based on jails, writing a small hack of a program called Warden to build and manage them. (Note: Totally unrelated to the Warden in Cloud Foundry). Jails gave us bare metal performance, resource isolation and a way to consistently deploy the entire environment for each application server. With Warden, we could build jail images on the fly automatically and deploy them consistently with our existing tools, galaxy and cosmos. Hey, it was a Band-Aid for a difficult situation, but my experience with building Warden flipped a switch in my brain. I had seen the way lightweight virtualization could eliminate complexity, and I knew right then this was the way of the future.

Today, with the success of Docker we are finally experiencing sweeping changes in the processes and tools we use to build, deploy, and manage distributed applications. At the same time, we are about to enter a brief period of containers disillusionment as early stage solutions get washed out. When that is over, what comes afterward is something of a golden era for distributed applications. I call it the container application platform, and its somewhere between what we think of as platorm as a service (PaaS) and Serverless today. In that not-too-distant future, containers will be the default output of the software build process, and where that container gets deployed will hardly be an afterthought. This unicorn magic will come from the ability to introspect containers to determine their environment dependencies, and a fully integrated tool chain that deals with deployment, scaling and dependency resolution automatically. I will give you just one guess which company is in the best position to own that future. “One ring to rule them all.”

What happens to our beautiful configuration management systems? There has been much debate about this among smart people over the last few years. Most of these arguments boiled down to some kind of coexistence model. Indeed, configuration management systems are useful for dealing with a wide range of operational problems outside of the container platform. While it’s no surprise that Chef, Puppet, Salt and Ansible all moved quickly to include Docker support, I think we will see these tools begin to focus more on what Rob Hirschfeld calls achieving “ready state”. That is to say, everything that still needs to be done before you can lay down your container platform also still needs to be automated. Right now, that is still a great deal of work, and it’s a pretty safe place to draw the line on using the right tool for the right job.

About the Author / James Thomason

James Thomason is chief technology officer at HyperGrid from Dell, where as CTO he led the vision, technology strategy and product road map for Dell Cloud Marketplace. He joined Dell through the acquisition of Gale Technologies in 2012, where as CTO he headed the product road map, architecture and engineering of the company’s flagship cloud and converged infrastructure automation software, GaleForce. Prior to joining Gale Technologies, Thomason was the CTO and Founder of Virtiv, a San Francisco cloud automation startup focused on Linux virtualization, acquired by Gale Technologies in 2011. For over 17 years, as a specialist in distributed systems and large-scale infrastructures, Thomason has been an entrepreneur and innovator at a number of notable Silicon Valley startups, including Exodus Communications, Digital Island, Netli, NetVMG, Netscaler, 3Leaf Systems and Ning. Connect with him on LinkedIn / Twitter.