What do you do if you’re entire business relies on delivering streaming video content over the Web to millions of customers, and there are no examples to follow or tools to use? If you’re Netflix, you set the example and build the tools.
Netflix had tremendous success with its original business model shipping movies on DVD to customers in trademark red envelopes. The precision of the shipping logistics, and the efficiency with which Netflix was able to meet user demand and ensure speedy delivery is a case study of its own. But, Netflix also saw the writing on the wall and became one of the first to embrace delivering movies via streaming video over the Internet.
Networking protocols are fairly resilient, and able to get data from Point A to Point B eventually, but streaming content like audio or video is less forgiving. The data needs to be transferred at high speed, and the packets need to arrive in order to avoid buffering or glitchy playback.
Netflix chose Amazon Web Services (AWS) to provide the cloud server infrastructure for delivering its streaming content. Amazon is reliable and resilient, but it is not infallible. When Amazon experiences a massive outage—as it has a few times in the past—it can mean down time for Netflix customers. Netflix, however, works very hard to guard against being affected by issues like that, and—thanks to the Simian Army—it is quite successful.
Simian Army
I spoke recently with Ruslan Meshenberg, director of platform engineering and Josh Evans, director of operations engineering for Netflix to learn more about how Netflix has embraced both DevOps and open source to develop tools that help it deliver high quality streaming video content to customers around the world.
Simian Army is the name Netflix has given its suite of automated tools engineered to stress test the Netflix infrastructure and proactively identify weaknesses so Netflix can resolve them. The Simian Army is comprised of tools like ChaosMonkey, ChaosGorilla, and ChaosKong.
Josh explained that when you’re dealing with a massively distributed system like the Netflix cloud infrastructure, it’s inevitable that you will encounter bugs. Rather than waiting for issues to surface, Netflix uses the Simian Army to intentionally induce failure on its own terms, so that issues can be identified and fixed before they impact customers.
ChaosMonkey wreaks havoc on an individual server or cluster level—randomly shutting down servers to make sure that automated resilience works, and that redundant systems pick up the slack so content delivery is not interrupted. ChaosGorilla works on a larger scale by taking out entire AWS zones, and ChaosKong takes the concept to a national level by randomly shutting down the East Coast or West Coast regions of AWS to ensure traffic is automatically redirected to the remaining region.
There are also other tools in the Simian Army. JanitorMonkey keeps everything tidy by cleaning up unused EC2 resources. As the name implies, its job is to automatically clean up after developers. Then there’s ConformityMonkey, which checks for EC2 instances that are not conforming to predefined rules for best practices.
Netflix lets the Simian Army roam free causing random chaos, but only Monday through Friday between the hours of 9am and 3pm when managers and developers are present to address any urgent situations that might be caused inadvertently. In some cases the tools take automated corrective action, and in some they simply generate an alert and escalate the issue to the appropriate group or individual.
Embracing Open Source
One of the things that makes the Simian Army unique is that Netflix created these tools out of necessity. Netflix was blazing a trail on the cutting edge of delivering content from the cloud, and there simply were no commercial tools available to do the things Netflix needed to do.
In the early stages of scaling cloud services to meet demand, Netflix frequently stumbled and hit a variety of walls. Netflix was a pioneer in delivering content through the cloud at this scale, but the open source community has experience, and an established understanding of working with a distributed environment.
By marshaling the resources of the open source community, Netflix was able to get more eyes on the code, and enlist the volunteer support of hundreds or thousands of developers to help refine and improve the tools in the Simian Army, and other tools Netflix has developed. Ruslan and Josh also explained that there is an unintended benefit of cooperating with the open source community: in-house developers are more diligent when they know that the code they write will be exposed to the general public.
Netflix doesn’t just take advantage of the open source community; it also makes its tools available to the general public through the Netflix Open Source Software Center on Github. There are tools for availability (like the Simian Army), cloud management, persistent systems, infrastructure services, developer productivity, and more—all available for free for anyone who wishes to use them.
Culture of Freedom and Responsibility
Lots of organizations talk a good game. It’s easy to come up with a profound mission statement, or an insightful set of company values. Actually living up to them, however, is a whole different story. Netflix prides itself on building a culture of freedom and responsibility.
The Netflix culture is part of what makes the company so successful, and why Netflix has been able to achieve what it has in the realms of DevOps and open source. Netflix believes in finding highly skilled, seasoned engineers and developers, and allowing them the autonomy to get their jobs done and own responsibility for what they produce.
Other organizations can learn a lot from Netflix. Netflix has paved new paths for DevOps, set the bar for collaborating with the open source community, and established a culture of success that any company could benefit from adopting.