This guest post is based on a presentation given by @mattkemp, @chicagobuss, and @codyaray at CloudConnect Chicago 2012
As a fast-growing tech company in a highly dynamic industry, BrightTag has made a concerted effort to stay true to our development philosophy. This includes fully embracing open source tools, designing for scale from the outset and maintaining an obsessive focus on performance and code quality (read our full Code to Code By for more on this topic).
Our recent CloudConnect presentation, Automating Cloud Applications Using Open Source, highlights much of what we learned in building BrightTag ONE, an integration platform that makes data collection and distribution easier. Understanding many of you are also building large, distributed systems, we wanted to share some of what we’ve learned so you, too, can more easily automate your life in the cloud.
BrightTag utilizes cloud providers to meet the elastic demands of our clients. We also make use of many off-the-shelf open source components in our system including Cassandra, HAProxy and Redis. However, while each component or tool is designed to solve a specific pain point, gaps exist when it comes to a holistic approach to managing the cloud-based software lifecycle. The six major categories below explain how we addressed common challenges that we faced and it’s our hope that these experiences help other growing companies grow fast too.
Service Oriented Architecture
Cloud-based architecture can greatly improve scalability and reliability. At BrightTag, we use a service oriented architecture to take advantage of the cloud’s elasticity. By breaking a monolithic application into simpler reusable components that can communicate, we achieve horizontal scalability, improve redundancy, and increase system stability by designing for failure. Load balancers and virtual IP addresses tie the services together, enabling easy elasticity of individual components; and because all services are over HTTP, we’re able to use standard tools such as load balancer health checks without extra effort.
Most web services require some data to be available in all regions, but traditional relational databases don’t handle partitioning well. BrightTag uses Cassandra for eventually consistent cross-region data replication. Cassandra handles all the communication details and provides a linearly scalable distributed database with no single point of failure.
In other cases, a message-oriented architecture is more fitting, so we designed a cross-region messaging system called Hiveway that connects message queues across regions by sending compressed messages over secure HTTP. Hiveway provides a standard RESTful interface to more traditional message queues like RabbitMQ or Redis, allowing greater interoperability and cross-region communication.
Zero Downtime Builds
Whether you have a website or a SaaS system, everyone knows uptime is critical to the bottom line. To achieve 99.995% uptime, BrightTag uses a combination of Puppet, Fabric and bash to perform zero downtime builds. Puppet provides a rock-solid foundation for our systems. We then use Fabric to push out changes on demand. We use a combinations of haproxy and built-in health checks to make sure that our services are always available.
Whether you use a dedicated DNS server or /etc/hosts files, to keep a flexible environment functioning properly, you need to update your records. This includes knowing where your instances are on a regular and automatic basis. To accomplish this, we use a tool called Zerg, a Flask web app that leverages libcloud to abstract away the specific cloud provider API from the common operations we need to do regularly in all our environments.
HAProxy Config Generation
Zerg allows us to do more than just generate lists of instances with their IP addresses. We can also abstractly define our services in terms of their ports and health check resource URLs, giving us the power to build entire load balancer configurations filled in with dynamic information from the cloud API where instances are available. We use this plus some carefully designed workflow patterns with Puppet and git to manage load balancer configuration in a semi-automated way. This approach maximizes safety while maintaining an easy process for scaling our services independently – regardless of the hosting provider.
Application and OS level monitoring is important to gain an understanding of your system. At BrightTag, we collect and store metrics in Graphite on a per-region basis. We also expose a metrics service per-region that can perform aggregation and rollup. On top of this, we utilize dashboards to provide visibility across all regions. Finally, in addition to visualizations of metrics, we use open source tools such as Nagios and Tattle to provide alerting on metrics we’ve identified as key signals.
There is obviously a lot more to discuss when it comes to how we automate our life in the cloud at BrightTag. We plan to post more updates in the near future to share what we’ve learned in the hopes that it will help save you time and headaches living in the cloud. In the meantime, check out our slides from CloudConnect 2012.