by Dave Connors from Constant Contact
This talk is about how Constant Contact integrated social media into their offering using Cassandra and Puppet. Small businesses look to Constant Contact for help with customer contact. Now social media as part of marketing is really growing and so Constant Contact had to integrate. Social media business rules vs email marketing is quite different but the number one challenge with a social media integration is that the data volume over email is on the order of 10 to 100 times greater.
NoSQL, Puppet, and DevOps practice offers answers on how to accomplish the integration described above rapidly and with low cost. Two million dollars would be the price tag for the integration with their traditional data stores. With NoSQL it is much much cheaper. The second nice thing about NoSQL is that the time to market was reduced. Just the right technology would not have been the solution though, they needed to focus on having a real DevOps culture and practice.
Ops and Dev both face issues in getting the Constant Contact social media integration project done.
- Data Model – Cassandra is different
- Monitoring – Old monitoring solution was not suitable
- Logging – Lots more data
- Risk profile
- Roles and Responsibilities – swapping them around a bit from the traditional approach
This social media project was completed in 3 months. Cassandra/NoSQL and DevOps brought them a lot of advantage in making this possible.
The Dev Perspective
The system architect Jim now speaks about the dev perspective on this project. Cassandra was the tool that was chosen to really underpin the project. It was developed at Facebook and open sourced in 2008. This was incubated at Apache and in use at Digg, Facebook, Twitter, Reddit etc… Cassandra has the following characteristics:
Cassandra is implemented in Java, which does not much matter.
- It is fault tolerant.
- It is elastic, meaning you can basically keep adding nodes and it scales more or less linearly as you add nodes.
- It is durable. Data is automatically replicated to multiple nodes. You can tweak options about consistency and replication.
- It has a rich data model, not strictly key value. You can actually have some structure to the data if needed.
Some development challenges in working on this project with this technology included:
- moving target. Cassandra major releases come fast in comparison to DB2 for example.
- Developer unfamiliarity. Cassandra is not totally trivial to wrap your mind around.
- Operational procedures. There are not a lot of established best practices out there for dealing with this sort of DB.
- Reliability concerns. Can you realize the promise of its reliability if you don’t fully understand how to do so.
How this was mitigated/handled for this project
- Pushing hard on deployment automation – clearly
- Community involvement. Apache and Cassandra has a very active community for ferreting out best practices.
Getting into the community is key. Mailing lists and IRC and #cassandra at freenode.
Contribute back to the community so that you don’t have to maintain your own fork when you find bugs.
- There is training and consulting available for Cassandra and they used it. There does not exist “one neck to wring” with Cassandra but again you can get paid support and training at DataStax and Constant Contact used them and was happy with it.
- Lots of monitoring. They put a lot of work into being comprehensive. Munin was used
- Choosing a good client to Cassandra – Hector was used. (Don’t use Thrift, it is really intended as a driver level client, it does not provide a lot of the things you would want a real application client to do; things like failover and retry).
- Using switchable modes. Using the relational DB as the system of record as you start to move over to Cassandra.
- Mirroring is another technique that was employed at the application level. All writes go in parallel to Cassandra and to the relational DB as well. When things fail the RDBMS is the backup.
- Dialable traffic. Being able to turn down the traffic when things go wrong.
Collaboration was really key in getting this to work. It was a big complex project. We had to have close collaboration and flexible roles. Mark and Jim the two primary dev and ops folks on the project and they had to be flexible. For example they changed the monitoring system that is traditionally used at Constant Contact when it was recognized that the current system would have failed. This is the type of systemic change that would be difficult to do without an environment of collaboration between Dev and Ops. Now that we have covered the dev side, we can talk a bit about ops.
The Ops Perspective
Now Mark will talk about this project from the ops side. Mark is the manager of system automation. He will talk today about how they use Puppet and in general a software tool chain that allows for improved levels of deployment flexibility.
When Mark starts a project, as a system admin, he tries to find the system specifications that will support the system the best. The came up with this machine spec after working with DataStax
3 500gig disks
1 250gig disk
Raid Zero Root Partition and Data Storage
The vendor was not sure if they should order that configuration because there is no internal fault tolerance built into that model. Cassandra however deals with redundancy at the node level though. So the question then became, how many nodes are needed?
- Cassandra Quorum = 3 (meaning each bit of data needs to ultimately live on three machines)
- Two data centers
- Each node can only half the available disk because of RAID
- ~ 6 TB needed
Ultimately that means 72 nodes which is a fair amount to manage to the level required by the project. Without getting into details about the management it would have been impossible for a human to do it. We wrote a puppet module that handles much of the management of this cluster. Puppet is not the only part of the whole system though. Here is the total tool chain:
Fedora anaconda/kickstart -> Func (for upgrades, puppet module exec’d through func) -> Puppet (for OS and app config) -> Scribe (Facebook’s open source logging framework) -> Nagios (for logging, managed by puppet) -> Munin (for trending)
The tool chain above, really centered around puppet, means that Dev and Ops were able to talk about things in a common language. That language was Puppet. They also started using subversion for their config. Puppet allowed for infrastructure as code.
Operational efficiencies were garnered though using Puppet with Cassandra. Remote logging was a requirement. Cassandra uses log4j natively but resources were not available for remote log4j logging. Ops was able to get Scribe integrated with puppet easily.
Munin is another tool in the stack that allows for JMX trending. It allows critical data points to be identified. With Puppet they could continuously deploy improvements to the trending and analysis tool across the cluster in a uniform way. They did 7 X 92 graphs across the cluster in 5 minutes with Puppet. This gets reused over and over again as more apps get pushed to the cassandra cluster. DataStax provides RPMs for Constant Contact to deploy this software within. Admins in general at Constant Contact must be able to build RPMs. Maven is used to build the RPMs for custom applications.
Traditional Ops vs Today at Constant Contact
- Infrastructure 4 weeks then to 4 hours to build 72 node today.
- Development to Deployment 9 months then to 3 months today (for the whole project given comparable projects)
- Cost millions of dollars then to 150k today.
Q. What was the role of the DBA in this model?
A. DBAs will be the keepers of the data dictionary. They will also be helping with tuning of the actual cluster.
Q. Have you had an opportunity to do version upgrades on a running cluster?
A. Yes, we worked with QA to do a rolling upgrade twice. It worked nicely, no issues. We did a slow roll, sequentially. Cassandra naturally takes care of this with hints.
Q. Both dev and ops roles are writing puppet code. How do you stop them from clobbering each other.
A. We are still working on it, but version control helps a lot. They actually pushed some code into production before it was ready. They expect to be able to treat this puppet code ultimately like any other code.