Nine Common Ops Mistakes (and How to Prevent Them)

Constant change is a reality for any growing, dynamic organization. Being receptive to change helps us quickly innovate and learn, yet at the same time, poorly-managed change can create instability and downtime. Ask any engineer what the number one cause of downtime in their systems is, and they’ll say change – software changes, network changes, configuration changes. While it would be nice to avoid this instability by working on completely static systems, to cope with the needs of modern technology businesses, operations teams must learn to better manage change by preventing common ops mistakes before they compromise systems.

Arup Chakrabarti, operations engineering manager and my colleague at PagerDuty, stopped by Heavybit Industries, a community workspace for developer-focused entrepreneurs, to discuss the biggest mistakes an operations team can make and how to head them off:

1. Getting It Wrong in Infrastructure Setup

Creating Accounts
A lot of people use personal accounts when setting up enterprise infrastructure deployments. Instead, create new accounts using corporate addresses to enforce consistency.
Be careful storing passwords. Keeping them in your Git repository could require you to wipe out your entire Git history at a later date. It’s better to save passwords within configuration management so they can be plugged in as needed.
Selecting Tools
Another important consideration with new deployments is selecting your tools wisely. Leverage PaaS tools as long as possible so you can focus on acquiring customers instead of building infrastructure. And don’t be afraid to employ “boring” products like Java. Well-established, tried-and-true tech can let you do some really cool stuff.

2. Poorly Designed Test Environments

Keep Test and Production Separate
Don’t mingle your test and production environments. Be sure to set up test environments with different hosting and provider accounts than what you use in production. At the same time, make sure your test environments resemble production infrastructure as best as possible.
Virtual Machines
Are you performing local development? There’s no way around it: applications will run differently on local machines and in production. To simulate a production environment as closely as possible, create VMs with a tool like Vagrant.

3. Incorrect Configuration Management

Infrastructure-as-code
Essentially, infrastructure-as-code is the process of building infrastructure in such a way that it can be spun up or down quickly and consistently. Server configurations are going to pose problems regardless of where your infrastructure is running, so you have to be prepared to restore your servers in as little time as possible. Ansible and Chef are two tools that makes infrastructure-as-code deployment super-simple for ops teams.Whatever tool you use, as a rule of thumb, it’s best to limit the number of automation software tools you’re using. Each one is a source of truth in your infrastructure, which means it’s also a point of failure.

4. Deploying the Wrong Way

Consistency matters
Every piece of code must be deployed in as similar a fashion as possible. Standardizing deployment practices takes time and effort, but eliminating variability and the potential for human error is essential for success. And that applies to rollbacks, too— anyone with deploy rights should also be able to easily roll back any code they ship.

Orchestrate your efforts
Powerful automation software can certainly help enforce consistency, but automation tools are only appropriate for big deployments. When you’re getting started, Arup suggests running development using Git and employing an orchestration tool, such as Capistrano for Rails, Celery for Python or Ansible and Salt for both orchestration and configuration management.

5. Not Handling Incidents Correctly

Have a process in place
Creating and documenting an incident management process is absolutely necessary, even if the process isn’t perfect. Be sure to review the incident-management document on an ongoing basis, too. Incident response should be defined relative to the severity of the incident— a minor latency blip and a total downtime event have different response workflows and expectations.

Put everyone on-call
It’s becoming less and less common for companies to have dedicated on-call teams – instead, everyone who touches production code is expected to be reachable in the event of downtime. This requires a platform that can notify different people in different ways. What really matters is getting a hold of the right people at the right time.

6. Neglecting Monitoring and Alerting

Start anywhere
The specific tool you use for monitoring is less important than just putting something in place. At PagerDuty, we use StatsD in concert with Datadog; open-source tools like Nagios can be just as effective. For application performance management, you might also look at a tool like New Relic or AppDynamics. There are hundreds of great monitoring tools out there, but the important thing is to get something in place and make sure you’re alerting on the exceptions that matter to you.

Check externally
If uptime is important to you (and let’s face it, it always is), make sure you’re also running some sort of external check on your service or site, with a tool like NodePing or Ghost Inspector.

7. Failing to Maintain Backups

Systematizing backups and restores
Just like monitoring and alerting, backing up your data is non-negotiable. Scheduling regular backups to S3 is an industry-standard practice today, and always have one more backup method than you think you need.And backups are useless if you can’t restore from them! At least once a month, try restoring your production dataset into a test environment to confirm that your backups are working as designed.

8. Ignoring High Availability Principles

“Multiple” is the key.
Having multiple servers at every layer, multiple stateless app servers and multiple load balancers is a no-brainer. Only with multiple failover options can you truly say you’ve optimized for HA.
Datastore design matters, too.
With multimaster data clusters like Cassandra, individual nodes can be taken out with absolutely no customer-facing impact. Clustered datastores are ideal in fast-moving deployment environments for this reason.

9. Falling Into Common Security Traps

Relying solely on SSH
Use gateway boxes instead of SSH on your database servers and load balancers. You can run proxies through these gateways and lock traffic down if you suspect an incursion.

Not configuring individual user accounts
When an employee leaves your organization, it’s nice to be able to revoke his or her access expediently. But there are other reasons to set people up with user accounts to your various tools. Someone’s laptop may get lost. An individual might need his password reset. It’s a lot easier to revoke or reset one user password than a master account password.

Failing to activate encryption in dev
Making encryption a part of the development cycle helps you catch security-related bugs early in development. Plus, forcing devs to think constantly about security is simply a good practice.

These are just a few of the mistakes that operations teams can potentially make when managing change. There are millions more out there that you’re likely to run across, but these nine mistakes tend to be the most commonly seen, even at larger companies like Amazon and Netflix. By putting these best practices in place, you’ll help set your team up to effectively manage change.