Designing Engineering Teams for Scale

As a senior engineering manager who has taken many companies from just a few engineers to entire teams at global scale in hypergrowth, I have learned quite a few lessons along the way on coping with growth and scale on a diversity of fronts. I put my thoughts down to help senior technical people with deep knowledge in code, system architecture and more, who would like to understand how to take their career to the next level. I hope this will provide some insights on how to design not only your systems but also your teams to support the business for scale. This blog post is based on my keynote talk from DevOpsDays Tel Aviv 2018.

What Do You Think About When You Hear ‘Scale’?

Let’s start by framing the meaning of “scale.” When defining scale, the first thing that most technologists will likely think about is additional traffic or users. The first thing that comes to mind is likely distributed systems, event traffic, more clients using the same system and so on. This scale is well-known and you can find many excellent posts and talks on this subject and learn from many companies that went through the process of scaling their systems.

But there are additional types of scale. Here’s how you, as a senior engineer, can learn to identify which one is relevant and how you can help your company tackle it.

Scaling challenges are diverse, and if I had to categorize them into three primary categories they would be the three “T”s:

Traffic
Team
Territories

Let’s deep dive, and see what are the characteristics of each, the unique challenges from a technical perspective and how you, as a leading technologist, can help. More importantly, though, is how to avoid scaling-related mistakes like the ones I made and have learned from over the years.

Scaling Challenge 1: Traffic

We will start with the most commonly known scaling challenge: traffic scale.

This one is probably the easiest to identify. You will start feeling growing pains when you see an increasing number of events coming into the system, a growing number of concurrent users and the number of registered clients, among other tell-tale signs.

This will start to present a number of technical challenges; for example, services will fail to serve, there will be database performance degradation, exponential growth responses—sometimes 50x—from the AWS Elastic Load Balancing service and more.

The most important thing when triaging your issues around traffic scale is to remember that the priority is to be ready to support more traffic so users will still be able to get served, even if there are many more users, so focus on anything that relates to maintaining uptime. This means you must be ready to scale the web services and all other real-time functionality for peak loads. To do so, auto-scaling is mandatory, as well as elastic real-time databases. Offline processes, on the other hand, can be delayed if needed and catch up asynchronously when the traffic load subsides.

As much as I hate to say it, while we always need to be cognizant of technical debt and be sure we’re on track to closing these deficits, code quality and refactoring can take a lower priority at times of load, as you might actually find yourself rewriting services from scratch and you just might need different service architecture altogether.

What about the team?

Well, with the growth of traffic usually comes the need to add more team members, so a sufficient number of team members can focus on scaling the infrastructure, while you still have enough attention to work on the business road map. This is the right time to think “DevOps” and create dev tools so the work can be more efficient and you might be able to achieve more with fewer people.

You will also need more money for your infrastructure services, and you need to be ready for that. Scale costs money, and traffic scaling is translated automatically to your data center or cloud provider’s monthly invoice. It might mean that you also have to start thinking about using cost-saving services and optimizing your usage.

Scaling Challenge 2: Teams

The second type of scale comes with growing pains of human capital: team scaling.

While this may seem obvious, in many cases, companies tend to ignore the impact that the growing team has on the system itself.

Hiring becomes a scaling issue when your team needs to double in a year. Even going from five to 10 people is significant growth. Many leading technologists tend to ignore this type of scale and look at it as a managerial problem. This is not true; if you see yourself as part of the R&D leadership you should take a prominent role in this scaling.

These are several tech challenges for this growth:

Additional functions might be introduced to the team. For example: I tend to hire more generalists as the first team members, and get more specialists when the team grows. In addition, QA engineers might be hired only after there is already a full team of developers and they might need different tools.
In many cases, manual processes just don’t scale with growing teams. Things that used to be done manually on the production database, for example, might be OK when it is just you but should be automated when there are now four people working with you.
In some cases, the system architecture should be changed to enable more people to work effectively. This is a big one: Your system architecture was working well for a small-sized team, but now that there will be many more hands working on the same code base it becomes much more complex to manage. Some companies take the approach of having the same code base for everyone and forcing the engineering quality by using code reviews, testing and more. Some teams decide to split a service into separate services, so each team will be in charge of a sub-service. I prefer to split services, as it makes team ownership more explicit.
Knowledge sharing and best-practices should be formalized. From my experience, investing in very good first training pays off. The format that I usually use is on the job training, using a dedicated buddy from the team who guides the new team member on coding tasks, with increased autonomy. This method enables the learning of the system from the inside alongside business-related sessions, including self-reading technical material for relevant technologies. Avoid overly complex best practices and make the time to replace them with scripts or well-written instructions.

The most important thing is for new people to be productive. It is crucial that each new team member can write code, test and deploy it, exactly the same as before. It doesn’t make sense to hire new people who are less productive.

This is the right time to invest in a good automatic testing mechanism and good automatic build and deploy.

You know your company best, and in some cases, team scaling comes as a result of traffic scaling. If this is not the case, then you might need to leave the traffic scaling aside at this point.

You have to be very careful with team growth. In some cases, you may be able to use external services or automation to reduce team growth. For example, instead of hiring more DevOps engineers, you can consider using hosted solutions. It might cost a little more than the non-hosted solution, but it might save you valuable hours on maintenance.

Scaling Challenge 3: New Territories

The third challenge is territories. As a company looks to expand beyond its primary geographic region to additional regions, the new challenges arise not only with sales and organizational structures but also with technology. Let’s take an example of an organization currently serving clients in the U.S. and expanding to the EU:

GDPR: in the U.S., California and New York were the first to establish GDPR-like laws, while EU countries have already implemented privacy laws.
Language and culture: The EU has different languages, which might affect your product.
In regulated environments, such as FinTech and HealthTech, the product can be different when moving from the U.S. to the EU.

It is pretty straightforward to understand if you are facing territory challenges that are outside of sales and customer support when you start talking about considerations such as data protection and other regulatory considerations. Tech challenges can be different from company to company, depending on your business area:

The location of your production environment. This largely was a performance issue and might still be the case when your system is highly sensitive to performance. In most other cases, there are other considerations to decide if and where you need additional data centers when expanding to new territories. It can be a result of the regulations and specific needs of your clients. In such cases, you need to specify the cost of having an additional data center for the business. This is not just extra money but also complexity and product needs that can be much more difficult, such as having the same user in two different environments at the same time while keeping everything in sync. In most cases, the legal team can find reasonable solutions to keep only one data center, and it will make your life easier. It is important to have this discussion and it is your role as a leading technologist to lead it.
Data consideration. Different countries can have different regulations for data privacy and data protection. It might be easiest if you agree with the business people to keep the most strict rules for everyone.

The most important thing is for you to understand what is really mandatory and what is more “nice to have” from a legal perspective and the regulators.

In case you don’t already support i18n, this is the time. Try to avoid additional production environments, if they are not absolutely mandatory.

Below is the way I break it down to try to analyze how to use technology for different types of scaling.

Here are a few questions to help you understand how to identify your company’s unique situation.

You should start by asking yourself, “Is this temporary, or here to stay?”

- Do you see this trend going on for a few months?
- Do you see this change supporting a significant amount of sales in the pipeline?
- Do you think that entering into the new territory is just the first step before foraying into other regions in the world?
- Does your team plan to double itself according to next year’s budget plans?

If your answer is yes to most of these questions, then you are talking about scaling challenges.

Now that we understand that we have a scaling problem, let’s understand the type of scaling. Go back to the three Ts and ask yourself:

Is it traffic?
Teams?
Territories?
Or maybe a combination of a few types together?

Failing at Scaling

I’ve had my share of scaling through the years. It is a lot of fun to work in a growing company and take part in the scaling process. However, in some cases, hindsight is 20/20 to understand you may not have handled it ideally. This has happened to me more than once, and I’m proud of my mistakes because they afforded me the opportunity to learn and grow.

Let me share a few:

Like many companies, LivePerson started with a monolith. It was a Java-based web application that handled all the incoming visitors’ traffic, as well as chats, operators and more. Around 2012, as the team grew and experienced quality challenges; we collected all the unit tests together with some additional integration tests and decided that it should run as part of the build. The build process would compile the Java code, perform additional steps and run the entire test suite before it declares failure or success. It was a great idea and enabled us to see not only that the code compilation went well but also that the tests are running and nothing breaks. The only thing that ruined the party, though, was that the entire process took 30(!) minutes. It meant that instead of waiting 10 minutes for the compilation, we added another 20 minutes every time a bug fix was deployed to the testing environment. It was a great idea, but it added even more time to the “already too long” build process.
At AppsFlyer, we had a real-time reporting system from the very first day. It was one of the more important business offerings and marketing people would buy this service to analyze their campaigns as they go. This system was built when the company had just a few small clients and was based on MongoDB in a sub-optimal solution by increasing counters. As the company had more and more clients—particularly more huge ones—the system was not holding up with the traffic very well and we experienced outages once every few weeks. We added more compute power, split the cluster and more. This gave us some air for a few more months, until one day two of our biggest customers couldn’t fetch data for the entire week(!), and could only see today’s data. That was the moment I understood that I made a mistake and that we had to build the next generation in time.
And here is one scaling mistake I made in my house … My ceiling has collapsed after three days of non-stop rain last year. My roof was stable for many winters, but it got older and eventually couldn’t keep up with the nonstop rain. We found ourselves putting bowls under the ceiling and cleaning the floor from the rainwater at 3 a.m. Saturday. It turns out, you can also have production problems at home.

Be Brutally Honest!

Now it is the time to be honest with yourself!

Do you find yourself saying things such as, “We can’t do this task”? “Our infrastructure can’t support the new challenge”? Evolving into a leading technologist, you should be able to communicate the complexity in business terms and come up with alternatives and trade-offs.

This is a topic for a whole new article, but it is important to open the right discussion with the business and express the technical needs in words that will enable non-technical people to understand. Saying things such as “The current system doesn’t support an additional payment method, and we will need to spend time to refactor it before going to EU,” or, “We see that additional clients are planning to use the product in the coming few months, so we will need to spend time to enhance the architecture and maybe replace the database technology to be able to support the 10x scale,” is better than just saying things like “This is hard.”

It is crucial for the business to understand the complexity and trade-offs before taking any decision, and it enables you to impact the decision. It also provides you with inside information about what trade-offs are acceptable and what cannot be sacrificed under any circumstance at your company. You need to be very careful here and put on the table the technical changes that hold the company back. The ones that are very cool but not really crucial for the scaling should wait.

Plan and Execute

Once we have identified whether this is long-term scale and the scale type, and mapped the gaps, now it is time to create a plan and execute it.

I’m not going to say this is an easy feat, but you start by taking small steps. As always, technology is the enabler for the business’s growth, and your plan should reflect this. It should contain all the scaling-killers and only them. It means that in some cases, you have to put aside the industry best practices for a later stage and stick with things that still work for you. For example, the monolith that your company started with might still be valid if it is not your crucial scaling point. Splitting it can be pushed to a point in time when it becomes the crucial scaling-killer.

There is no magic, just constant hard work.

Finally, make sure you don’t perform too many changes at once, as there are always unplanned things that will break at scale. Also, don’t forget simple but life-changing things such as automation and auto-scaling, talking to other senior engineers who experienced the same scaling before to learn from their mistakes and plan for small iterations.

Working in a fast-growing company is a lot of fun. You are probably facing different technology scaling challenges, which provide opportunities for growth and learning and some adrenaline boosts. I hope this post helped you to understand not only how to help your company, but also how to scale your career.