An Operational Approach to Benchmarking Microservices

Benchmarking microservices often falls behind on the engineering team’s to-do list. However, those that aren’t are missing out on a key component. The fundamental goal of benchmarking is to better understand the software, and test out the effects of various optimization techniques for microservices. In this post, I’ll describe an effective approach to benchmarking microservices.

First, create a spreadsheet for tracking your benchmarking. One convenient way to document a series of benchmarks is in a Google Spreadsheet. It allows collaboration and provides the necessary features to analyze and sum up your results. Structure your spreadsheet as follows:

Title page
- Goals
- Methodology
- List of planned and completed experiments (evolving as you learn more)
- Insights
Additional pages
- Detailed benchmark results for various experiments

Then, before you engage in benchmarking, clearly state (and document) your goal. Examples of goals are:

“I am trying to understand how input X affects metric Y”
“I am running experiments A, B and C to increase/decrease metric X”

Next, pick one key metric, stating clearly which one metric you are concerned about and how the metric affects users of the system. If you choose to capture additional metrics for your test runs, ensure that the key metrics stands out.

Now, you’re going to have to think like a scientist and perform a series of experiments to better understand which inputs affect your key metric and how. Consider and document the variables you devise and create a standard control set to compare against. Design your series of experiments in a fashion that leads to the understanding in the least amount of time and effort. You’ll then need to define a methodology for running your benchmarks. It is critical your benchmarks be:

Fairly fast (several minutes, ideally)
Reproducible in the exact same manner, even months later
Documented well enough so another person can repeat them and get identical results

Document your methodology in detail. Also document how to re-create your environment. Include all details another person needs to know:

Versions used
Feature flags and other configuration
Instance types and any other environmental details

In most cases, to accomplish repeatable, rapid-fire experiments, you need a synthetic load generation tool. Find out whether one already exists. If not, you may need to write one. Understand that load generation tools are at best an approximation of what is going on in production. The better the approximation, the more relevant the results you’re going to obtain. If you find yourself drawing insights from benchmarks that do not translate into production, revisit your load generation tool.

It is then important to validate your benchmarking methodology, and repeat a baseline benchmark at least 10 times and calculate the standard deviation over the results. You can use the following spreadsheet formula:

=STDEV(<range>)/AVERAGE(<range>)

Format this number as a percentage, and you’ll see how big the relative variance in your result set is. Ideally, you want this value to be less than 10 percent. If your benchmarks have larger variance, revisit your methodology. You may need to tweak factors like:

Increase the duration of the tests.
Eliminate variance from the environments.
- Ensure all benchmarks start in the same state (i.e. cold caches, freshly launched JVMs, etc).
- Consider the effects of Hotspot/JITs.
Simplify/stub components and dependencies on other microservices that add variance but aren’t key to your benchmark.
- Don’t be shy to make hacky code changes and push binaries you’d never ship to production.

Important: Determine the number of results you need to get the standard deviation below a good threshold. Run each of your actual benchmarks at least that many times. Otherwise, your results may be too random.

Now that you have developed a sound methodology, it’s time to gather data. Some tips include:

Only vary one input/knob/configuration setting at a time.
For every run of the benchmark, capture start and end time. This will help you correlate it to logs and metrics later.
If you’re unsure whether the input will actually affect your metric, try extreme values to confirm it’s worth running a series.
Script the execution of the benchmarks and collection of metrics.
Interleave your benchmarks to make sure what you’re observing aren’t slow changes in your test environment. Instead of running AAAABBBBCCCC, run ABCABCABCABC.

Now, you’ll need to create enough loads to be able to measure a difference. There are two different strategies for generating load:

Redline it! In most cases, you want to ensure you’re creating enough load to saturate your component. If you do not manage to accomplish that, how would you see that you increased its throughput? If your component falls apart at redline (i.e. OOMs, throughput drops, or otherwise spirals out of control), understand why, and fix the problem.
Measure machine resources. In cases where you cannot redline the component, or you have reason to believe it behaves substantially different in less-than-100-percent-load situations, you may need to resort to OS metrics such as CPU utilization and IOPS to determine whether you’ve made a change. Make sure your load is large enough for changes to be visible. If your load causes 3 percent CPU utilization, a 50 percent improvement in performance will be lost in the noise. Try different amounts of load and find a sweet spot, where your OS metric measurement is sensitive enough.

As you execute your benchmarks and develop a better understanding of the system, you are likely to discover new factors that may impact your key metric. Add new experiments to your list and prioritize them over the previous ones if needed.

In some instances, the code may not have configuration or control knobs for the inputs you want to vary. Find the fastest way to change the input, even if it means hacking the code, commenting out sections or otherwise manipulating the code in ways that wouldn’t be “kosher” for merges into master. Remember: The goal here is to get answers as quickly as possible, not to write production-quality code—that comes later, once we have our answers.

Once you’ve completed a series of benchmarks, take a step back and think about what the data is telling you about the system you’re benchmarking. Document your insights and how the data backs them up. It may be helpful to:

Calculate the average for each series of benchmarks you ran and to use that to calculate the difference (in percent) between series — i.e. “when I doubled the number of threads, QPS increased by 23 percent on average.”
Graph your results — is the relationship between your input and the performance metric linear? Logarithmic? Bell curve?

Finally, you’ll need to present your insights. Here are some final tips for exhibiting to management and/or other engineering teams:

Apply the Pyramid Principle. Engineers often make the mistake of explaining methodology, results and concluding with the insights. It is preferable to reverse the order and start with the insight. Then, if needed/requested, explain methodology and how the data supports your insight.
Omit nitty-gritty details of any experiments that didn’t lead to interesting insights.
Avoid jargon, and if you cannot, explain it. Don’t assume your audience knows the jargon.
Make sure your graphs have meaningful, human-readable units.
Make sure your graphs can be read when projected onto a screen or TV.

— Stefan Zier