The Good, The Bad and The Testable: How to Give Your Experiment the Best Foundation

Whether you’re constructing a house, a new business or a mobile app, a solid foundation is critical. It gives you a strong base from which to build, and it can have massive implications on the life of your project further down the road. The same goes for the experimentation process.

Before ever beginning a software experiment, it’s important to create a strong foundation by investing time in the experimental-design phase. A well designed experiment will have a clear and testable hypothesis, a set run time calculated via a power analysis and a predetermined plan for the analysis stage. Two key components of planning the analysis are deciding which metrics you’ll be measuring and which statistical test you will use to determine whether or not any changes were statistically significant. Unfortunately, this is easier said than done, and with the wrong statistical test or metric, your experiment is destined to crumble.

How Do You Spot a Bad Metric or Wrong Test?

Oftentimes, people don’t realize they’re using a bad metric or a wrong test. Of the two, a bad metric is easier to spot. First, ask yourself, “What does it mean if this metric increases?” With that answer, you should be able to easily explain whether it is a good or bad result and/or which aspect of user behavior or product performance has changed and in what way. If there are multiple different ways to interpret a change to your metric, then it is probably not a good one to use.

Determining if you’re using the wrong statistical test is harder. There are many different types of tests that are appropriate in different scenarios. For example, whilst a chi-squared test may be appropriate for a proportions metric (e.g. percentage of unique users who convert), this test would not give accurate results if applied to a non-proportions metric such as page load time. Instead, for means, or real-valued, a t-test is often the correct one to apply to your results.

The most important thing to keep in mind is that the metric should satisfy all the assumptions of the test you are applying to it. For example, both the t-test and the chi-squared test assume the observations are independent and the noise in the data is normally distributed. One way to check this is with AA tests, an experiment where there is no difference at all between the two treatments. If you run a series of AA tests applying your statistical test to the metric, the p-values should be uniformly distributed, and the fraction of the AA tests which appear statistically significant should be roughly equal to your false positive rate, e.g. 5% if using a p-value threshold of 0.05.

Consequences of a Bad Metric or Wrong Test

Without the right metrics and test, your experiment is bound to run into trouble. Specifically, using a bad metric can make it very difficult to interpret the results of your experiment, or in the worst cases, it can lead you to unknowingly draw incorrect conclusions. It can mean you are not receiving as much value and learnings out of an experiment as you could have.

Similarly, using a statistical test which is not appropriate for the metric you’re applying it to can also cause you to draw incorrect conclusions. In this case, it can lead to a much higher false positive rate than you intended when you set the significance level. This can leave you thinking your experiment has had an impact, while it’s just normal noise in the data.

What Is a Good Metric?

Ideally, a good metric will be interpretable, meaningful, sensitive and fit for the statistical test you’re applying to it. Interpretable means if the metric changes, you should easily be able to determine if it will have a positive impact or negative impact and what that means in terms of user behavior or system performance. A meaningful metric directly measures something which you care about in your experiment. This could be something that’s a good proxy for business value or customer satisfaction. Sensitive metrics are able to detect smaller changes for a given size of traffic. For example, slow moving metrics such as retention or high variance metrics such as revenue, whilst being important metrics, are not inherently sensitive and hence might not be the best choice for your experiment’s primary metric.

Finally, fit for the test means the metric should satisfy all the assumptions of the statistical test you will use as previously mentioned. For example, metrics where the denominator is not equal to the randomization unit⎯e.g. per-session metrics when you are randomizing on users⎯would not give independent observations and hence are not fit for use in a standard statistical test such as a t-test or chi-squared test.

Benefits of a Good Metric

A good metric is crucial to any successful experiment. It allows you to get the most information and learnings out of an experiment, and it makes experiments easier to interpret as well as more efficient to run. Using a good metric and appropriate test will also avoid the chances of inflated false positive rates and getting misled by the data, which provides the biggest benefit of all: saving time and money.

— Lizzie Eardley