Avoid Early Stopping in A/B Tests
A while ago, my colleague YinYin Yu mentioned that early stopping could inflate your false positive rate. This gave me an idea to write a simulation to demonstrate how this could happen, as well as look further into the subject. This notebook is my attempt to document what I’ve learned.
When running an A/B test (experiment), you should outline your testing process before running the experiment, and not change that process in response to what happens during the test. Changing your testing process during the experiment may result in an inflated False Positive Rate (FPR), which is the probability of finding a significant result (rejecting the null hypothesis) when there no true effect. This will make you more often believe there are real results when there are not!
One example of the above is early or optional stopping. Roughly speaking, this is where you:
- Start the experiment
- As the results come in, you repeatedly check the p-value for significance as the experiment progresses. (Also known as peeking)
- Based on the result of (2), you decide whether to let the experiment continue:
- a. If the p-value is not significant, let the experiment run longer to collect more data.
- b. If the p-value is significant, stop the experiment and declare a successful outcome.
The step in 3(b) is what causes an inflation of the false positive rate, corrupting the results of your experiments and making you find statistically significant results (at a higher rate than expected) when there should not be any.
Continues as a Jupyter Notebook…