Solved – bias of peeking at AB test data and adjusting minimum detectable effect

Let's say we're running an A/B test on my website comparing blue button clicks (baseline) to green button clicks.

I use http://www.evanmiller.org/ab-testing/sample-size.html to calculate my required number of subjects per branch with the following parameters:
- significance level of 5%
- statistical power of 80%
- an observed historical baseline conversion rate of 5%
- a desired minimum detectable effect of 1% (ie. conversions between 4% and 6% will be indistinguishable from the baseline)

Using the calculator, I determine that we need 7,663 pageviews to declare a result.

Now let's say everyone gets impatient and decides to check in on the experiment after only 900 pageviews.

The Game Plan:

1) If it turns out that the green button is at least 3% better than baseline, we will decide to conclude the experiment and declare the green button as the winner (a 3% MDE given the same other initial parameters requires only 894 pageviews according to the calculator).

2) If it turns out that the green button is less than 3% better than baseline after 900 pageviews, we will decide to keep the experiment running to it's full course of 7,663 pageviews and then make a conclusion at that time.

Are we introducing bias with this Game Plan?

Best Answer

Setting your stopping condition based on the significance of your interim analyses is, in general, not a great idea. The worst possible thing you could do would be to re-run your analysis after every page view and stop as soon as you got a significant result. You've decided, by setting your $\alpha=0.05$, that you're willing to tolerate a 5 percent chance of making a Type I error (i.e., claiming there's an effect even though there actually is not). This repeated "peeking" inflates that more than 5-fold, so there is actually a 1:4 chance your effect is due to random noise, rather than a 1:20 on. That is clearly bad. By peeking only once instead, you're not doing nearly as badly, of course.

This problem has been studied extensively, largely under the name of "Interim Monitoring" or "Sequential Experimental Design" and comes up a lot in the design of clinical trials. This review, by Jennison and Turnbull, covers a bunch of approaches. It looks like your ad-hoc idea is pretty close to the "stochastic curtailment" approach, so that and the references therein might point you in the right direction if you want to do this completely correctly.

Best Answer

Related Solutions

Solved – Calculate sample size based on Conversion Rate, Minimum Detectable Effect, Statistical power and Significance level

Solved – What’s the “best” way to calculate sample size for A/B tests

Related Question