Solved – What’s the “best” way to calculate sample size for A/B tests

ab-testsample-size

I've read several seemingly conflicting accounts on the best way to calculate sample size. Visual Website Optimizer (VWO) has a lengthy article on this topic. So does Evan Miller. And so does Optimizely.

Using the various tools to estimate sample size with the following settings:

  • Baseline Conversion Rate: 3%
  • Minimum Detectable Effect: 20%
  • Significance: 95%
  • Variations: 2

I get the following from the various calculators:

  • VWO (have to set "daily visitors" to 1 to get exact sample size): 25,867
  • Evan Miller (set to relative, stat. power 80%): 13,050
  • Optimizely: 13,000

Given the seemingly different methods of calculation, which one is the "best" to use? I'm trying to understand how to approach this issue of sample size. Thanks!

(I had to list links here because I need more points to post more than 2 inline links)
References:

Articles:

  1. vwo.com/blog/how-to-calculate-ab-test-sample-size/
  2. www.evanmiller.org/how-not-to-run-an-ab-test.html
  3. help.optimizely.com/hc/en-us/articles/200133789-How-long-to-run-a-test

Calculators:

  1. vwo.com/ab-split-test-duration/
  2. www.evanmiller.org/ab-testing/sample-size.html
  3. www.optimizely.com/resources/sample-size-calculator/?conversion=3&effect=20&significance=95

Best Answer

There is no best to use because each method relates to specific assumptions about the testing methodology. Evan Miller's calculator calculates sample size for a two-tailed test. In the past Optimizely's calculator was calculating samples for a one-tailed test. Currently, Optimizely uses a Bayesian states engine and their sample size calculator has no input for Power, based on the construction of their stats engine. You can back into the sample size for each variation in the VWO calculator by multiplying the daily traffic * the number of days the test will run / number of variations. The results seem to imply they are also calculating sample size generically, like Evan's calculator, for a two-tailed hypothesis.