Solved – Determining samples sizes for an A/B test using z-test and chi-square test

chi-squared-testhypothesis testingpythonstatistical-powerz-test

Assumptions

I consider an A/B test where there is a control group and a variant group. Each observation can either be true (converted) or false (not converted).
I evenly and randomly split the incoming users to the two treatments.
So, the results can be summarized in a contingency table:

|       | Converted | Not converted |
|-------|-----------|---------------|
|Control|           |               |
|Variant|           |               |

Let the conversion rate be Converted / (Converted + Not Converted).
The null hypothesis is that the conversion rate is independent of the treatment.

It seems like in this case, I can use either the two-tailed $z$-test or the $\chi^2$-test. Feel free to correct me on this one.

Determining the sample size

I want to use statsmodels.stats.power.GofChisquarePower.solve_power and statsmodels.stats.power.NormalIndPower.solve_power.
For example:

import statsmodels.stats.power as power
zpower = power.NormalIndPower()
chipower = power.GofChisquarePower()
zpower.solve_power(0.1, nobs1=None, alpha=0.05, power=0.9, ratio=1.) # Returns ~2100
chipower.solve_power(0.1, nobs=None, alpha=0.05, power=0.9) # Returns ~1050

Question: I am puzzled by the huge difference. What is the reason for it? Am I using something wrongly in regards to my assumptions?

N.B. I now realize that the documentation states that GofChisquarePower.solve_power

(solves) for any one parameter of the power of a one sample chisquare-test

and NormalIndPower.solve_power

(solves) for any one parameter of the power of a two sample z-test

What is the difference between the one sample and two samples?

Best Answer

Simply put:

A one sample test is used to test a sample mean ($\mu_0$) to a known population mean ($\mu$). Think about testing the height of a sample of females against the average female height according to the latest census.
A two sample test is used to test a sample mean from one group ($\mu_1$) against a sample mean from another, independent group ($\mu_2$).

This is why the required sampled size for a two-sample test is double the required sample size for a one-sample test. A/B tests should use a two sample test.

For more information, there are a lot of resources on this site and online related to one sample vs two sample tests.

Related Solutions

A/B Testing – How to Safely Determine Sample Size for A/B Testing

The most common method for doing this kind of testing is with binomial proportion confidence intervals (see http://bit.ly/fa2K7B)

You won't be able to ever know the "true" conversion rate of the two paths, but this will give you the ability to say something to the effect "With 99% confidence, A is more effective at converting than B".

For example: Lets assume that you have run 1000 trials down path A. Of these 1000 trials, 121 were successful conversions (conversion rate of 0.121) and we would like a 99% confidence interval around this 0.121 result. The z-score for 99% confidence intervals is 2.576 (you just look this up in a table), so according to the formula: $$ \begin{aligned} \hat p &\pm 2.576\left(\sqrt{\frac{0.121 * (1 - 0.121)}{1000}}\right) \\ \hat p &\pm 0.027 \end{aligned} $$ So with 99% confidence we can say that $0.094 \le \hat p \le 0.148$, where $\hat p$ is the "true" conversion rate of process A.

If we construct a similar interval for process B, we can compare the intervals. If the intervals don't overlap, then we can say with 98% confidence that one is better than the other. (Remember, we're only 99% confident about each interval, so our overall confidence about the comparison is 0.99 * 0.99)

If the intervals do overlap, then we have to run more trials, or decide that they are too similar in performance to distinguish, which brings us the tricky part - determining $N$, the number of trials. I'm not familiar with other methods, but with this method, you aren't going to be able to determine $N$ up front unless you have an accurate estimate of the performance of both A and B up front. Otherwise, you are just going to have to run trials until you get samples so that the intervals separate.

Best of luck to you. (I'm rooting for process B, by the way).

Solved – calculating necessary sample size

For quick calculation, one can use following simplified formula:

sample size = 16 * p * (100-p) / (d ^ 2)

where p = baseline proportion in percent

and d = absolute percent difference

If p=4.29 and d=5.43-4.29=1.14

sample size = 5055

Which is very close to accurate calculations using proper formulae.

Also, if you feed above p and d at https://www.evanmiller.org/ab-testing/sample-size.html

you get sample size of 5,142 which is also close and consistent.

On https://www.optimizely.com/sample-size-calculator/?conversion=4.29&effect=26.6&significance=95 you have feed relative percent difference, i.e. (1.14/4.29)*100 = 26.6%. With these values you get sample size of 4500, which is not close for reasons unclear to me.