Solved – Sample size for a very skewed A/B Test

ab-testp-valuesample-sizeskewness

I would like to perform an A/B Test on my website. I have basic knowledge on know how to do a basic test statistics, but I'm not sure on how to choose the sample size. In particular, if I have an event with a conversion rate (before the campaign) very very low, let's say p = 1/10^6. To apply the C.L.T. on p for a given interval of confidence, I guess I need a very much larger sample size (for both the control sample and the treatment sample). How can I find the best size of the sample, given that a so skewed distribution of p?

Best Answer

If one wants to perform an A/B-Test with a small baserate not just for funsies, one has to ask what effect size i.e. which absolute improvement is considered to be worth the effort.

For example, if p=1/10^6 and number-of-visitors-per-month=10^6, then even an relative improvement of 500 % means an absolute improvement of 4 more conversions on average. If such differences cannot be justified with monetary arguments (e.g. the website is selling trips to space), an A/B-Test is not worth the trouble.

However, if such differences are considered to be worth the effort, I suggest to break down / decompose the conversion-rate into the participating factors. For example, let's say that one measures conversions as $\frac{boughtSpaceTrips}{siteVisitors}$. This rate can be splitted into ...

$\frac{boughtSpaceTrips}{siteVisitors} = \frac{boughtSpaceTrips}{spaceTripsInBasket} * \frac{spaceTripsInBasket}{siteVisitors}$

This decomposition may allow one to detect differences in a decomposed ratio, which do not appear in the composed ratio because they are countered by the other ratios (negative correlation) or have such a small contribution weight, that it requires the mentioned ton of data to do so. Whether there is some sort of negative correlation between the decomposed factors can be decided by applying domain knowledge, for example, how much does it "cost" for the user to perform a certain action.

In the given constructed example, the reasoning

Improve $\frac{boughtSpaceTrips}{spaceTripsInBasket}$ => Improve $\frac{boughtSpaceTrips}{siteVisitors}$

is valid, but the other way around

Improve $\frac{spaceTripsInBasket}{siteVisitors}$ => Improve $\frac{boughtSpaceTrips}{siteVisitors}$

is not.

If the decomposition does not lead to more feasible base rates, then take a look at the statistical discipline for this kind of problem (keyword: "rare event(s)"). But in this case you go beyond the scope of normal A/B-Tests, so I would ask again, whether this is worth the effort. Aside, my intuition tells me that one cannot avoid the pillars of the universe, so rare events still require a lot of data (but maybe not a ton), no matter which fancy method is applied (domain knowledge may help a lot though).

Related Solutions

Solved – What’s the “best” way to calculate sample size for A/B tests

There is no best to use because each method relates to specific assumptions about the testing methodology. Evan Miller's calculator calculates sample size for a two-tailed test. In the past Optimizely's calculator was calculating samples for a one-tailed test. Currently, Optimizely uses a Bayesian states engine and their sample size calculator has no input for Power, based on the construction of their stats engine. You can back into the sample size for each variation in the VWO calculator by multiplying the daily traffic * the number of days the test will run / number of variations. The results seem to imply they are also calculating sample size generically, like Evan's calculator, for a two-tailed hypothesis.

Solved – calculating necessary sample size

For quick calculation, one can use following simplified formula:

sample size = 16 * p * (100-p) / (d ^ 2)

where p = baseline proportion in percent

and d = absolute percent difference

If p=4.29 and d=5.43-4.29=1.14

sample size = 5055

Which is very close to accurate calculations using proper formulae.

Also, if you feed above p and d at https://www.evanmiller.org/ab-testing/sample-size.html

you get sample size of 5,142 which is also close and consistent.

On https://www.optimizely.com/sample-size-calculator/?conversion=4.29&effect=26.6&significance=95 you have feed relative percent difference, i.e. (1.14/4.29)*100 = 26.6%. With these values you get sample size of 4500, which is not close for reasons unclear to me.

Best Answer

Related Solutions

Solved – What’s the “best” way to calculate sample size for A/B tests

Solved – calculating necessary sample size

Related Question