Solved – How to analyze and interpret A/B test results via Bootstrap method

bootstraphypothesis testingp-value

We've run a split test of a new product feature and want to measure if the uplift on revenue is significant. Our observations are definitely not normally distributed (most of our users don't spend, and within those that do, it is heavily skewed towards lots of small spenders and a few very big spenders), so we've decided on using bootstrapping to compare the means, to get round the issue of the data not being normally distributed.

So my results show that we do have an uplift of around 8% vs. control. I now want to calculate how confident I can be in this uplift. Is it as simple as measuring the proportion of the probability density function below zero, for the PDF that is test group PDF minus control PDF? (e.g., that portion reflects the % chance that my 2 PDFs are not different?)

Any help would be much appreciated.

Best Answer

Bootstrapping for a mean does not usually make sense for due to the CLT. Just use the mean and standard error on the mean. This will either give the same result as your bootstrap or your bootstrap would have given a poor result.

If you truly know you want to compare the mean (and it is not clear that you do Better estimator of expected sum than mean) then you want to test if the two samples are likely to have come from the same population. This would be a Welch t-test with the mean, $\mu$, and standard error on the mean, $\sigma_\mu$.

The problem becomes more complicated if there is a different associated risk with each choice coming from other factors (eg implementation cost). If the t-test says there is no significant difference then you clearly want to take the variant with lower risk. However, if the variant with higher risk has a statistically significant impact on mean revenue then it is a judgment call.