Machine Learning – Target Population for Power Analysis of ML Model A/B Test

ab-testcohens-dmachine learningsample-sizestatistical-power

We are working on an ML model that predicts a numeric result (call it $\hat{x}$). Eventually, we will perform an A/B test, where the metric is a function that takes $\hat{x}$ as an input (call it $f(\hat{x})$). Our control group will get the input from the old source (call it $x$, so its metric is calculated as $f(x)$), and the variant group will get the input as an output from the model (thus its metric is calculated as $f(\hat{x})$). We will then test to see whether the variant group metric's mean is higher by a specified effect size than that of the control group.

We want to perform the experiment's power analysis right away, even before the model is finished, so that we understand the experiment and know whether the required sample size is feasible and thus whether the improvement promised by the model is testable in this way. We have a reasonable effect size in mind (call it $\bar{x}$), as well as the standard $\alpha$ (0.05) and power (80%).

We can calculate $f(x)$ on past data, and it is from that data that we hope to shape the power analysis. One thing I am realizing is that the choice of past population from which to get the standard deviation $s$ with which to calculate the power analysis input (Cohen's $d = \bar{x}/s$) can make an enormous impact on what sample size is required for the future experiment. The sample size output is extremely sensitive to that $s$, in our case. So my question is, what is a good population from which to get the $s$? Some of us feel we should use the initial training data for the model. Others feel that constitutes some sort of leakage and we should use the initial test data. It may be the answer varies, depending on other factors. If the train / test choice is not clear cut, what are some guidelines to keep in mind?

Best Answer

Here is a related thread on power and prediction. The short answer is you should use/borrow data from a source that is representative of your target. If you are not sure which source is representative then perform several sensitivity analyses (each time borrowing from a different source) to see how the results vary, and give more importance to the more pessimistic results.

I also recommend you take into consideration the estimand you are targeting. There is a recent ICH E9 addendum on estimands in clinical drug development. This is important even if your work is not in clinical drug development. Here is a related thread on this topic. The idea is that there will be post-baseline events like treatment switching or discontinuation called "intercurrent events" that can introduce confounding. How these intercurrent events are addressed define the estimand you are estimating, and can impact the conclusions drawn from the analysis. Are you imagining a world where such intercurrent events would not occur and censoring endpoint observations in your data set? Are you considering a world where such intercurrent events would occur and incorporating these events into the treatment definition? If you are incorporating estimation and inference from earlier studies, make sure their estimand definition and methods for handling missing data align with your planned approach.

If I am understanding your question and comments correctly the estimated standard deviation varies from one sample to the next. This could be due to natural sampling variability, it could be sampling bias, or you are simply sampling from different populations. You could investigate this with confidence intervals and p-values.

Here is a paper on transfer learning for inference on a population quantity such as the mean. It discusses the idea that there may be multiple sources to borrow from and it may not be clear which is representative of the target. Here is a paper for predicting a future experimental outcome. Combining the ideas of these two papers should address your situation. Let me know if you need more details and I can edit my response.

Related Question