It's generally well known that the difference of means is an unbiased estimator of the Average Treatment Effect in randomized experiments: $\mathbb{E}[Y|A=1]-\mathbb{E}[Y|A=0]$ is unbiased for $\mathbb{E}[Y(1) – Y(0)]$ where $A\in{0,1}$ indicates treatment branch and $Y(1),Y(0)$ are the potential outcomes under treatment and control, respectively (potential outcomes framework). As a result, confidence intervals for the ATE can be constructed using standard t tests.
It's also well known that this is equivalent to a linear regression with a single indicator variable: inferences on $\hat{\delta}$ in the following model: $y_i=\alpha + \delta\cdot t_i$ where $t_i\in{0,1}$ is the treatment indicator for the $i$-th unit.
I am interested in inferences on relative differences in means in a randomized experiment:
$$f(Y_t,Y_c)=\frac{\bar{Y}_t – \bar{Y}_c}{\bar{Y}_c}=\frac{\bar{Y}_t}{\bar{Y}_c}-1$$
We can construct confidence intervals for $f$ either analytically (an application of Fieller's Theorem ) or through bootstrapping.
It's also known that covariate adjustment (the addition of prognostic or baseline variables) can improve the precision of the ATE estimate from the OLS model. This is commonly used in online experimentation platforms in industry.
It's easy to see how covariate adjustment improves the estimate of the ATE, that is, of the difference in means.
My question is: is there a way to benefit from covariate adjustment for inferences on the relative difference (statistic $f$, above)? Can you, for example, draw inferences on $f(Y_t^{cv},Y_c^{cv})$ using bootstrapping, where $Y_t^{cv},Y_c^{cv}$ are the CUPED-adjusted metric values.
Best Answer
The added precision from covariate adjustment comes from the fact that some of the variation in the outcome may be due to variation in the adjustment variables. When this is the case, adjusting for those variables in OLS is the same as adjustments done from CUPED (perhaps better considering CUPED does not seem to alter the degrees of freedom used for the test statistic).
If you're willing to use post-estimation techniques, like a marginal effect, then obtaining confidence intervals for the quantity $f$, sometimes called the excess risk ratio, should be straight forward.
Let's do a little example in R. Let's do one adjustment covariate and a randomized exposure, not unlike what you might see in experiments in industry.
The ATE in this example is 0.5, and the expected value for the potential outcome under no treatment is 1. This means we should get an estimate of the excess risk close to 0.5.
Let's fit a model and use
marginaleffects
to estimate this quantity with a 95% CI.marginaleffects
can estimate something closely related to the excess risk by passingcomparison = "lnratioavg"
andtransform=exp
to theavg_comparisons
function. This will actually estimate$$ \tau = \ln(E(Y(1)) - \ln(E(Y(0)) $$
and then return estimates for $\exp(\tau)$ which is close enough for our purposes. Let's do this in code.
We get an estimate of $\exp(\tau) = 1.5$ which is exactly where it should be. We also get a confidence interval for this quantity. Let's refit the model and see how much wider the CI is when we don't adjust for
x
.The CI is much wider when we don't adjust for
x
, as expected. So this is how we can practically get a CI for the excess risk, and we get to benefit from covariate adjustment too.Bootstrapping will work in this scenario as well. Here is an example:
The bootstrap results look fairly similar to the marginal effect estimates.
So in short, you can just use regression and a marginal effect in most cases. Bootstrapping is a good idea too.