Solved – Power and sample size in regression context

hypothesis testingmultiple regressionrsample-sizestatistical-power

My question in short:

What is the current practice regarding Power Analysis in a multiple regression framework when we are interested in a single coefficient (which may or may not be dummy / interaction term)? In particular, how I do calculate required sample sizes?

Lengthy version:

When we test equality of a simple mean with a standard t-test, the corresponding formula contains $n$, which makes it easy to calculate the power or to solve for $n$ for a given power. In a regression framework, I have $\hat\beta$ and I have $\hat\sigma_\beta$, the estimated standard error of the coefficient. In a homoskedastic case, $\hat\sigma^2_\beta = \hat\sigma^2_\epsilon (X'X)^{-1} $. If $X$ contains a constant, then there will be a division by $n$ involved somewhere, so this could be worked out. However, this seems less obvious in case we adjust the variance-covariance matrix for potential heteroskedasticity or clustering. Googling around the standard reference appears to be Cohen (1988), and there is even a nice R package called pwr that has implemented a couple of power calculations based on this reference. Chapter 9 of Cohen (1988) is framed in terms of the F-test, and the power in the regression framework is ascertained in terms of the $R^2$. For instance, effect sizes ("$f^2$") are defined as $f^2 = \frac{R^2}{1-R^2}$ or $f^2 = \frac{R_{AB}^2-R_{B}^2}{1-R_{B}^2}$, where in the latter $A,B$ indicate different sets of regressors, and everything is framed in terms of variance explained.

My main concern:
Since 1988, statistical practice at least in economics and also in other fields has changed. The major differences, as far as I can tell, are these: first, we compute robust or cluster-robust standard errors by default. This naturally inflates standard errors, which makes it harder to reject the null in standard t-tests. Second, in small samples, we are typically worried about distributional assumptions, and violations thereof. Third, many people now work with quasi-experimental methods, where variance explained or $R^2$-values are typically tiny, and $R_{AB}^2-R_{B}^2$ is typically close to zero.

I'd imagine that this has implications for Power Analysis and the determination of the required sample size. I think this is a methodologically really interesting and also important question. I wonder what people who are more up to date are would do, which is why I start a bounty.

Illustration of the question:

Imagine you ran the following regression:

library(sandwich)
library(lmtest)
mod <- lm(mpg ~ disp + drat + wt*qsec, data=mtcars)
coeftest(mod, vcov. = vcovHC(mod))

t test of coefficients:

               Estimate  Std. Error t value Pr(>|t|)
(Intercept) -13.3238114  51.2643074 -0.2599   0.7970
disp          0.0026224   0.0113389  0.2313   0.8189
drat          1.5662444   1.4498325  1.0803   0.2899
wt            3.1612129  16.6582389  0.1898   0.8510
qsec          2.3617536   2.8229593  0.8366   0.4104
wt:qsec      -0.4402128   0.8979944 -0.4902   0.6281

You see that the coefficient on, say, wt:qsec is insignificant, but imagine you had a strong prior that it would be important. Imagine you wondered if there truly is no effect, or if the sample size is merely too small. How can we calculate the power of this test, or, correspondingly how can we calculate the sample size required to detect an effect of similar size?

Importantly, notice that standard errrors in the above regression are computed with a variance-covariance matrix robust to heteroskedasticity of unknown form, which reflects the standard practice in many social science fields nowadays. This is markedly different from homoskedastic standard errors. You can run summary(mod) yourself to verify this.

Best Answer

There are two main approaches to power analysis:

When your design conforms to "classical standards" with regard to estimators used and distributional assumptions, then the formulae from Cohen (most of which are much older than that reference) are mathematically correct, provably so.

When your design starts to depart from these standards, either because you are using nonstandard estimators (for whatever reason) or because there are other wrinkles with your data generation or selection process, theory generally breaks down quickly. Whilst there do exist formulae for a few cases which are very close to the classical paradigm, the normal approach is simulation. If you believe your effect is of a certain magnitude then simulate, say, 10000 datasets of a given sample size, with this magnitude of effect. Apply your chosen estimator to each of these datasets, and see how many return a significant result. Then, adjust the sample size to suit your needs (if not enough of the replicates are significant, you should increase the sample size. If more were significant than were required, you can get away with reducing it.)

Related Question