Regression Standard Error – When and Why to Bootstrap the Standard Error in Regression

bootstrapregressionregression coefficientsstandard error

I have a linear regression model:

$$Y_i = \alpha + \beta_0T_i D_i + \beta_1D_i + \beta_2T_i + \delta x_i + \epsilon_i$$

where $Y$ is a continuous outcome variable, $D$ is the binary treatment variable (0 or 1), $T$ is the type of person (0 or 1) and $x$ are other controls.
I am interested in $\beta_0$ i.e., testing if the treatment affects the two types differently.

Based on a lab experiment such that treatment and type were assigned randomly, I have 28 observations of each type with each treatment (for a total of 112 observations).

I am getting large standard errors and high p-value for $\beta_0$ with robust standard errors. However, when I bootstrap the standard errors, I get the same coefficients with much smaller standard errors and p-value. $\beta_0$ is significantly different from 0 with bootstrap standard error but not otherwise.

How can I tell which approach and result is correct?

Best Answer

Robust standard errors lose some power in order to be safer in case of certain deviations from the normal distribution model. This means that they have a tendency to produce larger p-values and fewer significances than some other approaches. Is this good or not?

(a) If the data show the specific problems that the robustness is meant for, the robust approach is better because it takes this into account.

(b) Even otherwise an insignificant result isn't wrong, because if there is no significance it doesn't mean that the null hypothesis is true, it only means that evidence against it isn't strong enough to reject it.

(c) Note that there's more than one way of bootstrapping standard errors in regression (basic distinction is between bootstrapping full observations or residuals, however one could also run parametric bootstrap in case a specific non-normal model is assumed).

(d) An advantage of (full observations) bootstrap is that it doesn't make specific distributional assumptions, so it can be more precise than other standard errors in case that the data represent the actual underlying distribution well. Which, if you so wish, is the assumption of bootstrap.

(e) Bootstrap can be very unstable if the dataset is small; it can also be unstable if there are not enough bootstrap replicates.

From your response in comments: "The goal is to understand the approach that helps to get closer to the truth irrespective of what that is (null or significant effect)."

There is a confusion of terms here. Whether an effect is a null effect is a matter of the unobserved truth. An effect that isn't a null effect should be called "non-null". "Significant" is not the opposite of null. Whether an effect is significant or not can be observed and computed from the data, and is relative to the chosen method. In the question you're talking about an effect that is significant under one method and not significant under another. This is not contradictory, both results are correct, because the concept of significance relies on the method to compute it. It is also not the case that any of the results has to be wrong, because even if the true effect is non-null, an insignificant result is by no means impossible and shouldn't be interpreted as meaning "this is a null effect" (I am aware that this is not your personal confusion but rather that whole fields of research tend to ignore this and misinterpret p-values in turn). On the other hand, a significant result does not have to be wrong even if the null is true; it just means that something unlikely has happened (which happens rarely but it does). On top of this there is the added difficulty that models are never precisely and literally true in reality, so even what you'd like to call a "real null effect" will in reality not just be a data generator that behaves exactly like your model specifies with $\beta=0$, and may occasionally cause significances that don't mean what people usually think they mean.

Obviously in your situation you cannot know whether the true effect is null or not. You probably don't know (much) more about the reality of interest than what the data say, and the data won't tell you precisely whether the true effect is null or not (and actually in reality there may not even be an unambiguous answer to that question).

The only thing you can go by is what can be seen in the data, which roughly is the following:

(1) If the data indicate any of the specific model assumptions the robust standard error is robust against (there are various versions also of this, so I cannot tell what exactly these are in you case), using the robust standard error is a good idea, however this may also raise doubt about the regression parameter estimators, that should maybe also be computed in a robust way.

(2) Dataset too small => bootstrap is unreliable. In any case use a generous number of bootstrap samples if you want bootstrap.

(3) In case the dataset is reasonably big and doesn't show the specific robustness issues for which the robust estimator is made, I'd probably be surprised to see big differences between robust SE and bootstrap; if the robust SE is just above a significance threshold and the bootstrap is below, I'd say there is some indication that something is going on. (The issue is not whether the effect is really significant or insignificant - it is just significant according to one method and insignificant according to the other - but rather whether there is reason to believe that the effect is non-null, which is normally indicated by a significant result.)

Final remark: Significance thresholds are largely arbitrary. p-values will not perfectly replicate due to random variation. Even if you fix your significance threshold at 0.05, 0.04 and 0.07 are in fact not very different p-values, and to say that something "fails to replicate" because the original p-value was 0.04 and you get 0.07 on new data or with a different method is harsh. In fact this is entirely possible both if the truth is null or non-null. Under a true null effect the p-value is ideally distributed Uniform(0,1), so observing p=0.77 on some data and then 0.05 on the next dataset is entirely possible and realistic. (If the true effect is strongly non-null, you'd expect only small p-values though.)

Related Question