Solved – Can we correctly identify all the non-zero coefficients in the linear regression model

feature selectionregressionregression-strategies

I have a conceptual question regarding linear regression.

Assume our model is correct, i.e., the response variable $Y$ is indeed coming from the model

$$Y=\beta_0+\beta X+\epsilon.$$

Here $X$ is a vector of length $m$. Assume all the nice assumptions hold, e.g., $\epsilon$ is normal, we have a set of i.i.d. observations. We know that by consistency of OLS estimator, we will be able to reach the true value of those coefficients (some of the coefficients could be 0) if we have an infinite amount of observations. My question is, given the fact that we only have a finite amount of data, is there a method that can correctly identify those non-zero coefficients at some confidence level?

According to "An Introduction to Statistical Learning" (and I agree with this book on page 77), we cannot simply look at the $p-$value associated with each individual coefficient and claim that if one coefficient has $p-$value less than 5%, then we conclude this coefficient is non-zero at 95% confidence. In this book, it says that this logic (looking at each individual $p-$value) is flawed, especially when we have large number of predictors. Because if we have 100 predictors, then about 5% of the $p-$values will be below 0.05 just by chance, even though it could be that the true model has all coefficients being 0. That's why we need to look at the $F-$test of the overall significance of the model, i.e., whether there is at least one coefficient that is significantly different from 0.

I fully agree with this argument. But then what is the point of those $p-$values of each individual coefficient?

And the next question is: if the $F-$test concludes that among all predictors, there is at least one coefficient that is not 0 at significance level $\alpha=0.05$, we don't know which coefficient is (coefficients are) significantly different from 0. We know that if $\text{H}_0:$ all coefficients are 0 is true, then there is only $\alpha$% of chance that the $F-$test will result in a $p-$value below 0.05, regardless of the number of predictors and number of observations. So now if $F-$test rejects $\text{H}_0$, in this case, if we find there are 12 coefficients whose $p-$values are below $\alpha$, can we conclude that these 12 coefficients are non-zero? At what confidence level? $1-\alpha$? Or something else? How to interpret the individual $p-$values (the $t-$tests) together with the overall $p-$value (the $F-$test)?

If we perform forward selection/ backward elimination/ step-wise selection, in the final set of predictors produced by these methods, at what confidence level can we conclude that those corresponding coefficients are non-zero?

If we run $2^m$ regression model (where $m$ is total number of predictors), will this help us to identify the set of non-zero coefficients at least theoretically? At what confidence level?

Best Answer

I do not have a good answer, but let me rephrase some of you thoughts and questions and give comments.

<...> what is the point of those $p$−values of each individual coefficient?

A $p$-value is a valid tool when assessing the significance of a single regressor individually. If you care about whether $X_i$ has a non-zero coefficient in population, you look at the $p$-value associated with the coefficient $\beta_i$ at $X_i$. If this is the only question you have, this is a satisfactory answer. That is the point of the individual $p$-values.

That's why we need to look at the $F$−test of the overall significance of the model, i.e., whether there is at least one coefficient that is significantly different from 0.

Yes, the $F$-statistic informs you whether all the regressors taken together have zero coefficients in population. But this is only a special case of what you are interested in, if I understand you correctly. So the $F$-statistic is not useful here -- unless it is low enough to conclude that at the given significance level there is not enough evidence to reject the null.

<...> if we find there are 12 coefficients whose $p$−values are below $\alpha$, can we conclude that these 12 coefficients are non-zero? At what confidence level? $1−\alpha$? or something else?

You can take any single coefficient individually and conclude at $1-\alpha$ confidence level that it is non-zero, but you cannot do that jointly for all 12 coefficients at $1-\alpha$ confidence level. If the significance tests for the twelve coefficients were independent, you could say that the confidence level is $(1-\alpha)^{12}$ which is (considerably) lower than $1-\alpha$.

If we perform forward selection/ backward elimination/ step-wise selection, in the final set of predictors produced by these methods, at what confidence level can we conclude that those corresponding coefficients are non-zero?

This is a tough question. The $p$-values and the $F$-statistic in the final model are conditional on how that model was built, i.e. on the forward selection / backward elimination / step-wise selection mechanism. Hence, they cannot be used as is for making inference about whether the coefficients are zero in population; these values have to be adjusted. There may exist a procedure for that (because the issue has been known for a long time), but I cannot remember any relevant reference.

If we run $2m$ regression model (where $m$ is total number of predictors), will this help us to identify the set of non-zero coefficients at least theoretically? At what confidence level?

Recall that all the models that will have omitted variables (variables that have non-zero coefficients in population) will suffer from the omitted variable bias and generally will have "wrong" $p$-values etc., so the approach appears problematic.

Finally, note that the "nice assumptions"

Assume all the nice assumptions hold, e.g., $\epsilon$ is normal, we have a set of i.i.d. observations.

require $\epsilon$ -- rather than the observations $Y$ and $X$ -- to be i.i.d.