Solved – Criteria for eliminating variables in multiple linear regression

multiple regressionp-value

I'm new to the concept of regression and its application .What is the criteria for eliminating variables in a multiple linear regression model whose purpose is prediction ?

  1. Is it the p value of the coefficients > 0.05 or
  2. Variance inflation factor > 10 ?

Best Answer

First, there is no one criterion. Predictive modeling is not itself a decision tree, it is a science, a trade, and an artform. If you go looking for hard rules, they will often lead you astray.

Given that, neither of the criteria you state are particularly relevant for predictive modeling.

  • p-values are only weakly related to predictive importance and power. Even in situations where the parameter estimates are of direct and primary interest, they are not intended to serve as a decision criteria for variable elimination.
  • The variance inflation factor measures impact to the precision of estimated coefficients due to correlation within your training data; in a predictive modeling problem the parameter estimates themselves are of secondary interest.

Much better is a disciplined regularization strategy (ridge, or some combination of ridge with LASSO), a broadly specified model (use splines to capture non-linearities, include reasonably hypothesized interactions), and a steady dose of cross validation.

So all i must look at is the Overall F value and its P value only to assess the predictive power of the model.

No. Again, p-values are not tools that are intended to assess the predictive power of a model or variable. They have little to nothing to say about that issue. To assess the predictive power of a model, you need to use cross validation or a help out test data set. Make predictions, compare them to the truth (on data that the model has not yet seen).

Assessing the predictive power of a variable is much more subtle, and there is no general consensus on what that even means, or what a solution would actually measure.

In such a case what should be done?

If you would like to control the complexity of a model for purposes of maximizing predictive power, then you need to use regularization. You can search this site for "ridge regression" and "lasso regression" to get started learning about this technique.

Related Question