I'm new to the concept of regression and its application .What is the criteria for eliminating variables in a multiple linear regression model whose purpose is prediction ?
- Is it the p value of the coefficients > 0.05 or
- Variance inflation factor > 10 ?
Best Answer
First, there is no one criterion. Predictive modeling is not itself a decision tree, it is a science, a trade, and an artform. If you go looking for hard rules, they will often lead you astray.
Given that, neither of the criteria you state are particularly relevant for predictive modeling.
Much better is a disciplined regularization strategy (ridge, or some combination of ridge with LASSO), a broadly specified model (use splines to capture non-linearities, include reasonably hypothesized interactions), and a steady dose of cross validation.
No. Again, p-values are not tools that are intended to assess the predictive power of a model or variable. They have little to nothing to say about that issue. To assess the predictive power of a model, you need to use cross validation or a help out test data set. Make predictions, compare them to the truth (on data that the model has not yet seen).
Assessing the predictive power of a variable is much more subtle, and there is no general consensus on what that even means, or what a solution would actually measure.
If you would like to control the complexity of a model for purposes of maximizing predictive power, then you need to use regularization. You can search this site for "ridge regression" and "lasso regression" to get started learning about this technique.