Solved – How to determine the best predictor in a linear regression model

multiple regressionregression

I've got data regarding baseball with four independent variables. I'm confused as to how to determine which variable is the most significant predictor. To begin, I ran a multiple regression and focused on the t-values stated in the coefficient's table. However, 3 of the 4 variables are significant according to the p-values. How can I go about resolving this? Should I run 4 separate linear regressions and then compare the F statistics instead?

Best Answer

There are multiple ways to determine the best predictor. One of the most easy way is to first see correlation matrix even before you perform the regression. Generally variable with highest correlation is a good predictor. You can also compare coefficients to select the best predictor (Make sure you have normalized the data before you perform regression and you take absolute value of coefficients) You can also look change in R-squared value. Trying removing each one of them and see which variable causes maximum change.

Related Solutions

Multiple Regression – Is Adjusting p-Values for Multiple Comparisons Advisable

It seems your question more generally addresses the problem of identifying good predictors. In this case, you should consider using some kind of penalized regression (methods dealing with variable or feature selection are relevant too), with e.g. L1, L2 (or a combination thereof, the so-called elasticnet) penalties (look for related questions on this site, or the R penalized and elasticnet package, among others).

Now, about correcting p-values for your regression coefficients (or equivalently your partial correlation coefficients) to protect against over-optimism (e.g. with Bonferroni or, better, step-down methods), it seems this would only be relevant if you are considering one model and seek those predictors that contribute a significant part of explained variance, that is if you don't perform model selection (with stepwise selection, or hierarchical testing). This article may be a good start: Bonferroni Adjustments in Tests for Regression Coefficients. Be aware that such correction won't protect you against multicollinearity issue, which affects the reported p-values.

Given your data, I would recommend using some kind of iterative model selection techniques. In R for instance, the stepAIC function allows to perform stepwise model selection by exact AIC. You can also estimate the relative importance of your predictors based on their contribution to $R^2$ using boostrap (see the relaimpo package). I think that reporting effect size measure or % of explained variance are more informative than p-value, especially in a confirmatory model.

It should be noted that stepwise approaches have also their drawbacks (e.g., Wald tests are not adapted to conditional hypothesis as induced by the stepwise procedure), or as indicated by Frank Harrell on R mailing, "stepwise variable selection based on AIC has all the problems of stepwise variable selection based on P-values. AIC is just a restatement of the P-Value" (but AIC remains useful if the set of predictors is already defined); a related question -- Is a variable significant in a linear regression model? -- raised interesting comments (@Rob, among others) about the use of AIC for variable selection. I append a couple of references at the end (including papers kindly provided by @Stephan); there is also a lot of other references on P.Mean.

Frank Harrell authored a book on Regression Modeling Strategy which includes a lot of discussion and advices around this problem (§4.3, pp. 56-60). He also developed efficient R routines to deal with generalized linear models (See the Design or rms packages). So, I think you definitely have to take a look at it (his handouts are available on his homepage).

References

Whittingham, MJ, Stephens, P, Bradbury, RB, and Freckleton, RP (2006). Why do we still use stepwise modelling in ecology and behaviour? Journal of Animal Ecology, 75, 1182-1189.
Austin, PC (2008). Bootstrap model selection had similar performance for selecting authentic and noise variables compared to backward variable elimination: a simulation study. Journal of Clinical Epidemiology, 61(10), 1009-1017.
Austin, PC and Tu, JV (2004). Automated variable selection methods for logistic regression produced unstable models for predicting acute myocardial infarction mortality. Journal of Clinical Epidemiology, 57, 1138–1146.
Greenland, S (1994). Hierarchical regression for epidemiologic analyses of multiple exposures. Environmental Health Perspectives, 102(Suppl 8), 33–39.
Greenland, S (2008). Multiple comparisons and association selection in general epidemiology. International Journal of Epidemiology, 37(3), 430-434.
Beyene, J, Atenafu, EG, Hamid, JS, To, T, and Sung L (2009). Determining relative importance of variables in developing and validating predictive models. BMC Medical Research Methodology, 9, 64.
Bursac, Z, Gauss, CH, Williams, DK, and Hosmer, DW (2008). Purposeful selection of variables in logistic regression. Source Code for Biology and Medicine, 3, 17.
Brombin, C, Finos, L, and Salmaso, L (2007). Adjusting stepwise p-values in generalized linear models. International Conference on Multiple Comparison Procedures. -- see step.adj() in the R someMTP package.
Wiegand, RE (2010). Performance of using multiple stepwise algorithms for variable selection. Statistics in Medicine, 29(15), 1647–1659.
Moons KG, Donders AR, Steyerberg EW, and Harrell FE (2004). Penalized Maximum Likelihood Estimation to predict binary outcomes. Journal of Clinical Epidemiology, 57(12), 1262–1270.
Tibshirani, R (1996). Regression shrinkage and selection via the lasso. Journal of The Royal Statistical Society B, 58(1), 267–288.
Efron, B, Hastie, T, Johnstone, I, and Tibshirani, R (2004). Least Angle Regression. Annals of Statistics, 32(2), 407-499.
Flom, PL and Cassell, DL (2007). Stopping Stepwise: Why stepwise and similar selection methods are bad, and what you should use. NESUG 2007 Proceedings.
Shtatland, E.S., Cain, E., and Barton, M.B. (2001). The perils of stepwise logistic regression and how to escape them using information criteria and the Output Delivery System. SUGI 26 Proceedings (pp. 222–226).

R-Squared Split – How to Allocate Between Predictor Variables in Multiple Regression

You can just get the two separate correlations and square them or run two separate models and get the R^2. They will only sum up if the predictors are orthogonal.

Best Answer

Related Solutions

Multiple Regression – Is Adjusting p-Values for Multiple Comparisons Advisable

R-Squared Split – How to Allocate Between Predictor Variables in Multiple Regression

Related Question