Solved – Stepwise regression – what are the steps in STATA

econometricsregressionstatastepwise regression

This question will probably seem very stupid, but hey, econometrics and statistics were never really my strongest features!
For my BA, my professor adviced me to perform stepwise regression. My dependent variable is Hiv Prevalence (expressed between 0 and 1), whereas my independent variables include GDP per capita, school enrollment, unemployment, urban population rate, population growth, HCI, spending on healthcare. Everything should be estimated on mean. My question is, whether someone would be able to tell me:
1. Is there anything else that I should do after performing both forward and backward procedures in STATA? Will it be enough for me to look at the p-values and then create a regression only with does that turned out to be significant?
2. When it comes to diagnostics tests, which ones should I perform? Linearity, homoscedasticity, normality and something else?
3. What if my results turn out to be heteroscedastic, not linear etc?
What can I do?
I will be very greatful for all the answers!

Best Answer

With 137 data points and only 8 predictors, there should be no need to do any predictor selection at all. A rough rule of thumb for ordinary least-squares regression is that you need about 10-20 observations per predictor to avoid overfitting. If your model doesn't include interactions among the predictors then you seem fine in that regard.

A danger in cutting down on the number of predictors is omitted-variable bias. If you omit a predictor that is associated both with outcome and with the included predictors in a linear regression, the coefficient estimates for the included predictors will be biased. It seems likely that most of your predictors are correlated with each other, so that would seem to be a serious risk in your case.

As a comment from @Tim rightly points out, if you do need to cut down on the number of predictors then stepwise regression is not a good choice. LASSO is a more principled approach, in which you penalize the magnitudes of the regression coefficients to help trade off against the overfitting that predictor selection entails. As this answer points out, even though LASSO typically performs better than other predictor-selection techniques, in situations like yours the best model will include all predictors provided that you avoid overfitting.

To learn more about LASSO, see An Introduction to Statistical Learning for a helpful introduction to that and many other techniques. LASSO is implemented in STATA and their website evidently links to video tutorials.

Be aware, however, that LASSO's choice of a set of predictors can be highly dependent on the particular data set at hand. When predictors are correlated with each other, a new data sample might well lead to a different set of chosen predictors. So be wary of jumping to a conclusion that any particular set of chosen predictors (whether by LASSO or other approaches) constitute the "really important ones." At best they are the important ones in your particular data sample.

In terms of diagnostic tests and what to do if they fail, see this page for a start. LASSO is a linear modeling technique so linearity is important to document. Resolving linearity problems typically requires some transformation of predictors or the outcome variable. As usually implemented LASSO doesn't provide p-values anyway, so normality of residuals isn't critical.