Solved – Stepwise regression – what are the steps in STATA

econometricsregressionstatastepwise regression

This question will probably seem very stupid, but hey, econometrics and statistics were never really my strongest features!
For my BA, my professor adviced me to perform stepwise regression. My dependent variable is Hiv Prevalence (expressed between 0 and 1), whereas my independent variables include GDP per capita, school enrollment, unemployment, urban population rate, population growth, HCI, spending on healthcare. Everything should be estimated on mean. My question is, whether someone would be able to tell me:
1. Is there anything else that I should do after performing both forward and backward procedures in STATA? Will it be enough for me to look at the p-values and then create a regression only with does that turned out to be significant?
2. When it comes to diagnostics tests, which ones should I perform? Linearity, homoscedasticity, normality and something else?
3. What if my results turn out to be heteroscedastic, not linear etc?
What can I do?
I will be very greatful for all the answers!

Best Answer

With 137 data points and only 8 predictors, there should be no need to do any predictor selection at all. A rough rule of thumb for ordinary least-squares regression is that you need about 10-20 observations per predictor to avoid overfitting. If your model doesn't include interactions among the predictors then you seem fine in that regard.

A danger in cutting down on the number of predictors is omitted-variable bias. If you omit a predictor that is associated both with outcome and with the included predictors in a linear regression, the coefficient estimates for the included predictors will be biased. It seems likely that most of your predictors are correlated with each other, so that would seem to be a serious risk in your case.

As a comment from @Tim rightly points out, if you do need to cut down on the number of predictors then stepwise regression is not a good choice. LASSO is a more principled approach, in which you penalize the magnitudes of the regression coefficients to help trade off against the overfitting that predictor selection entails. As this answer points out, even though LASSO typically performs better than other predictor-selection techniques, in situations like yours the best model will include all predictors provided that you avoid overfitting.

To learn more about LASSO, see An Introduction to Statistical Learning for a helpful introduction to that and many other techniques. LASSO is implemented in STATA and their website evidently links to video tutorials.

Be aware, however, that LASSO's choice of a set of predictors can be highly dependent on the particular data set at hand. When predictors are correlated with each other, a new data sample might well lead to a different set of chosen predictors. So be wary of jumping to a conclusion that any particular set of chosen predictors (whether by LASSO or other approaches) constitute the "really important ones." At best they are the important ones in your particular data sample.

In terms of diagnostic tests and what to do if they fail, see this page for a start. LASSO is a linear modeling technique so linearity is important to document. Resolving linearity problems typically requires some transformation of predictors or the outcome variable. As usually implemented LASSO doesn't provide p-values anyway, so normality of residuals isn't critical.

Related Solutions

Model Selection – Modern Alternatives to Stepwise Regression

There are several alternatives to Stepwise Regression. The most used I have seen are:

Expert opinion to decide which variables to include in the model.
Partial Least Squares Regression. You essentially get latent variables and do a regression with them. You could also do PCA yourself and then use the principal variables.
Least Absolute Shrinkage and Selection Operator (LASSO).

Both PLS Regression and LASSO are implemented in R packages like

PLS: http://cran.r-project.org/web/packages/pls/ and

LARS: http://cran.r-project.org/web/packages/lars/index.html

If you only want to explore the relationship between your dependent variable and the independent variables (e.g. you do not need statistical significance tests), I would also recommend Machine Learning methods like Random Forests or Classification/Regression Trees. Random Forests can also approximate complex non-linear relationships between your dependent and independent variables, which might not have been revealed by linear techniques (like Linear Regression).

A good starting point to Machine Learning might be the Machine Learning task view on CRAN:

Machine Learning Task View: http://cran.r-project.org/web/views/MachineLearning.html

Solved – Assessing the effect of adding a variable using stepwise forward logistic regression using Stata

I am assuming you know that the stepwise regression is a wrong approach (see Frank Harrell's terrific book, or just wait for his comments in this thread), and you are ready to face the criticism of the reviewers (or your dissertation committee, depending on your career stage). I am thus treating this as a programming exercise, rather than a rigorous methodological investigation.

Stata stepwise command does not support factor variables, as you have probably discovered already, so you'd have to rewrite its main functionality, at least at a descriptive level. I will make use of Ben Jann's estadd command published in Stata Journal.

    net sj 7-2 st0085_1
    net install st0085_1
    webuse nlswork, clear
    foreach catvar of varlist race grade ind_code occ_code {
      regress ln_wage age i.`catvar'
      levelsof `catvar', local( thelevels )
      tokenize `thelevels'
      local dotcat
      while "`1'"!="" {
        local dotcat `dotcat' `1'.`catvar'
        macro shift
      }
      test `dotcat'
      estadd scalar pnew = r(p)
      estimates store with_`catvar'
    }
    estimates tab with_* , stats( pnew )

The last line gives you the answers (not terribly informative in this case, of course, as the sample sizes are quite a bit larger than yours).

Feel free to ask about specific commands in this code fragment. Of course, you'd modify this for your own data and estimation command of your liking. The above code assumes Stata 11 and factor variables; you have not stated what version of Stata you are using, which would've helped.

Best Answer

Related Solutions

Model Selection – Modern Alternatives to Stepwise Regression

Solved – Assessing the effect of adding a variable using stepwise forward logistic regression using Stata

Related Question