I'd really appreciate help using Stata to perform a manual stepwise forward logistic regression.
I have 37 biologically plausible, statistically significant categorical variables linked to disease outcome. I need to end up with a final multivariable model.
I've added the first variable (most significant/most plausible) with corresponding OR output.
xi:logistic outcome i.variable1
. xi:logistic casecontrol i.breed_groupall
i.breed_group~l _Ibreed_gro_0-7 (naturally coded; _Ibreed_gro_0 omitted)
Logistic regression Number of obs = 995
LR chi2(6) = 83.87
Prob > chi2 = 0.0000
Log likelihood = -422.36813 Pseudo R2 = 0.0903
------------------------------------------------------------------------------
casecontrol | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ibreed_gr~2 | 1.757143 .6861797 1.44 0.149 .817345 3.777537
_Ibreed_gr~3 | 1.952381 .8811439 1.48 0.138 .8061249 4.728537
_Ibreed_gr~4 | 1.464286 .530121 1.05 0.292 .7202148 2.977074
_Ibreed_gr~5 | 6.192708 1.779109 6.35 0.000 3.526453 10.87485
_Ibreed_gr~6 | 3.880357 1.103611 4.77 0.000 2.222193 6.775816
_Ibreed_gr~7 | .636646 .2555236 -1.13 0.261 .2899083 1.398091
------------------------------------------------------------------------------
What do I look for to see if adding the second variable I choose means that both variables should stay in, when, for example I type;
xi:logistic outcome i.variable1 i.variable2
. xi:logistic casecontrol i.breed_groupall i.height_category
i.breed_group~l _Ibreed_gro_0-7 (naturally coded; _Ibreed_gro_0 omitted)
i.height_cate~y _Iheight_ca_0-4 (naturally coded; _Iheight_ca_0 omitted)
Logistic regression Number of obs = 992
LR chi2(10) = 132.25
Prob > chi2 = 0.0000
Log likelihood = -396.05629 Pseudo R2 = 0.1431
------------------------------------------------------------------------------
casecontrol | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ibreed_gr~2 | 1.002262 .4145417 0.01 0.996 .4455736 2.254465
_Ibreed_gr~3 | 1.185087 .557553 0.36 0.718 .4712828 2.980017
_Ibreed_gr~4 | 1.774162 .6616502 1.54 0.124 .8541793 3.685001
_Ibreed_gr~5 | 1.452416 .541494 1.00 0.317 .6994292 3.01605
_Ibreed_gr~6 | 1.256098 .4377793 0.65 0.513 .6343958 2.487064
_Ibreed_gr~7 | .4238999 .1777699 -2.05 0.041 .1863361 .9643388
_Iheight_c~1 | 2.780857 .918967 3.09 0.002 1.455088 5.314572
_Iheight_c~2 | 5.402833 2.246817 4.06 0.000 2.391342 12.20679
_Iheight_c~3 | 13.50715 5.989787 5.87 0.000 5.663642 32.21303
_Iheight_c~4 | 16.85605 8.674745 5.49 0.000 6.147467 46.21846
------------------------------------------------------------------------------
How do I know if I want to keep one, or both of these variables, or that one, or both of them is no use to me?
If for example, I want to keep both of these and add the 3rd variable, how do I know which?
. xi:logistic casecontrol i.height_category i.breed_groupall i.combinedweight
i.height_cate~y _Iheight_ca_0-4 (naturally coded; _Iheight_ca_0 omitted)
i.breed_group~l _Ibreed_gro_0-7 (naturally coded; _Ibreed_gro_0 omitted)
i.combinedwei~t _Icombinedw_0-5 (naturally coded; _Icombinedw_0 omitted)
Logistic regression Number of obs = 891
LR chi2(14) = 123.58
Prob > chi2 = 0.0000
Log likelihood = -346.82026 Pseudo R2 = 0.1512
------------------------------------------------------------------------------
casecontrol | Odds Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iheight_c~1 | 3.397418 1.235922 3.36 0.001 1.665316 6.931087
_Iheight_c~2 | 6.321891 3.119652 3.74 0.000 2.403289 16.62983
_Iheight_c~3 | 14.36312 7.797083 4.91 0.000 4.956448 41.62242
_Iheight_c~4 | 23.39157 15.34571 4.81 0.000 6.46607 84.62101
_Ibreed_gr~2 | .7339708 .3482777 -0.65 0.515 .2895834 1.860303
_Ibreed_gr~3 | 1.060443 .5318773 0.12 0.907 .396787 2.834113
_Ibreed_gr~4 | 1.644423 .6502556 1.26 0.208 .757569 3.569479
_Ibreed_gr~5 | 1.246412 .4940969 0.56 0.578 .5731024 2.71076
_Ibreed_gr~6 | 1.262449 .4661338 0.63 0.528 .6122442 2.603172
_Ibreed_gr~7 | .401331 .1782529 -2.06 0.040 .1680497 .9584456
_Icombined~1 | 1.21903 .7160999 0.34 0.736 .3854692 3.855132
_Icombined~2 | 1.238685 .3967314 0.67 0.504 .6612025 2.320531
_Icombined~4 | 1.764532 .5970935 1.68 0.093 .9090641 3.425031
_Icombined~5 | 2.107871 1.118772 1.40 0.160 .7448366 5.965227
------------------------------------------------------------------------------
Any help would be gratefully received.
(I am using STATA v9.1 I believe.)
Best Answer
I am assuming you know that the stepwise regression is a wrong approach (see Frank Harrell's terrific book, or just wait for his comments in this thread), and you are ready to face the criticism of the reviewers (or your dissertation committee, depending on your career stage). I am thus treating this as a programming exercise, rather than a rigorous methodological investigation.
Stata
stepwise
command does not support factor variables, as you have probably discovered already, so you'd have to rewrite its main functionality, at least at a descriptive level. I will make use of Ben Jann'sestadd
command published in Stata Journal.The last line gives you the answers (not terribly informative in this case, of course, as the sample sizes are quite a bit larger than yours).
Feel free to ask about specific commands in this code fragment. Of course, you'd modify this for your own data and estimation command of your liking. The above code assumes Stata 11 and factor variables; you have not stated what version of Stata you are using, which would've helped.