Solved – Assessing the effect of adding a variable using stepwise forward logistic regression using Stata

categorical datalogisticmultiple regressionstatastepwise regression

I'd really appreciate help using Stata to perform a manual stepwise forward logistic regression.

I have 37 biologically plausible, statistically significant categorical variables linked to disease outcome. I need to end up with a final multivariable model.

I've added the first variable (most significant/most plausible) with corresponding OR output.

xi:logistic outcome i.variable1

. xi:logistic casecontrol i.breed_groupall
i.breed_group~l   _Ibreed_gro_0-7     (naturally coded; _Ibreed_gro_0 omitted)

Logistic regression                               Number of obs   =        995
                                                  LR chi2(6)      =      83.87
                                                  Prob > chi2     =     0.0000
Log likelihood = -422.36813                       Pseudo R2       =     0.0903

------------------------------------------------------------------------------
 casecontrol | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ibreed_gr~2 |   1.757143   .6861797     1.44   0.149      .817345    3.777537
_Ibreed_gr~3 |   1.952381   .8811439     1.48   0.138     .8061249    4.728537
_Ibreed_gr~4 |   1.464286    .530121     1.05   0.292     .7202148    2.977074
_Ibreed_gr~5 |   6.192708   1.779109     6.35   0.000     3.526453    10.87485
_Ibreed_gr~6 |   3.880357   1.103611     4.77   0.000     2.222193    6.775816
_Ibreed_gr~7 |    .636646   .2555236    -1.13   0.261     .2899083    1.398091
------------------------------------------------------------------------------

What do I look for to see if adding the second variable I choose means that both variables should stay in, when, for example I type;

xi:logistic outcome i.variable1 i.variable2


. xi:logistic casecontrol i.breed_groupall i.height_category
i.breed_group~l   _Ibreed_gro_0-7     (naturally coded; _Ibreed_gro_0 omitted)
i.height_cate~y   _Iheight_ca_0-4     (naturally coded; _Iheight_ca_0 omitted)

Logistic regression                               Number of obs   =        992
                                                  LR chi2(10)     =     132.25
                                                  Prob > chi2     =     0.0000
Log likelihood = -396.05629                       Pseudo R2       =     0.1431

------------------------------------------------------------------------------
 casecontrol | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Ibreed_gr~2 |   1.002262   .4145417     0.01   0.996     .4455736    2.254465
_Ibreed_gr~3 |   1.185087    .557553     0.36   0.718     .4712828    2.980017
_Ibreed_gr~4 |   1.774162   .6616502     1.54   0.124     .8541793    3.685001
_Ibreed_gr~5 |   1.452416    .541494     1.00   0.317     .6994292     3.01605
_Ibreed_gr~6 |   1.256098   .4377793     0.65   0.513     .6343958    2.487064
_Ibreed_gr~7 |   .4238999   .1777699    -2.05   0.041     .1863361    .9643388
_Iheight_c~1 |   2.780857    .918967     3.09   0.002     1.455088    5.314572
_Iheight_c~2 |   5.402833   2.246817     4.06   0.000     2.391342    12.20679
_Iheight_c~3 |   13.50715   5.989787     5.87   0.000     5.663642    32.21303
_Iheight_c~4 |   16.85605   8.674745     5.49   0.000     6.147467    46.21846
------------------------------------------------------------------------------

How do I know if I want to keep one, or both of these variables, or that one, or both of them is no use to me?

If for example, I want to keep both of these and add the 3rd variable, how do I know which?

. xi:logistic casecontrol i.height_category i.breed_groupall i.combinedweight

i.height_cate~y   _Iheight_ca_0-4     (naturally coded; _Iheight_ca_0 omitted)
i.breed_group~l   _Ibreed_gro_0-7     (naturally coded; _Ibreed_gro_0 omitted)
i.combinedwei~t   _Icombinedw_0-5     (naturally coded; _Icombinedw_0 omitted)

Logistic regression                               Number of obs   =        891
                                                  LR chi2(14)     =     123.58
                                                  Prob > chi2     =     0.0000
Log likelihood = -346.82026                       Pseudo R2       =     0.1512

------------------------------------------------------------------------------
 casecontrol | Odds Ratio   Std. Err.      z    P>|z|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
_Iheight_c~1 |   3.397418   1.235922     3.36   0.001     1.665316    6.931087
_Iheight_c~2 |   6.321891   3.119652     3.74   0.000     2.403289    16.62983
_Iheight_c~3 |   14.36312   7.797083     4.91   0.000     4.956448    41.62242
_Iheight_c~4 |   23.39157   15.34571     4.81   0.000      6.46607    84.62101
_Ibreed_gr~2 |   .7339708   .3482777    -0.65   0.515     .2895834    1.860303
_Ibreed_gr~3 |   1.060443   .5318773     0.12   0.907      .396787    2.834113
_Ibreed_gr~4 |   1.644423   .6502556     1.26   0.208      .757569    3.569479
_Ibreed_gr~5 |   1.246412   .4940969     0.56   0.578     .5731024     2.71076
_Ibreed_gr~6 |   1.262449   .4661338     0.63   0.528     .6122442    2.603172
_Ibreed_gr~7 |    .401331   .1782529    -2.06   0.040     .1680497    .9584456
_Icombined~1 |    1.21903   .7160999     0.34   0.736     .3854692    3.855132
_Icombined~2 |   1.238685   .3967314     0.67   0.504     .6612025    2.320531
_Icombined~4 |   1.764532   .5970935     1.68   0.093     .9090641    3.425031
_Icombined~5 |   2.107871   1.118772     1.40   0.160     .7448366    5.965227
------------------------------------------------------------------------------

Any help would be gratefully received.

(I am using STATA v9.1 I believe.)

Best Answer

I am assuming you know that the stepwise regression is a wrong approach (see Frank Harrell's terrific book, or just wait for his comments in this thread), and you are ready to face the criticism of the reviewers (or your dissertation committee, depending on your career stage). I am thus treating this as a programming exercise, rather than a rigorous methodological investigation.

Stata stepwise command does not support factor variables, as you have probably discovered already, so you'd have to rewrite its main functionality, at least at a descriptive level. I will make use of Ben Jann's estadd command published in Stata Journal.

    net sj 7-2 st0085_1
    net install st0085_1
    webuse nlswork, clear
    foreach catvar of varlist race grade ind_code occ_code {
      regress ln_wage age i.`catvar'
      levelsof `catvar', local( thelevels )
      tokenize `thelevels'
      local dotcat
      while "`1'"!="" {
        local dotcat `dotcat' `1'.`catvar'
        macro shift
      }
      test `dotcat'
      estadd scalar pnew = r(p)
      estimates store with_`catvar'
    }
    estimates tab with_* , stats( pnew )

The last line gives you the answers (not terribly informative in this case, of course, as the sample sizes are quite a bit larger than yours).

Feel free to ask about specific commands in this code fragment. Of course, you'd modify this for your own data and estimation command of your liking. The above code assumes Stata 11 and factor variables; you have not stated what version of Stata you are using, which would've helped.

Related Question