Solved – Comparing nested binary logistic regression models when $n$ is large

large datalogisticmodel selectionrregression

To better ask my question, I have provided some of the outputs from both a 16 variable model (fit) and a 17 variable model (fit2) below (all predictor variables in these models are continuous, where the only difference between these models is that fit does not contain variable 17 (var17)):

fit                    Model Likelihood     Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       
 Obs        102849    LR chi2   13602.84    R2       0.173    C       0.703    
  0          69833    d.f.            17    g        1.150    Dxy     0.407    
  1          33016    Pr(> chi2) <0.0001    gr       3.160    gamma   0.416    
 max |deriv| 3e-05                          gp       0.180    tau-a   0.177    
                                            Brier    0.190       


fit2                 Model Likelihood       Discrimination    Rank Discrim.    
                         Ratio Test            Indexes          Indexes       
 Obs        102849    LR chi2   13639.70    R2       0.174    C       0.703    
  0          69833    d.f.            18    g        1.154    Dxy     0.407    
  1          33016    Pr(> chi2) <0.0001    gr       3.170    gamma   0.412    
 max |deriv| 3e-05                          gp       0.180    tau-a   0.177    
                                            Brier    0.190          

I used Frank Harrell's rms package to build these lrm models. As you can see, these models do not appear to vary much, if at all, across Discrimination Indexes and Rank Discrim. Indexes; however, using lrtest(fit,fit2), I was provided with the following results:

 L.R. Chisq         d.f.            P 
3.685374e+01     1.000000e+00    1.273315e-09 

As such, we would reject the null hypothesis of this likelihood ratio test; however, I would assume this is likely due to the large sample size (n = 102849) as these models appear to perform in a similar fashion. Furthermore, I am interested in finding a better way of formally comparing nested binary logistic regression models when n is large.

I greatly appreciate any feedback, R scripts, or documentation that can steer me in the right direction in terms of comparing these types of nested models! Thanks!

Best Answer

(1) There is an extensive literature on why one should prefer full models to restricted/parsimonious models. My understanding are few reasons to prefer the parsimonious model. However, larger models may not be feasible for many clinical applications.

(2) As far as I know, Discrimination/Discrimination indexes aren’t (?should not be) used as a model/variable selection parameter. They aren’t intended for this use and as a result there may not be much of a literature on why they shouldn’t be used for model building.

(3) Parsimonious models may have limitations that aren’t readily apparent. They may be less well calibrated than larger models, external/internal validity may be reduced.

(4) The c statistic may not be optimal in assessing models that predict future risk or stratify individuals into risk categories. In this setting, calibration is as important to the accurate assessment of risk. For example, a biomarker with an odds ratio of 3 may have little effect on the cstatistic, yet an increased level could shift estimated 10-year cardiovascular risk for an individual patient from 8% to 24%

Cook N.R.; Use and misuse of the ROC curve in the medical literature. Circulation. 115 2007:928-935.

(5) AUC/c-statistic/discrimination is known to be insensitive to significant predictor variables. This is discussed in the Cook reference above, and the motivating force behind the development of net reclassification index. Also discussed in Cook above.

(6) Large datasets can still lead to larger models than desired if standard variable selection methods are used. In stepwise selection procedures often a p-value cut-off of 0.05 is used. But there is nothing intrinsic about this value that means you should choose this value. With smaller datasets a larger p-value (0.2) may be more appropriate, in larger datasets a smaller p-value may be appropriate (0.01 was used for the GUSTO I dataset for this reason).

(7) While AIC is often use for model selection, and is better supported by the literature, BIC may be a valid alternative in larger datasets. For BIC model selection the chi-squared must exceed log(n), thus it will result in smaller models in larger datasets. (Mallow’s may have similar characteristics)

(8) But if you just want a max of 10 or 12 variables, the easier solution is something like bestglm or leaps packages were you just set the maximum number of variables you want to consider.

(9) if you just want a test that will make the two models look the same, and aren't too worried about the details, you could likely compare the AUC of the two models. Some packages will even give you a p-value for the comparison. Doesn't seem advisable.

Ambler G (2002) Simplifying a prognostic model: a simulation study based on clinical data
Cook N.R.; Use and misuse of the ROC curve in the medical literature. Circulation. 115 2007:928-935.
Gail M.H., Pfeiffer R.M.; On criteria for evaluating models of absolute risk. Biostat. 6 2005:227-239.

(10) Once the model has been build, c-statistics/decimation indexes may not be the best approach to comparing models and have well documented limitations. Comparisons should likely also at the minimum include calibration, reclassification index.

Steyerber (2010) Assessing the performance of prediction models: a framework for some traditional and novel measures

(11) It may be a good idea to go beyond above and use decision analytic measures.

Vickers AJ, Elkin EB. Decision curve analysis: a novel method for evaluating prediction models. Med Decis Making. 2006;26:565-74.
Baker SG, Cook NR, Vickers A, Kramer BS. Using relative utility curves to evaluate risk prediction. J R Stat Soc A. 2009;172:729-48.
Van Calster B, Vickers AJ, Pencina MJ, Baker SG, Timmerman D, Steyerberg EW. Evaluation of Markers and Risk Prediction Models: Overview of Relationships between NRI and Decision-Analytic Measures. Med Decis Making. 2013;33:490-501

---Update--- I find the Vickers article the most interesting. But this still hasn't been widely accepted despite many editorials. So may not be of much practical use. The Cook and Steyerberg articles are much more practical.

No one likes stepwise selection. I'm certainly not going to advocate for it. I might emphasize that most of the criticisms of stepwise assumes EPV<50 and a choice between a full or pre-specified model and a reduced model. If EPV>50 and there is a commitment to a reduce model the cost-benefit analysis may be different.

The weak thought behind comparing c-statistics is that they may not be different and I seem to remember this test being significantly underpowered. But now I can't find the reference, so might be way off base on that.

Related Question