Solved – How to do external validation of logistic regression models and perform model benchmarking

aiclogisticpredictive-modelsrocvalidation

Quality assessment in trauma has for > 25 years been done with the US derived logistic regression model, the TRISS model. DV: survival/death and IVs: physiologic derangement (continuous), anatomic injury (continuous), age (dichotomous, >= 55, <55). Probability of survival (Ps) are calculated for each patient:

$$
Ps = \frac{\exp\big(b_0 + b_1{\rm (physiology)} + b_2{\rm (anatomy)} + b_3{\rm (age)}\big)}{1+\exp\big(b_0 + b_1{\rm (physiology)} + b_2{\rm (anatomy)} + b_3{\rm (age)}\big)}
$$

Used worldwide with the US derived regression coefficients, updated in 2005 and in 2009, it has had huge impact in trauma research. Foreign institutions has been able to benchmark their performance towards the US standard by use of the W-statistic (with 95% CI), expressing excess (less) survivors per 100 patients in own institution vs. the US standard. (W = (actual number of survivors – predicted number of survivors)/(number of patients/100). Some countries have used the same DV and IVs, however derived their own regression coefficients from their own population. Others have derived their own logistic regression model.

We have recently derived our own logistic regression model with the same DV, but different IVs. We have also implemented the TRISS model in our registry with the national US IV, but with IV coefficients derived from our trauma population.

Questions:

  1. I want to perform an external validation of the US model with national US IV regression coefficients. I want to compare the predictive ability/performance vs. our own model. How should we perform such a comparison? Compare ROC curves, deviance, or AIC? Is it a meaningful comparison?

  2. I want to perform an external validation of the US model, however with IV regression coefficients derived from our trauma population. I want to compare the predictive ability/performance vs. our own model. How should we perform such a comparison? Compare ROC curves, deviance, or AIC? Is this a more relevant comparison? How can we tell whether differences are significant or not?

  3. What about using Net Reclassification Improvement?

  4. I want to decide which model fits our data best. How?

  5. Can other institutions use our model for benchmarking?

Best Answer

It is important to note that the model you specified has no face validity, and it can easily be shown to be miscalibrated for patients of specific ages. The discontinuity the model specified does not occur in nature.

ROC curves do not have anything to do with model validation. High-resolution nonparametric calibration curves are all-important here. The R rms package val.prob function provides that plus many relevant statistics including the powerful Spiegelhalter test of calibration accuracy. Be sure to (in)validate the model's calibration in several age ranges.