I performed multivariate logistic regression with the dependent variable Y
being death at a nursing home within a certain period of entry and got the following results (note if the variables starts in A
it is a continuous value while those starting in B
are categorical):
Call:
glm(Y ~ A1 + B2 + B3 + B4 + B5 + A6 + A7 + A8 + A9, data=mydata, family=binomial)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0728 -0.2167 -0.1588 -0.1193 3.7788
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 20.048631 6.036637 3.321 0.000896 ***
A1 0.051167 0.016942 3.020 0.002527 **
B2 -0.664940 0.304299 -2.185 0.028878 *
B3 -2.825281 0.633072 -4.463 8.09e-06 ***
B4 -2.547931 0.957784 -2.660 0.007809 **
B5 -2.862460 1.385118 -2.067 0.038774 *
A6 -0.129808 0.041286 -3.144 0.001666 **
A7 0.020016 0.009456 2.117 0.034276 *
A8 -0.707924 0.253396 -2.794 0.005210 **
A9 0.003453 0.001549 2.229 0.025837 *
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 485.10 on 2206 degrees of freedom
Residual deviance: 417.28 on 2197 degrees of freedom
AIC: 437.28
Number of Fisher Scoring iterations: 7
(Intercept) A1 B2 B3 B4 B5 A6 A7 A8 A9
5.093426e+08 1.052499e+00 5.143045e-01 5.929197e-02 7.824340e-02 5.712806e-02 8.782641e-01 1.020218e+00 4.926657e-01 1.003459e+00
2.5 % 97.5 %
(Intercept) 3.703525e+03 7.004944e+13
A1 1.018123e+00 1.088035e+00
B2 2.832698e-01 9.337710e-01
B3 1.714448e-02 2.050537e-01
B4 1.197238e-02 5.113460e-01
B5 3.782990e-03 8.627079e-01
A6 8.099945e-01 9.522876e-01
A7 1.001484e+00 1.039302e+00
A8 2.998207e-01 8.095488e-01
A9 1.000416e+00 1.006510e+00
As you can see, all of the variables are "significant" in that their p values are below the usual threshold of 0.05. However looking at the coefficients, I'm not quite sure what to make of these results. It seems that although these variables contribute to the model, looking at the odds ratios, they don't seem to really seem to have much predictive power. Of note, when I calculated the AUC, I got approximately 0.8.
Can I say that this model is better at predicting against mortality (e.g. predicting that seniors will live past the prescribed period) compared to predicting for mortality?
Best Answer
I would suggest that you use Frank Harrell's excellent rms package. It contains many useful functions to validate and calibrate your model. As far as I know, you cannot assess predictive performance solely based on the coefficients. Further, I would suggest that you use the bootstrap to validate the model. The AUC or concordance-index (c-index) is a useful measure of predictive performance. A c-index of $0.8$ is quite high but as in many predictive models, the fit of your model is likely overoptimistic (overfitting). This overoptimism can be assessed using bootstrap. But let me give an example:
On the bottom you see the usual regression coefficients with corresponding $p$-values. On the top right, you see several discrimination indices. The
C
denotes the c-index (AUC), and a c-index of $0.5$ denotes random splitting whereas a c-index of $1$ denotes perfect prediction.Dxy
is Somers' $D_{xy}$ rank correlation between the predicted probabilities and the observed responses. $D_{xy}$ has simple relationship with the c-index: $D_{xy}=2(c-0.5)$. A $D_{xy}$ of $0$ occurs when the model's predictions are random and when $D_{xy}=1$, the model is perfectly discriminating. In this case, the c-index is $0.693$ which is slightly better than chance but a c-index of $>0.8$ is good enough for predicting the outcomes of individuals.As said above, the model is likely overoptimistic. We now use bootstrap to quantify the optimism:
Let's concentrate on the $D_{xy}$ which is at the top. The first column denotes the original index, which was $0.3857$. The column called
optimism
denotes the amount of estimated overestimation by the model. The columnindex.corrected
is the original estimate minus the optimism. In this case, the bias-corrected $D_{xy}$ is a bit smaller than the original. The bias-corrected c-index (AUC) is $c=\frac{1+ D_{xy}}{2}=0.6749$.We can also calculate a calibration curve using resampling:
The plot provides some evidence that our models is overfitting: the model underestimates low probabilities and overestimates high probabilities. There is also a systematic overestimation around $0.3$.
Predictive model building is a big topic and I suggest reading Frank Harrell's course notes.