Solved – a “good fit” Brier score and Harrell’s C Index

brier-scorecox-modelglmnetgoodness of fitsurvival

This is a question I originally posted on r-help but it is more suited here. I will post the question and the answer I received from Dr. Winsemius and would be most grateful for any additional answers you can provide.

I am evaluating survival models using Brier score (“peperr”) and Harrell’s C-Index (“Hmisc”).
I am wondering:

  1. What would be considered a “good fit” according to these scores (like the heuristic levels we have for R square in linear regressions) ?

  2. Are there any papers to cite on the matter (I couldn’t find any) ?

  3. Is there any paper to cite that discusses the limitation of using traditional reporting for model fit in survival analysis as opposed to these measures ?

Dr. David Winsemius replied:

  1. Frank Harrell's excellent text "Regression Modeling Strategies" has an extensive discussion of "goodness of fit" and the principles of model comparison. It's both too involved as well as off-topic for Rhelp. The other text to consult is Steyerberg's "Clinical Prediction Models".

  2. I predict that the RMS bibliography would be an excellent place to start your search.

Despite getting his name attached to what he calls the 'c-index', I don't think one could call Frank Harrell a proponent of that measure or any of the "competitors". It's really just a dressed up/transformed AUC. The message I have taken from reading his book and listening to presentations is that one should apply biologic tests of sensibility as well as careful investigation of the functional relationships between candidate predictors and the outcomes of interest. He speaks very disparagingly about automatic procedures.

Best Answer

Prior CV postings on the matter of GOF measures in generalized linear models:

Find out pseudo R square value for a Logistic Regression analysis

Which pseudo-$R^2$ measure is the one to report for logistic regression (Cox & Snell or Nagelkerke)?

Addressing model uncertainty

Compare classifiers based on AUROC or accuracy?

"Goodness of fit" is an elusive notion. Any set of data can be perfectly fit with a complex, saturated model, but such a model will generally be useless despite being perfect. Application of such tests often completely ignores what the model that is being fit to. I find it rather strange that Anderson-Darling and Kolmogorov-Smirnov tests are being called "goodness of fit tests" when they are really being used as "tests of normality".

Models need to be both validated and calibrated and the GOF measures generally tell you very little about those aspects. (It should be noted in passing that the 'rms' function print.cph also reports the Brier score along with a pseudo-R^2 and Somers-D as "discrimination indexes". And it does not report the c-index, perhaps because the Somers-D is equivalent and preceded it historically and Harrell is tired of people misusing it.)

You will note that Frank told you that your proposed strategy in an earlier rhelp posting where you proposed taking a "best" glmnet model and then apply stepwise forward and backward reduction was bad statistical practice. Part of the problem is that you were taking a result from a method which is optimized for prediction (penalized glmnet) and then applying a procedure that was in all probability lowering its predictive capacity.

Your low Brier score is something I see all the time in my research. I work with large datasets where the outcomes of interest are rather rare (mortality over 5-12 years for basically healthy people). Even a good model will only be predicting a mortality rate of 4-5% for most of the people who die and the "error rate" remains high despite many variables being highly significant. Model comparison measures (especially the deviance) are much better guides for decision making than any of the GOF or discrimination measures.