Solved – Adjusting for covariates in ROC curve analysis

epidemiologyroc

This question is about estimating cut-off scores on a multi-dimensional screening questionnaire to predict a binary endpoint, in the presence of correlated scales.

I was asked about the interest of controlling for associated subscores when devising cut-off scores on each dimension of a measurement scale (personality traits) which might be used for alcoholism screening. That is, in this particular case, the person was not interested in adjusting on external covariates (predictors) — which leads to (partial) area under covariate-adjusted ROC curve, e.g. (1-2) — but essentially on other scores from the same questionnaire because they correlate one to each other (e.g. "impulsivity" with "sensation seeking"). It amounts to build an GLM which includes on the left-side the score of interest (for which we seek a cut-off) and another score computed from the same questionnaire, while on the right-hand side the outcome may be drinking status.

To clarify (per @robin request), suppose we have $j=4$ scores, say $x_j$ (e.g., anxiety, impulsivity, neuroticism, sensation seeking), and we want to find a cut-off value $t_j$ (i.e. "positive case" if $x_j>t_j$, "negative case" otherwise) for each of them. We usually adjust for other risk factors like gender or age when devising such cut-off (using ROC curve analysis). Now, what about adjusting impulsivity (IMP) on gender, age, and sensation seeking (SS) since SS is known to correlate with IMP? In other words, we would have a cut-off value for IMP where effect of age, gender and anxiety level are removed.

Apart from saying that a cut-off must remain as simple as possible, my response was

About covariates, I would recommend
estimating the AUCs with and without
adjustment, just to see if the
predictive performance increase. Here,
your covariates are merely other
subscores defined from the same
measurement instrument and I never
faced such a situation (usually, I
adjust on known risk factors, like Age
or Gender). […] Also, since you are
interested in prognostic issues (i.e.
screening efficacy of the questionnaire), you
may also be interested in estimating
the positive predictive value (PPV,
probability of patients with positive
test results who are correctly
classified) provided you are able to
classify subjects as "positive" or
"negative" depending on their
subscores on your questionnaire. Note, however,
that it is necessary to know the
prevalence of this disorder to
correctly interpret the PPV in turn…

Do you have a more thorough understanding of this particular situation, with link to relevant papers when possible?

References

  1. Janes, H and Pepe, MS (2008). Adjusting for Covariates in Studies of Diagnostic, Screening, or Prognostic Markers: An Old Concept in a New Setting. American Journal of Epidemiology, 168(1): 89-97.
  2. Janes, H and Pepe, MS (2008). Accommodating Covariates in ROC Analysis. UW Biostatistics Working Paper Series, Paper 322.

Best Answer

The way that you've envisioned the analysis is really not the way I would suggest you start out thinking about it. First of all it is easy to show that if cutoffs must be used, cutoffs are not applied on individual features but on the overall predicted probability. The optimal cutoff for a single covariate depends on all the levels of the other covariates; it cannot be constant. Secondly, ROC curves play no role in meeting the goal of making optimum decisions for an individual subject.

To handle correlated scales there are many data reduction techniques that can help. One of them is a formal redundancy analysis where each predictor is nonlinearly predicted from all the other predictors, in turn. This is implemented in the redun function in the R Hmisc package. Variable clustering, principal component analysis, and factor analysis are other possibilities. But the main part of the analysis, in my view, should be building a good probability model (e.g., binary logistic model).