Solved – How to pool c-statistic/AUROC (or any bounded variable) after using multiple imputation techniques

multiple-imputationpooling

I am conducting a study where I am interested in predicting a dichotomous outcome (poor outcome yes/no) for patients in a hospital setting. Specifically, I want to compare how different summary measures for the first week of admission affect the models' discrimination, as measured by the c-index (aka area under the receiver operator curve or AUROC).

As usually happens in clinical studies however, I have missing data on predictor and outcome variables. I have decided to attack this problem by using multiple imputation techniques. This way I have created 50 datasets with replaced missing values (using the 'mice' package in R).

Using the appropriate functions I am able to obtain the c-statistics with confidence interval (& variance) for each imputation dataset.
Using 'plain' Rubins rules for pooling of normally distributed variables I would now average the point estimate and adjust the total variance for the variance between imputation datasets.
Now I come onto the problem: I am unsure whether I can treat the 50 c-indices as normally distributed and calculate point estimate and the variance needed for a proper confidence interval.

I have tried searching for an answer, but I only found the following three suggestions used in (slightly) different situations:

to pool assuming normal distribution anyway (often applied to other statistics which are bounded or definitively not normally distributed);
look at the distribution of statistics over all imputation datasets and take the median c-index as point estimate, while using the 2.5th and 97.5th percentile values as lower and upper bound of a 95% confidence interval.
transform all c-indices and variances to an unbounded scale, pool transformed values assuming normal distribution, and finally transform back to bounded c-index scale (as suggested for the observed:expected ratio by log-transforming in Siregar S – Eur J Cardiothorac Surg 2012). For the $[0, 1]$ bounded c-index this could be done by logit-transformation of the c-indices.

Any help would be greatly appreciated.

Best Answer

The c-index is a useful measure of predictive discrimination because it is easy to interpret and at least moderately sensitive. It is not a full-information proper accuracy scoring rule. It is not sensitive enough for comparing two models. So I suggest you obtain the best model using all the partial information available (e.g., multiple imputation with the number of imputations being at least the percentage of records that are incomplete), then attempt to quantify the value of that single model. That is easier said than done, but you can start with the overall Wald statistic for the global null hypothesis that none of the predictors are associated with $Y$. There are a few papers showing how to derive a unitless discrimination index from the Wald $\chi^2$ statistic. Also take a quick look at the $g$-index in my Regression Modeling Strategies book and notes.

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

A major point of multiple imputations is to do separate analyses on each of the imputed data sets, so that you can get both pooled estimates of things like regression coefficients and an estimate of the errors in the coefficients. Averaging the imputed data sets first is not the correct use of this approach. And don't limit yourself to so few imputations; with modern computers there's no reason not to do 100 or more. See http://www.stefvanbuuren.nl/mi/MI.html, from the person who developed the mice package, for further information.

Multiple Imputation – Applying Rubin’s Rule for Combining Multiply Imputed Datasets

Rubin's rules can only be applied to parameters following a normal distribution. For parameters with a F or Chi Square distribution a different set of formulas is needed:

Allison, P. D. (2002). Missing data. Newbury Park, CA: Sage.

For performing an ANOVA on multiple imputed datasets you could use the R package miceadds (pdf; miceadds::mi.anova).

Update 1

Here is a complete example:

Export your data from SPSS to R. In Spss save your dataset as .csv
Read in your dataset:
```
library(miceadds)   
dat <– read.csv(file='your-dataset.csv')
```
Lets assume, that $reading$ is your dependent variable and that you have two factors
- gender, with male = 0 and female = 1
- treatment, with control = 0 and 'received treatment' = 1
Now lets convert them to factors:
```
dat$gender    <- factor(dat$gender)
dat$treatment <- factor(dat$treatment)
```
Convert your dataset to a mids object, wehere we assume, that the first variable holds the imputation number (Imputation_ in SPSS):
```
dat.mids <- as.mids(dat)
```

Now you can perform an ANOVA:

fit <- mi.anova(mi.res=dat.mids, formula="reading~gender*treatment", type=3)
summary(fit)

Update 2 This is a reply to your second comment:

What you describe here is a data import/export related problem between SPSS and R. You could try to import the .sav file directly into R and there are a bunch of dedicated packages for that: foreign, rio, gdata, Hmisc, etc. I prefer the csv-way, but that's a matter of taste and/or depends on the nature of your problem. Maybe you should also check some tutorials on youtube or other sources on the internet.

library(foreign)
dat <- read.spss(file='path-to-sav', use.value.labels=F, to.data.frame=T)

Update 3 This is a reply to your first comment:

Yes, you can do your analysis in SPSS and pool the F values in miceadds (please note this example is taken from the miceadds::micombine.F help page):

library(miceadds)
Fvalues <- c(6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78, 
             6.76 , 4.54 , 4.23 , 5.45 , 4.78, 6.76 , 4.54 , 4.23 , 5.45 , 4.78 )
micombine(Fvalues, df1=4)

Best Answer

Related Solutions

Solved – Pooling imputed, still not analysed datasets in MICE

Multiple Imputation – Applying Rubin’s Rule for Combining Multiply Imputed Datasets

Related Question