The warnings aren't a problem unless you are interested in response or deviance residuals for individual cases. If you are interested in such residuals, you seem to have found an error in the way those are calculated for some survreg()
models of interval-censored data.
With mice
, this warning comes from how the number of observations is estimated when summarizing or pooling results from models that don't directly provide the number of observations (nobs
). For example, from the code for mice:::summary.mira()
where object
is a mira
object and v
is the set of model fits after tidy()
is applied:
if (!"nobs" %in% colnames(v)) {
v$nobs <- tryCatch(length(stats::residuals(getfit(object)[[1]])),
error = function(e) NULL)
}
The warnings come from the attempt to get residuals from a survreg
Gaussian fit to interval-censored data. The mice:::pool.fitlist()
function calls summary()
, so you get the same warnings. Only the length
of the residuals vector (including those that are NA
) is used in the mice
system, however, so there is no further downstream problem.
What's going on; possible error in survreg.distributions
If you ask for response or deviance residuals from any one of your models on the multiple imputations you get the same warning*:
> gf <- getfit(model1)
> residuals(gf[[1]],type="deviance")
1 2 3 4 5 6 7
NaN NaN NaN NaN -1.1860347 NaN NaN
8 9 10 11 12 13 14
1.9884853 NaN -1.2012836 NaN NaN NaN -0.7339629
15 16 17 18 19 20
NaN NaN 2.0007294 1.9884853 NaN 2.0007294
Warning message:
In log(1 - 2 * pnorm(width/2)) : NaNs produced
If you look carefully, you will see than the only non-NaN
values are for cases where the start
and stop
values are the same, so that the software treated them as exact rather than interval-censored.
The log(1 - 2 * pnorm(width/2))
formula is from the provided definition of deviance for interval-censored data (status == 3
in the internal coding) with Gaussian errors:
> survreg.distributions$gaussian$deviance
function (y, scale, parms)
{
status <- y[, ncol(y)]
width <- ifelse(status == 3, (y[, 2] - y[, 1])/scale, 0)
center <- ifelse(status == 3, rowMeans(y), y[, 1])
temp2 <- log(1 - 2 * pnorm(width/2)) ## HERE'S THE CULPRIT
loglik <- ifelse(status == 1, -log(sqrt(2 * pi) * scale),
ifelse(status == 3, temp2, 0))
list(center = center, loglik = loglik)
}
<bytecode: 0x7f9925dcaae0>
<environment: namespace:survival>
Above, y
is the Surv
object, scale
is the model estimate, and parms
is irrelevant for Gaussian models. That loglik
is only used to calculate deviance residuals. This function is also called for response residuals but the loglik
value isn't used.
As the survival
vignette explains in Section 5.5, an interval-censored observation lacks an exact observation time, so likelihood needs to be calculated around an appropriate intermediate point $\hat y_0$ of the interval. "For interval censored observations from a symmetric distribution, $\hat y_0 =$ the center of the censoring interval." You then combine that with the scale-normalized width
of the interval to find the likelihood around that value of $\hat y_0$, which for interval censored data is proportional to the difference in survival fraction between the lower (L) and upper (U) bounds of the interval: $S(L) - S(U) = F(U) - F(L)$, where $F(t)$ is the distribution function corresponding to $S(t)$.
The problem is that 1 - 2 * pnorm(width/2)
returns a negative number for any value of width
> 0! That's where the NaNs come from. As far as I can tell, this formula is in error and should be 2 * pnorm(width/2)
for the Gaussian likelihood of an observation within an interval around its center. As the log-normal distribution inherits from this definition you also get such warnings with interval-censored log-normal models;** you get similar warnings from models using the t
distribution, where the offending line of code includes log(1 - 2 * pt(width/2, df = parms))
.
After looking at this code I'm also a bit concerned about how the center
values are calculated via rowMeans(y)
, where y
is a Surv
object representing interval-censored data. Perhaps there is some internal definition of rowMeans()
in the survival
package that omits the third column of the Surv
object (value of 3 for interval-censored), but when I apply rowMeans()
to such an object outside of the package the value seems to average in that value of 3:
> rowMeans(Surv(1,2,type="interval2"))
[1] 2
although the mean of the [1,2] interval is 1.5.
*You didn't show how you set the seed, so your results might differ slightly.
**You can't fit a log-normal to values of 0; to see this behavior, just add 0.5 to all of your dv_low
and dv_high
values.
Best Answer
I think both options result in the correct answer. In general, I would prefer method 1 as that preserves the entire distribution.
For method 1, bootstrap the parameter $k$ times within each of the $m$ MI solutions. Then simply mix the $m$ bootstrapped distributions to obtain your final density, now consisting of $k \times m$ samples that include the between-imputation variation. Then treat that as a conventional bootstrap sample to get confidence intervals. Use the Bayesian bootstrap for small samples. I know of no simulation work that investigates this procedure, and this is actually an open problem to be investigated.
For method 2, use the Licht-Rubin procedure. See How to get pooled p-values on tests done in multiple imputed datasets?