I am trying to run an interval regression using the survival r package (as described here https://stats.oarc.ucla.edu/r/dae/interval-regression/), but I am running into difficulties when trying to pool results across multiply imputed datasets. Specifically, although estimates are returned, I get the following warning: log(1 – 2 * pnorm(width/2)) : NaNs produced. The estimates seem reasonable, at face value (no NaNs, very large or small SEs).
I ran the same model on the stacked dataset (ignoring imputations) and on individual imputed datasets, but in either case, I do not get the warning. It seems that there might be an issue calculating the deviance when pooling (although I also get the warning with summary()
)? Would someone be able to explain to me what is happening and whether this warning can be safely ignored when interpreting the estimates? If anyone could offer a solution to this issue, that would also be great.
Thanks so much!
# A Reproducible Example
require(survival)
require(mice)
require(car)
# Create DF
dat <- data.frame(dv = c(1, 1, 2, 1, 0, NA, 1, 4, NA, 0, 3, 1, 3, 0, 2, 1, 4, NA, 2, 4),
catvar1 = factor(c(0, 0, 0, 0, 0, 1, 0, 0, 0, NA, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0)),
catvar2 = factor(c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, NA, 0)))
dat_imp <- mice(data = dat)
# Transform Outcome Var for Interval Reg
dat_imp_long <- complete(dat_imp, action = "long", include=TRUE)
# 1-4 correspond to ranges (e.g., 1 = 1 to 2 times...4 = 10 or more)
# create variables that reflect this range
dat_imp_long$dv_low <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 1; 2 = 3; 3 = 6; 4 = 10")
dat_imp_long$dv_high <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 2; 2 = 5; 3 = 9; 4 = 999")
dat_imp_long$dv_high[dat_imp_long$dv_high > 40] <- Inf
# Convert back to mids
dat_mids <- as.mids(dat_imp_long)
# Run Interval Reg
model1 <- with(dat_mids, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
# Warning message for both calls: In log(1 - 2 * pnorm(width/2)) : NaNs produced
# Problem does not only occur with pool, but summary
summary(model1)
summary(pool(model1))
# Run Equivalent Model on Individual Datasets
# No warnings produced
imp1 <- subset(dat_imp_long, .imp == 1)
model2 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp1)
summary(model2)
imp2 <- subset(dat_imp_long, .imp == 2)
model3 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian", data = imp2)
summary(model3)
# Equivalent Analysis on Stacked Dataset
# No warning
model <- with(dat_imp_long, survreg(Surv(dv_low, dv_high, type = "interval2") ~
catvar1 + catvar2, dist = "gaussian"))
summary(model)
Update 2/22/22: At EdM's suggestion, I contacted the maintainer of the survival package, and they found the errors and fixed them (both in the calculation of the Gaussian likelihood and the center values), and they are working on a CRAN release.
Best Answer
The warnings aren't a problem unless you are interested in response or deviance residuals for individual cases. If you are interested in such residuals, you seem to have found an error in the way those are calculated for some
survreg()
models of interval-censored data.With
mice
, this warning comes from how the number of observations is estimated when summarizing or pooling results from models that don't directly provide the number of observations (nobs
). For example, from the code formice:::summary.mira()
whereobject
is amira
object andv
is the set of model fits aftertidy()
is applied:The warnings come from the attempt to get residuals from a
survreg
Gaussian fit to interval-censored data. Themice:::pool.fitlist()
function callssummary()
, so you get the same warnings. Only thelength
of the residuals vector (including those that areNA
) is used in themice
system, however, so there is no further downstream problem.What's going on; possible error in
survreg.distributions
If you ask for response or deviance residuals from any one of your models on the multiple imputations you get the same warning*:
If you look carefully, you will see than the only non-
NaN
values are for cases where thestart
andstop
values are the same, so that the software treated them as exact rather than interval-censored.The
log(1 - 2 * pnorm(width/2))
formula is from the provided definition of deviance for interval-censored data (status == 3
in the internal coding) with Gaussian errors:Above,
y
is theSurv
object,scale
is the model estimate, andparms
is irrelevant for Gaussian models. Thatloglik
is only used to calculate deviance residuals. This function is also called for response residuals but theloglik
value isn't used.As the
survival
vignette explains in Section 5.5, an interval-censored observation lacks an exact observation time, so likelihood needs to be calculated around an appropriate intermediate point $\hat y_0$ of the interval. "For interval censored observations from a symmetric distribution, $\hat y_0 =$ the center of the censoring interval." You then combine that with the scale-normalizedwidth
of the interval to find the likelihood around that value of $\hat y_0$, which for interval censored data is proportional to the difference in survival fraction between the lower (L) and upper (U) bounds of the interval: $S(L) - S(U) = F(U) - F(L)$, where $F(t)$ is the distribution function corresponding to $S(t)$.The problem is that
1 - 2 * pnorm(width/2)
returns a negative number for any value ofwidth
> 0! That's where the NaNs come from. As far as I can tell, this formula is in error and should be2 * pnorm(width/2)
for the Gaussian likelihood of an observation within an interval around its center. As the log-normal distribution inherits from this definition you also get such warnings with interval-censored log-normal models;** you get similar warnings from models using thet
distribution, where the offending line of code includeslog(1 - 2 * pt(width/2, df = parms))
.After looking at this code I'm also a bit concerned about how the
center
values are calculated viarowMeans(y)
, wherey
is aSurv
object representing interval-censored data. Perhaps there is some internal definition ofrowMeans()
in thesurvival
package that omits the third column of theSurv
object (value of 3 for interval-censored), but when I applyrowMeans()
to such an object outside of the package the value seems to average in that value of 3:although the mean of the [1,2] interval is 1.5.
*You didn't show how you set the seed, so your results might differ slightly.
**You can't fit a log-normal to values of 0; to see this behavior, just add 0.5 to all of your
dv_low
anddv_high
values.