Pooling Survreg Results Across Multiply Imputed Datasets – Warning: log(1 – 2 * pnorm(width/2)) : NaNs produced

interval-censoringmicemultiple-imputationrsurvival

I am trying to run an interval regression using the survival r package (as described here https://stats.oarc.ucla.edu/r/dae/interval-regression/), but I am running into difficulties when trying to pool results across multiply imputed datasets. Specifically, although estimates are returned, I get the following warning: log(1 – 2 * pnorm(width/2)) : NaNs produced. The estimates seem reasonable, at face value (no NaNs, very large or small SEs).

I ran the same model on the stacked dataset (ignoring imputations) and on individual imputed datasets, but in either case, I do not get the warning. It seems that there might be an issue calculating the deviance when pooling (although I also get the warning with summary())? Would someone be able to explain to me what is happening and whether this warning can be safely ignored when interpreting the estimates? If anyone could offer a solution to this issue, that would also be great.

Thanks so much!

# A Reproducible Example

require(survival)
require(mice)
require(car)

# Create DF
dat <- data.frame(dv = c(1, 1, 2, 1, 0, NA, 1, 4, NA, 0, 3, 1, 3, 0, 2, 1, 4, NA, 2, 4),
                  catvar1 = factor(c(0, 0, 0, 0, 0, 1, 0, 0, 0, NA, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0)),
                  catvar2 = factor(c(1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, NA, 0)))

dat_imp <- mice(data = dat)

# Transform Outcome Var for Interval Reg
dat_imp_long <- complete(dat_imp, action = "long", include=TRUE)

# 1-4 correspond to ranges (e.g., 1 = 1 to 2 times...4 = 10 or more)
# create variables that reflect this range
dat_imp_long$dv_low <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 1; 2 = 3; 3 = 6; 4 = 10")
dat_imp_long$dv_high <- car::recode(dat_imp_long$dv, "0 = 0; 1 = 2; 2 = 5; 3 = 9; 4 = 999")
dat_imp_long$dv_high[dat_imp_long$dv_high > 40] <- Inf

# Convert back to mids
dat_mids <- as.mids(dat_imp_long)

# Run Interval Reg 
model1 <- with(dat_mids, survreg(Surv(dv_low, dv_high, type = "interval2") ~ 
                                     catvar1 + catvar2, dist = "gaussian"))

# Warning message for both calls: In log(1 - 2 * pnorm(width/2)) : NaNs produced
# Problem does not only occur with pool, but summary
summary(model1)
summary(pool(model1))

# Run Equivalent Model on Individual Datasets
# No warnings produced
imp1 <- subset(dat_imp_long, .imp == 1)
model2 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~ 
                       catvar1 + catvar2, dist = "gaussian", data = imp1)
summary(model2)

imp2 <- subset(dat_imp_long, .imp == 2)
model3 <- survreg(Surv(dv_low, dv_high, type = "interval2") ~ 
                    catvar1 + catvar2, dist = "gaussian", data = imp2)
summary(model3)

# Equivalent Analysis on Stacked Dataset
# No warning
model <- with(dat_imp_long, survreg(Surv(dv_low, dv_high, type = "interval2") ~ 
                                  catvar1 + catvar2, dist = "gaussian"))
summary(model)

Update 2/22/22: At EdM's suggestion, I contacted the maintainer of the survival package, and they found the errors and fixed them (both in the calculation of the Gaussian likelihood and the center values), and they are working on a CRAN release.

Best Answer

The warnings aren't a problem unless you are interested in response or deviance residuals for individual cases. If you are interested in such residuals, you seem to have found an error in the way those are calculated for some survreg() models of interval-censored data.

With mice, this warning comes from how the number of observations is estimated when summarizing or pooling results from models that don't directly provide the number of observations (nobs). For example, from the code for mice:::summary.mira() where object is a mira object and v is the set of model fits after tidy() is applied:

if (!"nobs" %in% colnames(v)) {
        v$nobs <- tryCatch(length(stats::residuals(getfit(object)[[1]])), 
            error = function(e) NULL)
    }

The warnings come from the attempt to get residuals from a survreg Gaussian fit to interval-censored data. The mice:::pool.fitlist() function calls summary(), so you get the same warnings. Only the length of the residuals vector (including those that are NA) is used in the mice system, however, so there is no further downstream problem.

What's going on; possible error in survreg.distributions

If you ask for response or deviance residuals from any one of your models on the multiple imputations you get the same warning*:

> gf <- getfit(model1)
> residuals(gf[[1]],type="deviance")
         1          2          3          4          5          6          7 
       NaN        NaN        NaN        NaN -1.1860347        NaN        NaN 
         8          9         10         11         12         13         14 
 1.9884853        NaN -1.2012836        NaN        NaN        NaN -0.7339629 
        15         16         17         18         19         20 
       NaN        NaN  2.0007294  1.9884853        NaN  2.0007294 
Warning message:
In log(1 - 2 * pnorm(width/2)) : NaNs produced

If you look carefully, you will see than the only non-NaN values are for cases where the start and stop values are the same, so that the software treated them as exact rather than interval-censored.

The log(1 - 2 * pnorm(width/2)) formula is from the provided definition of deviance for interval-censored data (status == 3 in the internal coding) with Gaussian errors:

> survreg.distributions$gaussian$deviance
function (y, scale, parms) 
{
    status <- y[, ncol(y)]
    width <- ifelse(status == 3, (y[, 2] - y[, 1])/scale, 0)
    center <- ifelse(status == 3, rowMeans(y), y[, 1])
    temp2 <- log(1 - 2 * pnorm(width/2)) ## HERE'S THE CULPRIT
    loglik <- ifelse(status == 1, -log(sqrt(2 * pi) * scale), 
        ifelse(status == 3, temp2, 0))
    list(center = center, loglik = loglik)
}
<bytecode: 0x7f9925dcaae0>
<environment: namespace:survival>

Above, y is the Surv object, scale is the model estimate, and parms is irrelevant for Gaussian models. That loglik is only used to calculate deviance residuals. This function is also called for response residuals but the loglik value isn't used.

As the survival vignette explains in Section 5.5, an interval-censored observation lacks an exact observation time, so likelihood needs to be calculated around an appropriate intermediate point $\hat y_0$ of the interval. "For interval censored observations from a symmetric distribution, $\hat y_0 =$ the center of the censoring interval." You then combine that with the scale-normalized width of the interval to find the likelihood around that value of $\hat y_0$, which for interval censored data is proportional to the difference in survival fraction between the lower (L) and upper (U) bounds of the interval: $S(L) - S(U) = F(U) - F(L)$, where $F(t)$ is the distribution function corresponding to $S(t)$.

The problem is that 1 - 2 * pnorm(width/2) returns a negative number for any value of width > 0! That's where the NaNs come from. As far as I can tell, this formula is in error and should be 2 * pnorm(width/2) for the Gaussian likelihood of an observation within an interval around its center. As the log-normal distribution inherits from this definition you also get such warnings with interval-censored log-normal models;** you get similar warnings from models using the t distribution, where the offending line of code includes log(1 - 2 * pt(width/2, df = parms)).

After looking at this code I'm also a bit concerned about how the center values are calculated via rowMeans(y), where y is a Surv object representing interval-censored data. Perhaps there is some internal definition of rowMeans() in the survival package that omits the third column of the Surv object (value of 3 for interval-censored), but when I apply rowMeans() to such an object outside of the package the value seems to average in that value of 3:

> rowMeans(Surv(1,2,type="interval2"))
[1] 2

although the mean of the [1,2] interval is 1.5.


*You didn't show how you set the seed, so your results might differ slightly.

**You can't fit a log-normal to values of 0; to see this behavior, just add 0.5 to all of your dv_low and dv_high values.

Related Question