P-Value Pooling – How to Pool Bootstrapped P-values Across Multiply Imputed Data Sets

bootstrapconfidence intervalmultiple-imputationp-valuevariance

I am concerned with the problem that I would like to bootstrap the p-value for an estimate of $\theta$ from multiply imputed (MI) data, but that it is unclear to me how to combine the p-values across MI sets.

For MI data sets, the standard approach to get to the total variance of estimates uses Rubin's rules. See here for a review of pooling MI data sets. The square root of the total variance serves as a standard error estimate of $\theta$. However, for some estimators the total variance has no known closed form or the sampling distribution is not normal. The statistic ${\theta}/{se(\theta)}$ may then not be t-distributed, not even asymptotically.

Therefore, in the complete data case, one alternative option is to bootstrap the statistic to find variance, a p-value and a confidence interval, even if the samling distribution is not normal and its closed form unknown. In the MI case there are then two options:

Pool the bootstrapped variance across MI data sets
Pool the p-value or confidence bounds across MI data sets

The first option would then again use Rubin's rules. However, I believe that this is problematic, if $\theta$ has a non-normal sampling distribution. In this situation (or more generally, in all situations) the bootstrapped p-value can be used directly. However, in the MI case, this would lead to multiple p-values or confidence intervals, which need to be pooled across MI data sets.

So my question is: how should I pool multiple bootstrapped p-values (or confidence intervals) across multiply imputed data sets?

I would welcome any suggestions on how to proceed, thank you.

Best Answer

I think both options result in the correct answer. In general, I would prefer method 1 as that preserves the entire distribution.

For method 1, bootstrap the parameter $k$ times within each of the $m$ MI solutions. Then simply mix the $m$ bootstrapped distributions to obtain your final density, now consisting of $k \times m$ samples that include the between-imputation variation. Then treat that as a conventional bootstrap sample to get confidence intervals. Use the Bayesian bootstrap for small samples. I know of no simulation work that investigates this procedure, and this is actually an open problem to be investigated.

For method 2, use the Licht-Rubin procedure. See How to get pooled p-values on tests done in multiple imputed datasets?

Related Solutions

Factor Analysis – Conducting It on Multiply Imputed Data with Procrustes Analysis

Factor analysis or principal components analysis may indeed yield solutions whose answers are rotated or mirrored versions of each other, so averaging the person scores is not a good idea. Several authors have explored the use of Procrustes analysis to correct for the rotational indetermination, so try searching on 'multiple imputation' and 'procrustes'.

Survreg Results – Pooling Survreg Results Across Multiply Imputed Datasets

The warnings aren't a problem unless you are interested in response or deviance residuals for individual cases. If you are interested in such residuals, you seem to have found an error in the way those are calculated for some survreg() models of interval-censored data.

With mice, this warning comes from how the number of observations is estimated when summarizing or pooling results from models that don't directly provide the number of observations (nobs). For example, from the code for mice:::summary.mira() where object is a mira object and v is the set of model fits after tidy() is applied:

if (!"nobs" %in% colnames(v)) {
        v$nobs <- tryCatch(length(stats::residuals(getfit(object)[[1]])), 
            error = function(e) NULL)
    }

The warnings come from the attempt to get residuals from a survreg Gaussian fit to interval-censored data. The mice:::pool.fitlist() function calls summary(), so you get the same warnings. Only the length of the residuals vector (including those that are NA) is used in the mice system, however, so there is no further downstream problem.

What's going on; possible error in survreg.distributions

If you ask for response or deviance residuals from any one of your models on the multiple imputations you get the same warning*:

> gf <- getfit(model1)
> residuals(gf[[1]],type="deviance")
         1          2          3          4          5          6          7 
       NaN        NaN        NaN        NaN -1.1860347        NaN        NaN 
         8          9         10         11         12         13         14 
 1.9884853        NaN -1.2012836        NaN        NaN        NaN -0.7339629 
        15         16         17         18         19         20 
       NaN        NaN  2.0007294  1.9884853        NaN  2.0007294 
Warning message:
In log(1 - 2 * pnorm(width/2)) : NaNs produced

If you look carefully, you will see than the only non-NaN values are for cases where the start and stop values are the same, so that the software treated them as exact rather than interval-censored.

The log(1 - 2 * pnorm(width/2)) formula is from the provided definition of deviance for interval-censored data (status == 3 in the internal coding) with Gaussian errors:

> survreg.distributions$gaussian$deviance
function (y, scale, parms) 
{
    status <- y[, ncol(y)]
    width <- ifelse(status == 3, (y[, 2] - y[, 1])/scale, 0)
    center <- ifelse(status == 3, rowMeans(y), y[, 1])
    temp2 <- log(1 - 2 * pnorm(width/2)) ## HERE'S THE CULPRIT
    loglik <- ifelse(status == 1, -log(sqrt(2 * pi) * scale), 
        ifelse(status == 3, temp2, 0))
    list(center = center, loglik = loglik)
}
<bytecode: 0x7f9925dcaae0>
<environment: namespace:survival>

Above, y is the Surv object, scale is the model estimate, and parms is irrelevant for Gaussian models. That loglik is only used to calculate deviance residuals. This function is also called for response residuals but the loglik value isn't used.

As the survival vignette explains in Section 5.5, an interval-censored observation lacks an exact observation time, so likelihood needs to be calculated around an appropriate intermediate point $\hat y_0$ of the interval. "For interval censored observations from a symmetric distribution, $\hat y_0 =$ the center of the censoring interval." You then combine that with the scale-normalized width of the interval to find the likelihood around that value of $\hat y_0$, which for interval censored data is proportional to the difference in survival fraction between the lower (L) and upper (U) bounds of the interval: $S(L) - S(U) = F(U) - F(L)$, where $F(t)$ is the distribution function corresponding to $S(t)$.

The problem is that 1 - 2 * pnorm(width/2) returns a negative number for any value of width > 0! That's where the NaNs come from. As far as I can tell, this formula is in error and should be 2 * pnorm(width/2) for the Gaussian likelihood of an observation within an interval around its center. As the log-normal distribution inherits from this definition you also get such warnings with interval-censored log-normal models;** you get similar warnings from models using the t distribution, where the offending line of code includes log(1 - 2 * pt(width/2, df = parms)).

After looking at this code I'm also a bit concerned about how the center values are calculated via rowMeans(y), where y is a Surv object representing interval-censored data. Perhaps there is some internal definition of rowMeans() in the survival package that omits the third column of the Surv object (value of 3 for interval-censored), but when I apply rowMeans() to such an object outside of the package the value seems to average in that value of 3:

> rowMeans(Surv(1,2,type="interval2"))
[1] 2

although the mean of the [1,2] interval is 1.5.

*You didn't show how you set the seed, so your results might differ slightly.

**You can't fit a log-normal to values of 0; to see this behavior, just add 0.5 to all of your dv_low and dv_high values.

Best Answer

Related Solutions

Factor Analysis – Conducting It on Multiply Imputed Data with Procrustes Analysis

Survreg Results – Pooling Survreg Results Across Multiply Imputed Datasets

Related Question