GAM Model Results – How to Summarize from Multiple Imputation Data in R Using MGCV

generalized-additive-modelmgcvmultiple-imputationr

I am very new to R and not very experienced in statistics. I have this general question regarding applying Generalized Additive Models (GAM) in multiple imputation dataset.

I used R package mice for multiple imputation and mgcv for GAM. But I have no idea how to combine the results from each imputed data set.

Does anyone have any good idea about this?

Thanks a lot!

Best Answer

First, GAM's are basically a combination of GLMs, splines, and ridge regression (loosely speaking). You might be well advised to work on your understanding of these three things before attempting to work with their combination.

Second, I've done multiple imputation with GAMs, but not with mice.

In general, you're interested in fitting a model: $$ y = f(\mathbf{X}) + \epsilon $$ (in the gaussian, identity-linked case). $\mathbf{X}$ has holes in it -- missing data. Without going into the detail about how multiple imputation works (Gelman and Hill have a very good and less-technical chapter explaining missing data imputation), what it does is create many different $\mathbf{X}$'s, call them $\mathbf{X}_m$ where $m$ indexes imputed datasets from 1 to the number of imputations, with the missing values filled in with plausible guesses about what the missing values might be -- based on correlations in your dataset. You then fit your model $$ y = f(\mathbf{X}_m) + \epsilon $$ to each of those $m$ partially-imputed datasets. Since you're fitting $m$ models to $m$ different datasets, you're going to get $m$ different vectors of regression coefficients. These are combined according to "Rubin's Rules" (after the statistician who invented MI). Basically you average the coefficients; $$\hat\beta = \frac{1}{M}\displaystyle\sum_{m=1}^M \hat\beta_m$$ The variance-covariance matrix $\hat V_\beta$ of the estimated parameters is calculated by first averaging variance-covariance matrices, and then adding a correction to account for variation between imputation models: \begin{equation} \hat V_{\hat\beta} = W + \left(1 + \frac{1}{M} \right)B \end{equation} where $W = \frac{1}{M}\displaystyle\sum_{m=1}^M \widehat{VCV}_m$, $\widehat{VCV}$ is the estimated variance-covariance matrix of the estimated parameters, and $B = \frac{1}{M-1}\displaystyle\sum_{m=1}^M \left(\hat\beta_m - \hat{\bar\beta}\right)\left(\hat\beta_m - \hat{\bar\beta}\right)^T$. This procedure inflates the standard errors on coefficients about which the imputation model is relatively less certain, either due to a lot of missing data, or due to a poorly-informative imputation model.

You can then use the vector of coefficients and the VCV as if they were gotten from a single complete-case model.

OP's question in a comment makes me realize that the below isn't quite right! Read EDIT2 below to see why

So what is different about doing this with a GAM? Only a couple of things. Your variables represented by smooth functions are associated with a number of coefficients (mgcv's default is 10). During estimation they are subject to ridge penalties, which smooths the estimated function. But otherwise they can be slotted into Rubin's rules just like the parametric coefficients. Each of the different $m$ models will have different estimated smoothing parameters, as they are estimated from the data. I don't think there is a problem with this, though I'd be interested to hear if someone has another perspective. You do want to take care that you pre-specify the location of the knots of each spline so that they are uniform across the $m$ models -- otherwise your coefficients won't be comparable, and combining them won't be appropriate.

Once you have calculated the combined coefficients and VCV, you can simply stuff them into a GAM object (one of your $m$ models). Functions that summarize data and make plots draw from those two objects, and will thereby use your imputation estimates rather than those from the $m$th model. There is code online illustrating the whole process from this paper.

EDIT I just saw the comment about how you've got your models fitted to your imputed datasets already. If you can coerce those into a list object with the 5 GAM models, then you can run something like the following to combine them:

bhat=results[[1]]$coeff
    for (i in 2:reps){
    	bhat=bhat+results[[i]]$coeff
    }
bhat = bhat/reps

W=results[[1]]$Vp
    for (i in 2:reps){
    	W = W+results[[i]]$Vp
    	}
    W = W/reps
    B= (results[[1]]$coeff-bhat) %*% t(results[[1]]$coeff-bhat)
    for (i in 2:reps){
    	B = B+(results[[i]]$coeff-bhat) %*% t(results[[i]]$coeff-bhat)
    }
B=B/(reps-1)

Vb = W+(1+1/reps)*B

dfr=results[[1]]$df.residual
    for (i in 2:reps){
    	dfr=dfr+results[[i]]$df.residual
    }
dfr = dfr/reps

MI = results[[1]]

MI$coefficients=bhat
    MI$Vp = Vb
MI$df.residual = dfr

EDIT2 OP's question in a comment about p-values makes me realize that the above isn't quite right. It should get you appropriate coefficient and VCV estimates. But p-values are based on a reduced-rank Wald statistic that relies on the model matrix for its calculation. Obviously the model matrix will differ between different imputations. So if you take a single one of many gam objects (which will all have their own model frames) and stuff the VCV and coefficient vectors into it, you won't get the same result as you would if you chose a different one of your models fit to a different imputed dataset. I'm not sure how Rubin's Rules would be generalized to combine imputations here! Maybe a short-term hack would be to also average the model R matrices from each model? This should be about as valid as averaging the effective degrees of freedom (which may also be a hack).