Solved – Simulate data based on output from a generalized linear model in R

generalized linear modelrsimulation

I have a generalized linear model for a data set and I want to use the model to simulate data that matches the data used to generate the model. I am not wanting to resample the original data, but rather use the model output to generate data that matches original data.

For example, suppose I have a binomial data set in which survivorship is a function of size.

size <- c(3, 5, 10, 5, 17, 7, 17, 5, 18, 20, 9, 4, 17, 14, 8, 19, 11, 
3, 12, 14, 4, 11, 10, 22, 10, 12, 10, 7, 3, 10, 9, 5, 4, 23, 
8, 7, 7, 11, 19, 12, 3, 9, 6, 6, 8, 14, 18, 10, 9, 11)

surv <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 
0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 
1, 0, 1, 1, 0, 1, 0, 1, 0)

dat=data.frame(surv,size)
out.glm = glm(surv ~ size, family=binomial, data = dat)

I can graph the model predictions and standard errors, but how do I use the output of the glm to simulate data that should theoretically match the original dataset? Specifically, should I use the standard error of the mean as the input to rbinom() across the range of probabilities and then resample from the output?

Best Answer

A generalized linear model specification consists of a distribution family and a link function (depending on the family, it may also include a free dispersion parameter).

The model's parameter estimates for the mean allow us to estimate the conditional mean, $\hat{\mu}(\mathbf{x})=E[Y|\mathbf{x}]$.

Assuming you want the estimated parameters as the population parameters (to simulate just from the fitted model as if it were the population model), the general steps would be:

  • At each predictor combination (at each $\mathbf{x}_i$), compute $\hat{\mu}(\mathbf{x}_i)$. There will typically be a function to do this computation of fitted value automatically.

  • obtain an estimate (or identify) the dispersion parameter if needed (e.g. in a gamma GLM you'll require an estimate of dispersion, whether from the GLM output or from other means). You won't need one for an ordinary binomial GLM (i.e. skip this step in that case).

  • If necessary reparameterize from those parameter estimates to a convenient parameterization for random generation (supplied generators won't usually be in mean-dispersion form) - these will be different for every observation. (again, you won't need this for the binomial)

  • use random number generating routines to simulate from that conditional distribution for each observation (predictor combination $\mathbf{x}_i$).

[In terms of your original example problem the entire set of steps above is doable in a single line of R code - rbinom(length(surv),1,fitted(out.glm,method="response"))]

Such simulations are random and don't match the original data set (the responses will be different in general), but should have similar characteristics

If you want to allow for the fact that the parameter estimates themselves are uncertain it's more complicated, but can at least be done approximately.