Solved – Simulate data based on output from a generalized linear model in R

generalized linear modelrsimulation

I have a generalized linear model for a data set and I want to use the model to simulate data that matches the data used to generate the model. I am not wanting to resample the original data, but rather use the model output to generate data that matches original data.

For example, suppose I have a binomial data set in which survivorship is a function of size.

size <- c(3, 5, 10, 5, 17, 7, 17, 5, 18, 20, 9, 4, 17, 14, 8, 19, 11, 
3, 12, 14, 4, 11, 10, 22, 10, 12, 10, 7, 3, 10, 9, 5, 4, 23, 
8, 7, 7, 11, 19, 12, 3, 9, 6, 6, 8, 14, 18, 10, 9, 11)

surv <- c(1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 
0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 
1, 0, 1, 1, 0, 1, 0, 1, 0)

dat=data.frame(surv,size)
out.glm = glm(surv ~ size, family=binomial, data = dat)

I can graph the model predictions and standard errors, but how do I use the output of the glm to simulate data that should theoretically match the original dataset? Specifically, should I use the standard error of the mean as the input to rbinom() across the range of probabilities and then resample from the output?

Best Answer

A generalized linear model specification consists of a distribution family and a link function (depending on the family, it may also include a free dispersion parameter).

The model's parameter estimates for the mean allow us to estimate the conditional mean, $\hat{\mu}(\mathbf{x})=E[Y|\mathbf{x}]$.

Assuming you want the estimated parameters as the population parameters (to simulate just from the fitted model as if it were the population model), the general steps would be:

At each predictor combination (at each $\mathbf{x}_i$), compute $\hat{\mu}(\mathbf{x}_i)$. There will typically be a function to do this computation of fitted value automatically.
obtain an estimate (or identify) the dispersion parameter if needed (e.g. in a gamma GLM you'll require an estimate of dispersion, whether from the GLM output or from other means). You won't need one for an ordinary binomial GLM (i.e. skip this step in that case).
If necessary reparameterize from those parameter estimates to a convenient parameterization for random generation (supplied generators won't usually be in mean-dispersion form) - these will be different for every observation. (again, you won't need this for the binomial)
use random number generating routines to simulate from that conditional distribution for each observation (predictor combination $\mathbf{x}_i$).

[In terms of your original example problem the entire set of steps above is doable in a single line of R code - rbinom(length(surv),1,fitted(out.glm,method="response"))]

Such simulations are random and don't match the original data set (the responses will be different in general), but should have similar characteristics

If you want to allow for the fact that the parameter estimates themselves are uncertain it's more complicated, but can at least be done approximately.

Related Solutions

Solved – How is the intercept calculated in a generalized linear model and why is it different from a linear model

The difference in estimated intercepts is not because of the overdispersion in the data. Peter Flom's comment is the correct answer. To see this, change the lm() model into a glm() model with a gaussian family:

glm(data ~ 1, family = gaussian)
glm(data ~ 1, family = gaussian(link="log"),start=c(20))

The canonical link for the gaussian family is the identity link, so you get exactly the same estimate as for lm(). Changing the link to the log link function gives you the same estimate of the intercept that you're getting from the Poisson and NB models. The gaussian model with log link is $log(E(Y|X))=θ^′X$, while the glm with identity link is $E(Y|X)=θ^′X$. That's why exponentiating the estimated intercept for the log link models $e^{5.453} = 233$ gives you the estimated intercept for the identity link models - you are using the inverse link function. Getting the expected # of saplings per plot for this simple model is easy with just the value of the coefficient, but once you add treatment effects and other covariates it will be more difficult. You should use the predict() function like this:

data = data.frame(saplings=data,
                  treat=gl(2,6,24,label=c("control","treat")),
                  year=gl(2,12,label=c("2004","2011")))

test.glm = glm.nb(saplings~treat*year,link=log,data=data)
nd = data.frame(treat=gl(2,1,4,label=c("control","treat")),
            year=gl(2,2,label=c("2004","2011")))
predict(test.glm,newdata=nd,type="response")

Also see this question, and read Chapter 6 of Zuur et al (2007) "Analysing Ecological Data"

Solved – Using fitted GLM model to simulate y’s from new x-values

How about this. First make some fake data to test on

x <- rnorm(100) 
y <- rpois(rep(1,100), exp(x)) ## poisson regression with slope=1
## fit model
m1 <- glm(y ~ x,family=poisson)

Now decide on your new x points

new.data <- data.frame(x=seq(-3,3,.1))

get the predicted expected value of y for these points

mu.y <- predict(m1, newdata=new.data, type='response')

and generate k sets of simulated y at these new points. Your question has k=1 but we may as well be general.

sim.y <- replicate(k, rpois(rep(1, length(mu.y)), mu.y))

Now sim.y is a matrix with as many rows as new.data and k columns, each containing a possible set of y values assuming the model is correct.

Best Answer

Related Solutions

Solved – How is the intercept calculated in a generalized linear model and why is it different from a linear model

Solved – Using fitted GLM model to simulate y’s from new x-values

Related Question