Solved – simulate GLM with square root link in R

generalized linear modellink-functionrsimulation

I'm trying to simulate a fitted GLM using basic functions, not using the simulate() and predict() functions that are widely questioned and answered. I get different results when I compare my math function to the simulate() and predict() function. Most probably I'm doing something wrong but I cannot seem to find the error.

First I generate random correlated data with a skewed dependent variable:

library(MASS)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)

Next I fit a GLM with a square root link to adjust for the skewed data:

m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))

Next I generate predicted values first on the linear scale and then on the inverse scale inclusing stochastic uncertainty (eventually I want to use other than the source data as input values):

p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))

I compare the results to the simulate() and predict() function:

par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))
points(predict(m, type="response"), d[,1], col=rgb(1,0,0,alpha=0.1))
abline(a=0, b=1, col="red")

What is going wrong in my formula to predict values?
Where can I find what mathematical and R expressions are being used in the predict() and simulate() functions?
Is there a link explaining simulation of GLM including stochastic uncertainty (and eventually my next step is also parameter uncertainty) in various family/link combinations applied in R. I found one nice source on GLM simulation though not answering my specific questions: http://www.sagepub.com/upm-data/57233_Chapter_6.pdf

Best Answer

I figured out some answers to my questions.

Regarding the mathematical expressions of predict and simulate, these can be obtained by the following code (thanks to a tip from W. van der Elst):

getS3method(c("predict"), class = "glm")
getS3method(c("simulate"), class = "lm")

.

Regarding the predict function, I incorrectly used the option type=”response”. This does not include the stochastic uncertainty as was my aim. It is the prediction on the backtransformed linear scale. This can be tested by the following graph (run after the initial code stated in the question):

plot(predict(m, type="response"), p_lin^2)

.

Regarding the simulate function, it seemed to be correct if I look at the plot from the initial question:

plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))

However, if I upscale the dependent variable:

d[,1] <- d[,1]*10000

And recalculate the predictions:

m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))
p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))

And plot the results:

par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))

I see different predictions.

This seems to be attributable to a difference between the value of the sd in my general formula approach and the simulate() function approach. Recall the calculation of the sd in the general formulas approach:

sd=sd(p_lin^2 - d[,1])

in the simulate function R calculated the sd differently from the general formulas approach (for a Gaussian family):

vars <- deviance(m)/df.residual(m)
if (!is.null(m$weights)) vars <- vars/m$weights # the m$weights seems similar to the m$fitted.values multiplied by about 4
fitted(m) + rnorm(n, sd = sqrt(vars))

I don’t understand why the sd is calculated and then divided by the m$weights. Why is sd a vector and not a single value? The simulate() help text states: “The methods for linear models fitted by lm or glm(family = "gaussian") assume that any weights which have been supplied are inversely proportional to the error variance.” I can’t seem to grasp the meaning of this. If I run multiple simulations using the simulate() function, the simulations look very much like each other:

plot(simulate(m)$sim_1,simulate(m,2)$sim_2)

If I do not use a sqrt-link function I get simulations that seem to better reflect stochastic uncertainty as they are less identical when I run them twice:

library(MASS)
rm(list=ls())
set.seed(2)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)
d[,1] <- d[,1]*10000
m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4])
plot(simulate(m)$sim_1,simulate(m,2)$sim_2)

What method is correct? What is the difference? (should I post this as a separate question?)