I figured out some answers to my questions.
Regarding the mathematical expressions of predict and simulate, these can be obtained by the following code (thanks to a tip from W. van der Elst):
getS3method(c("predict"), class = "glm")
getS3method(c("simulate"), class = "lm")
.
Regarding the predict
function, I incorrectly used the option type=”response”
. This does not include the stochastic uncertainty as was my aim. It is the prediction on the backtransformed linear scale. This can be tested by the following graph (run after the initial code stated in the question):
plot(predict(m, type="response"), p_lin^2)
.
Regarding the simulate
function, it seemed to be correct if I look at the plot from the initial question:
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))
However, if I upscale the dependent variable:
d[,1] <- d[,1]*10000
And recalculate the predictions:
m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))
p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))
And plot the results:
par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))
I see different predictions.
This seems to be attributable to a difference between the value of the sd in my general formula approach and the simulate()
function approach. Recall the calculation of the sd in the general formulas approach:
sd=sd(p_lin^2 - d[,1])
in the simulate function R calculated the sd differently from the general formulas approach (for a Gaussian family):
vars <- deviance(m)/df.residual(m)
if (!is.null(m$weights)) vars <- vars/m$weights # the m$weights seems similar to the m$fitted.values multiplied by about 4
fitted(m) + rnorm(n, sd = sqrt(vars))
I don’t understand why the sd is calculated and then divided by the m$weights
. Why is sd a vector and not a single value? The simulate()
help text states: “The methods for linear models fitted by lm or glm(family = "gaussian") assume that any weights which have been supplied are inversely proportional to the error variance.” I can’t seem to grasp the meaning of this. If I run multiple simulations using the simulate()
function, the simulations look very much like each other:
plot(simulate(m)$sim_1,simulate(m,2)$sim_2)
If I do not use a sqrt-link function I get simulations that seem to better reflect stochastic uncertainty as they are less identical when I run them twice:
library(MASS)
rm(list=ls())
set.seed(2)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)
d[,1] <- d[,1]*10000
m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4])
plot(simulate(m)$sim_1,simulate(m,2)$sim_2)
What method is correct? What is the difference?
(should I post this as a separate question?)
I was hoping someone else would expand upon my comments in a full answer, but here is mine, incase anyone needs help with a similar question.
1. Response to what does it predict
Glm ''like'' regression predicts the mean value given the independent variables. Selecting response in predict function back transforms the prediction out of the link scale (inverses the link), so the prediction is on the same scale as the dependent variable. If you "overfit" the model you can return back the exact value of y, but when modeling you're trying to generalize and approximate. It may not be the same as the arithmetic mean because the mean is estimated using maximum likelihood and is conditional on the included independent variables.
2. Response to how to "know" if the gamma distribution is right for my data
The gamma distribution is very flexible and is, in fact, a series of distributions that changes shape depending on the response (variance changes with mean). The gamma distribution requires the data to be positive definite and continuous if you fit those conditions the gamma distribution will quite often fit. Additionally, the Pearson residuals should be normally distributed and show no signs of heteroscedasticity. For more on model evaluation see: https://www.crcpress.com/Extending-the-Linear-Model-with-R-Generalized-Linear-Mixed-Effects-and/Faraway/p/book/9781498720960
3. Response to what happens when I use the qgamma function?
I am not 100% sure why the person you borrowed your sample code from is using the qgamma function. However, it looks to me like they are trying to return the 90th percent confidence limit. This is the code I use to return back 95% confidence limits on predictions.
preds_link = predict(model, newdata = test_data,
type = "link",
se.fit = TRUE)# use the glm to make predictions, also provides std. error
critval <- 1.96 # critical value for approx 95% CI
upper_ci_link <- preds_link$fit + (critval * preds_link$se.fit)# estimate upper CI for prediction on link scale
lwr_ci_link <- preds_link$fit - (critval * preds_link$se.fit)# estimate lower CI for prediction on link scale
fit_link <- preds_link$fit# returns fited value
upper_ci <- model$family$linkinv(upper_ci_link)
lwr_ci <- model$family$linkinv(lwr_ci_link)
fit <- model$family$linkinv(fit)
preds_link = data.frame(fit,lwr_ci,upper_ci,
fit_link,lwr_ci_link,
upper_ci_link)# puts predictions, CI in a single dataframe
colnames(preds_link) = c("Prediction","LCL", "UCL",
"Link_Prediction","Link_LCL", "Link_UCL")# give variables logical names
Prediction, LCL, and UCL: Prediction and Confidence Limits on response scale (i.e., same scale as data)
Link_Prediction, Link_LCL, and Link_UCL: Prediction and Confidence Limits on link scale (i.e., log transformed)
If you wish to return back the 90th percent confidence value just change the critical value used to one corresponding to $\alpha$ = 0.10 instead of $\alpha$ = 0.05
Best Answer
If you include type="response" in your predict code then the predicted values will be on the response scale.
predict(object, newdata = NULL, type = c("link", "response", "terms"), se.fit = FALSE, dispersion = NULL, terms = NULL, na.action = na.pass, ...)