Solved – GLM with Gamma-Log link: How to do Predictions

gamma distributiongeneralized linear modelrregression

I am rather new to regression analysis, having a completely different background. I am trying to build a model for predictions.

The distribution of my dependent variable, Value (in US$), is right skewed and nonnegative. My explanatory variables are CurrentValue, i.e., the value of the item one year before, the age A of the item, and a categorical variable Type.

Reading a bit, also from other posts here, I came to the conclusion that a GLM with gamma distribution and log link seems a suitable choice. In R:

fit <- glm(Value ~ CurrentValue + Age + Type -1, 
           family=Gamma(link="log"), data=values)

My question, probably trivial to the most, is the following: How do I back-transform the predicted value into the scale of my dependent variable Value? More simply, given that I call:

 predicted <- predict(fit, newvalues)

how do I obtain values in US$, given the vector predicted?
Any help will be highly appreciated.

Best Answer

If you include type="response" in your predict code then the predicted values will be on the response scale.

predict(object, newdata = NULL, type = c("link", "response", "terms"), se.fit = FALSE, dispersion = NULL, terms = NULL, na.action = na.pass, ...)

Related Solutions

Solved – simulate GLM with square root link in R

I figured out some answers to my questions.

Regarding the mathematical expressions of predict and simulate, these can be obtained by the following code (thanks to a tip from W. van der Elst):

getS3method(c("predict"), class = "glm")
getS3method(c("simulate"), class = "lm")

Regarding the predict function, I incorrectly used the option type=”response”. This does not include the stochastic uncertainty as was my aim. It is the prediction on the backtransformed linear scale. This can be tested by the following graph (run after the initial code stated in the question):

plot(predict(m, type="response"), p_lin^2)

Regarding the simulate function, it seemed to be correct if I look at the plot from the initial question:

plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))

However, if I upscale the dependent variable:

d[,1] <- d[,1]*10000

And recalculate the predictions:

m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4], family=gaussian(link="sqrt"))
p_lin <- m$coef[1] + m$coef[2]*d[,2] + m$coef[3]*d[,3] + m$coef[4]*d[,4]
p <- rnorm(n=n, mean=p_lin^2, sd=sd(p_lin^2 - d[,1]))

And plot the results:

par(mfrow=c(1,1), mar=c(4,2,2,1), pch=16, cex=0.8, pty="s")
xylim <- c(min(c(d[,1], p)), max(c(d[,1], p)))
plot(p, d[,1], xlab="predicted values", ylab="original data", xlim=xylim, ylim=xylim, col=rgb(0,0,0,alpha=0.1))
points(simulate(m)$sim_1, d[,1], col=rgb(0,1,0,alpha=0.1))

I see different predictions.

This seems to be attributable to a difference between the value of the sd in my general formula approach and the simulate() function approach. Recall the calculation of the sd in the general formulas approach:

sd=sd(p_lin^2 - d[,1])

in the simulate function R calculated the sd differently from the general formulas approach (for a Gaussian family):

vars <- deviance(m)/df.residual(m)
if (!is.null(m$weights)) vars <- vars/m$weights # the m$weights seems similar to the m$fitted.values multiplied by about 4
fitted(m) + rnorm(n, sd = sqrt(vars))

I don’t understand why the sd is calculated and then divided by the m$weights. Why is sd a vector and not a single value? The simulate() help text states: “The methods for linear models fitted by lm or glm(family = "gaussian") assume that any weights which have been supplied are inversely proportional to the error variance.” I can’t seem to grasp the meaning of this. If I run multiple simulations using the simulate() function, the simulations look very much like each other:

plot(simulate(m)$sim_1,simulate(m,2)$sim_2)

If I do not use a sqrt-link function I get simulations that seem to better reflect stochastic uncertainty as they are less identical when I run them twice:

library(MASS)
rm(list=ls())
set.seed(2)
n <- 1500
d <- mvrnorm(n=n, mu=c(0,0,0,0), Sigma=matrix(.7, nrow=4, ncol=4) + diag(4)*.3)
d[,1] <- qgamma(p=pnorm(q=d[,1]), shape=2, rate=2)
d[,1] <- d[,1]*10000
m <- glm(formula=d[,1] ~ d[,2] + d[,3] + d[,4])
plot(simulate(m)$sim_1,simulate(m,2)$sim_2)

What method is correct? What is the difference? (should I post this as a separate question?)

Generalized Linear Model – What Do Predicted Values Mean in Gamma GLM with Log Link?

I was hoping someone else would expand upon my comments in a full answer, but here is mine, incase anyone needs help with a similar question.

1. Response to what does it predict

Glm ''like'' regression predicts the mean value given the independent variables. Selecting response in predict function back transforms the prediction out of the link scale (inverses the link), so the prediction is on the same scale as the dependent variable. If you "overfit" the model you can return back the exact value of y, but when modeling you're trying to generalize and approximate. It may not be the same as the arithmetic mean because the mean is estimated using maximum likelihood and is conditional on the included independent variables.

2. Response to how to "know" if the gamma distribution is right for my data

The gamma distribution is very flexible and is, in fact, a series of distributions that changes shape depending on the response (variance changes with mean). The gamma distribution requires the data to be positive definite and continuous if you fit those conditions the gamma distribution will quite often fit. Additionally, the Pearson residuals should be normally distributed and show no signs of heteroscedasticity. For more on model evaluation see: https://www.crcpress.com/Extending-the-Linear-Model-with-R-Generalized-Linear-Mixed-Effects-and/Faraway/p/book/9781498720960

3. Response to what happens when I use the qgamma function?

I am not 100% sure why the person you borrowed your sample code from is using the qgamma function. However, it looks to me like they are trying to return the 90th percent confidence limit. This is the code I use to return back 95% confidence limits on predictions.

 preds_link = predict(model, newdata = test_data,
                     type = "link",
                     se.fit = TRUE)# use the glm to make predictions, also provides  std. error
 critval <- 1.96 # critical value for approx 95% CI
 upper_ci_link <- preds_link$fit + (critval * preds_link$se.fit)# estimate upper CI for prediction on link scale
 lwr_ci_link <- preds_link$fit - (critval * preds_link$se.fit)# estimate lower CI for prediction on link scale
 fit_link <- preds_link$fit# returns fited value
 upper_ci <- model$family$linkinv(upper_ci_link)
 lwr_ci <- model$family$linkinv(lwr_ci_link)
 fit <- model$family$linkinv(fit)
 preds_link = data.frame(fit,lwr_ci,upper_ci,
                    fit_link,lwr_ci_link,
                    upper_ci_link)# puts predictions, CI in a single dataframe
 colnames(preds_link) = c("Prediction","LCL", "UCL",
                     "Link_Prediction","Link_LCL", "Link_UCL")# give variables logical names

Prediction, LCL, and UCL: Prediction and Confidence Limits on response scale (i.e., same scale as data)

Link_Prediction, Link_LCL, and Link_UCL: Prediction and Confidence Limits on link scale (i.e., log transformed)

If you wish to return back the 90th percent confidence value just change the critical value used to one corresponding to $\alpha$ = 0.10 instead of $\alpha$ = 0.05

Best Answer

Related Solutions

Solved – simulate GLM with square root link in R

Generalized Linear Model – What Do Predicted Values Mean in Gamma GLM with Log Link?

Related Question