Solved – Compare predicted versus actual outcomes in a GLM

generalized linear modelmathematical-statisticspredictive-modelsrresiduals

I read somewhere that you could compute a "residual value" for a GLM by taking the actual values of your response variable divided by the predicted value of that response variable.

For example, suppose the response variable y represents number of cars, and $x_1$ represents the age of. a car. We would fit a glm model and calculate the "residual value", denoted residual below, for every observation in our data set with something like the following in R:

library(dplyr)
m <- glm(y~x_1,data=dataset, family=poisson(link='log'))
dataset <- dataset %>% mutate(pred_value = predict(m,type='response))
dataset %>% mutate(residual = y/pred_value)

I'm wondering if that actually makes sense, since unlike linear regression the GLM equation generally doesn't contain a residual term in it.

If not, what would be the best way to compare predicted versus expected values? The goal is the see if one can derive a 2nd predictor from the noise not modeled by the GLM model.

Best Answer

For a poisson glm, you can get the residuals, its similar to what @Demetri commented (y_observed - y_predicted) :

library(MASS)
data(Insurance)
dataset = Insurance
fit = glm(Claims ~ Age,offset=log(Holders),data=dataset)
dataset$residuals = fit$residuals

You can compare it against the outcome:

plot(fitted(fit),dataset$residuals)

Now if we want to explore whether the residuals can be explained, I don't think you can fit a poisson glm again (might be wrong), so maybe we explore that with a regression tree:

library(randomForest)
f2 = randomForest(residuals ~ Group + Age + District,data=dataset,importance=TRUE)

importance(f2,type=1)
           %IncMSE
Group    11.898473
Age      -1.058334
District 10.104154

You can see age has no more effect.. And in this dataset, unfortunately the other remaining factors have an effect. Maybe you can also check this

I am guessing your intend is to control / regress out the efficient of certain variables from your response, and fit them to another model. You can consider fitting a full model with all the covariates, regress out the so called "nuisance" parameters and fitting everything again.