Solved – Compare predicted versus actual outcomes in a GLM

generalized linear modelmathematical-statisticspredictive-modelsrresiduals

I read somewhere that you could compute a "residual value" for a GLM by taking the actual values of your response variable divided by the predicted value of that response variable.

For example, suppose the response variable y represents number of cars, and $x_1$ represents the age of. a car. We would fit a glm model and calculate the "residual value", denoted residual below, for every observation in our data set with something like the following in R:

library(dplyr)
m <- glm(y~x_1,data=dataset, family=poisson(link='log'))
dataset <- dataset %>% mutate(pred_value = predict(m,type='response))
dataset %>% mutate(residual = y/pred_value)

I'm wondering if that actually makes sense, since unlike linear regression the GLM equation generally doesn't contain a residual term in it.

If not, what would be the best way to compare predicted versus expected values? The goal is the see if one can derive a 2nd predictor from the noise not modeled by the GLM model.

Best Answer

For a poisson glm, you can get the residuals, its similar to what @Demetri commented (y_observed - y_predicted) :

library(MASS)
data(Insurance)
dataset = Insurance
fit = glm(Claims ~ Age,offset=log(Holders),data=dataset)
dataset$residuals = fit$residuals

You can compare it against the outcome:

plot(fitted(fit),dataset$residuals)

Now if we want to explore whether the residuals can be explained, I don't think you can fit a poisson glm again (might be wrong), so maybe we explore that with a regression tree:

library(randomForest)
f2 = randomForest(residuals ~ Group + Age + District,data=dataset,importance=TRUE)

importance(f2,type=1)
           %IncMSE
Group    11.898473
Age      -1.058334
District 10.104154

You can see age has no more effect.. And in this dataset, unfortunately the other remaining factors have an effect. Maybe you can also check this

I am guessing your intend is to control / regress out the efficient of certain variables from your response, and fit them to another model. You can consider fitting a full model with all the covariates, regress out the so called "nuisance" parameters and fitting everything again.

Related Solutions

Solved – Use predicted values with or without random part to plot Residuals with binnedplot of a logistic regression in glmer (lme4 package) in R

You should use resid(fit.glmer,type="response") since default type is "deviance" while type of y.glmer is "response". I bet this will change pretty much your binnedplot figure of residuals against the predicted values with random part. See also: Interpreting a binned residual plot in logistic regression

Solved – Is it the correct usage of nnet in R

For the most part you are fine; however, the glaring issue is that you have no reason to convert char to an integer as nnet accepts factors. That is why you only see 1's reported. As an example:

set.seed(123)
vars <- as.matrix(replicate(18, rnorm(25)))

# Wrong way
char <- as.integer(factor(rep(letters[1:5], each=5)))
df <- data.frame(char, vars)
head(df)

library(nnet)

nn1 <- nnet(char ~ ., data=df, size=20, maxit=1000, range=0.1, trace=T)
nn1$fitted.values
    > nn1$fitted.values
   [,1]
1     1
2     1
3     1
4     1
5     1
6     1
...

# Right way
char <- rep(letters[1:5], each=5)
df <- data.frame(char, vars)

nn2 <- nnet(char ~ ., data=df, size=20, maxit=1000, range=0.1, trace=T)
nn2$fitted.values
    > nn2$fitted.values
              a            b            c            d            e
1  1.000000e+00 2.281148e-08 2.034399e-11 5.934214e-12 3.212223e-10
2  1.000000e+00 1.568664e-09 3.117289e-14 7.895235e-14 5.656804e-23
3  9.999958e-01 1.666613e-07 6.259551e-08 3.969482e-06 4.485918e-23
4  9.999994e-01 5.522178e-07 3.361721e-10 1.284468e-08 1.236227e-20
5  9.999909e-01 8.255399e-06 2.314208e-09 8.657084e-07 1.898005e-14
6  1.718135e-17 1.000000e+00 6.838461e-14 1.594482e-12 3.872572e-21
...

When you submit actual classes, you then get output that you can actually use for predicting classes.

Best Answer

Related Solutions

Solved – Use predicted values with or without random part to plot Residuals with binnedplot of a logistic regression in glmer (lme4 package) in R

Solved – Is it the correct usage of nnet in R

Related Question