Solved – Finding the fitted and predicted values for a statistical model

r

Let's say I have the following data and am running a regression model:

df=data.frame(income=c(5,3,47,8,6,5),
              won=c(0,0,1,1,1,0),
              age=c(18,18,23,50,19,39),
              home=c(0,0,1,0,0,1))

On one hand, I run a linear model to predict on income:

md1 = lm(income ~ age + home + home, data=df)

Second, I run a logit model to predict on the won variable:

md2 = glm(factor(won) ~ age + home, data=df, family=binomial(link="logit"))

For both models, I wonder how I can generate a table or data frame with the predictor response category, fitted value, and the model predicted value.

So for the linear model, something like:

age  fitted_income  predicted_income
18    3              5 
23    3              3
50    4              2
19    5              5
39    6              4

home   fitted_income    predicted_income
0       5               6       
1       3               9

Or perhaps it should be for each data point. So for x_i data point, the fitted and predicted values are:

id   age  fitted_income  predicted_income
1     18    3              5 
2     23    3              3
3     50    4              2
4     19    5              5
5     39    6              4
  1. From a statistical standpoint, is such an undertaking useful? Why or why not?

  2. How can this be done in R? (looked at names(md1) and found what I can pull from the model, but haven't proceeded past that)

Thanks!

Best Answer

You have to be a bit careful with model objects in R. For example, whilst the fitted values and the predictions of the training data should be the same in the glm() model case, they are not the same when you use the correct extractor functions:

R> fitted(md2)
        1         2         3         4         5         6 
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112 
R> predict(md2)
         1          2          3          4          5          6 
-0.3192480 -0.3192480 -0.3252830  0.9818840 -0.2785876  0.3252830

That is because the default for predict.glm() is to return predictions on the scale of the linear predictor. To get the fitted values we want to apply the inverse of the link function to those values. fitted() does that for us, and we can get the correct values using predict() as well:

R> predict(md2, type = "response")
        1         2         3         4         5         6 
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112

Likewise with residuals() (or resid()); the values stored in md2$residuals are the working residuals are are unlikely to be what you want. The resid() method allows you to specify the type of residual you want and has a useful default.

For the glm() model, something like this will suffice:

R> data.frame(Age = df$age, Won = df$won, Fitted = fitted(md2))
  Age Won    Fitted
1  18   0 0.4208590
2  18   0 0.4208590
3  23   1 0.4193888
4  50   1 0.7274819
5  19   1 0.4308001
6  39   0 0.5806112

Something similar can be done for the lm() model:

R> data.frame(Age = df$age, Income = df$income, Fitted = fitted(md1))
  Age Income    Fitted
1  18      5  7.893273
2  18      3  7.893273
3  23     47 28.320749
4  50      8 -1.389725
5  19      6  7.603179
6  39      5 23.679251