Let's say I have the following data and am running a regression model:
df=data.frame(income=c(5,3,47,8,6,5),
won=c(0,0,1,1,1,0),
age=c(18,18,23,50,19,39),
home=c(0,0,1,0,0,1))
On one hand, I run a linear model to predict on income:
md1 = lm(income ~ age + home + home, data=df)
Second, I run a logit model to predict on the won variable:
md2 = glm(factor(won) ~ age + home, data=df, family=binomial(link="logit"))
For both models, I wonder how I can generate a table or data frame with the predictor response category, fitted value, and the model predicted value.
So for the linear model, something like:
age fitted_income predicted_income
18 3 5
23 3 3
50 4 2
19 5 5
39 6 4
home fitted_income predicted_income
0 5 6
1 3 9
Or perhaps it should be for each data point. So for x_i data point, the fitted and predicted values are:
id age fitted_income predicted_income
1 18 3 5
2 23 3 3
3 50 4 2
4 19 5 5
5 39 6 4
-
From a statistical standpoint, is such an undertaking useful? Why or why not?
-
How can this be done in R? (looked at names(md1) and found what I can pull from the model, but haven't proceeded past that)
Thanks!
Best Answer
You have to be a bit careful with model objects in R. For example, whilst the fitted values and the predictions of the training data should be the same in the
glm()
model case, they are not the same when you use the correct extractor functions:That is because the default for
predict.glm()
is to return predictions on the scale of the linear predictor. To get the fitted values we want to apply the inverse of the link function to those values.fitted()
does that for us, and we can get the correct values usingpredict()
as well:Likewise with
residuals()
(orresid()
); the values stored inmd2$residuals
are the working residuals are are unlikely to be what you want. Theresid()
method allows you to specify the type of residual you want and has a useful default.For the
glm()
model, something like this will suffice:Something similar can be done for the
lm()
model: