Solved – Finding the fitted and predicted values for a statistical model

Let's say I have the following data and am running a regression model:

df=data.frame(income=c(5,3,47,8,6,5),
              won=c(0,0,1,1,1,0),
              age=c(18,18,23,50,19,39),
              home=c(0,0,1,0,0,1))

On one hand, I run a linear model to predict on income:

md1 = lm(income ~ age + home + home, data=df)

Second, I run a logit model to predict on the won variable:

md2 = glm(factor(won) ~ age + home, data=df, family=binomial(link="logit"))

For both models, I wonder how I can generate a table or data frame with the predictor response category, fitted value, and the model predicted value.

So for the linear model, something like:

age  fitted_income  predicted_income
18    3              5 
23    3              3
50    4              2
19    5              5
39    6              4

home   fitted_income    predicted_income
0       5               6       
1       3               9

Or perhaps it should be for each data point. So for x_i data point, the fitted and predicted values are:

id   age  fitted_income  predicted_income
1     18    3              5 
2     23    3              3
3     50    4              2
4     19    5              5
5     39    6              4

From a statistical standpoint, is such an undertaking useful? Why or why not?
How can this be done in R? (looked at names(md1) and found what I can pull from the model, but haven't proceeded past that)

Thanks!

Best Answer

You have to be a bit careful with model objects in R. For example, whilst the fitted values and the predictions of the training data should be the same in the glm() model case, they are not the same when you use the correct extractor functions:

R> fitted(md2)
        1         2         3         4         5         6 
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112 
R> predict(md2)
         1          2          3          4          5          6 
-0.3192480 -0.3192480 -0.3252830  0.9818840 -0.2785876  0.3252830

That is because the default for predict.glm() is to return predictions on the scale of the linear predictor. To get the fitted values we want to apply the inverse of the link function to those values. fitted() does that for us, and we can get the correct values using predict() as well:

R> predict(md2, type = "response")
        1         2         3         4         5         6 
0.4208590 0.4208590 0.4193888 0.7274819 0.4308001 0.5806112

Likewise with residuals() (or resid()); the values stored in md2$residuals are the working residuals are are unlikely to be what you want. The resid() method allows you to specify the type of residual you want and has a useful default.

For the glm() model, something like this will suffice:

R> data.frame(Age = df$age, Won = df$won, Fitted = fitted(md2))
  Age Won    Fitted
1  18   0 0.4208590
2  18   0 0.4208590
3  23   1 0.4193888
4  50   1 0.7274819
5  19   1 0.4308001
6  39   0 0.5806112

Something similar can be done for the lm() model:

R> data.frame(Age = df$age, Income = df$income, Fitted = fitted(md1))
  Age Income    Fitted
1  18      5  7.893273
2  18      3  7.893273
3  23     47 28.320749
4  50      8 -1.389725
5  19      6  7.603179
6  39      5 23.679251

Related Solutions

Solved – How to aggregate and merge these two data.frames

I am not an R expert, and since I don’t have your data, I cannot experiment with it. But here is what I would try:

1) I believe your x in the aggregate function should be c$Branch, not c.

2) I believe the argument by must only be used if the columns have the same name in both dataframes. Try using by.x="Trade", by.y="predict.Financial.Object". See documentation.

PS: Convention suggests writing cc <- merge(…) instead of cc = merge(…).

Solved – Why does GBM predict different values for the same data

The factors, as always. Seems like the model is not using the actual value of the factor, but rather something like the position in the factor-levels.

I was able to reproduce your error with the data OrchardSprays

data(OrchardSprays)

model <- gbm(decrease ~ rowpos+colpos+treatment, data=OrchardSprays, n.trees=1000, distribution="gaussian", interaction.depth=3, bag.fraction=0.5, train.fraction=1.0, shrinkage=0.1, keep.data=TRUE)

firstrow <- OrchardSprays[1,]
str(firstrow)

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment="D")
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)
predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 20.67818
> predict(model,newdata=data.frame(decrease=57,rowpos=1,colpos=1,treatment="A"),n.trees=100)
[1] 20.67818

since A has position 1 in the levels of OrchardSprays$treatment. Adding the levels to the data declaration does the trick

manualFirstrow <- data.frame(decrease=57,rowpos=1,colpos=1,treatment=factor("D",levels(OrchardSprays$treatment)))
str(manualFirstrow)

predict(model,newdata=firstrow,n.trees=100)
predict(model,newdata=manualFirstrow,n.trees=100)

output:

> predict(model,newdata=firstrow,n.trees=100)
[1] 50.31276
> predict(model,newdata=manualFirstrow,n.trees=100)
[1] 50.31276

Best Answer

Related Solutions

Solved – How to aggregate and merge these two data.frames

Solved – Why does GBM predict different values for the same data

Related Question