Solved – Observed vs predicted values from a logit model

logisticregression

I have a logit model and am trying to understand and compare the predicted and observed values generated by the model. Let's say data set had 100 values and I generate all the predicted probabilities, and then I find the actual probabilities from the data set.

If I'm comparing the predicted vs observed values, I'm thinking there are two ways to do it. One is to do it value by value, while the second would be to group by the 'predicted probabilities.'

Method 1:

x_value  pred_val   obs_val
100       0.30       0.34
102       0.33       0.36
104       0.35       0.37
106       0.40       0.40
...

I'm also thinking there has to some way to aggregate these values. So I'm thinking of aggregating all x values where the predicted probabilities is between 10 to 20% percent, then find the avg predicted value from that range, followed by the predicted value for that range.

Method 2:

Pred_probs       pred_val    obs_val
10 to 20% vals     0.10        0.11
21 to 30% vals     0.12        0.16
31 to 50% vals     0.15        0.30

What I'm wondering is:

  1. When there are a large number of data points, what use is having a list of the predicted and observed values for any given value of x?

  2. Does it ever make sense to do something as identified in 'Method 2'?

Best Answer

It sounds as if you are wanting to check the calibration of a model on the same dataset used to build the model. This will require the use of the bootstrap to re-fit the model 300 times. You can use a bootstrap overfitting-corrected nonparametric calibration curve with a nonparametric smoother. It is not a good idea to bin predicted probabilities. Assuming you did no variable selection here's an approach in R with the rms package.

require(rms)
f <- lrm(y ~ x1 + x2 + x3, x=TRUE, y=TRUE)  # Full pre-specified model
validate(f, B=300)  # bootstrap stats such as Somers' Dxy
cal <- calibrate(f, B=300)
plot(cal)