Regression Strategies – Walk through `rms::val.prob` in R

calibrationprobabilityrregressionregression-strategies

The val.prob function in the rms R package has similarities to the calibrate function discussed in another question of mine, but a key difference in that val.prob has no notion of a probability model. Indeed, I can run val.prob on random data.

library(rms)
set.seed(2022)
N <- 100
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
rms::val.prob(rbeta(N, 1, 1), y)

Unsurprisingly, the results show the random numbers between $0$ and $1$ to be unrelated to the binary y.

enter image description here

I have some idea of how val.prob works.

  1. Draw a bootstrap sample of the p-y pairs.

  2. Fit a model to the p-y pairs.

  3. Repeat, repeat, repeat…

  4. Use some kind of average to say what the true probability of $Y=1$ is for each given probability p.

In R code:

set.seed(2022)
N <- 500
B <- 1000
pr <- sort(rbeta(N, 1, 1))
y <- rbinom(N, 1, pr)
m <- matrix(NA, B, N)
for (i in 1:B){
  idx <- sample(seq(1, N, 1), N, replace = T)
  pb <- pr[idx]
  yb <- y[idx]
  
  Lb <- glm(yb ~ pb, family = binomial)
  m[i, ] <- 1/(1 + exp(-predict(Lb, data.frame(pb = pr))  ))
  
  if (i %% 50 == 0 | i < 6 | B - i < 6){
    print(i/B*100)
  }
}

pL <- apply(m, 2, mean)

plot(pr, pL, xlim = c(0, 1), ylim = c(0, 1), xlab = "Asserted Probability", ylab = "True Probability")
abline(0, 1)

enter image description here

However, I simulated my data! I know that pr generated y, so the calibration should be pretty good, not curved like it is. Even with a large sample size of $50000$, that curve remains when I do my own implementation. When I use rms::val.prob, even for a sample size of $500$, I have no such issue.

enter image description here

What step(s) am I missing in my implementation?

Best Answer

So I dug into the source code. There is no bootstrapping in rms::val.prob, that would likely occur in calibrate or validate. The two lines you see (logistic calibrated and non-parametric) are obtained in the following way:

For logistic calibration, we fit a logistic regression using the logit of the probability as the predictor. Here is some code to do just that

set.seed(0)
p = runif(1000)
logit_p = qlogis(p)
y = rbinom(1000, 1, p)

mod = lrm(y~logit_p)
pp = seq(0.01, 0.99, 0.01)
pred = plogis(predict(mod, newdata = list(logit_p = qlogis(pp))))

For the non-parametric curve, Frank uses a lowess smoother.

sm = lowess(p, y, iter=0)

If we plot the two curves against the output of rms::val.prob, we see the two lines lay on top of one another.


val.prob(p, y)
lines(sm, col='red')
lines(pp, pred, col='blue')

enter image description here

The source code is available by running the following code. Just replace <file name here> with a file path.

sink(<file name here>)
rms::val.prob
sink()
Related Question