Regression Strategies – Walk through `rms::val.prob` in R

calibrationprobabilityrregressionregression-strategies

The val.prob function in the rms R package has similarities to the calibrate function discussed in another question of mine, but a key difference in that val.prob has no notion of a probability model. Indeed, I can run val.prob on random data.

library(rms)
set.seed(2022)
N <- 100
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
rms::val.prob(rbeta(N, 1, 1), y)

Unsurprisingly, the results show the random numbers between $0$ and $1$ to be unrelated to the binary y.

I have some idea of how val.prob works.

Draw a bootstrap sample of the p-y pairs.
Fit a model to the p-y pairs.
Repeat, repeat, repeat…
Use some kind of average to say what the true probability of $Y=1$ is for each given probability p.

In R code:

set.seed(2022)
N <- 500
B <- 1000
pr <- sort(rbeta(N, 1, 1))
y <- rbinom(N, 1, pr)
m <- matrix(NA, B, N)
for (i in 1:B){
  idx <- sample(seq(1, N, 1), N, replace = T)
  pb <- pr[idx]
  yb <- y[idx]
  
  Lb <- glm(yb ~ pb, family = binomial)
  m[i, ] <- 1/(1 + exp(-predict(Lb, data.frame(pb = pr))  ))
  
  if (i %% 50 == 0 | i < 6 | B - i < 6){
    print(i/B*100)
  }
}

pL <- apply(m, 2, mean)

plot(pr, pL, xlim = c(0, 1), ylim = c(0, 1), xlab = "Asserted Probability", ylab = "True Probability")
abline(0, 1)

However, I simulated my data! I know that pr generated y, so the calibration should be pretty good, not curved like it is. Even with a large sample size of $50000$, that curve remains when I do my own implementation. When I use rms::val.prob, even for a sample size of $500$, I have no such issue.

What step(s) am I missing in my implementation?

Best Answer

So I dug into the source code. There is no bootstrapping in rms::val.prob, that would likely occur in calibrate or validate. The two lines you see (logistic calibrated and non-parametric) are obtained in the following way:

For logistic calibration, we fit a logistic regression using the logit of the probability as the predictor. Here is some code to do just that

set.seed(0)
p = runif(1000)
logit_p = qlogis(p)
y = rbinom(1000, 1, p)

mod = lrm(y~logit_p)
pp = seq(0.01, 0.99, 0.01)
pred = plogis(predict(mod, newdata = list(logit_p = qlogis(pp))))

For the non-parametric curve, Frank uses a lowess smoother.

sm = lowess(p, y, iter=0)

If we plot the two curves against the output of rms::val.prob, we see the two lines lay on top of one another.


val.prob(p, y)
lines(sm, col='red')
lines(pp, pred, col='blue')

The source code is available by running the following code. Just replace <file name here> with a file path.

sink(<file name here>)
rms::val.prob
sink()

Related Solutions

Solved – lrm and orm contrast rms package

This is a significant error that has now been fixed on github and will be in the next release to CRAN. It's best to report rms package issues through https://github.com/harrelfe/rms. But thanks for reporting the error. You can get a temporary fix by typing

require(rms)
getRs('contrast.s', grepo='rms', dir='R', put='source')

Logistic Regression – Detailed Walkthrough of `rms::calibrate` Function

Both calibrate() and validate() in rms by default use the optimism bootstrap, explained for example in Chapter 7 of Elements of Statistical Learning. In your example, you do repeat the logistic-regression modeling on each of multiple bootstrapped data samples.

In general with an optimism bootstrap, you develop a new model on a bootstrap sample. You find how much better that model performs (by some measure) on its bootstrap sample than it does on the full data set. That's the "optimism bias" of that model for that measure.

By the bootstrap principle, an average "optimism" over multiple bootstrap-based models estimates the optimism of the original full model when applied to the population from which the original data were sampled. You then can correct the original model by that average optimism.

For calibrate(), the "optimism" is in the calibration error as a function of predicted probability values. You first calculate the calibration error of the full model over a range of predicted probability values. You then estimate the "optimism" in those calibration error estimates by the optimism bootstrap, and subtract the averaged "optimism" from the original calibration of the full model.

The plot then shows smoothed curves for the original calibration and the optimism-corrected calibration. In the plot you show, the original model itself ("Apparent") deviates some from the "Ideal" calibration, but there doesn't seem to be much optimism (difference between "Apparent" and "Bias-corrected") in the calibration.

You can see details in the code; for logistic regression (lrm objects) type rms:::calibrate.default at the R command prompt. That includes a cal.error() internal function for evaluating calibration, and a call to the predab.resample() function for the optimism bootstrap.

Best Answer

Related Solutions

Solved – lrm and orm contrast rms package

Logistic Regression – Detailed Walkthrough of `rms::calibrate` Function

Related Question