The val.prob
function in the rms
R
package has similarities to the calibrate
function discussed in another question of mine, but a key difference in that val.prob
has no notion of a probability model. Indeed, I can run val.prob
on random data.
library(rms)
set.seed(2022)
N <- 100
p <- rbeta(N, 1, 1)
y <- rbinom(N, 1, p)
rms::val.prob(rbeta(N, 1, 1), y)
Unsurprisingly, the results show the random numbers between $0$ and $1$ to be unrelated to the binary y
.
I have some idea of how val.prob
works.
-
Draw a bootstrap sample of the
p-y
pairs. -
Fit a model to the
p-y
pairs. -
Repeat, repeat, repeat…
-
Use some kind of average to say what the true probability of $Y=1$ is for each given probability
p
.
In R
code:
set.seed(2022)
N <- 500
B <- 1000
pr <- sort(rbeta(N, 1, 1))
y <- rbinom(N, 1, pr)
m <- matrix(NA, B, N)
for (i in 1:B){
idx <- sample(seq(1, N, 1), N, replace = T)
pb <- pr[idx]
yb <- y[idx]
Lb <- glm(yb ~ pb, family = binomial)
m[i, ] <- 1/(1 + exp(-predict(Lb, data.frame(pb = pr)) ))
if (i %% 50 == 0 | i < 6 | B - i < 6){
print(i/B*100)
}
}
pL <- apply(m, 2, mean)
plot(pr, pL, xlim = c(0, 1), ylim = c(0, 1), xlab = "Asserted Probability", ylab = "True Probability")
abline(0, 1)
However, I simulated my data! I know that pr
generated y
, so the calibration should be pretty good, not curved like it is. Even with a large sample size of $50000$, that curve remains when I do my own implementation. When I use rms::val.prob
, even for a sample size of $500$, I have no such issue.
What step(s) am I missing in my implementation?
Best Answer
So I dug into the source code. There is no bootstrapping in
rms::val.prob
, that would likely occur incalibrate
orvalidate
. The two lines you see (logistic calibrated and non-parametric) are obtained in the following way:For logistic calibration, we fit a logistic regression using the logit of the probability as the predictor. Here is some code to do just that
For the non-parametric curve, Frank uses a lowess smoother.
If we plot the two curves against the output of
rms::val.prob
, we see the two lines lay on top of one another.The source code is available by running the following code. Just replace
<file name here>
with a file path.