Logistic Regression – Correct Interpretation of Confidence Interval in Logistic Regression

confidence intervallogisticrregression

In trying to understand logistic regression, I find it easiest to transform the coefficients into predicted probabilities. So, for a particular predictor value (x):

precicted probability = 1 / (1 + exp(-(intercept + slope * x)))

Using the 'predict' function, I'll exemplify as:

# Generate data
set.seed(42)
n <- 90
x <- sort(sample(seq(1, 90, by = 1), n, replace = T))
y <- rbinom(n, c(1, 0), c(seq(0, 1, length.out = n), seq(1, 0, length.out = n)))

############ Apply logistic regression ##########
m1 <- glm(y ~ x, family = 'binomial')

############ Interpreting logistic regression ##########
new <- data.frame(x = seq(1, 90, by = 1))
pred <- predict.glm(m1, newdata = new, type = "response")
plot(x, m1$fitted.values, ylim = c(0, 1))
lines(new$x, pred)

My question, then, is if I can interpret the confidence intervals around the coefficients for logistic regression in terms of predicted probability too. That is, like:

############ Interpreting confidence intervals ##########
confs <- confint(m1)
ll <- 1 / (1 + exp(-(confs[1, 1] + confs[2, 2] * new$x)))
ul <- 1 / (1 + exp(-(confs[1, 2] + confs[2, 1] * new$x)))
lines(new$x, ll, col = "red")
lines(new$x, ul, col = "green")

I believe the two colored lines (are not prediction intervals but) illustrate the limits of the relationship between the predictor and outcome variable we can be 95% confident in based on this data. Is this so?
CIs around logistic regression coefficients?

Best Answer

The problem is that you cannot use the confidence intervals for the coefficients in that way, for various reasons, including that it ignores dependence among the estimates. The fact that the lines cross, indicating a 95% confidence interval on a single value, is a clue to the mistake.

Instead, (i) find the logits and their standard errors (this involves finding the asymptotic variance of a linear combination of the coefficient estimates), (ii) find the 95% intervals for the true logits, and (iii) back-transform to get to the the probability scale, like this:

pred1 <- predict.glm(m1, newdata = new, type = "link", se.fit=TRUE)
logit =  pred1$fit
fit.prob = exp(logit)/(1+exp(logit))
upper.logit = logit + 1.96*pred1$se.fit
lower.logit = logit - 1.96*pred1$se.fit 
upper.prob = exp(upper.logit)/(1+exp(upper.logit))
lower.prob = exp(lower.logit)/(1+exp(lower.logit))

lines(new$x, lower.prob, col = "red")
lines(new$x, upper.prob , col = "green")

Now the picture makes more sense:

enter image description here

Related Question