Solved – Logistic regression, “sum” confidence interval

confidence intervallogisticregression

I have an existing fitted logit regression model.

Model:

$\hat{p}(x)=\frac{1}{1+e^{-\hat{\beta}x}}$

With parameter estimates $\hat{\beta}$, and observation $x$

Given a new set of datapoints $x_1,\ldots,x_n$ and $o_1,\ldots,o_n$, I want to construct a confidence interval for $E=\sum_i \hat{p}(x_i)$. To goal is to check if $O=\sum_i o_i$ is (not) significantly different.

For a single observation, you can construct a confidence interval for the probability quite easily. Given $\hat{\beta}$ is normally distributed, and the confidence interval for these estimates is ($\hat{\beta}_{0.025}, \hat{\beta}_{0.975}$), the confidence interval for the predictor is: ($\frac{1}{1 + e^{-\hat{\beta}_{0.025}x}}$, $\frac{1}{1 + e^{-\hat{\beta}_{0.975}x}}$).

Is there a good way to get the confidence interval for the "sum" of the predicted probabilities?

Best Answer

One possible solution is to use a boot strapping approach, given the new set of data points, to construct a boot strap estimate and confidence interval for $\sum_i\hat p(x_i)$

set.seed(44)

#Pseudo Data
prob = .2
x = sort(rnorm(100))
y = rbinom(100,1,prob)

#Fit the logistic regression model
model = glm(y~x,family="binomial")

#More data arrives
x.new = rnorm(10000)

#The number of bootstrap samples
B = 10000

sum.p = rep(NA,)
for(i in 1:B){
    #Create a bootstrap sample
    x.boot = sample(x.new,length(x.new)*.1,replace=TRUE)        

    #Calculate the sum of p
    sum.p[i] = sum(1/(1+exp(-(model$coef[1]+model$coef[2]*x.boot))))
}

#Get the 2.5% and 97.5% quantile from the bootstrap estimator 
lower = quantile(sum.p,prob=.025)
upper = quantile(sum.p,prob=.975)

#Construct a 95% confidence interval
ci = c(lower,upper)

An alternative method would be to take a Bayesian approach and recaculate $\sum_i\hat p(x_i)$ for every sample of $\beta$ at every step of an MCMC type algorithm. Then at the end of our MCMC we would have a sample of $\sum_i\hat p(x_i)^{(j)}$ for every $j^{th}$ step of the MCMC that we could take the quantiles of in order to obtain our 95% confidence interval for $\sum_i\hat p(x_i)$ .

Using Scortchi's Suggestion here is the revised code:

#Scortchi's suggestion
set.seed(44)
prob = .2

#More data arrives
x.new = rnorm(10000)
y.new = rbinom(10000,1,prob)

#The number of bootstrap samples
B = 10000

sum.p = rep(NA,B)
for(i in 1:B){
    #Create a bootstrap sample
    index = sample(1:length(x.new),length(x.new)*.1,replace=TRUE)
    x.boot = x.new[index]       
    y.boot = y.new[index]

    model = glm(y.boot~x.boot,family="binomial")

    #Calculate the sum of p
    sum.p[i] = sum(1/(1+exp(-(model$coef[1]+model$coef[2]*x.boot))))
}

#Get the 2.5% and 97.5% quantile from the bootstrap estimator 
lower = quantile(sum.p,prob=.025)
upper = quantile(sum.p,prob=.975)

#Construct a 95% confidence interval
ci = c(lower,upper)

Now interestingly, the confidence interval from using Scortchi's suggestion results in

> ci
 2.5% 97.5% 
  174   223 

where as using my original code we obtain the following:

> ci
    2.5%    97.5% 
242.4727 247.0230

So there is clearly a difference between the two methods.

Related Question