Logistic Regression Coefficients – Why is the Mean of Confidence Interval the Same?

logisticstatsmodels

I'm currently using statsmodels library to train a Logistic Regression.
However, I noticed that logit.conf_int().mean(axis=1) and logit.param generates the same values. Why is that? What am I missing regarding confidence interval and coefficients relation?

I'm calling 'logit' the following model:

import statsmodels.formula.api as smf
logit = smf.logit(formula=function, data=df).fit()

I know I should share some data, but I don't think this is relevant in here, since you can achieve the same thing with any data you have available.

Best Answer

The conf_int() method returns confidence intervals for the coefficients. GLM usually uses a normal approximation to the likelihood function, meaning the intervals are symmetric (hence the coefficient estimate is the mid point of the interval).

This in conjunction with the fact that the midpoint of an interval is the arithmetic mean of the endpoints should be enough to understand why.

Related Solutions

Solved – Is conditional logit a specific form of GLM? And what are its specificities

Your reference says that clogit is a special form of Cox regression, not the GLMM. So you are probably mixing things up.

The conditional logit log-likelihood is (reverse engineering the LaTeX code from the Stata manual): conditional on $\sum_{j=1}^{n_i} y_{ij} = k_{1i}$, $$ {\rm Pr}\Bigl[(y_{i1},\ldots,y_{i{n_i}})|\sum_{j=1}^{n_i} y_{ij} = k_{1i}\Bigr] = \frac{\exp(\sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta)}{\sum_{{\bf d}_i\in S_i}\exp(\sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta)} $$ where $S_i$ is a set of all possible combinations of $n_i$ binary outcomes, with $k_{1i}$ ones and remaining zeroes, so the summation index-vector has components $d_{ij}$ that are 0/1 with $\sum_{i=1}^{n_i} d_{ij} = k_{1i}$. That's a pretty weird likelihood to me. Denoting the denominator as $f_i(n_i,k_{1i})$, the conditional log-likelihood is $$ \ln L = \sum_{i=1}^n \biggl[ \sum_{j=1}^{n_i} y_{ij} x_{ij}'\beta - \ln f_i(n_i, k_{1i}) \biggr] $$ This likelihood can be computed exactly, although the computational time goes up steeply as $p^2 \sum_{i=1}^n n_i \min(k_{1i}, n_i - k_{1i})$ where $p={\rm dim}\, \beta = {\rm dim}\, x_{ij}$. This is the likelihood that should be identical to the stratified Cox regression, which I won't try to entertain here.

The mixed model likelihood (again, adopting from Stata manuals) is based on integrating out the random effects: $$ {\rm Pr}(y_{i1}, \ldots, y_{1{n_i}} |x_{i1}, \ldots, x_{i{n_i}})=\int_{-\infty}^{+\infty} \frac{\exp(-\nu_i^2/2\sigma_\nu^2)}{\sigma_\nu \sqrt{2\pi}} \prod_{i=1}^{n_i}F(y_{ij}, x_{ij}'\beta + \nu_i) $$ where $ F(y,z) = \Bigl\{ 1+\exp\bigl[ (-1)^y z \bigr] \Bigr\}^{-1} $ is a witty way to write down the logistic contribution for the outcome $y=0,1$. This likelihood cannot be computed exactly, and in practice is approximated numerically using a set of Gaussian quadrature points with abscissas $a_m$ and weights $w_m$ resembling the density of the standard normal density on a grid, producing (in the simplest version) $$ \ln L \approx \sum_{i=1}^n \ln\biggl[ \sqrt{2} \sum_{m=1}^M w_m \frac{1}{\sigma_\nu \sqrt{2\pi}} \prod_{i=1}^{n_i}F(y_{ij}, x_{ij}'\beta + \sqrt{2} \sigma_\nu a_m) \biggr] $$ (The $\exp(\nu_i^2)$-like terms disappear due to the full quadrature formula, but since it is designed for the physicist' erf() function rather than statisticians' $\Phi()$ function, it works with $\exp(-z^2)$ rather than $\exp(-z^2/2)$; hence the weird $\sqrt{2}$ in a couple of places.) Computational time for $\ln L$ itself is proportional to $nM$, but since you need to take the second order derivatives for Newton-Raphson, feel free to multiply by $p^2$. Smarter computational schemes aka adaptive Gaussian quadratures try to find a better location and scale parameters for the quadrature to make the approximation more accurate.

In fact, that latter Stata manual describes the differences between the GLMM (aka random effect xtlogit, in econometric slang) and conditional logit (aka fixed effect xtlogit), and might be worth a more serious reading.

Solved – Logistic regression, “sum” confidence interval

One possible solution is to use a boot strapping approach, given the new set of data points, to construct a boot strap estimate and confidence interval for $\sum_i\hat p(x_i)$

set.seed(44)

#Pseudo Data
prob = .2
x = sort(rnorm(100))
y = rbinom(100,1,prob)

#Fit the logistic regression model
model = glm(y~x,family="binomial")

#More data arrives
x.new = rnorm(10000)

#The number of bootstrap samples
B = 10000

sum.p = rep(NA,)
for(i in 1:B){
    #Create a bootstrap sample
    x.boot = sample(x.new,length(x.new)*.1,replace=TRUE)        

    #Calculate the sum of p
    sum.p[i] = sum(1/(1+exp(-(model$coef[1]+model$coef[2]*x.boot))))
}

#Get the 2.5% and 97.5% quantile from the bootstrap estimator 
lower = quantile(sum.p,prob=.025)
upper = quantile(sum.p,prob=.975)

#Construct a 95% confidence interval
ci = c(lower,upper)

An alternative method would be to take a Bayesian approach and recaculate $\sum_i\hat p(x_i)$ for every sample of $\beta$ at every step of an MCMC type algorithm. Then at the end of our MCMC we would have a sample of $\sum_i\hat p(x_i)^{(j)}$ for every $j^{th}$ step of the MCMC that we could take the quantiles of in order to obtain our 95% confidence interval for $\sum_i\hat p(x_i)$ .

Using Scortchi's Suggestion here is the revised code:

#Scortchi's suggestion
set.seed(44)
prob = .2

#More data arrives
x.new = rnorm(10000)
y.new = rbinom(10000,1,prob)

#The number of bootstrap samples
B = 10000

sum.p = rep(NA,B)
for(i in 1:B){
    #Create a bootstrap sample
    index = sample(1:length(x.new),length(x.new)*.1,replace=TRUE)
    x.boot = x.new[index]       
    y.boot = y.new[index]

    model = glm(y.boot~x.boot,family="binomial")

    #Calculate the sum of p
    sum.p[i] = sum(1/(1+exp(-(model$coef[1]+model$coef[2]*x.boot))))
}

#Get the 2.5% and 97.5% quantile from the bootstrap estimator 
lower = quantile(sum.p,prob=.025)
upper = quantile(sum.p,prob=.975)

#Construct a 95% confidence interval
ci = c(lower,upper)

Now interestingly, the confidence interval from using Scortchi's suggestion results in

> ci
 2.5% 97.5% 
  174   223

where as using my original code we obtain the following:

> ci
    2.5%    97.5% 
242.4727 247.0230

So there is clearly a difference between the two methods.

Best Answer

Related Solutions

Solved – Is conditional logit a specific form of GLM? And what are its specificities

Solved – Logistic regression, “sum” confidence interval

Using Scortchi's Suggestion here is the revised code:

Related Question