Solved – Are logistic regression coefficient estimates biased when the predictor has large variance

logisticnormal distributionvariance

I'm simulating data from a logistic regression model:

log(p/1-p)= 0 + X

where $X \sim N(0,\sigma^2)$. After I simulate the data, I fit a logistic regression model to the data and compare the fitted regression coefficients to the actual regression coefficients.

I've noticed that as I increase $\sigma$ (i.e. the variance of the original $X$ data) the fitted regression coefficient for $X$ (i.e. $\beta_1$) is consistently greater than 1 (however, the sd of the estimate also increased so 1 is still contained in the confidence interval for beta1)

I was wondering why when you increase the variance, the fitted $\beta_1$'s tend to be greater than the actual $\beta_1$ (i.e. $\beta_1 = 1$), not less than? Is there a statistical explanation for this?


beta0 = 0
beta1 = 1
sigma = 1
number_samples = 10000
genLogit = function(pos_prop,sd){
    generated_data = c()

xtest = rnorm(10000,0,sd)
linpred = beta0 + (xtest * beta1)
prob = exp(linpred)/ (1+exp(linpred))

runis = runif(10000,0,1)
ytest = ifelse(runis<prob,1,0)

pos = sample(xtest[ytest ==1],floor(pos_prop*1000))
neg = sample(xtest[ytest == 0], floor((1-pos_prop)*1000))

generated_data = rbind(cbind(pos,rep(1,floor(pos_prop*1000))),cbind(neg,rep(0,floor((1-pos_prop)*1000))))
colnames(generated_data) = c('X','Y')
generated_data = data.frame(generated_data)

fit = glm(Y~X,data =generated_data, family=binomial(link="logit"))


If you run genLogit(.5,1000) this is generating balanced (50/50) data with X distributed normal(0,1000). Running it multiple times, I get a beta0 estimate much greater than 0.

Best Answer

Edit: After running the original poster's code, I noticed the algorithm usually doesn't converge for large $σ$, e.g. $σ=1000$. This probably happens because, when $X>0$, it's generally a very large number, so $P(Y=1)=1$, essentially. Similarly, when $X<0$, $P(Y=0)=1$ for the same reason. Therefore, there is very little curvature in the likelihood - the regression function is a step function at 0 - it's essentially asking the model to estimate a regression function that is $−∞$ when $X<0$ and $+∞$ when $X>0$, making it clear why the optimization fails - the best the algorithm tries to do is make $\beta$ as large as possible. You shouldn't be expecting anything from $β$ estimates on these failed runs, since they are not MLEs.

Original post: This isn't a random sample - it looks like you're doing a retrospective sample so that half of the responses are '1's. Prentice and Pyke (1979) show that the odds ratios are still estimated correctly in case-control studies, which, in principal, has the same sampling scheme you've described.

But, the intercepts are off - you're over/undersampling for 'cases' when you force a 50/50 split, and therefore the fitted probability estimates are biased (as reflected by a biased estimate of the intercept). To get consistent estimates of the intercept (and therefore the fitted probabilities), you have to include an offset for the log of the sampling probabilities for each outcome.