Solved – Are logistic regression coefficient estimates biased when the predictor has large variance

logisticnormal distributionvariance

I'm simulating data from a logistic regression model:

log(p/1-p)= 0 + X

where $X \sim N(0,\sigma^2)$. After I simulate the data, I fit a logistic regression model to the data and compare the fitted regression coefficients to the actual regression coefficients.

I've noticed that as I increase $\sigma$ (i.e. the variance of the original $X$ data) the fitted regression coefficient for $X$ (i.e. $\beta_1$) is consistently greater than 1 (however, the sd of the estimate also increased so 1 is still contained in the confidence interval for beta1)

I was wondering why when you increase the variance, the fitted $\beta_1$'s tend to be greater than the actual $\beta_1$ (i.e. $\beta_1 = 1$), not less than? Is there a statistical explanation for this?

Thanks!

beta0 = 0
beta1 = 1
sigma = 1
number_samples = 10000
genLogit = function(pos_prop,sd){
    generated_data = c()

xtest = rnorm(10000,0,sd)
linpred = beta0 + (xtest * beta1)
prob = exp(linpred)/ (1+exp(linpred))

runis = runif(10000,0,1)
ytest = ifelse(runis<prob,1,0)

pos = sample(xtest[ytest ==1],floor(pos_prop*1000))
neg = sample(xtest[ytest == 0], floor((1-pos_prop)*1000))

generated_data = rbind(cbind(pos,rep(1,floor(pos_prop*1000))),cbind(neg,rep(0,floor((1-pos_prop)*1000))))
colnames(generated_data) = c('X','Y')
generated_data = data.frame(generated_data)

fit = glm(Y~X,data =generated_data, family=binomial(link="logit"))

return(fit)
}

If you run genLogit(.5,1000) this is generating balanced (50/50) data with X distributed normal(0,1000). Running it multiple times, I get a beta0 estimate much greater than 0.

Best Answer

Edit: After running the original poster's code, I noticed the algorithm usually doesn't converge for large $σ$, e.g. $σ=1000$. This probably happens because, when $X>0$, it's generally a very large number, so $P(Y=1)=1$, essentially. Similarly, when $X<0$, $P(Y=0)=1$ for the same reason. Therefore, there is very little curvature in the likelihood - the regression function is a step function at 0 - it's essentially asking the model to estimate a regression function that is $−∞$ when $X<0$ and $+∞$ when $X>0$, making it clear why the optimization fails - the best the algorithm tries to do is make $\beta$ as large as possible. You shouldn't be expecting anything from $β$ estimates on these failed runs, since they are not MLEs.

Original post: This isn't a random sample - it looks like you're doing a retrospective sample so that half of the responses are '1's. Prentice and Pyke (1979) show that the odds ratios are still estimated correctly in case-control studies, which, in principal, has the same sampling scheme you've described.

But, the intercepts are off - you're over/undersampling for 'cases' when you force a 50/50 split, and therefore the fitted probability estimates are biased (as reflected by a biased estimate of the intercept). To get consistent estimates of the intercept (and therefore the fitted probabilities), you have to include an offset for the log of the sampling probabilities for each outcome.

Related Solutions

Logistic Regression – Performance of Logistic Regression with High Number of Predictors

I think we should give the word to Venables and Ripley, page 198 in MASS:

There is one fairly common circumstance in which both convergence problems and the Hauck-Donner phenomenon can occur. This is when the fitted probabilities are extremely close to zero or one. Consider a medical diagnosis problem with thousands of cases and around fifty binary explanatory variables (which may arise from coding fewer categorical factors); one of these indicators is rarely true but always indicates that the disease is present. Then the fitted probabilities of cases with that indicator should be one, which can only be achieved by taking $\hat\beta_i = \infty$. The result from glm will be warnings and an estimated coefficient of around +/- 10.

Besides potential numerical difficulties there is no formal problem with probabilities being estimated numerically to 0 or 1. However, the $t$-test, which is based on a quadratic approximation, for testing the hypothesis $\beta_i = 0$ can become a poor approximation of the likelihood ratio test, and the $t$-test may appear insignificant though in reality the hypothesis is definitely wrong. As I understand it, this it what the warning is about.

With many predictors a situation like the one Venables and Ripley describes may easily occur; one predictor is mostly not informative, but in certain cases it is a strong predictor for a case.

Solved – the impact of low predictor variance on logistic regression coefficient estimates

Lower variance in the predictor leads to larger standard errors - when the predictors are orthogonal, they are exactly inversely proportional in a least squares model, as can be seen from the well known formula:

$$ {\rm var}(\hat\beta_{j}) = \sigma^2[(X'X)^{-1}]_{j} $$

where $\sigma^2$ is the error variance and $X$ is the design matrix. Similarly, the standard errors in a GLM are generally inversely related in GLMs like a logistic model. In the extreme case where you have no variance in the predictor, the effect is not estimable and you will get an error when you attempt to fit the model.

As an example, consider logistic regression with a single predictor $X_{i} \sim N(0,\sigma^{2})$:

$$ \log \left( \frac{ P(Y_{i} = 1) }{ P(Y_{i} = 0 } \right) = \beta_{0} + \beta_{1} X_{i} $$

In the code below I simulate from the model under increasing values for $\sigma^2$ and show that the standard error decreases. In all simulations $\beta_{0} = 0$, $\beta_{1} = 1$, $n = 1000$. $\sigma^{2}$ is incremented from .1 to 2 in such a way that there are 1000 points. The empirically observed standard errors from a single set of simulations are plotted below. The apparent "bumpyness" in the plot in monte carlo error - bump up the sample size and that will go away.

s = seq(.1, 2, length=1000)
V = rep(0,1000)
for(i in 1:1000)
{
      x = rnorm(1000,mean=0,sd=s[i])
      y = (x + rlogis(1000))>0
      g = glm(y ~ x, family="binomial")
     V[i] = summary(g)$coef[2,2]
}
plot(s,V,pch=16,xlab="Variance of the predictor",ylab="Standard error of regression coefficient", cex.lab=1.5, cex.axis=1.5)

enter image description here

Best Answer

Related Solutions

Logistic Regression – Performance of Logistic Regression with High Number of Predictors

Solved – the impact of low predictor variance on logistic regression coefficient estimates

Related Question