Solved – the impact of low predictor variance on logistic regression coefficient estimates

logisticregression coefficientsstandard error

Let's say I am using a logistic model to predict whether it rains (yes or no) based on the high temperature and have collected data for the past 100 days. Let's say that it rains 30/100 days. Furthermore, let's say that 70/100 days the temperature is actually 50 degrees. So there were only 30/100 days where the temperature was not 50 degrees.

So in terms of the response variable (rain – yes or no), there is enough variation to develop the model. However, in regard to the predictor (temperature), with so many days where the temperature is 50 degrees, how does a recurring variable "element" (50 degrees) impact the calculations of a logistic model?

Best Answer

Lower variance in the predictor leads to larger standard errors - when the predictors are orthogonal, they are exactly inversely proportional in a least squares model, as can be seen from the well known formula:

$$ {\rm var}(\hat\beta_{j}) = \sigma^2[(X'X)^{-1}]_{j} $$

where $\sigma^2$ is the error variance and $X$ is the design matrix. Similarly, the standard errors in a GLM are generally inversely related in GLMs like a logistic model. In the extreme case where you have no variance in the predictor, the effect is not estimable and you will get an error when you attempt to fit the model.

As an example, consider logistic regression with a single predictor $X_{i} \sim N(0,\sigma^{2})$:

$$ \log \left( \frac{ P(Y_{i} = 1) }{ P(Y_{i} = 0 } \right) = \beta_{0} + \beta_{1} X_{i} $$

In the code below I simulate from the model under increasing values for $\sigma^2$ and show that the standard error decreases. In all simulations $\beta_{0} = 0$, $\beta_{1} = 1$, $n = 1000$. $\sigma^{2}$ is incremented from .1 to 2 in such a way that there are 1000 points. The empirically observed standard errors from a single set of simulations are plotted below. The apparent "bumpyness" in the plot in monte carlo error - bump up the sample size and that will go away.

s = seq(.1, 2, length=1000)
V = rep(0,1000)
for(i in 1:1000)
{
      x = rnorm(1000,mean=0,sd=s[i])
      y = (x + rlogis(1000))>0
      g = glm(y ~ x, family="binomial")
     V[i] = summary(g)$coef[2,2]
}
plot(s,V,pch=16,xlab="Variance of the predictor",ylab="Standard error of regression coefficient", cex.lab=1.5, cex.axis=1.5)

enter image description here

Related Question