Logistic Regression – Is it Possible to Simulate Logistic Regression Without Randomness?

linearlogisticrregularizationseparation

We can simulate linear regression without randomness, which means we make $y=X\beta$ instead of $y=X\beta+\epsilon$. Then if we fit a linear model the coefficients will be identical to the "ground truth". Here is an example.

set.seed(0)
n    <- 1e5
p    <- 3
X    <- matrix(rnorm(n*p), ncol=p)
beta <- runif(p)
# y  <- X %*% beta + rnorm(n)*0.5
# remove the randomness
y    <- X %*% beta
dat  <- data.frame(y=y, x=X)
lm.res = lm(y ~ .-1, data=dat)
norm(as.matrix(lm.res$coefficients - beta))
[1] 2.176037e-14

My question is can we do the similar simulation with logistic regression? From this question I get the point of remove randomness can be done by using deterministic statement instead of sample from binomial distribution.

y <- ifelse(plogis(X %*% beta)>0.5,1,0) 

instead of

y <- rbinom(n,1,prob=plogis(X %*% beta))

But if we do that, complete separation will happen, and we cannot get the coefficients. On the other hand, if we add regularization, then the coefficients will not be the ones generated data.

So, what can I do to "remove the randomness in logistic regression" and solve for the exact "ground truth" coefficients like the linear regression case?

I feel I have some fundamental misunderstanding of the concept, what am I missing?

Best Answer

Logistic regression does not have an "error" term as with classical linear regression. The exception to this might be thresholded linear regression with a logistic error term, but this isn't a commonly accepted probability model which results in a logistic regression model. This is because logistic models have a mean-variance relationship. The analogue to "adding an error term" to a linear regression model is actually a quasibinomial model in which the variance is merely proportional to p*(1-p).

A related question may be how to obtain regression model results which are identical over various designs or replications. This can be done with a "trick" in regression modeling software. You can generate non-integral $Y$ outcomes from the predicted risk which result in the same logistic regression results independent of the design of $X$. For instance: x1 <- seq(-3, 3, 0.1) and x2 <- rnorm(61) as two different designs. As in your case, y1 <- plogis(0.3*x1) and y2 <- plogis(0.3*x2) both result in the same logistic regression model results with 0.3 as the log odds ratio and 0.0 as the log odds for $x=0$.

> glm(y1 ~ x1, family=binomial)

Call:  glm(formula = y1 ~ x1, family = binomial)

Coefficients:
(Intercept)           x1  
 -2.528e-16    3.000e-01  

This relates to your question because the parameter estimates are exactly as defined in your probability model, independent of the design of $x$, and without separation (e.g. log odds ratios, $\beta = \pm \infty$).

Modeling fractional results in a logistic model is an accepted way of analyzing ecological data, where the outcome may indeed be fractional. Not coincidentally, this is also a type of modeling when quasibinomial models are of the most use. Also not coincidentally, I think dispersion is proportional to a scale parameter for the logistic error term when doing "latent logistic regression".