Solved – How to generate data for logistic regression with an independent variable that is not centred

logisticrsimulation

In this post, there is a script to generate data for a logistic regression.

set.seed(666)
x1 = rnorm(1000,0,1)           # some continuous variables 
x2 = rnorm(1000,0,1)
z = 1 + 2*x1 + 3*x2        # linear combination with a bias

pr = 1/(1+exp(-z))         # pass through an inv-logit function
y = rbinom(length(x1),1,pr)      # bernoulli response variable

df = data.frame(y=y,x1=x1,x2=x2)
glm( y~x1+x2,data=df,family="binomial")

From this script I want to 1. change the $x_2$ for a quadratic term in x1. In addition I want to 2. compare $x_1$ when not centred (change in mean) and more or less variable (change in sd). Below, I modified

set.seed(666)
x1 = rnorm(1000,0,1)         
z = 1 + 2*x1 + 3*x1^2  
pr = 1/(1+exp(-z)) 
y = rbinom(length(x1),1,pr)
df = data.frame(y=y,
                x1=x1,
                x2=x1^2)
glm( y~x1+x2,
     data=df,
     family="binomial")

Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
(Intercept)           x1           x2  
      1.002        2.437        3.490  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      795.3 
Residual Deviance: 615.9    AIC: 621.9
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

The estimates seems a little off. But they are close to the coefficients that were theoretically put in the model.
But as soon that I change the mean, the algorithm does not converge:

set.seed(666)
x1 = rnorm(1000,10,1)         
z = 1 + 2*x1 + 3*x1^2  
pr = 1/(1+exp(-z)) 
y = rbinom(length(x1),1,pr)

df = data.frame(y=y,
                x1=x1,
                x2=x1^2)
glm( y~x1+x2,
     data=df,
     family="binomial")
Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
  (Intercept)           x1           x2  
2.657e+01   -2.351e-08    1.234e-09  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      0 
Residual Deviance: 5.802e-09    AIC: 6
Warning message:
glm.fit: algorithm did not converge 

The only way I found to get back to the same estimates was to standardize the raw data (which makes the mean = 0, so I'm back to the first example).

set.seed(666)
x1 = rnorm(1000,10,1)         
x1=scale(x1)
z = 1 + 2*x1 + 3*(x1)^2
pr = 1/(1+exp(-z)) 
y = rbinom(length(x1),1,pr)

df = data.frame(y=y,
                x1=x1,
                x2=x1^2)
glm( y~x1+x2,
     data=df,
     family="binomial")
Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
(Intercept)           x1           x2  
     0.9872       2.4292       3.5237  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      787.8 
Residual Deviance: 605.7    AIC: 611.7
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred 

So how is it possible to get the logistic estimates = what was theoretically put in the linear combination, but with an $x_1$ that is e.g. x1 = rnorm(1000,10,2)?

In addition, is it possible to add an error term here: z = 1 + 2*x1 + 3*x2 + rnorm(length(x1),0,1)? This would be the equivalent of saying $y = α_0 + β_1*x_1+β_2*x_1^2+ε$

Best Answer

The logistic regression model maintains $P(Y_i=1|X_i=x_i)=\pi_i=\frac{\exp(\beta X +\epsilon)}{1+\exp(\beta X +\epsilon)}$ where $\pi_i=E(Y_i)$, $\beta$ and $X$ are respectively a vector of the regression coefficient and a matrix of independent variables.

To generate data of logistic regression model, we set up $\beta$, $X$, and $\epsilon$.

For example, with your model of interest,

b1=1
b2=1
x1=rnorm(1000000,10,2)
x2=x1^2
e=rnorm(1000000,0,200)
pi=exp(b1*x1+b2*x2+e)/(1+exp(b1*x1+b2*x2+e))
y=rbinom(1000000,1,prob = pi)
summary(glm(y~x1+x2,family="binomial"))

Here, the error term helps mitigate the issue resulted from large $\pi$. Alternatively, adding an intercept term would also help,

b0=-100
b1=1
b2=1
x1=rnorm(1000,10,2)
x2=x1^2
e=rnorm(1000,0,10)
pi=exp(b0+b1*x1+b2*x2+e)/(1+exp(b0+b1*x1+b2*x2+e))
y=rbinom(1000,1,prob = pi)
summary(glm(y~x1+x2,family="binomial"))

It would always be good to check $y$ before model fitting to ensure that there are two classes (preferably of balanced number). If $y$ is of single class, then the model will not converge. If $y$ is of classes with (extremely) imbalanced number, the estimation of coefficients won't be accurate (note that because your model inherently violates the assumption of logistic regression, the parameter estimation are "doomed" from the start...).

Here, the error term/the intercept term helps to ensure that not all $y$ would be of a single class. The variance of the error term or the value of the intercept can be determined based on the range of $x_2$ so that $\pi$ would be of higher variance.

However, it is worth noting that it would be very difficult, if not impossible, to obtain accurate estimate of $\beta$s given the specified model of $y=\beta_1 X_1+\beta_2 X_1^2+\epsilon$. The model estimates do not match the specified generative value because $X_1^2$ and $X_1$ are highly collinear, which results in violation of the assumption that the predictors should be independent. I suppose derivation can be done to show how the $\beta$ estimates would change under this particular model given the generative regression coefficients and the simulation can probably follow the derivation to achieve more accurate recovery of the parameters.