R Logistic Simulation – How to Simulate Data for Logistic Regression with a Categorical Variable

logisticrsimulation

I was trying to create some test data for logistic regression and I found this post How to simulate artificial data for logistic regression?

It is a nice answer but it creates only continuous variables. What about a categorical variable x3 with 5 levels (A B C D E) associated with y for the same example as in the link?

Best Answer

The model

Let $x_B = 1$ if one has category "B", and $x_B = 0$ otherwise. Define $x_C$, $x_D$, and $x_E$ similary. If $x_B = x_C = x_D = x_E = 0$, then we have category "A" (i.e., "A" is the reference level). Your model can then be written as

$$ \textrm{logit}(\pi) = \beta_0 + \beta_B x_B + \beta_C x_C + \beta_D x_D + \beta_E x_E $$ with $\beta_0$ an intercept.


Data generation in R

(a)

x <- sample(x=c("A","B", "C", "D", "E"), 
              size=n, replace=TRUE, prob=rep(1/5, 5))

The x vector has n components (one for each individual). Each component is either "A", "B", "C", "D", or "E". Each of "A", "B", "C", "D", and "E" is equally likely.

(b)

library(dummies)
dummy(x)

dummy(x) is a matrix with n rows (one for each individual) and 5 columns corresponding to $x_A$, $x_B$, $x_C$, $x_D$, and $x_E$. The linear predictors (one for each individual) can then be written as

linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)

(c)

The probabilities of success follows from the logistic model:

pi <- exp(linpred) / (1 + exp(linpred))

(d)

Now we can generate the binary response variable. The $i$th response comes from a binomial random variable $\textrm{Bin}(n, p)$ with $n = 1$ and $p =$ pi[i]:

y <- rbinom(n=n, size=1, prob=pi)

Some quick simulations to check this is OK

> #------ parameters ------
> n <- 1000 
> beta0 <- 0.07
> betaB <- 0.1
> betaC <- -0.15
> betaD <- -0.03
> betaE <- 0.9
> #------------------------
> 
> #------ initialisation ------
> beta0Hat <- rep(NA, 1000)
> betaBHat <- rep(NA, 1000)
> betaCHat <- rep(NA, 1000)
> betaDHat <- rep(NA, 1000)
> betaEHat <- rep(NA, 1000)
> #----------------------------
> 
> #------ simulations ------
> for(i in 1:1000)
+ {
+   #data generation
+   x <- sample(x=c("A","B", "C", "D", "E"), 
+               size=n, replace=TRUE, prob=rep(1/5, 5))  #(a)
+   linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)  #(b)
+   pi <- exp(linpred) / (1 + exp(linpred))  #(c)
+   y <- rbinom(n=n, size=1, prob=pi)  #(d)
+   data <- data.frame(x=x, y=y)
+   
+   #fit the logistic model
+   mod <- glm(y ~ x, family="binomial", data=data)
+   
+   #save the estimates
+   beta0Hat[i] <- mod$coef[1]
+   betaBHat[i] <- mod$coef[2]
+   betaCHat[i] <- mod$coef[3]
+   betaDHat[i] <- mod$coef[4]
+   betaEHat[i] <- mod$coef[5]
+ }
> #-------------------------
> 
> #------ results ------
> round(c(beta0=mean(beta0Hat), 
+         betaB=mean(betaBHat), 
+         betaC=mean(betaCHat), 
+         betaD=mean(betaDHat), 
+         betaE=mean(betaEHat)), 3)
 beta0  betaB  betaC  betaD  betaE 
 0.066  0.100 -0.152 -0.026  0.908 
> #---------------------
Related Question