Solved – Generate random data for logistic regression with a categorical independent variable

categorical datalogisticrregressionsimulation

I am trying to generate a data frame of fake data for exploratory purposes. Specifically, I am trying to produce data with a binary dependent variable (say, failure/success), and a categorical independent variable called 'picture' with 5 levels (pict1, pict2, etc.). I am following the answer provided here, which allows me to successfully generate the data. However, I need each level of 'picture' to occur the same number of times (i.e. 11 repetitions of each level = 55 total observations per subject).

Here is a reproducible example of what has worked so far (code from user: ocram):

library(dummies)

#------ parameters ------
n <- 1000 
beta0 <- 0.07
betaB <- 0.1
betaC <- -0.15
betaD <- -0.03
betaE <- 0.9
#------------------------

#------ initialisation ------
beta0Hat <- rep(NA, 1000)
betaBHat <- rep(NA, 1000)
betaCHat <- rep(NA, 1000)
betaDHat <- rep(NA, 1000)
betaEHat <- rep(NA, 1000)
#----------------------------

#------ simulations ------
for(i in 1:1000)
{
  #data generation
  x <- sample(x=c("pict1","pict2", "pict3", "pict4", "pict5"), 
              size=n, replace=TRUE, prob=rep(1/5, 5))  #(a)
  linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)  #(b)
  pi <- exp(linpred) / (1 + exp(linpred))  #(c)
  y <- rbinom(n=n, size=1, prob=pi)  #(d)
  data <- data.frame(picture=x, choice=y)

  #fit the logistic model
  mod <- glm(choice ~ picture, family="binomial", data=data)

  #save the estimates
  beta0Hat[i] <- mod$coef[1]
  betaBHat[i] <- mod$coef[2]
  betaCHat[i] <- mod$coef[3]
  betaDHat[i] <- mod$coef[4]
  betaEHat[i] <- mod$coef[5]
}

However, as you can see from the output, each level of the factor 'picture' does not occur the same number of times (i.e. 200 times each).

> summary(data)
picture     choice     
pict1:200   Min.   :0.000  
pict2:207   1st Qu.:0.000  
pict3:217   Median :1.000  
pict4:163   Mean   :0.559  
pict5:213   3rd Qu.:1.000  
            Max.   :1.000

Moreover, it is not entirely clear to me how to manipulate the initial beta values as to determine the probability of success/failure for each level of 'picture'. I cannot comment the original question because I do not yet have the necessary reputation points.

Best Answer

If you want 200 copies of each of 5 levels in a polytomous variable in random order, then do this instead:
```
x <- sample(rep(paste0('pict', 1:5), 200))
```

If you want to control for overall prevalence of a specific outcome, then you must choose which beta you will fudge. I usually do beta0.

MM         <- model.matrix(~x)
betas      <- rnorm(4)
prevTarget <- 0.3
prevDiff   <- function(beta0)  prevTarget - 
                               mean(binomial()$linkinv(MM%*%c(beta0, betas)))
beta0      <- uniroot(prevDiff, c(-100, 100))$root
mean(binomial()$linkinv(MM%*%c(beta0, betas)))

Related Solutions

R Logistic Simulation – How to Simulate Data for Logistic Regression with a Categorical Variable

The model

Let $x_B = 1$ if one has category "B", and $x_B = 0$ otherwise. Define $x_C$, $x_D$, and $x_E$ similary. If $x_B = x_C = x_D = x_E = 0$, then we have category "A" (i.e., "A" is the reference level). Your model can then be written as

$$ \textrm{logit}(\pi) = \beta_0 + \beta_B x_B + \beta_C x_C + \beta_D x_D + \beta_E x_E $$ with $\beta_0$ an intercept.

Data generation in R

(a)

x <- sample(x=c("A","B", "C", "D", "E"), 
              size=n, replace=TRUE, prob=rep(1/5, 5))

The x vector has n components (one for each individual). Each component is either "A", "B", "C", "D", or "E". Each of "A", "B", "C", "D", and "E" is equally likely.

(b)

library(dummies)
dummy(x)

dummy(x) is a matrix with n rows (one for each individual) and 5 columns corresponding to $x_A$, $x_B$, $x_C$, $x_D$, and $x_E$. The linear predictors (one for each individual) can then be written as

linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)

(c)

The probabilities of success follows from the logistic model:

pi <- exp(linpred) / (1 + exp(linpred))

(d)

Now we can generate the binary response variable. The $i$th response comes from a binomial random variable $\textrm{Bin}(n, p)$ with $n = 1$ and $p =$ pi[i]:

y <- rbinom(n=n, size=1, prob=pi)

Some quick simulations to check this is OK

> #------ parameters ------
> n <- 1000 
> beta0 <- 0.07
> betaB <- 0.1
> betaC <- -0.15
> betaD <- -0.03
> betaE <- 0.9
> #------------------------
> 
> #------ initialisation ------
> beta0Hat <- rep(NA, 1000)
> betaBHat <- rep(NA, 1000)
> betaCHat <- rep(NA, 1000)
> betaDHat <- rep(NA, 1000)
> betaEHat <- rep(NA, 1000)
> #----------------------------
> 
> #------ simulations ------
> for(i in 1:1000)
+ {
+   #data generation
+   x <- sample(x=c("A","B", "C", "D", "E"), 
+               size=n, replace=TRUE, prob=rep(1/5, 5))  #(a)
+   linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)  #(b)
+   pi <- exp(linpred) / (1 + exp(linpred))  #(c)
+   y <- rbinom(n=n, size=1, prob=pi)  #(d)
+   data <- data.frame(x=x, y=y)
+   
+   #fit the logistic model
+   mod <- glm(y ~ x, family="binomial", data=data)
+   
+   #save the estimates
+   beta0Hat[i] <- mod$coef[1]
+   betaBHat[i] <- mod$coef[2]
+   betaCHat[i] <- mod$coef[3]
+   betaDHat[i] <- mod$coef[4]
+   betaEHat[i] <- mod$coef[5]
+ }
> #-------------------------
> 
> #------ results ------
> round(c(beta0=mean(beta0Hat), 
+         betaB=mean(betaBHat), 
+         betaC=mean(betaCHat), 
+         betaD=mean(betaDHat), 
+         betaE=mean(betaEHat)), 3)
 beta0  betaB  betaC  betaD  betaE 
 0.066  0.100 -0.152 -0.026  0.908 
> #---------------------

Logistic – Can Polytomous Categorical Independent Variables be Used in Logistic Regression?

Here's an answer from a different forum about how you might use coding to handle your polytomous variables in regression in general; the original question was about logistic regression here too, so this corroborates other answers here that logistic vs. linear regression isn't an important distinction in the use of coding for polytomous variables. Here's another answer on Cross Validated from @StasK that answers a question just like yours, again suggesting coding.

However, in @GaetanLion's answer to one of those similar questions, some discussion of a drawback of coding appears (mostly interpretive complexity, I think) to emphasize that coding may not be necessary depending on your statistical software. On the other hand, judging from @gung's answer to another very similar question about the interpretive complexity of an analysis like yours, some software may code automatically, and different tests are necessary for estimating the significance of the polytomous factor as a whole (as opposed to particular levels).

Best Answer

Related Solutions

R Logistic Simulation – How to Simulate Data for Logistic Regression with a Categorical Variable

Logistic – Can Polytomous Categorical Independent Variables be Used in Logistic Regression?

Related Question