Solved – Generate random data for logistic regression with a categorical independent variable

categorical datalogisticrregressionsimulation

I am trying to generate a data frame of fake data for exploratory purposes. Specifically, I am trying to produce data with a binary dependent variable (say, failure/success), and a categorical independent variable called 'picture' with 5 levels (pict1, pict2, etc.). I am following the answer provided here, which allows me to successfully generate the data. However, I need each level of 'picture' to occur the same number of times (i.e. 11 repetitions of each level = 55 total observations per subject).

Here is a reproducible example of what has worked so far (code from user: ocram):

library(dummies)

#------ parameters ------
n <- 1000 
beta0 <- 0.07
betaB <- 0.1
betaC <- -0.15
betaD <- -0.03
betaE <- 0.9
#------------------------

#------ initialisation ------
beta0Hat <- rep(NA, 1000)
betaBHat <- rep(NA, 1000)
betaCHat <- rep(NA, 1000)
betaDHat <- rep(NA, 1000)
betaEHat <- rep(NA, 1000)
#----------------------------

#------ simulations ------
for(i in 1:1000)
{
  #data generation
  x <- sample(x=c("pict1","pict2", "pict3", "pict4", "pict5"), 
              size=n, replace=TRUE, prob=rep(1/5, 5))  #(a)
  linpred <- cbind(1, dummy(x)[, -1]) %*% c(beta0, betaB, betaC, betaD, betaE)  #(b)
  pi <- exp(linpred) / (1 + exp(linpred))  #(c)
  y <- rbinom(n=n, size=1, prob=pi)  #(d)
  data <- data.frame(picture=x, choice=y)

  #fit the logistic model
  mod <- glm(choice ~ picture, family="binomial", data=data)

  #save the estimates
  beta0Hat[i] <- mod$coef[1]
  betaBHat[i] <- mod$coef[2]
  betaCHat[i] <- mod$coef[3]
  betaDHat[i] <- mod$coef[4]
  betaEHat[i] <- mod$coef[5]
}

However, as you can see from the output, each level of the factor 'picture' does not occur the same number of times (i.e. 200 times each).

> summary(data)
picture     choice     
pict1:200   Min.   :0.000  
pict2:207   1st Qu.:0.000  
pict3:217   Median :1.000  
pict4:163   Mean   :0.559  
pict5:213   3rd Qu.:1.000  
            Max.   :1.000 

Moreover, it is not entirely clear to me how to manipulate the initial beta values as to determine the probability of success/failure for each level of 'picture'. I cannot comment the original question because I do not yet have the necessary reputation points.

Best Answer

  1. If you want 200 copies of each of 5 levels in a polytomous variable in random order, then do this instead:

    x <- sample(rep(paste0('pict', 1:5), 200))
    
  2. If you want to control for overall prevalence of a specific outcome, then you must choose which beta you will fudge. I usually do beta0.

    MM         <- model.matrix(~x)
    betas      <- rnorm(4)
    prevTarget <- 0.3
    prevDiff   <- function(beta0)  prevTarget - 
                                   mean(binomial()$linkinv(MM%*%c(beta0, betas)))
    beta0      <- uniroot(prevDiff, c(-100, 100))$root
    mean(binomial()$linkinv(MM%*%c(beta0, betas)))