I was trying to create some test data for logistic regression and I found this post How to simulate artificial data for logistic regression?
It is a nice answer but it creates only continuous variables. What about a categorical variable x3 with 5 levels (A B C D E) associated with y for the same example as in the link?
Best Answer
The model
Let $x_B = 1$ if one has category "B", and $x_B = 0$ otherwise. Define $x_C$, $x_D$, and $x_E$ similary. If $x_B = x_C = x_D = x_E = 0$, then we have category "A" (i.e., "A" is the reference level). Your model can then be written as
$$ \textrm{logit}(\pi) = \beta_0 + \beta_B x_B + \beta_C x_C + \beta_D x_D + \beta_E x_E $$ with $\beta_0$ an intercept.
Data generation in R
(a)
The
x
vector hasn
components (one for each individual). Each component is either "A", "B", "C", "D", or "E". Each of "A", "B", "C", "D", and "E" is equally likely.(b)
dummy(x)
is a matrix withn
rows (one for each individual) and 5 columns corresponding to $x_A$, $x_B$, $x_C$, $x_D$, and $x_E$. The linear predictors (one for each individual) can then be written as(c)
The probabilities of success follows from the logistic model:
(d)
Now we can generate the binary response variable. The $i$th response comes from a binomial random variable $\textrm{Bin}(n, p)$ with $n = 1$ and $p =$
pi[i]
:Some quick simulations to check this is OK