R – How to Generate Correlated Test Data with Bernoulli, Categorical, and Continuous Vectors

categorical datamultivariate analysisrrandom-generation

I'm looking to generate a set of 5 random variables and enforce a dependence structure between them and onto a dependent variable $Y$. I understand how to generate correlated random variables for multivariate normal, but not when mixing different types. Below is a little more than I need, but I'm hoping someone can give me a general way of solving this problem…

$X_1$ and $X_2$ need to be highly correlated Bernoulli variables.
$X_3$ needs to take one of 5 categorical values, call them "A"…"E".
$X_4$ needs to be normal, and negatively correlated with $X_1$, $X_2$.
$X_5$ needs to approximate test scores from $0$ to $100$ with a high skew, so gamma probably. $X_5$ needs to be positively correlated with $X_1$, $X_2$, $X_4$.

Each of these variables must impact a "success/occurrence" Bernoulli distributed variable $Y$.

How would I begin? I would like to enforce correlation both between the values of $X$, and also between each $X$ and $Y$. (The categorical correlations seem particularly confusing to me.)

Best Answer

Using copulas is one way of generating dependent or (rank) correlated data from multivariable distributions that are not necessarily normal. Here is a simple example of doing this in Matlab: Simulating Dependent Random Variables Using Copulas. I am not sure if this can handle categorical variables though.

Related Solutions

Solved – Missing factor levels after logistic regression glm()

This line in glm() is doing you in:

mf$drop.unused.levels <- TRUE

which is effectively setting the argument of the same name of model.frame(), which results in the behaviour you report.

The obvious solution is to not allow this to happen, to adjust your split sampling algorithm you use to produce your training and test sets. Instead of randomly sampling the rows of the data randomly sample within the levels of the factor.

If you don't want to handle the details yourself, try the caret package and its function createFolds():

## install.packages("caret")
library("caret")

X1 <- factor(rep(1:3, times = c(20, 30, 50))) ## dummy data for illustration
f <- createFolds(X1, k = 5)
f

which gives:

> f <- createFolds(X1, k = 5)
> f
$Fold1
 [1]  5  7 10 20 21 24 29 31 34 42 51 52 59 68 75 76 82 83
[19] 85 94

$Fold2
 [1]  4  9 11 18 22 23 30 38 40 44 55 58 62 66 70 72 80 81
[19] 87 92

$Fold3
 [1]  1 12 14 16 27 37 41 48 49 50 53 60 61 63 64 74 79 88
[19] 89 97

$Fold4
 [1]  3 15 17 19 25 28 32 35 36 43 54 57 67 69 71 73 78 86
[19] 98 99

$Fold5
 [1]   2   6   8  13  26  33  39  45  46  47  56  65  77  84
[15]  90  91  93  95  96 100

The values in f are the indices of the elements of X1 partitioning it into k = 5 groups, with sampling from within the levels of X1 as needed. Then take 1 of these folds at random as the test set.

## number of samples in levels of X1 for each split
> table(X1[-f[[1]]])

 1  2  3 
16 24 40 
> table(X1[-f[[2]]])

 1  2  3 
16 24 40 
> table(X1[-f[[3]]])

 1  2  3 
16 24 40 
> table(X1[-f[[4]]])

 1  2  3 
16 24 40 
> table(X1[-f[[5]]])

 1  2  3 
16 24 40

Do note that this algorithm doesn't guarantee that for small sample sizes that the stratified sampling will always work (i.e you may not be able to escape the missing levels issue in all cases).

Solved – How to write down a logistic regression formula for continuous and categorical variables

The formalism used to write models in R can be quite handy, in this case with factor variables explicitly noted:

Y ~ age + calendar + factor(teacher) + factor(gender) + factor(prep_course)

You could expand to indicate more specifically that this is a logistic regression, and I suppose to indicate the reference levels of the factor variables (although that probably isn't so important for your presentation).

Best Answer

Related Solutions

Solved – Missing factor levels after logistic regression glm()

Solved – How to write down a logistic regression formula for continuous and categorical variables

Related Question