Generating Correlated Binomial Random Variables in R

bernoulli-distributionbinomial distributioncorrelationrrandom-generation

I was wondering if it might be possible to generate correlated random binomial variables following a linear transformation approach?

Below, I tried something simple in R and it produces some correlation. But I was wondering if there is a principled way to do this?

X1 = rbinom(1e4, 6, .5) ; X2 = rbinom(1e4, 6, .5) ;  X3 = rbinom(1e4, 6, .5) ; a = .5

Y1 = X1 + (a*X2) ; Y2 = X2 + (a*X3) ## Y1 and Y2 are supposed to be correlated

cor(Y1, Y2)

Best Answer

Binomial variables are usually created by summing independent Bernoulli variables. Let's see whether we can start with a pair of correlated Bernoulli variables $(X,Y)$ and do the same thing.

Suppose $X$ is a Bernoulli$(p)$ variable (that is, $\Pr(X=1)=p$ and $\Pr(X=0)=1-p$) and $Y$ is a Bernoulli$(q)$ variable. To pin down their joint distribution we need to specify all four combinations of outcomes. Writing $$\Pr((X,Y)=(0,0))=a,$$ we can readily figure out the rest from the axioms of probability: $$\Pr((X,Y)=(1,0))=1-q-a, \\\Pr((X,Y)=(0,1))=1-p-a, \\\Pr((X,Y)=(1,1))=a+p+q-1.$$

Plugging this into the formula for the correlation coefficient $\rho$ and solving gives $$a = (1-p)(1-q) + \rho\sqrt{{pq}{(1-p)(1-q)}}.\tag{1}$$

Provided all four probabilities are non-negative, this will give a valid joint distribution--and this solution parameterizes all bivariate Bernoulli distributions. (When $p=q$, there is a solution for all mathematically meaningful correlations between $-1$ and $1$.) When we sum $n$ of these variables, the correlation remains the same--but now the marginal distributions are Binomial$(n,p)$ and Binomial$(n,q)$, as desired.

Example

Let $n=10$, $p=1/3$, $q=3/4$, and we would like the correlation to be $\rho=-4/5$. The solution to $(1)$ is $a=0.00336735$ (and the other probabilities are around $0.247$, $0.663$, and $0.087$). Here is a plot of $1000$ realizations from the joint distribution:

Scatterplot

The red lines indicate the means of the sample and the dotted line is the regression line. They are all close to their intended values. The points have been randomly jittered in this image to resolve the overlaps: after all, Binomial distributions only produce integral values, so there will be a great amount of overplotting.

One way to generate these variables is to sample $n$ times from $\{1,2,3,4\}$ with the chosen probabilities and then convert each $1$ into $(0,0)$, each $2$ into $(1,0)$, each $3$ into $(0,1)$, and each $4$ into $(1,1)$. Sum the results (as vectors) to obtain one realization of $(X,Y)$.

Code

Here is an R implementation.

#
# Compute Pr(0,0) from rho, p=Pr(X=1), and q=Pr(Y=1).
#
a <- function(rho, p, q) {
  rho * sqrt(p*q*(1-p)*(1-q)) + (1-p)*(1-q)
}
#
# Specify the parameters.
#
n <- 10
p <- 1/3
q <- 3/4
rho <- -4/5
#
# Compute the four probabilities for the joint distribution.
#
a.0 <- a(rho, p, q)
prob <- c(`(0,0)`=a.0, `(1,0)`=1-q-a.0, `(0,1)`=1-p-a.0, `(1,1)`=a.0+p+q-1)
if (min(prob) < 0) {
  print(prob)
  stop("Error: a probability is negative.")
}
#
# Illustrate generation of correlated Binomial variables.
#
set.seed(17)
n.sim <- 1000
u <- sample.int(4, n.sim * n, replace=TRUE, prob=prob)
y <- floor((u-1)/2)
x <- 1 - u %% 2
x <- colSums(matrix(x, nrow=n)) # Sum in groups of `n`
y <- colSums(matrix(y, nrow=n)) # Sum in groups of `n`
#
# Plot the empirical bivariate distribution.
#
plot(x+rnorm(length(x), sd=1/8), y+rnorm(length(y), sd=1/8),
     pch=19, cex=1/2, col="#00000010",
     xlab="X", ylab="Y",
     main=paste("Correlation is", signif(cor(x,y), 3)))
abline(v=mean(x), h=mean(y), col="Red")
abline(lm(y ~ x), lwd=2, lty=3)