Solved – Tool for generating correlated data sets

correlationdatasetmathematical-statisticsrandom-generationsoftware

Does anyone know of a tool that I can use to generate a set of data with known correlations (and to put the icing on the cake – output this in json,csv,txt or some common format)?

I am working on some data visualizations and want to evaluate which ones can more easily allow a user to spot correlations – visually.

Best Answer

You could do it in any variety of places. Excel, R, ... almost anything capable of doing basic statistical calculations.

Population correlation. This is a simple matter in the bivariate case of taking independent random variables with the same standard deviation and creating a third variable from those two that has the required correlation with one of the two random variables. If $X_1$ and $X_2$ are independent standard normal variables, then $Y=rX_2+\sqrt{1-r^2}X_1$ will have correlation $r$ between $Y$ and $X_2$.

Here's an example in R:
```
 n = 10
 r = 0.8
 x1 = rnorm(n)
 x2 = rnorm(n)
 y1 = r*x2+sqrt(1-r*r)*x1   
```
Here the underlying variables have population correlation of the desired size, but the sample correlation will differ from it. (I just ran the code three times and got sample correlations of 0.938,0.895, and 0.933).

This could be done in Excel or any number of other packages with similar ease.

If you need it for more than two variables and some prespecified correlation matrix, this can be done using Cholesky decomposition (of which the above is a special case). If $Z$ is a vector of length $k$ of independent random variables with unit (or at least constant) standard deviation; and $\S$ is a correlation matrix with Cholesky decomposition $S=LL'$, then $LZ$ with have population correlation $S$.
Sample correlation. For the exact sample correlation, you need samples with exactly zero sample correlation, and identical sample variances, before applying the above trick. There are a variety of ways to achieve that, but one simple way is to take residuals from a regression (which will be uncorrelated with the x-variable in the regression), and then scale both variables to have unit variance.

Here's an example in R:
```
 n = 10
 r = 0.8
 x1 = rnorm(n)
 x2 = rnorm(n)
 y1 = scale(x2) * r  +  scale(residuals(lm(x1~x2))) * sqrt(1-r*r)
```
which produces the correlation:
```
 cor(y1,x2)
     [,1]
[1,]  0.8
```
exactly as desired.

So now it's just a matter of writing out the results in your preferred format (all the formats you mention can be done easily; for example, as a csv file, you'd call write.csv:

write.csv(data.frame(y=y1,x=x2),file="myfile.csv")

which makes a file of the name "myfile.csv" in the current working directory with the contents:

"","y","x"
"1",0.743433299251026,0.617686871809365
"2",0.527604385327034,-0.113047553664104
"3",-0.397333571358269,0.196447643803443
"4",-0.875264248799599,-1.57628371273354
"5",-0.225441433921137,-0.107919886825751
"6",0.0817573026498336,0.370207951209058
"7",-2.15935431462587,-1.21145928947767
"8",1.46638207013879,1.10215217029937
"9",0.311683673588212,-0.470550477344661
"10",0.526532837749974,-0.104382608454622

Related Solutions

Solved – Method for generating correlated non-normal data

After much searching, jumping around online forums, consulting with professors and doing A LOT of literature review, I have come to the conclusion that probably THE only way to address this problem is through the use of vine copulas indeed. It gives you some control over the pairwise skewness and kurtosis (or any higher moments) - for a p-variate random vector and the freedom to specify p-1 pair of copulas and the remaining p*(p-1)/2 - (p-1) dimensions can be specified in some kind of conditional copula.

I welcome other methods people might've come across but at least I'm going to leave this pointer towards an answer because i cannot, for the life of me, find any other ways to address this.

Generating Correlated Binomial Random Variables in R

Binomial variables are usually created by summing independent Bernoulli variables. Let's see whether we can start with a pair of correlated Bernoulli variables $(X,Y)$ and do the same thing.

Suppose $X$ is a Bernoulli$(p)$ variable (that is, $\Pr(X=1)=p$ and $\Pr(X=0)=1-p$) and $Y$ is a Bernoulli$(q)$ variable. To pin down their joint distribution we need to specify all four combinations of outcomes. Writing $$\Pr((X,Y)=(0,0))=a,$$ we can readily figure out the rest from the axioms of probability: $$\Pr((X,Y)=(1,0))=1-q-a, \\\Pr((X,Y)=(0,1))=1-p-a, \\\Pr((X,Y)=(1,1))=a+p+q-1.$$

Plugging this into the formula for the correlation coefficient $\rho$ and solving gives $$a = (1-p)(1-q) + \rho\sqrt{{pq}{(1-p)(1-q)}}.\tag{1}$$

Provided all four probabilities are non-negative, this will give a valid joint distribution--and this solution parameterizes all bivariate Bernoulli distributions. (When $p=q$, there is a solution for all mathematically meaningful correlations between $-1$ and $1$.) When we sum $n$ of these variables, the correlation remains the same--but now the marginal distributions are Binomial$(n,p)$ and Binomial$(n,q)$, as desired.

Example

Let $n=10$, $p=1/3$, $q=3/4$, and we would like the correlation to be $\rho=-4/5$. The solution to $(1)$ is $a=0.00336735$ (and the other probabilities are around $0.247$, $0.663$, and $0.087$). Here is a plot of $1000$ realizations from the joint distribution:

The red lines indicate the means of the sample and the dotted line is the regression line. They are all close to their intended values. The points have been randomly jittered in this image to resolve the overlaps: after all, Binomial distributions only produce integral values, so there will be a great amount of overplotting.

One way to generate these variables is to sample $n$ times from $\{1,2,3,4\}$ with the chosen probabilities and then convert each $1$ into $(0,0)$, each $2$ into $(1,0)$, each $3$ into $(0,1)$, and each $4$ into $(1,1)$. Sum the results (as vectors) to obtain one realization of $(X,Y)$.

Code

Here is an R implementation.

#
# Compute Pr(0,0) from rho, p=Pr(X=1), and q=Pr(Y=1).
#
a <- function(rho, p, q) {
  rho * sqrt(p*q*(1-p)*(1-q)) + (1-p)*(1-q)
}
#
# Specify the parameters.
#
n <- 10
p <- 1/3
q <- 3/4
rho <- -4/5
#
# Compute the four probabilities for the joint distribution.
#
a.0 <- a(rho, p, q)
prob <- c(`(0,0)`=a.0, `(1,0)`=1-q-a.0, `(0,1)`=1-p-a.0, `(1,1)`=a.0+p+q-1)
if (min(prob) < 0) {
  print(prob)
  stop("Error: a probability is negative.")
}
#
# Illustrate generation of correlated Binomial variables.
#
set.seed(17)
n.sim <- 1000
u <- sample.int(4, n.sim * n, replace=TRUE, prob=prob)
y <- floor((u-1)/2)
x <- 1 - u %% 2
x <- colSums(matrix(x, nrow=n)) # Sum in groups of `n`
y <- colSums(matrix(y, nrow=n)) # Sum in groups of `n`
#
# Plot the empirical bivariate distribution.
#
plot(x+rnorm(length(x), sd=1/8), y+rnorm(length(y), sd=1/8),
     pch=19, cex=1/2, col="#00000010",
     xlab="X", ylab="Y",
     main=paste("Correlation is", signif(cor(x,y), 3)))
abline(v=mean(x), h=mean(y), col="Red")
abline(lm(y ~ x), lwd=2, lty=3)