Solved – Tool for generating correlated data sets

correlationdatasetmathematical-statisticsrandom-generationsoftware

Does anyone know of a tool that I can use to generate a set of data with known correlations (and to put the icing on the cake – output this in json,csv,txt or some common format)?

I am working on some data visualizations and want to evaluate which ones can more easily allow a user to spot correlations – visually.

Best Answer

You could do it in any variety of places. Excel, R, ... almost anything capable of doing basic statistical calculations.

  1. Population correlation. This is a simple matter in the bivariate case of taking independent random variables with the same standard deviation and creating a third variable from those two that has the required correlation with one of the two random variables. If $X_1$ and $X_2$ are independent standard normal variables, then $Y=rX_2+\sqrt{1-r^2}X_1$ will have correlation $r$ between $Y$ and $X_2$.

    Here's an example in R:

     n = 10
     r = 0.8
     x1 = rnorm(n)
     x2 = rnorm(n)
     y1 = r*x2+sqrt(1-r*r)*x1   
    

    Here the underlying variables have population correlation of the desired size, but the sample correlation will differ from it. (I just ran the code three times and got sample correlations of 0.938,0.895, and 0.933).

    This could be done in Excel or any number of other packages with similar ease.

    If you need it for more than two variables and some prespecified correlation matrix, this can be done using Cholesky decomposition (of which the above is a special case). If $Z$ is a vector of length $k$ of independent random variables with unit (or at least constant) standard deviation; and $\S$ is a correlation matrix with Cholesky decomposition $S=LL'$, then $LZ$ with have population correlation $S$.

  2. Sample correlation. For the exact sample correlation, you need samples with exactly zero sample correlation, and identical sample variances, before applying the above trick. There are a variety of ways to achieve that, but one simple way is to take residuals from a regression (which will be uncorrelated with the x-variable in the regression), and then scale both variables to have unit variance.

    Here's an example in R:

     n = 10
     r = 0.8
     x1 = rnorm(n)
     x2 = rnorm(n)
     y1 = scale(x2) * r  +  scale(residuals(lm(x1~x2))) * sqrt(1-r*r)
    

    which produces the correlation:

     cor(y1,x2)
         [,1]
    [1,]  0.8
    

    exactly as desired.


So now it's just a matter of writing out the results in your preferred format (all the formats you mention can be done easily; for example, as a csv file, you'd call write.csv:

write.csv(data.frame(y=y1,x=x2),file="myfile.csv")

which makes a file of the name "myfile.csv" in the current working directory with the contents:

"","y","x"
"1",0.743433299251026,0.617686871809365
"2",0.527604385327034,-0.113047553664104
"3",-0.397333571358269,0.196447643803443
"4",-0.875264248799599,-1.57628371273354
"5",-0.225441433921137,-0.107919886825751
"6",0.0817573026498336,0.370207951209058
"7",-2.15935431462587,-1.21145928947767
"8",1.46638207013879,1.10215217029937
"9",0.311683673588212,-0.470550477344661
"10",0.526532837749974,-0.104382608454622
Related Question