Solved – How to generate Normal variables parts of which are correlated (in R)

correlationnormal distributionrrandom variablesimulation

My second attempt to explain the question.

Start with a vector of numbers V1 of length M. The elements of V1 form a Normal distribution. Take (any) N elements from this vector. Generate a replica of this subset with some random error. Call this V2. Add to V2 elements so it becomes of length M and so that all M elements in V2 form a Normal distribution. As a result, V1 and V2 must be two normal random variables with one part that is correlated and one part which is not correlated across the two variables. And the resulting variables must have (the same) predefined mean and standard deviation.

My approach has been the following (in R):

rho<-0.9 #set the correlation level
x<-rnorm(1000,0,2) 
y<-x*rho+sqrt(1-rho^2)*rnorm(1000,0,2) #this generates the correlated parts of X and Y
x[1001:1500]<-rnorm(500,0,2) #this is the uncorrelated part of X
y[1001:1500]<-rnorm(500,0,2) # this is the uncorrelated part of Y

The remaining problem is that for now the correlated elements of x and y are normally distributed which is admissible but should not be necessary. In other words, how to attach elements to any collection of numbers so that the resulting set has a normal distribution?

If that would clarify things, in reality the data I am trying to simulate comes in the following form: people assign a fixed number of statements (based on agreement with the statement) into bins which follow a quasi-normal distribution. It is suspected that people score (assign) a subset (but not all) of these statements in a very similar way. I want to simulate such data in order to explore the power of different statistical tools to detect such data structure.

Best Answer

Just thought to throw it out there. If you are by chance interested in specifying a correlation between 2 independent variables (of any kind), it is possible to use the correlate package to do so (it uses rejection sampling to provide a practical, natural, multivariate distribution).

E.g.

somevar <- 1:1000
x <- rnorm(1000)
data <- cbind(somevar, x)
cor(data)
[1] 0.03

require(correlate)
newdata <- correlate(data, 0.5)
cor(newdata)
[1] 0.50001

# values within variable do not change
all(sort(newdata) == sort(data))
[1] TRUE

It sounds like this might help with your simulation design.