Solved – Generating random variables satisfying constraints

random matrixrandom-generation

I need to generate a list of random variables $\bf{x}$ subject to constraints that can be expressed in the form $\bf{E}x=b$ where $\bf{E}$ is an $m \times n $ matrix if $\bf{x}$ has $n$ entries. In all the cases I'm dealing with, $n >> m$, for example $n$ will be around 14,000 and $m$ will be 50. I'm not sure what method I will use for random sampling, either normal or uniform, it is not clear which is best for the problem I'm trying to solve, but I need all the variables to be sampled from distributions with the same mean and range/variance.

What I've been doing to solve this is reducing $\bf{E}$ to row-echelon form, setting all the variables corresponding to columns to the right of the last pivot to random values, and then solving the remaining square matrix equality.

There is a problem however, to solve the square matrix equality, I subtract the already set values from the right hand side. Unfortunately, the variances add as well, so my last 50 values tend to vary hugely, which is unfortunately unacceptable in this problem.

Is there a better way to do this? I cannot think of a way to fix the current method I am using. I use R.

Best Answer

This paper and R package completely solved my problem. It uses the Markov Chain Monte Carlo method, which relies on the fact that if you can find an initial solution of the constraint, through linear programming, you can find an arbitrary number of them by using a matrix that when multiplied by $E$, the constraints, gives zero. Read about it here:

http://www.vliz.be/imisdocs/publications/149403.pdf

and here is the package:

http://cran.r-project.org/web/packages/limSolve/index.html

Using vine method

In this thread: How to efficiently generate random positive-semidefinite correlation matrices? -- I described and provided the code for two efficient algorithms of generating random correlation matrices. Both come from a paper by Lewandowski, Kurowicka, and Joe (2009).

Please see my answer there for a lot of figures and matlab code. Here I would only like to say that the vine method allows to generate random correlation matrices with any distribution of partial correlations (note the word "partial") and can be used to generate correlation matrices with large off-diagonal values. Here is the relevant figure from that thread:

Vine method

The only thing that changes between subplots, is one parameter that controls how much the distribution of partial correlations is concentrated around $\pm 1$. As OP was asking for an approximately normal distribution off-diagonal, here is the plot with histograms of the off-diagonal elements (for the same matrices as above):

Off-diagonal elements

I think this distributions are reasonably "normal", and one can see how the standard deviation gradually increases. I should add that the algorithm is very fast. See linked thread for the details.

My original answer

A straight-forward modification of your method might do the trick (depending on how close you want the distribution to be to normal). This answer was inspired by @cardinal's comments above and by @psarka's answer to my own question How to generate a large full-rank random correlation matrix with some strong correlations present?

The trick is to make samples of your $\mathbf X$ correlated (not features, but samples). Here is an example: I generate random matrix $\mathbf X$ of $1000 \times 100$ size (all elements from standard normal), and then add a random number from $[-a/2, a/2]$ to each row, for $a=0,1,2,5$. For $a=0$ the correlation matrix $\mathbf X^\top \mathbf X$ (after standardizing the features) will have off-diagonal elements approximately normally distributed with standard deviation $1/\sqrt{1000}$. For $a>0$, I compute correlation matrix without centering the variables (this preserves the inserted correlations), and the standard deviation of the off-diagonal elements grow with $a$ as shown on this figure (rows correspond to $a=0,1,2,5$):

random correlation matrices

All these matrices are of course positive definite. Here is the matlab code:

offsets = [0 1 2 5];
n = 1000;
p = 100;

rng(42) %// random seed

figure
for offset = 1:length(offsets)
    X = randn(n,p);
    for i=1:p
        X(:,i) = X(:,i) + (rand-0.5) * offsets(offset);
    end
    C = 1/(n-1)*transpose(X)*X; %// covariance matrix (non-centred!)

    %// convert to correlation
    d = diag(C);
    C = diag(1./sqrt(d))*C*diag(1./sqrt(d));

    %// displaying C
    subplot(length(offsets),3,(offset-1)*3+1)
    imagesc(C, [-1 1])

    %// histogram of the off-diagonal elements
    subplot(length(offsets),3,(offset-1)*3+2)
    offd = C(logical(ones(size(C))-eye(size(C))));
    hist(offd)
    xlim([-1 1])

    %// QQ-plot to check the normality
    subplot(length(offsets),3,(offset-1)*3+3)
    qqplot(offd)

    %// eigenvalues
    eigv = eig(C);
    display([num2str(min(eigv),2) ' ... ' num2str(max(eigv),2)])
end

The output of this code (minimum and maximum eigenvalues) is:

0.51 ... 1.7
0.44 ... 8.6
0.32 ... 22
0.1 ... 48

Random Generation – Generating Random Matrices with Constraints on Row and Column Length

As @cardinal said in a comment:

Actually, after a little thought, I think you algorithm is exactly the Sinkhorn-Knopp algorithm with a very minor modification. Let $X$ be your original matrix and let $Y$ be a matrix of the same size such that $Y_{ij}=X^2_{ij}$. Then, your algorithm is equivalent to applying Sinkhorn-Knopp to $Y$, where at the final step you recover your desired form by taking $\hat{X}_{ij}=sgn(X_{ij})\sqrt{Y_{ij}}$. Sinkhorn-Knopp is guaranteed to converge except in quite pathological circumstances. Reading up on it should be very helpful.

...it seems that the iterative algorithm I suggested in the original question is very similar to the Sinkhorn-Knopp algorithm. Interestingly, it also seems very similar to iterative proportional fitting (IPF), which, as described on the IPF wikipedia page, is related to Newton's method and expectation maximization (all have the same limit).

These iterative methods are often applied to problems which lack a closed form solution, so I will tentatively assume that the answer to the question is negative: there is no way to achieve the desired solution without row/column iteration.

Best Answer

Related Solutions

Solved – How to generate random correlation matrix that has approximately normally distributed off-diagonal entries with given standard deviation

Using vine method

My original answer

Random Generation – Generating Random Matrices with Constraints on Row and Column Length

Related Question