Chi Square Test – Determining Null and Alternative Hypothesis for an Identity Matrix

correlationcorrelation matrixhypothesis testing

Say I had a correlation matrix:
$$
M =
\begin{bmatrix}
0.8 & 0.1 & 0.1\\
0.3 & 0.7 & 0.1 \\
-0.1 & -0.2 & 0.9
\end{bmatrix}
$$

I want to show that it is approximately an identity matrix $M = I$.

I was looking at the following R package: https://www.rdocumentation.org/packages/psych/versions/2.1.6/topics/cortest.mat

It uses a method derived from Steiger 1980, where he observed that

"the sum of the squared elements of a correlation matrix, or the Fisher z score equivalents, is distributed as chi square under the null hypothesis that the values are zero (i.e., elements of the identity matrix)"

I would translate this as:
$$
\sum^{N}_{i=1}\sum^{M}_{j=1}x_{i,j}^2 \sim \chi^{2}\; \mathrm{iff}\; \sum^{N}_{i=1}\sum^{M}_{j=1}x_{i,j}^2 = 0
$$

This doesn't make much sense to me. Using a simple identity matrix, I can easily show that this isn't true. I must be misunderstanding it.
E.g.
$$
I_{3} =
\begin{bmatrix}
1 & 0 & 0\\
0 & 1 & 0\\
0 & 0 & 1
\end{bmatrix}
\\
$$

$$
\sum^{N}_{i=1}\sum^{M}_{j=1}i_{i,j}^2 = 3
$$

When it comes to formulating the null and alternative hypothesis, the best I can get is:

H0: $\sum^{N}_{i=1}\sum^{M}_{j=1}x_{i,j}^2 = 0$

HA: $\sum^{N}_{i=1}\sum^{M}_{j=1}x_{i,j}^2 \neq 0$

Which somewhat makes sense. I want to prove that I have an identity matrix, and if it is true that summing the elements the identity matrix != 0, then this seems to check out. But the math I am coming here just feels inconsistent.

I ran the test on an identity matrix and found:

> cortest(diag(10), cor=FALSE)
Tests of correlation matrices 
Call:cortest(R1 = diag(10), cor = FALSE)
 Chi Square value Inf  with df =  100   with probability < 0

Looks great, assuming p val is "with probability", then that's what I want to see!

Then I ran the test again on a uniformly distributed correlation matrix

> X <- matrix(runif(100,-1,1), ncol=10)
> cortest(X, cor=FALSE)
Tests of correlation matrices 
Call:cortest(R1 = X, cor = FALSE)
 Chi Square value 7139.26  with df =  100   with probability < 0

this also has a very low p-value… maybe not so great afterall and I am not interpreting the results correctly

I checked out the original 1980 article on this, but it was too advanced for me to understand, and I cannot find any resources on this particular test online.

So my main question is:

What is the null and alternative hypothesis?

Best Answer

According to the paper, the test is the following. Denote $m$ as the number of features, then define $k = (m^2 - m)/2$, the number of unique off-diagonal element in the correlation matrix. Furthermore, define vector $p$ (length $k$) which contains all of these elements. Then (Eq. 16 in the paper),

$$H_0: p = p_0$$

or in your case $H_0: p = 0$, and the alternative $H_1: p \neq 0$.

The test statistic does not sum the diagonal element (see Eq. 22 in paper), only the off-diagonal. The test statistic is $\sum \sum_{i < j} z_{i,j}^2$, where $z_{i,j}$ is the fisher transformation of the $i,j$ element in the correlation matrix. It is distributed as $\chi^2_{k-q}$, where $q$ is the number of common correlations (in your case 1, since they are all 0).

Related Solutions

Solved – Chi Square test for survey data

It appears that you are first doing an omnibus test (Chi square test for independence) with 2 df to determine if the "like status" and "gender" are independent or not. And then you are doing post-hoc tests on the individual rows (Chi square goodness of fit tests) to see if the males/females are equally likely under each row. According to This Link under the section "Post Hoc Follow-up Tests", these post-hoc tests are allowable. Each row would generate a Chi square test with 1 df. They would test, for instance "Ho: men and women 'are likers' at the same rate", for each row.

However, I am leery that no adjustment was made for multiple comparisons. Since it appears you are doing three of these 1 df tests, you should adjust your $\alpha$ to correct the familywise error rate (Bonferroni correction for instance).

If your client wants to know how much more likely men are to be a "liker", etc. you could (a), provide a point estimate based on your data as Peter Flom suggested, or (b) you could construct a CI for the difference between the two proportions if you want an interval estimate. Along with the statement that the difference is significant (or not significant), my guess is that a point estimate would suffice for your clients.

Other than the problem with not controlling the familywise error rate, the analysis seems adequate to me. I hope this helps.

Correlation Matrix – How to Generate a Large Full-Rank Random Correlation Matrix

Other answers came up with nice tricks to solve my problem in various ways. However, I found a principled approach that I think has a large advantage of being conceptually very clear and easy to adjust.

In this thread: How to efficiently generate random positive-semidefinite correlation matrices? -- I described and provided the code for two efficient algorithms of generating random correlation matrices. Both come from a paper by Lewandowski, Kurowicka, and Joe (2009), that @ssdecontrol referred to in the comments above (thanks a lot!).

Please see my answer there for a lot of figures, explanations, and matlab code. The so called "vine" method allows to generate random correlation matrices with any distribution of partial correlations and can be used to generate correlation matrices with large off-diagonal values. Here is the example figure from that thread:

Vine method

The only thing that changes between subplots, is one parameter that controls how much the distribution of partial correlations is concentrated around $\pm 1$.

I copy my code to generate these matrices here as well, to show that it is not longer than the other methods suggested here. Please see my linked answer for some explanations. The values of betaparam for the figure above were ${50,20,10,5,2,1}$ (and dimensionality d was $100$).

function S = vineBeta(d, betaparam)
    P = zeros(d);           %// storing partial correlations
    S = eye(d);

    for k = 1:d-1
        for i = k+1:d
            P(k,i) = betarnd(betaparam,betaparam); %// sampling from beta
            P(k,i) = (P(k,i)-0.5)*2;     %// linearly shifting to [-1, 1]
            p = P(k,i);
            for l = (k-1):-1:1 %// converting partial correlation to raw correlation
                p = p * sqrt((1-P(l,i)^2)*(1-P(l,k)^2)) + P(l,i)*P(l,k);
            end
            S(k,i) = p;
            S(i,k) = p;
        end
    end

    %// permuting the variables to make the distribution permutation-invariant
    permutation = randperm(d);
    S = S(permutation, permutation);
end

Update: eigenvalues

@psarka asks about the eigenvalues of these matrices. On the figure below I plot the eigenvalue spectra of the same six correlation matrices as above. Notice that they decrease gradually; in contrast, the method suggested by @psarka generally results in a correlation matrix with one large eigenvalue, but the rest being pretty uniform.

eigenvalues of the matrices above

Update. Really simple method: several factors

Similar to what @ttnphns wrote in the comments above and @GottfriedHelms in his answer, one very simple way to achieve my goal is to randomly generate several ($k<n$) factor loadings $\mathbf W$ (random matrix of $k \times n$ size), form the covariance matrix $\mathbf W \mathbf W^\top$ (which of course will not be full rank) and add to it a random diagonal matrix $\mathbf D$ with positive elements to make $\mathbf B = \mathbf W \mathbf W^\top + \mathbf D$ full rank. The resulting covariance matrix can be normalized to become a correlation matrix (as described in my question). This is very simple and does the trick. Here are some example correlation matrices for $k={100, 50, 20, 10, 5, 1}$:

random correlation matrices from random factors

The only downside is that the resulting matrix will have $k$ large eigenvalues and then a sudden drop, as opposed to a nice decay shown above with the vine method. Here are the corresponding spectra:

eigenspectra of these matrices

Here is the code:

d = 100;    %// number of dimensions
k = 5;      %// number of factors

W = randn(d,k);
S = W*W' + diag(rand(1,d));
S = diag(1./sqrt(diag(S))) * S * diag(1./sqrt(diag(S)));

Best Answer

Related Solutions

Solved – Chi Square test for survey data

Correlation Matrix – How to Generate a Large Full-Rank Random Correlation Matrix

Update. Really simple method: several factors

Related Question