Solved – Covariance and correlation matrix comparison

correlationcovariancedistance-functionsr

I am aware that this question may be too broad and that answers are scattered in various posts, but i need concise and organized answer.

My dataset consists of linear measurements of cranial dimensions on 600 individual roe deer (50 distinct measurements with a dial caliper). I divide this dataset to unequal groups (corresponding to population membership), and caluclate correlation or covariance matrix so that every population is represented by 50×50 matrix.

My question is, what is the best way to compare those matrices, both for equality and pattern (excluding Mantel test)? Problematic part may be the fact that those matrices are rarely of full rank since many measured characters are significantly correlated. Also, that comparison should include some kind of confidence intervals.

Edit:

In the meantime I have found one possible solution, I just need to implement it in R code. This paper suggests possible distances based comparisons that I really need.

The Euclidean distance as a simple method like this (Si is a sample covariance matrix):

$d_e(S_1,S_2) = \sqrt {tr((S_1-S_2)^t(S_1-S_2))}$

which I implemented in R code like this (although unsure):

covDif <- sqrt(t(cov(malesMab)-cov(malesMbm))*(cov(malesMab)-cov(malesMbm)))
sqrt(sum(diag(cov(covDif))))

But this distance is not good for comparison and of all the distances suggested in the mentioned paper the Cholesky decomposition is the best but I don`t know how to program it in R. This is its form:

$d_e(S_1,S_2) = chol(S_1)-chol(S_2)$

which I tried like this (just substituting in the upper equation)

covDif <- sqrt(t(chol(cov(malesMab))-chol(cov(malesMbm)))*(chol(cov(malesMab))-chol(cov(malesMbm))))
sqrt(sum(diag(cov(covDif))))

which works, but encounters rank deficiency problems which I hoped to avoid by using Cholesky decomposition.

Any suggestions?

Best Answer

Have you tried using the morphometric approaches of Strauss & Bookstein (1982)? It seems like this may give you a relatively straightforward way to compare your populations. Here's a really brief summary, but there's much more in the paper and other morphometric publications.

If necessary, log-transform the 50 measurements ("dimensions")
PCA of these dimensions (variables)
- (note) PC 1 will likely explain almost all of the variance in the dimension data, and it mostly reflects overall size, so...
Regressions of each dimension and PC 1
Residuals of each regression may be used in the construction of discriminant model for DA based on pre-assigned groups (populations)
Use resubstitution error rates to assess morphometric differences between populations
MANOVA/ANOVA on regression residual data for both additional assessment of population differences and to identify specific dimensions that differ
- (note) you may want to be careful even if MANOVA results indicate real differences due to the sheer number of ANOVA

Strauss, R. E. and F. L. Bookstein. 1982. The truss: Body form reconstructions in morphometrics. Systematic Biology 31:113–135.

Related Solutions

Solved – Create positive-definite 3×3 covariance matrix given specified correlation values

To follow up on @cardinal's comment: your $x$, $y$, and $z$ define a $(3 \times 3)$ correlation matrix $R$. Since a correlation matrix also is a possible covariance matrix (of standardized variables), it has to be positive definite. This is the case if all eigenvalues are $> 0$. If $R$ is indeed positive definite, then all vectors $\boldsymbol{s}$ of variances (i.e., numbers $> 0$) will turn $\boldsymbol{R}$ into a positive definite covariance matrix $\boldsymbol{\Sigma} = \boldsymbol{D}_{s}^{1/2} \boldsymbol{R} \boldsymbol{D}_{s}^{1/2}$, where $\boldsymbol{D}_{s}^{1/2}$ is the square root of the diagonal matrix made from $\boldsymbol{s}$.

So just construct $R$ from $x, y, z$, and check if the eigenvalues are all $> 0$. If so, you're good, and you can transform any set of data to have a corresponding covariance matrix with arbitrary variances:

x <- 0.5
y <- 0.3                            # changing this to -0.6 makes it not pos.def.
z <- 0.4
R <- matrix(numeric(3*3), nrow=3)   # will be the correlation matrix
diag(R) <- 1                        # set diagonal to 1
R[upper.tri(R)] <- c(x, y, z)       # fill in x, y, z to upper right
R[lower.tri(R)] <- c(x, y, z)       # fill in x, y, z to lower left
eigen(R)$values                     # get eigenvalues to check if pos.def.

gives

[1] 1.8055810 0.7124457 0.4819732

So our $\boldsymbol{R}$ here is positive definite. Now construct the corresponding covariance matrix from arbitrary variances.

vars  <- c(4, 16, 9)                # the variances
Sigma <- diag(sqrt(vars)) %*% R %*% diag(sqrt(vars))

Generate some data matrix $\boldsymbol{X}$ that we will transform to later have exactly that covariance matrix.

library(mvtnorm)                    # for rmvnorm()
N  <- 100                           # number of simulated observations
mu <- c(1, 2, 3)                    # some arbitrary centroid
X  <- round(rmvnorm(n=N, mean=mu, sigma=Sigma))

To do that, we first orthonormalize matrix $\boldsymbol{X}$, giving matrix $\boldsymbol{Y}$ with covariance matrix $\boldsymbol{I}$ (identity).

orthGS <- function(X) {             # implement Gram-Schmidt algorithm
    Id <- diag(nrow(X))
    for(i in 2:ncol(X)) {
        A <- X[ , 1:(i-1), drop=FALSE]
        Q <- qr.Q(qr(A))
        P <- tcrossprod(Q)
        X[ , i] <- (Id-P) %*% X[ , i]
    }
    scale(X, center=FALSE, scale=sqrt(colSums(X^2)))
}

Xctr <- scale(X, center=TRUE, scale=FALSE)  # centered version of X
Y    <- orthGS(Xctr)                        # Y is orthonormal

Transform matrix $\boldsymbol{Y}$ to have covariance matrix $\boldsymbol{\Sigma}$ and centroid $\boldsymbol{\mu}$.

Edit: what's going on here: Do a spectral decomposition $\boldsymbol{\Sigma} = \boldsymbol{G} \boldsymbol{D} \boldsymbol{G}^{t}$, where $\boldsymbol{G}$ is the matrix of normalized eigenvectors of $\boldsymbol{\Sigma}$, and $\boldsymbol{D}$ is the corresponding matrix of eigenvalues. Now matrix $\boldsymbol{G} \boldsymbol{D}^{1/2} \boldsymbol{Y}$ has covariance matrix $\boldsymbol{G} \boldsymbol{D}^{1/2} Cov(\boldsymbol{Y}) \boldsymbol{D}^{1/2} \boldsymbol{G}^{t} = \boldsymbol{G} \boldsymbol{D} \boldsymbol{G}^{t} = \boldsymbol{\Sigma}$, as $Cov(\boldsymbol{Y}) = \boldsymbol{I}$.

eig    <- eigen(Sigma)
A      <- eig$vectors %*% sqrt(diag(eig$values))
XX1ctr <- t(A %*% t(Y)) * sqrt(nrow(Y))
XX1    <- sweep(XX1ctr, 2, mu, "+")         # move centroid to mu

Check that the correlation matrix is really $\boldsymbol{R}$.

> all.equal(cor(XX1), R)
[1] TRUE

For other purposes, the question might now be: How do I find a positive definite matrix that is "very similar" to a pre-specified one that is not positive definite. That I don't know.

Edit: corrected some square roots

Solved – Obtaining covariance matrix from correlation matrix

Let $R$ be the correlation matrix and $S$ the vector of standard deviations, so that $S\cdot S$ (where $\cdot$ is the componentwise product) is the vector of variances. Then $$ \text{diag}(S) R \text{diag}(S) $$ is the covariance matrix. This is fully explained here.

This can be implemented in R as

cor2cov_1 <- function(R,S){
    diag(S) %*% R %*% diag(S)
}

but is inefficient. An efficient implementation is

cor2cov <- function(R, S) {
 sweep(sweep(R, 1, S, "*"), 2, S, "*")
 }

and you can test yourself they give the same result.

TRUTH= 0.8 
R <- as.matrix(data.frame(c(1, TRUTH), c(TRUTH, 1)))
S = c(sqrt(1), sqrt(1))

cor2cov_1(R,S)

outer(S,S) * R 

smat = as.matrix(S)
R * smat %*% t(smat)

Here is a microbenchmark showing the efficiency of the functions:

library(microbenchmark)
microbenchmark::microbenchmark(outer(S,S) * R ,cor2cov_1(R,S), cor2cov(R,S), R * smat %*% t(smat), times = 10000)

Unit: microseconds
                 expr     min      lq       mean  median      uq      max neval cld
      outer(S, S) * R   1.968   2.214   2.724639   2.337   2.460 3611.362 10000  a 
      cor2cov_1(R, S)   1.722   1.886   2.778045   1.968   2.091 3743.259 10000  a 
        cor2cov(R, S) 113.037 116.071 125.844711 118.039 120.663 5462.020 10000   b
 R * smat %*% t(smat)   1.066   1.230   1.422712   1.435   1.517   12.177 10000  a

Best Answer

Related Solutions

Solved – Create positive-definite 3×3 covariance matrix given specified correlation values

Solved – Obtaining covariance matrix from correlation matrix

Related Question