Solved – Contrasting covariance calculation using R, Matlab, Pandas, NumPy cov, NumPy linalg.svd

I wish there were a kind of Rosetta Stone for covariance (and other) stat methods supplied by popular statistics libraries and software, so that we could understand why what is ostensibly the same method gives different results in different libraries. I am aware that most of the discrepancies are due to scaling and normalizing factors, but these factors are usually not explained in the documentation sets of these libraries, and some of the discrepancies are pretty significant. If approaches to normalization and scaling are widely understood, why is it so difficult to write Python code that manually reproduces the output of these methods? Shouldn't the algorithms used in these popular libraries be more transparent?

Test example 1.

Input matrix A = $\begin{pmatrix}
1 & 2 \\
3 & 4
\end{pmatrix}$

Output of cov(A) in R: $\begin{pmatrix}
2 & 2 \\
2 & 2
\end{pmatrix}$
Output in R when input is $A^\top$: $\begin{pmatrix}
0.5 & 0.5 \\
0.5 & 0.5
\end{pmatrix}$
Output of cov(A) in Matlab: $\begin{pmatrix}
2 & 2 \\
2 & 2
\end{pmatrix}$
Output in Pandas: $\begin{pmatrix}
2 & 2 \\
2 & 2
\end{pmatrix}$
Output in SciKit Learn Empirical Covariance: $\begin{pmatrix}
1 & 1 \\
1 & 1
\end{pmatrix}$
What sort of normalization or scaling is SciKit doing here?
Output in NumPy using np.cov(): $\begin{pmatrix}
0.5 & 0.5 \\
0.5 & 0.5
\end{pmatrix}$
Output from Python code that supposedly reflects what np.cov() is doing (according to this forum post): $\begin{pmatrix}
+2 & -2 \\
-2 & +2
\end{pmatrix}$
NOTE the sign difference in the off-diagonal elements!
Output in NumPy using np.linalg.svd() to calculate covariance: $\begin{pmatrix}
10 & -14 \\
-14 & 20
\end{pmatrix}$
The values here differ from Matlab by more than a constant factor or a square.

Code used to generate input matrices for Test 1:

R code for A = matrix(
+ c(1, 3, 2, 4),
+ nrow=2,
+ ncol=2)
R code for input of A.transpose = matrix(+ c(1, 2, 3, 4), + nrow=2, + ncol=2)
Matlab code for input A = [1, 2; 3, 4]
Panda code: DataFrame.cov(np.vstack(([1, 2], [3, 4])))

SciKit Learn empirical covariance code:

X = np.vstack(([1, 2], [3, 4]))
fit = EmpiricalCovariance().fit(X)
fit.covariance_

NumPy code using np.cov():

X = np.vstack(([1, 2], [3, 4]))
np.cov(X)

Python code that supposedly reflects what's happening in np.cov():

X0 = np.vstack(([1, 2], [3, 4])) 
X = X0 - X0.mean(axis=0)
N = X.shape[1]
fact = float(N - 1)
C = np.dot(X, X.T) / fact

NOTE that putting the transpose X.T first instead of second in np.dot(X.T, X) gives the same result as Matlab and R without the negative signs in the off-diagonal elements, but this is supposed to be an outer product!! What is backwards about the NumPy conventions such that we need to put the transpose term first in order to get an outer product?

NumPy code to obtain covariance using np.linalg.svd() according to this post:

X0 = np.vstack(([1, 2], [3, 4])) 
U, s, V = np.linalg.svd(X0, full_matrices = 0)
D = np.dot(np.dot(V,np.diag(s**2)),V.T)
Dadjust = D / (X0.shape[0] - 1)
print (Dadjust)

Again, the values here differ from Matlab by more than a constant factor or a square. Can anyone help me unravel this?

Test example 2.

Input matrix A = $\begin{pmatrix}
1 & 2 \\
3 & 4 \\
22 & 44
\end{pmatrix}$

Output of cov(A) in R: $\begin{pmatrix}
134.3333 & 274.3333 \\
274.3333 & 561.3333
\end{pmatrix}$
Output in R when input is $A^\top$: $\begin{pmatrix}
0.5 & 0.5 & 11 \\
0.5 & 0.5 & 11 \\
11.0 & 11.0 & 242
\end{pmatrix}$
Output in Matlab: $\begin{pmatrix}
134.3333 & 274.3333 \\
274.3333 & 561.3333
\end{pmatrix}$
Output in Pandas: $\begin{pmatrix}
134.3333 & 274.3333 \\
274.3333 & 561.3333
\end{pmatrix}$
Output in SciKit Learn Empirical Covariance: $\begin{pmatrix}
89.55555556 & 182.88888889 \\
182.88888889 & 374.22222222
\end{pmatrix}$
Output in NumPy using np.cov(): $\begin{pmatrix}
0.5 & 0.5 & 11 \\
0.5 & 0.5 & 11 \\
11 & 11 & 242
\end{pmatrix}$
Output from Python code that supposedly reflects what np.cov() is doing (according to this forum post): $\begin{pmatrix}
273.88888889 & 229.22222222 & -503.11111111 \\
229.22222222 & 192.55555556 & -421.77777778 \\
-503.11111111 & -421.77777778 & 924.88888889
\end{pmatrix}$
Output in NumPy using np.linalg.svd() to calculate covariance: $\begin{pmatrix}
247 & 491 \\
491 & 978
\end{pmatrix}$

Code used to generate input matrices for Test 2:

R code for A = matrix(+ c(1, 3, 22, 2, 4, 44), + nrow=3, + ncol=2)
R code for covariance with input of $A^\top$:
```
Atranspose = t(A)
cov(Atranspose)
```
Matlab code for input A = [1, 2; 3, 4; 22, 44]
Panda code: DataFrame.cov(np.vstack(([1, 2], [3, 4], [22, 44])))

SciKit Learn empirical covariance code:

X = np.vstack(([1, 2], [3, 4], [22, 44])
fit = EmpiricalCovariance().fit(X)
fit.covariance_

NumPy code using np.cov():

X = np.vstack(([1, 2], [3, 4], [22, 44])
np.cov(X)

Python code that supposedly reflects what's happening in np.cov():

X0 = np.vstack(([1, 2], [3, 4], [22, 44]) 
X = X0 - X0.mean(axis=0)
N = X.shape[1]
fact = float(N - 1)
C = np.dot(X, X.T) / fact

NumPy code to obtain covariance using np.linalg.svd() according to this post:

X0 = np.vstack(([1, 2], [3, 4], [22, 44]) 
U, s, V = np.linalg.svd(X0, full_matrices = 0)
D = np.dot(np.dot(V,np.diag(s**2)),V.T)
Dadjust = D / (X0.shape[0] - 1)
print (Dadjust)

Best Answer

Note that numpy.cov() considers its input data matrix to have observations in each column, and variables in each row, so to get numpy.cov() to return what other packages do, you have to pass the transpose of the data matrix to numpy.cov().

The Python code that you linked can be used to simulate what other packages do, but it contains some errors: N should be the number of rows, not columns, and you have to perform the matrix multiplication in the other order:

import numpy as np

def cov(X0):
    print "\n==\nMatrix:"
    print X0
    X = X0 - X0.mean(axis=0)
    N = X.shape[0]                # !!!
    fact = float(N - 1)
    print "Covariance:"
    print np.dot(X.T, X) / fact   # !!!

X0 = np.vstack(([1, 2], [3, 4]))
cov(X0)
cov(X0.T)

X0 = np.vstack(([1, 2], [3, 4], [22, 44]))
cov(X0)
cov(X0.T)

With these fixes, the covariance behaves as expected:

==
Matrix:
[[1 2]
 [3 4]]
Covariance:
[[ 2.  2.]
 [ 2.  2.]]

==
Matrix:
[[1 3]
 [2 4]]
Covariance:
[[ 0.5  0.5]
 [ 0.5  0.5]]

==
Matrix:
[[ 1  2]
 [ 3  4]
 [22 44]]
Covariance:
[[ 134.33333333  274.33333333]
 [ 274.33333333  561.33333333]]

==
Matrix:
[[ 1  3 22]
 [ 2  4 44]]
Covariance:
[[   0.5    0.5   11. ]
 [   0.5    0.5   11. ]
 [  11.    11.   242. ]]

As for the numpy.linalg.svd() code, you need to center the data matrix by subtracting off the variable means, and the multiplication involving the V matrix must be performed in the other order. With these changes you will replicate everybody else's behavior:

import numpy as np

def cov(X0):
    print "\n==\nMatrix:"
    print X0
    X = X0 - X0.mean(axis=0)
    U, s, V = np.linalg.svd(X, full_matrices = 0)
    D = np.dot(np.dot(V.T,np.diag(s**2)),V)
    Dadjust = D / (X0.shape[0] - 1)
    print "Covariance:"
    print (Dadjust)

Best Answer

Related Solutions

Mahalanobis Distance – Handling Singular Covariance Matrix in Mahalanobis Distance Calculations Using Matlab

Solved – Matlab RandomForest prediction error calculation

Related Question