Compute the covariance matrix by hand with Python / Numpy

covariancelinear algebramatricesprobability

I'm trying to compute the covariance matrix (in python 3 and numpy using the formula wikipedia

$$
\Sigma_{X_iX_j} = \text{cov}[X_i, X_j] = E[(X_i – E[X_i])(X_j – E[X_j])]
$$

import numpy as np
x = np.array([[0, 2, 7], [1, 1, 9], [2, 0, 13]]).T
print("matrix:")
print(x)
print("covariance matrix: ")
print(np.cov(x))
print("Sigma_1.1:")
print((x[0,0] - np.mean(x[0]))*(x[0,0] - np.mean(x[0])))
print("Sigma_3.3:")
print((x[2,2] - np.mean(x[2]))*(x[2,2] - np.mean(x[2])))
print("Sigma_3.3 computed 'by hand':")
print((13 - (7+9+13)/3)**2)

matrix:
[[ 0  1  2]
 [ 2  1  0]
 [ 7  9 13]]
covariance matrix:
[[ 1.         -1.          3.        ]
 [-1.          1.         -3.        ]
 [ 3.         -3.          9.33333333]]
Sigma_1.1:
1.0
Sigma_3.3:
11.111111111111114
Sigma_3.3 computed 'by hand':
11.111111111111114

why the entry $\Sigma_{3.3}$ computed by numpy (output: 9.33333333) and mine (output: 11.111111111111114) don't match? what am I doing wrong?

Best Answer

The third variable is not just $13$.

The reading of the third variables are $[7,9,13]$.

$$\frac{7+9+13}{3}=\frac{29}{3}$$

The unbiased estimator is

$$\frac12\left[ \left(7-\frac{29}{3}\right)^2 + (9 - \frac{29}{3})^2 + (13 - \frac{29}{3})^2\right]\approx 9.3333.$$

Note that for the first variables, the readings are $[0,1,2]$ with mean being $1$.

The unbiased estimator is

$$\frac12\left[ \left(0-1\right)^2 + (1-1)^2 + (2-1)^2\right]=1$$

Related Solutions

[Math] 3-sigma Ellipse, why axis length scales with square root of eigenvalues of covariance-matrix

If anyone is still interested in this problem, in my spare time i think i found an answer. The answer is: the two methods only yield the same lengths for the semi-minor and semi-major-axis of the ellipse for the theoretical case that the covariance matrix is exactly equal to the expectation value of the covariance matrix for a bivariate gaussian distribution. $$Cov = \left[{\begin{array}{cc} \sigma_1^2 & \rho\sigma_1\sigma_2\\ \rho\sigma_1\sigma_2 & \sigma_2^2\\ \end{array} }\right]$$ If the covariance-matrix comes from a numerical realization of a bivariate gaussian distribution, then the covariance matrix will not exactly look like stated above and the results of the "eigenvalue"-method and the "Batschelet, 1981"-Method begin to deviate. The result can be seen in this python code below

import numpy
import scipy

# example 14.4.1 (p. 273), parameter for bivariate gaussian distribution
mu_1 = 5
mu_2 = 4
sigma_1 = 3.0
sigma_2 = 2.0
rho = 0.4   # correlation

# eq 14.4.3 (p. 269), expectation value for covariance-matrix for bivariate gaussian distribution!
cov_theo = numpy.array([[sigma_1**2, rho*sigma_1*sigma_2],
                   [rho*sigma_1*sigma_2, sigma_2**2]])

# make an example (with 10000 points) as sketched in Batschelet, 1981
points = numpy.random.multivariate_normal(mean=(mu_1,mu_2), cov=cov_theo, size=10000)
x, y = points.T
cov = numpy.cov(x, y)   # covariance-matrix by the realization

# get estimations for mu, sigma, rho
mu_1 = numpy.mean(x)
mu_2 = numpy.mean(y)
sigma_1 = numpy.std(x)
sigma_2 = numpy.std(y)
rho = scipy.corrcoef(x,y)[0,1]

## -- BATSCHELET-METHOD (1981) -- ##
# eq 14.4. (ellipse equation): A(x-mu_1)^2 + 2B(x-mu_1)(y-mu_2) + C(y-mu_2)^2 = D
A = sigma_2**2
B = -rho*sigma_1*sigma_2
C = sigma_1**2
D = (1-rho**2)*sigma_1**2*sigma_2**2

# further calculations (p. 264)
R = numpy.sqrt((A-C)**2 + 4*B**2)
a = numpy.sqrt(2*D/(A+C-R))
b = numpy.sqrt(2*D/(A+C+R))

## -- EIGENVALUE-METHOD from SO -- ##
eigval, _ = numpy.linalg.eig(cov)
a_eig, b_eig = numpy.sqrt(eigval)

print 'realization of covariance_matrix\n', cov
print 'deviations from what it should look like:'
print 'sigma_1^2 - cov[0,0] = {d1}'.format(d1=sigma_1**2-cov[0,0])
print 'sigma_2^2 - cov[1,1] = {d2}'.format(d2=sigma_2**2-cov[1,1])
print 'rho*sigma_1*sigma_2 - cov[0,1] = {d}'.format(d=rho*sigma_1*sigma_2 - cov[0,1])
print 'a-a_eig = {diff_a}, b-b_eig = {diff_b}'.format(diff_a=a-a_eig, diff_b=b-b_eig)

The result looks like:

realization of covariance_matrix
[[ 9.00812576  2.4676689 ]
 [ 2.4676689   4.14435948]]
deviations from what it should look like:
sigma_1^2 - cov[0,0] = -0.00090081257572
sigma_2^2 - cov[1,1] = -0.00041443594839
rho*sigma_1*sigma_2 - cov[0,1] = -0.000246766890107
a-a_eig = -0.000158440395525, b-b_eig = -8.82014644565e-05

and it gets worse for smaller sample sizes.

In Summary:

The methods only give the same results, if the covariance matrix is exactly as assumed.
The sample points must be drawn from a bivariate gaussian distribution. If the points are not taken from such a distribution, it gets even worse.
I am still unable to show, that $a=\sqrt{\lambda_1}\quad,\quad b=\sqrt{\lambda_2}$ iff the covariance-matrix is exactly as assumed.

[Math] Python Solving simultaneous equations with numpy

I'll see if I can't help with the following...

I am really lost at rearranging the matrices in the form given in their examples

Here's some tips for going between matrices and Python lists and numpy arrays

$$ v = \begin{pmatrix} 9 \\\ 8 \\\ 7 \end{pmatrix} $$

Where I usually see $v_{\left(1\right)} = 9$ and $v_{\left(3\right)} = 7$ is written just a bit like...

vList = [9, 8, 7]
vArray = np.array(vList)

# vList[0]  # -> 9
# vArray[2] # -> 7

... and in most programming languages the index starts at 0, but it looks like you've got that.

Things get interesting with nested lists and multidimensional numpy arrays...

$$ p = \begin{pmatrix} 0.3 & 0.6 & 0.9 \\\ 0.4 & 0.7 & 0.8 \\\ 0.5 & 0.8 & 0.7 \end{pmatrix} $$

Which could be represented as a numpy.array as shown...

p = np.array([
    [0.3, 0.6, 0.9],
    [0.4, 0.7, 0.8],
    [0.5, 0.8, 0.7]
])

Accessing rows could then look like...

p[0]
# -> array([ 0.3,  0.6,  0.9])
p[2]
# -> array([0.5, 0.8, 0.7])

... but accessing cells is likely to frustrate those that want precision at a cellular level...

p[2,0]
# -> 0.5
# ... above looks okay...
p[0,0]
# -> 0.29999999999999999
# ... but that was `0.3`...
p[1,1]
# -> 0.69999999999999996
# ... and that should have been `0.7`

Even funkier than that...

p * 5
# -> array([[ 1.5,  3. ,  4.5],
#           [ 2. ,  3.5,  4. ],
#           [ 2.5,  4. ,  3.5]])

Hopefully this gave ya some traction on translating your problem into something that a computer will consider, as well as some pointers on how not to use numpy. It's fantastic but misusing it can lead to anger, and anger can lead to; well let's not even consider the paths divergent from light ;-)

Best Answer

Related Solutions

[Math] 3-sigma Ellipse, why axis length scales with square root of eigenvalues of covariance-matrix

[Math] Python Solving simultaneous equations with numpy

Related Question