I am trying to calculate covariance matrix from a 2D data, assumed from coming from a Gaussian Distribution. I am trying to calculate using the equality that $\mathrm{Var}[x] = \mathrm{E}[x^2] – \mathrm{E}[x]^2$, so supposing that `D`

is the data matrix where rows are observations, the MATLAB code is:

```
[mean(D(:,1).^2) - mean(D(:,1))^2 , mean(D(:,1).*D(:,2)) - mean(D(:,1))*mean(D(:,2))
mean(D(:,1).*D(:,2)) - mean(D(:,1))*mean(D(:,2)) , mean(D(:,2).^2) - mean(D(:,2))^2]
```

However `cov(D)`

gives me an entirely different covariance matrix. Of course I can use `cov()`

and go on my life but I am using the calculation method above in another different piece of C++ code, so it is nice to learn where I am doing wrong.

I think I am missing a crucial and fundamental information here but could not figure it out. Any help?

## Best Answer

For ease of presentation, let us restrict to two random variables, say $X_1$ and $X_2$. The covariance matrix is given by $$\Sigma = \left( \begin{array}{cc} \textrm{Var}(X_1) & \textrm{Cov}(X_1, X_2) \\ \textrm{Cov}(X_2, X_1) & \textrm{Var}(X_2) \end{array}\right).$$ Since $\textrm{Cov}(X_1, X_2) = \textrm{Cov}(X_2, X_1)$, only three entries have to be estimated.

Say you have one sample from each of $X_1$ and $X_2$ available to you: $$\{x_{11}, \ldots, x_{1n}\} \qquad \textrm{and} \qquad \{x_{21}, \ldots, x_{2n}\}.$$ The natural empirical estimates are given by $$\widehat{\textrm{Var}(X_k)} =\widehat{\textrm{Cov}(X_k, X_k)} = \frac{1}{n-1} \sum_{i = 1}^{n} (x_{ki} - \bar{x}_k)^2; \qquad k = 1, 2$$ $$\widehat{\textrm{Cov}(X_1, X_2)} = \frac{1}{n - 1} \sum_{i=1}^n (x_{1i} - \bar{x}_1) (x_{2i} - \bar{x}_2)$$ with $$\bar{x}_k = \frac{1}{n} \sum_{i=1}^n x_{ki}; \qquad k = 1, 2.$$