Wikipedia defines the covariance matrix as a generalized version of the variance: $C=E[(X-E[X])(X-E[X])^T]$. But, I usually see statisticians calculate the covariance matrix as $C=\frac{X X^T}{n-1}$ (e.g., Ameoba's nice answer here https://stats.stackexchange.com/a/134283/83526). Is $E[(X-E[X])(X-E[X])^T] = \frac{X X^T}{n-1}$? Are these both correct definitions of the covariance matrix?
Solved – Why can the covariance matrix be computed as $\frac{X X^T}{n-1}$
computational-statisticscovariancecovariance-matrixdescriptive statistics
Related Solutions
Answered in comments:
A covariance matrix is just a matrix of pairwise covariances, so I'm not sure about the distinction you're making.
ā dsaxton
Use $Nā1$ in place of $N$ to obtain the so-called "unbiased" version
ā rvl
See (1) the help page for
cov
; (2) How exactly did statisticians agree to using (n-1) as the unbiased estimator for population variance without simulation?; and (3) Intuitive explanation for dividing by $n-1$ when calculating standard deviation? for intuition. For yet more information search standard deviation correction.
ā whuber
You might find it instructive to start with a basic idea: the variance of any random variable cannot be negative. (This is clear, since the variance is the expectation of the square of something and squares cannot be negative.)
Any $2\times 2$ covariance matrix $\mathbb A$ explicitly presents the variances and covariances of a pair of random variables $(X,Y),$ but it also tells you how to find the variance of any linear combination of those variables. This is because whenever $a$ and $b$ are numbers,
$$\operatorname{Var}(aX+bY) = a^2\operatorname{Var}(X) + b^2\operatorname{Var}(Y) + 2ab\operatorname{Cov}(X,Y) = \pmatrix{a&b}\mathbb A\pmatrix{a\\b}.$$
Applying this to your problem we may compute
$$\begin{aligned} 0 \le \operatorname{Var}(aX+bY) &= \pmatrix{a&b}\pmatrix{121&c\\c&81}\pmatrix{a\\b}\\ &= 121 a^2 + 81 b^2 + 2c^2 ab\\ &=(11a)^2+(9b)^2+\frac{2c}{(11)(9)}(11a)(9b)\\ &= \alpha^2 + \beta^2 + \frac{2c}{(11)(9)} \alpha\beta. \end{aligned}$$
The last few steps in which $\alpha=11a$ and $\beta=9b$ were introduced weren't necessary, but they help to simplify the algebra. In particular, what we need to do next (in order to find bounds for $c$) is complete the square: this is the process emulating the derivation of the quadratic formula to which everyone is introduced in grade school. Writing
$$C = \frac{c}{(11)(9)},\tag{*}$$
we find
$$\alpha^2 + \beta^2 + \frac{2c^2}{(11)(9)} \alpha\beta = \alpha^2 + 2C\alpha\beta + \beta^2 = (\alpha+C\beta)^2+(1-C^2)\beta^2.$$
Because $(\alpha+C\beta)^2$ and $\beta^2$ are both squares, they are not negative. Therefore if $1-C^2$ also is non-negative, the entire right side is not negative and can be a valid variance. Conversely, if $1-C^2$ is negative, you could set $\alpha=-c\beta$ to obtain the value $(1-C^2)\beta^2\lt 0$ on the right hand side, which is invalid.
You therefore deduce (from these perfectly elementary algebraic considerations) that
If $A$ is a valid covariance matrix, then $1-C^2$ cannot be negative.
Equivalently, $|C|\le 1,$ which by $(*)$ means $-(11)(9) \le c \le (11)(9).$
There remains the question whether any such $c$ does correspond to an actual variance matrix. One way to show this is true is to find a random variable $(X,Y)$ with $\mathbb A$ as its covariance matrix. Here is one way (out of many).
I take it as given that you can construct independent random variables $A$ and $B$ having unit variances: that is, $\operatorname{Var}(A)=\operatorname{Var}(B) = 1.$ (For example, let $(A,B)$ take on the four values $(\pm 1, \pm 1)$ with equal probabilities of $1/4$ each.)
The independence implies $\operatorname{Cov}(A,B)=0.$ Given a number $c$ in the range $-(11)(9)$ to $(11)(9),$ define random variables
$$X = \sqrt{11^2-c^2/9^2}A + (c/9)B,\quad Y = 9B$$
(which is possible because $11^2 - c^2/9^2\ge 0$) and compute that the covariance matrix of $(X,Y)$ is precisely $\mathbb A.$
Finally, if you carry out the same analysis for any symmetric matrix $$\mathbb A = \pmatrix{a & b \\ b & d},$$ you will conclude three things:
$a \ge 0.$
$d \ge 0.$
$ad - b^2 \ge 0.$
These conditions characterize symmetric, positive semi-definite matrices. Any $2\times 2$ matrix satisfying these conditions indeed is a variance matrix. (Emulate the preceding construction.)
Best Answer
Let $\mu = E(X)$. Then $$Var(X) = E\left((X - \mu)(X - \mu)^T\right) = E\left(XX^T - \mu X^T - X \mu^T + \mu \mu^T\right) \\ = E(XX^T) - \mu\mu^T$$ which generalizes the well-known scalar equality $Var(Z) = E(Z^2) - E(Z)^2$.
The natural estimator of $\Sigma := Var(X)$ is $\hat \Sigma = \frac 1{n-1}XX^T - \hat \mu \hat \mu^T$.
In many situations we can take $\mu = 0$ without any loss of generality. One common example is PCA. If we center our columns then we find that $\hat \mu = 0$ so our estimate of the variance is simply $\frac 1{n-1}XX^T$. The univariate analogue of this is the familiar $s^2 = \frac 1{n-1} \sum_i x_i^2$ when $\bar x = 0$.
As @Christoph Hanck points out in the comments, you need to distinguish between estimates and parameters here. There is only one definition of $\Sigma$, namely $E((X - \mu)(X - \mu)^T)$. So $\frac 1{n-1}XX^T$ is absolutely not the correct definition of the population covariance, but if $\mu=0$ it is an unbiased estimate for it, i.e. $Var(X) = E(\frac 1{n-1}XX^T)$.