Is the cosine angle between two R.V. an (approximation) not equality to the correlation coefficient

correlationcovarianceinner-productsstatisticsvariance

I have seen in websites that given two R.V. $X,Y$, if $$
\cos(\theta)=\frac{X\cdot Y}{\|X\|_2\|Y\|_2}
$$

and
$$
\rho=\frac{\text{Cov}(X,Y)}{\sqrt{\text{Var}(X)\text{Var}(Y)}}
$$
then

$$
\cos(\theta)=\rho
$$


This identity implies $\text{Cov}(X,Y)=X\cdot Y$. Isn't $X\cdot Y$ the Maximum Likelihood Estimate for the covariance missing some factors? If true then the equation above is not equality but rather $≈$ as the samples become bigger.

Next is the denominator which implies that $$ \text{Var}(X)= \| X\|_2 ^2$$.
Again, isn't the right side not an identity but rather an estimator (MLE) to the variance of $X$?
Isn't $$ \rho ≈ \cos(\theta)$$


I have also seen the dot product (without the denominator in the first equation I've given but being more general using inner products) being used to measure correlation in some papers like Least Angle Regression. I am confused about the relationship between dot products and correlation. This leads me to a general question:

Is
$$
\langle X,X\rangle = \text{Var}(X)
$$

in Euclidean space.

Best Answer

Hint:

$X \cdot Y$ is a random variable.

$\text{Cov}(X,Y)$ is an expected value.

  • ---- addendum ----*

There is some confusion of terminolgy in your post.

If $\bf X , \bf Y$ are two random vectors (in $m$-space) then their dot product is a random variable $$ \begin{array}{l} {\bf X},{\bf Y} \in R^m \\ q = {\bf X} \cdot {\bf Y} = \left\| {\bf X} \right\|\,\left\| {\bf Y} \right\|\;\cos \alpha \quad \left| {\;q \in R} \right. \\ \end{array} $$ which is actually the product of three random variables, among which $\cos \alpha$.
In this case the covariance is defined as a matrix of expected values, which is not what you are considering.

If instead $\bf X , \bf Y$ are two vectors corresponding to the joint sampling (of size $m$) of two random variables $X,Y$ and we are to estimate the correlation between them (which seems what you mean to do) then

  • considering the variables to have zero mean, or the sampling mean having been subtracted from the sampling,
  • considering the samples to have equal probability to be drawn,
  • then the sampling covariance would in fact be $$ {\mathop{\rm cov}} (X,Y) = E\left[ {XY} \right] = \frac{1}{m}\sum\limits_{k = 1}^m {X_k Y_k } = \frac{1}{m}{\bf X} \cdot {\bf Y} = \frac{1}{m}\left\| {\bf X} \right\|\,\left\| {\bf Y} \right\|\;\cos \alpha $$ and $$ \rho _{X,Y} = \frac{{{\mathop{\rm cov}} (X,Y)}}{{\sqrt {{\mathop{\rm cov}} (XX){\mathop{\rm cov}} (YY)} }} = \cos \alpha $$
Related Question