Solved – Intuition on the definition of the covariance

correlationcovariance

I was trying to understand the Covariance of two random variables better and understand how the first person that thought of it, arrived at the definition that is routinely used in statistics.
I went to wikipedia to understand it better.
From the article, it seems that good candidate measure or quantity for $Cov(X,Y)$ should have the following properties:

  1. It shoukd have a positive sign when two random variables are similar (i.e. when one increases the other one does to and when one decreases the other one does too).
  2. We also want it to have a negative sign when two random variables are oppositely similar (i.e. when one increases the other random variable tends to decrease)
  3. Lastly, we want the this covariance quantity to be zero (or extremely small probably?) when the two variables are independent of each other (i.e. they don't co-vary with respect to each other).

From the above properties, we want to define $Cov(X,Y)$. My first question is, it is not entirely obvious to me why $Cov(X,Y) = E[(X-E[X])(Y-E[Y])]$ satisfies those properties. From the properties we have, I would have expected more of a "derivative"-like equation to be the ideal candidate. For example, something more like, "if the change in X positive, then the change in Y should also be positive". Also, why is taking the difference from the mean the "correct" thing to do?

A more tangential, but still interesting question, is there a different definition that could have satisfied those properties and still would have been meaningful and useful? I am asking this because it seems no one is questioning why we are using this definition in the first place (it kind of feels like, its "always been this way", which in my opinion, is a terrible reason and it hinders scientific and mathematical curiosity and thinking). Is the accepted definition the "best" definition that we could have?


These are my thoughts on why the accepted definition makes sense (its only going to be an intuitive argument):

Let $\Delta_X$ be some difference of for variable X (i.e. it changed from some value to some other value at some time). Similarly for define $\Delta_Y$.

For one instance in time, we can calculate if they are related or not by doing:

$$sign(\Delta_X \cdot \Delta_Y)$$

This is somewhat nice! For one instance in time, it satisfies the properties we want. If they both increase together, then most of the time, the above quantity should be be positive (and similarly when they are oppositely similar, it will be negative, because the $Delta$'s will have opposite signs).

But that only gives us the quantity we want for one instance in time, and since they are r.v. we might overfit if we decide to base the relationship of two variables based on only 1 observation. Then why not take the expectation of this to see the "average" product of differences.

$$sign(E[\Delta_X \cdot \Delta_Y])$$

Which should capture on average what the average relationship is as defined above!
But the only problem this explanation has is, what do we measure this difference from? Which seems to be addressed by measuring this difference from the mean (which for some reason is the correct thing to do).

I guess the main issue I have with the definition is taking the difference form the mean. I can't seem to justify that to myself yet.


The interpretation for the sign can be left for a different question, since it seems to be a more complicated topic.

Best Answer

Imagine we begin with an empty stack of numbers. Then we start drawing pairs $(X,Y)$ from their joint distribution. One of four things can happen:

  1. If both X and Y are bigger then their respective averages we say the pair are similar and so we put a positive number onto the stack.
  2. If both X and Y are smaller then their respective averages we say the pair are similar and put a positive number onto the stack.
  3. If X is bigger than its average and Y is smaller than its average we say the pair are dissimilar and put a negative number onto the stack.
  4. If X is smaller than its average and Y is bigger than its average we say the pair are dissimilar and put a negative number onto the stack.

Then, to get an overall measure of the (dis-)similarity of X and Y we add up all the values of the numbers on the stack. A positive sum suggests the variables move in the same direction at the same time. A negative sum suggests the variables move in opposite directions more often than not. A zero sum suggests knowing the direction of one variable doesn't tell you much about the direction of the other.

It's important to think about 'bigger than average' rather than just 'big' (or 'positive') because any two non-negative variables would then be judged to be similar (e.g. the size of the next car crash on the M42 and the number of tickets bought at Paddington train station tomorrow).

The covariance formula is a formalisation of this process:

$\text{Cov}(X,Y)=\mathbb E[(X−E[X])(Y−E[Y])]$

Using the probability distribution rather than monte carlo simulation and specifying the size of the number we put on the stack.