Sometimes we can "augment knowledge" with an unusual or different approach. I would like this reply to be accessible to kindergartners and also have some fun, so everybody get out your crayons!
Given paired $(x,y)$ data, draw their scatterplot. (The younger students may need a teacher to produce this for them. :-) Each pair of points $(x_i,y_i)$, $(x_j,y_j)$ in that plot determines a rectangle: it's the smallest rectangle, whose sides are parallel to the axes, containing those points. Thus the points are either at the upper right and lower left corners (a "positive" relationship) or they are at the upper left and lower right corners (a "negative" relationship).
Draw all possible such rectangles. Color them transparently, making the positive rectangles red (say) and the negative rectangles "anti-red" (blue). In this fashion, wherever rectangles overlap, their colors are either enhanced when they are the same (blue and blue or red and red) or cancel out when they are different.
(In this illustration of a positive (red) and negative (blue) rectangle, the overlap ought to be white; unfortunately, this software does not have a true "anti-red" color. The overlap is gray, so it will darken the plot, but on the whole the net amount of red is correct.)
Now we're ready for the explanation of covariance.
The covariance is the net amount of red in the plot (treating blue as negative values).
Here are some examples with 32 binormal points drawn from distributions with the given covariances, ordered from most negative (bluest) to most positive (reddest).
They are drawn on common axes to make them comparable. The rectangles are lightly outlined to help you see them. This is an updated (2019) version of the original: it uses software that properly cancels the red and cyan colors in overlapping rectangles.
Let's deduce some properties of covariance. Understanding of these properties will be accessible to anyone who has actually drawn a few of the rectangles. :-)
Bilinearity. Because the amount of red depends on the size of the plot, covariance is directly proportional to the scale on the x-axis and to the scale on the y-axis.
Correlation. Covariance increases as the points approximate an upward sloping line and decreases as the points approximate a downward sloping line. This is because in the former case most of the rectangles are positive and in the latter case, most are negative.
Relationship to linear associations. Because non-linear associations can create mixtures of positive and negative rectangles, they lead to unpredictable (and not very useful) covariances. Linear associations can be fully interpreted by means of the preceding two characterizations.
Sensitivity to outliers. A geometric outlier (one point standing away from the mass) will create many large rectangles in association with all the other points. It alone can create a net positive or negative amount of red in the overall picture.
Incidentally, this definition of covariance differs from the usual one only by a universal constant of proportionality (independent of the data set size). The mathematically inclined will have no trouble performing the algebraic demonstration that the formula given here is always twice the usual covariance.
Imagine we begin with an empty stack of numbers. Then we start drawing pairs $(X,Y)$ from their joint distribution. One of four things can happen:
- If both X and Y are bigger then their respective averages we say the pair are similar and so we put a positive number onto the stack.
- If both X and Y are smaller then their respective averages we say the pair are similar and put a positive number onto the stack.
- If X is bigger than its average and Y is smaller than its average we say the pair are dissimilar and put a negative number onto the stack.
- If X is smaller than its average and Y is bigger than its average we say the pair are dissimilar and put a negative number onto the stack.
Then, to get an overall measure of the (dis-)similarity of X and Y we add up all the values of the numbers on the stack. A positive sum suggests the variables move in the same direction at the same time. A negative sum suggests the variables move in opposite directions more often than not. A zero sum suggests knowing the direction of one variable doesn't tell you much about the direction of the other.
It's important to think about 'bigger than average' rather than just 'big' (or 'positive') because any two non-negative variables would then be judged to be similar (e.g. the size of the next car crash on the M42 and the number of tickets bought at Paddington train station tomorrow).
The covariance formula is a formalisation of this process:
$\text{Cov}(X,Y)=\mathbb E[(X−E[X])(Y−E[Y])]$
Using the probability distribution rather than monte carlo simulation and specifying the size of the number we put on the stack.
Best Answer
The problem with covariances is that they are hard to compare: when you calculate the covariance of a set of heights and weights, as expressed in (respectively) meters and kilograms, you will get a different covariance from when you do it in other units (which already gives a problem for people doing the same thing with or without the metric system!), but also, it will be hard to tell if (e.g.) height and weight 'covary more' than, say the length of your toes and fingers, simply because the 'scale' the covariance is calculated on is different.
The solution to this is to 'normalize' the covariance: you divide the covariance by something that represents the diversity and scale in both the covariates, and end up with a value that is assured to be between -1 and 1: the correlation. Whatever unit your original variables were in, you will always get the same result, and this will also ensure that you can, to a certain degree, compare whether two variables 'correlate' more than two others, simply by comparing their correlation.
Note: the above assumes that the reader already understands the concept of covariance.