Solved – the difference between covariance function and covariance operator in plain English

covariance

I read the articles in Wikipedia about covariance function and covariance operator, and I'm trying to understand the two concepts.

Can anybody please explain the difference between the two terms ("covariance function" and "covariance operator") in plain English? Also, what would be the case, where we use one term rather than the other?

Thank you!

Best Answer

This answer on math.se gives a good explanation of how the covariance operator compares to the usual definition of $Cov(X,Y) := E((X-EX)(Y-EY))$.

The typical setting of probability theory is $L^1(\Omega, \mathscr F, P)$ which is the set of all $\mathscr F$-measurable functions (i.e. random variables) $X : \Omega \to \mathbb R$ such that $E|X| < \infty$. This is a Banach space but not a Hilbert space. The covariance operator as discussed in that answer is defined on a Hilbert space and only deals with particular kinds of random variables so it is a more restrictive concept. Ultimately, although I'm definitely not an expert in functional analysis, I don't think there's necessarily anything really new about that definition, although it may be a different way to look at the same concept of covariance.

Now for a covariance function, this is in the context of a specific stochastic process. Suppose you've got some collection of random variables $S := \{X_t : t \in T\}$ where $T$ is some arbitrary index set, and you want a way to specify the covariance between $X_i$ and $X_j$ in terms of their indices. Thus you could say that the covariance operator is way to define the concept of covariance while a covariance function is an application of this concept.

As an example, let's say you've got a plot of land that is 1000 meters by 1000 meters, and for each location you're interested in the soil temperature. This means we've got a spatial process where $T = [0,1000]^2$, say, and $t \in T$ gives us the random variable $X_t$ for that location's soil temperature. The covariance function is what allows us to say that the temperatures of nearby points in our plot of land should be similar.

For instance, maybe you think that the covariance between two points $r$ and $s$ should only depend on the distance between them, and that it should decay smoothly as that distance increases. This could lead you to using the Gaussian covariance function $Cov(X_r, X_s) = \exp\left(-\gamma ||r - s||^2\right)$.

This stochastic process definition at first doesn't look much like how covariance functions are used in machine learning (ML). In ML we'd expect to get some data $\mathcal D := \{(x_1, y_1), \dots, (x_n, y_n)\}$ and we'd try to estimate a function $f : x_i \mapsto y_i$. Returning to the soil temperature example, this would mean $x_i$ is the location in our plot of land, and $y_i$ is the corresponding temperature reading, so our kernel matrix would contain elements $K(x_i, x_j) = \exp\left(-\gamma ||x_i - x_j||^2\right)$.

But now we can see that the index set is just the domain of our data, and the variables $X_t$ are actually the responses, so the covariance function defined on our index set is the exact same as the one defined on our predictors.