If the data is 1d, the variance shows the extent to which the data points are different from each other. If the data is multi-dimensional, we'll get a covariance matrix.
Is there a measure that gives a single number of how the data points are different from each other in general for multi-dimensional data?
I feel that there might be many solutions already, but I'm not sure the correct term to use to search for them.
Maybe I can do something like adding up the eigenvalues of the covariance matrix, does that sound sensible?
Best Answer
(The answer below merely introduces and states the theorem proven in Eq. (0) The beauty in that paper is that most of the arguments are made in terms of basic linear algebra. To answer this question it will be enough to state the main results, but by all means, go check the original source).
In any situation where the multivariate pattern of the data can be described by a $k$-variate elliptical distribution, statistical inference will, by definition, reduce it to the problem of fitting (and characterizing) a $k$-variate location vector (say $\boldsymbol\theta$) and a $k\times k$ symmetric semi-positive definite (SPSD) matrix (say $\boldsymbol\varSigma$) to the data. For reasons explained below (which are assumed as premises) it will often be more meaningful to decompose $\boldsymbol\varSigma$ into its shape component (a SPSD matrix of the same size as $\boldsymbol\varSigma$) accounting for the shape of the density contours of your multivariate distribution and a scalar $\sigma_S$ expressing the scale of these contours.
In univariate data ($k=1$), $\boldsymbol\varSigma$, the covariance matrix of your data is a scalar and, as will follow from the discussion below, the shape component of $\boldsymbol\varSigma$ is 1 so that $\boldsymbol\varSigma$ equals its scale component $\boldsymbol\varSigma=\sigma_S$ always and no ambiguity is possible.
In multivariate data, there are many possible choices for scaling functions $\sigma_S$. One in particular ($\sigma_S=|\pmb\varSigma|^{1/k}$) stands out in having a key desirable propriety, making it the preferred choice of scaling functions in the context of elliptical families.
Many problems in MV-statistics involve estimation of a scatter matrix, defined as a function(al) SPSD matrix in $\mathbb{R}^{k\times k}$ ($\boldsymbol\varSigma$) satisfying:
$$(0)\quad\boldsymbol\varSigma(\boldsymbol A\boldsymbol X+\boldsymbol b)=\boldsymbol A\boldsymbol\varSigma(\boldsymbol X)\boldsymbol A^\top$$ (for non singular matrices $\boldsymbol A$ and vectors $\boldsymbol b$). For example the classical estimate of covariance satisfies (0) but it is by no means the only one.
In the presence of elliptical distributed data, where all the density contours are ellipses defined by the same shape matrix, up to multiplication by a scalar, it is natural to consider normalized versions of $\boldsymbol\varSigma$ of the form:
$$\boldsymbol V_S = \boldsymbol\varSigma / S(\boldsymbol\varSigma)$$
where $S$ is a 1-honogenous function satisfying:
$$(1)\quad S(\lambda \boldsymbol\varSigma)=\lambda S(\boldsymbol\varSigma) $$
for all $\lambda>0$. Then, $\boldsymbol V_S$ is called the shape component of the scatter matrix (in short shape matrix) and $\sigma_S=S^{1/2}(\boldsymbol\varSigma)$ is called the scale component of the scatter matrix. Examples of multivariate estimation problems where the loss function only depends on $\boldsymbol\varSigma$ through its shape component $\boldsymbol V_S$ include tests of sphericity, PCA and CCA among others.
Of course, there are many possible scaling functions so this still leaves the open the question of which (if any) of several choices of normalization function $S$ are in some sense optimal. For example:
Among these, $S=|\boldsymbol\varSigma|^{1/k}$ is the only scaling function for which the Fisher Information matrix for the corresponding estimates of scale and shape, in locally asymptotically normal families, are block diagonal (that is the scale and shape components of the estimation problem are asymptotically orthogonal) [0]. This means, among other things, that the scale functional $S=|\boldsymbol\varSigma|^{1/k}$ is the only choice of $S$ for which the non specification of $\sigma_S$ does not cause any loss of efficiency when performing inference on $\boldsymbol V_S$.
I do not know of any comparably strong optimality characterization for any of the many possible choices of $S$ that satisfy (1).