Solved – Understanding degenerate multivariate normal distribution

correlationdistributionsmultivariate normal distributionsamplingsingular

MVN is degenerate when the covariance matrix $\Sigma$ is singular.

I am trying to understand mainly conceptual (but also theoretical) implications of this. The Wikipedia article is quite terse. It mentions the following non-trivial (atleast to me) things:

MVN does not have a density. More precisely, it does not have a density with respect to $k$-dimensional Lebesgue measure.

What does having no density mean in simple terms – if possible? It implies as if there are things that it does have? Is it possible to take samples from this distribution?

Geometrically this means that every contour ellipsoid is infinitely thin and has zero volume in $n$-dimensional space.

Does this relate to the degenerate case in sense that along the dependent subspace the variance is 0? Thus, removing the dependent subspace from $\Sigma$ – and thus reducing the dimension of $\mathbf {x}$ – will make having a proper density from which samples can be taken. The degenerate samples can then be reconstructed from $\mathbf {x}$. This is correct?

It is suggested to use the following density instead: $f(\mathbf {x}) =\left(\det \nolimits^{*}(2\pi \boldsymbol {\Sigma })\right)^{-\frac {1}{2}} \, e^{-\frac {1}{2} (\mathbf {x} -\boldsymbol {\mu})' \boldsymbol {\Sigma }^{+}(\mathbf {x} – \boldsymbol {\mu})}$

Suppose I use the Moore-Penrose pseudoinverse and disregard the non-zero eigenvalues in determinant calculation. Now I have a density. How are the samples from this density related to the degenerate case?

Wiki doesn't mention it, but what in case of non-singularity with negative eigenvalues? Determinant might or might not be negative then.

Positive-definiteness is a stricter concept than non-singularity. How does that relate?

Best Answer

A joint density function, say of two random variables $X$ and $Y$, is $f_{X,Y}(x,y)$ is an ordinary function of two real variables and the meaning that we ascribe to it is that if $\mathcal B$ is a region of very small area $b$ with the property that $(x_0, y_0) \in \mathcal B$, then $$P\{(X,Y)\in \mathcal B\} \approx f_{X,Y}(x_0,y_0)\cdot b \tag 1$$ and that this approximation gets better and better as $\mathcal B$ shrinks in area, and $b \to 0$. Of course, both sides of $(1)$ approach $0$ as $b \to 0$, but the ratio $\frac{P\{(X,Y)\in \mathcal B\}}{b}$ is converging to $f_{X,Y}(x_0,y_0)$. If we think of probability as probability mass spread over the $x$-$y$ plane, then $f_{X,Y}(x_0,y_0)$ is the density of the probability mass at the point $(x_0,y_0)$. Note that $f_{X,Y}(x,y)$ is not a probability, but a probability density, and it is measured in probability mass per unit area. In particular, note that it is possible for $f_{X,Y}(x_0,y_0)$ to exceed $1$ (probability mass is very dense at $(x_0,y_0)$), and we need to multiply it by an area (as in $(1)$) to get a probability from it.

With that as prologue, consider the case when $Y = \alpha X + \beta$. Now, the random point $(X,Y)$ is constrained to lie on the straight line $y = \alpha x + \beta$ in the $x$-$y$ plane. Consequently, $X$ and $Y$ do not enjoy a joint density because all the probability mass lies on the straight line which has zero area. (Remember that old shibboleth about a line having zero width that you learned in muddle school?) So, we cannot write something like $(1)$. The probability mass is all there; it lies along the straight line $y = \alpha x + \beta$, but its joint density (in terms of mass per unit area) is infinite along that straight line. So, now what? Well, the trick is to understand that we really have just one random variable, and questions about $(X,Y)$ can be translated into questions about just $X$, and answered in terms of $X$ alone. For example, (with $\alpha > 0$) $$F_{X,Y}(x_0,y_0) = P\{X\leq x_0, Y\leq y_0\} = P\{X\leq x_0, \alpha X + \beta \leq y_0\} = P\left\{X \leq \min\left(x_0, \frac{y_0-\beta}{\alpha}\right)\right\}.$$ Note that all the usual rules apply even though $X$ and $Y$ do not have a joint density. For example, $$\operatorname{cov}(X,Y)= \operatorname{cov}(X,\alpha X+\beta) = \alpha \operatorname{var}(X)$$ and so on.

Finally, if you are still paying attention, if $n$ jointly normal random variables $X_i$ have a singular covariance matrix $\Sigma$ and mean vector $\mathbf m$, then that means that there are $m < n$ independent standard normal random variables $Y_j$ such that $$(X_1,X_2,\ldots, X_n) = (Y_1,Y_2,\ldots, Y_m)\mathbf A + \mathbf m$$ where $\mathbf A$ is a $m\times n$ matrix, and all questions about $(X_1,X_2,\ldots, X_n)$ can be stated in terms of $(Y_1,Y_2,\ldots, Y_m)$ and answered in terms of these iid random variables. Note that $\Sigma = \mathbf A^T\mathbf A$.