[Math] problem with the Wishart distribution

pr.probability

It seems that the Wishart distribution scales problematically with dimension. I start with some background, which should make the question reasonably understandable to a general math and statistics audience. I then go into the details of my question. I look forward to hearing from you!

Background

The Wishart distribution with $\nu$ degrees of freedom and positive definite $p \times p$ scale matrix $V$, $\mathcal{W}_p(\nu,V)$, has the pdf

$$p(S\mid V,\nu) = \frac{|S|^{(\nu – p – 1)/2}}{2^{\nu p/2}|V|^{\nu/2}\Gamma_p(\nu/2)}\exp(-\frac{1}{2}\text{tr}(V^{-1}S))$$

Draws from this distribution will be $p \times p$ positive semidefinite matrices so long as $\nu > p$, with expectation $\mathbb{E}[S]= \nu V$ and variance $\text{Var}[S_{ij}] = \nu(V_{ij}^2 + V_{ii}V_{jj})$.

If $\nu$ is integer valued, we can write a Wishart random variable as a sum of outer products of $\nu$ i.i.d multivariate Gaussian random variables:

$$S = \sum_{i=1}^{\nu} \mathbb{u}_i \mathbb{u}_i^{\top} \sim \mathcal{W}_p(\nu,V),$$
where $\mathbb{u}_i \sim \mathcal{N}(0,V)$.

The question

$\nu$ must be greater than $p$ for draws from the Wishart to stay positive definite. But if $\nu$ is very large, the probability mass appears to concentrate on the expectation of the distribution. Yes, the variance of each entry increases with $\nu$, but there is some counterintuitive behaviour which I will try to explain.

Suppose we do not want entries of the matrices we draw to grow in expectation as we increase dimension. We can therefore let, for example, $V = I_p / \nu$, where $I_p$ is a $p \times p$ identity matrix. This way the expectation stays constant with $\nu$: $\mathbb{E}[S] = I_p$. But the variance of each entry is proportional to $1/\nu$ which $\to 0$ as $\nu \to \infty$.

Alternatively, we can imagine the top left entry of any matrix drawn from a Wishart distribution. It will be a chi squared random variable with $\nu$ degrees of freedom, scaled by a constant. If we let $\nu$ be very large, the variance is proportional to $2 \nu$, and the expectation is proportional to $\nu$. So the variance over the expectation is constant. However, the standard deviation over the expectation scales as $1/\sqrt{nu}$. So if we draw from a high dimensional Wishart distribution, hence requiring $\nu$ to be large, it seems that there is very little relative difference
between draws.

Therefore the Wishart distribution (seems as if it) is not useful in high dimensions, and has an undesirable scaling with dimension.

Any comments on this?

A possible solution

Suppose we want to draw from a distribution over high dimensional positive definite matrices, without this "concentration" problem. It seems the concentration is coming from the fact that the construction uses i.i.d matrices. Perhaps we could bump the variance up by instead using Gaussians which are correlated.

For example, imagine a 2-dimensional case. We have vectors
$$\mathbb{u} = [u_1 \quad u_2]^\top, \mathbb{v} = [v_1 \quad v_2]^\top, \mathbb{w}=[w_1 \quad w_2]^\top$$

Let $(u_1, v_1, w_1) \sim \mathcal{N}(0,K)$, where $K$ is some non-identity covariance matrix. Similarly correlate $(u_2,v_2,w_2)$, and $(u_3,v_3,w_3)$.

Then construct a draw from the distribution by taking
$$S = \mathbb{u}\mathbb{u}^\top + \mathbb{v}\mathbb{v}^\top + \mathbb{w}\mathbb{w}^\top.$$

A similar procedure could be followed in high dimensions, which would potentially? let you draw high dimensional positive definite matrices that are all reasonably distinct from one another. It seems the support would still be over all positive definite matrices, like the Wishart distribution. Not sure what the pdf would look like, or other analytic properties.

What do you think? Is there a more elegant solution? Is there a problem with my 'solution'?

Thanks and please let me know if you would like clarification on anything!

Best wishes,
Andrew

Best Answer

If you are interested in constructive distributions that you can simulate from (as opposed to something simpler with a known expression for the density) you have a lot of flexibility in constructing positive semi-definite matrices "from scratch". One popular way to do this is to use a "factor decomposition" which sets $$S = BB^t + \Psi$$ where $B$ is a $p$-by-$k$ matrix for $0 < k \leq p$ and $\Psi$ is a diagonal matrix with positive elements. A distribution on $S$ is induced by a distribution over the elements of $b_{ij}$ of $B$ and $\psi_{ii}$ of $\Psi$. One could draw the columns of $B$ to be orthogonal if desired (from say the Bingham-von-Mises-Fisher distribution). The recipe would be to 1.) draw $k$ between 0 and $p$ with some probabilities, 2.) draw $k$ columns of $B$, 3.) draw the elements of $\Psi$ iid and 4.) construct $S$. In this setting it seems like it would be easy to avoid the concentration phenomenon you note simply by modulating the distribution over $k$ and $\Psi$ as $p$ grows.

Related Question