Solved – How to make a matrix positive definite

expectation-maximizationfactor analysis

I'm trying to implement an EM algorithm for the following factor analysis model;

$$W_j = \mu+B a_j+e_j \quad\text{for}\quad j=1,\ldots,n$$

where $W_j$ is p-dimensional random vector, $a_j$ is a q-dimensional vector of latent variables and $B$ is a pxq matrix of parameters.

As a result of other assumptions used for the model, I know that $W_j\sim N(\mu, BB'+D)$ where $D$ is the variance covariance matrix of error terms $e_j$, $D$ = diag($\sigma_1^2$,$\sigma_2^2$,…,$\sigma_p^2$).

For the EM algorithm to work, I'm doing dome iterations involving estimation of $B$ and $D$ matrices and during these iterations I'm computing the inverse of $BB'+D$ at each iteration using new estimates of $B$ and $D$. Unfortunately during the course of iterations, $BB'+D$ loses its positive definiteness (but it shouldn't because it is a variance-covariance matrix) and this situation ruins the convergence of the algorithm. My questions are:

Does this situation show that there is something wrong with my algorithm since the likelihood should increase at every step of EM?
What are the practical ways to make a matrix positive definite?

Edit: I'm computing the inverse by using a matrix inversion lemma which states that:

$$(BB'+D)^{-1}=D^{-1}-D^{-1}B (I_q+B'D^{-1}B)^{-1} B'D^{-1}$$

where the right side involves only the inverses of $q\times q$ matrices.

Best Answer

OK, since you're doing FA I'm assuming that $B$ is of full column rank $q$ and $q<p$. We need a few more details though. This may be a numerical problem; it may also be a problem with your data.

How are you computing the inverse? Do you need the inverse explicitly, or can re-express the calculation as the solution to a linear system? (ie to get $A^{-1}b$ solve $Ax=b$ for x, which is typically faster and more stable)

What is happening to $D$? Are the estimates really small/0/negative? In some sense it is the critical link, because $BB'$ is of course rank deficient and defines a singular covariance matrix before adding $D$, so you can't invert it. Adding the positive diagonal matrix $D$ technically makes it full rank but $BB'+D$ could still be horribly ill conditioned if $D$ is small.

Oftentimes the estimate for the idiosyncratic variances (your $\sigma^2_i$, the diagonal elements of $D$) is near zero or even negative; these are called Heywood cases. See eg http://www.technion.ac.il/docs/sas/stat/chap26/sect21.htm (any FA text should discuss this as well, it's a very old and well-known problem). This can result from model misspecification, outliers, bad luck, solar flares... the MLE is particularly prone to this problem, so if your EM algorithm is designed to get the MLE look out.

If your EM algorithm is approaching a mode with such estimates it's possible for $BB'+D$ to lose its positive definiteness, I think. There are various solutions; personally I'd prefer a Bayesian approach but even then you need to be careful with your priors (improper priors or even proper priors with too much mass near 0 can have the same problem for basically the same reason)

Related Solutions

Solved – Estimating specific variance for items in factor analysis – how to achieve the theoretical maximum

Not sure my response is relevant, perhaps what I say is not news for you. It is about starting values for communalities in factor analysis.

Actually, you cannot estimate the true communality (and likewise uniqueness) of a variable before you've done FA. This is because communalities are tied up with the number of factors m being extracted. In Principal Axes factor analysis method of extraction communalities are being iteratively trained (like dogs are trained) to restore pairwise coefficients - correlations or covariances - maximally by m factors.

To estimate starting values for communalities several methods can be used, as you probably know:

The squared multiple correlation coefficient$^1$ between the variable and the rest variables is considered the best guess for the starting value of communality of the variable. This value is the lower bound for the "true", resultant, communality.
Another possible guess for the value is the maximal or the mean absolute correlation/covariance of the variable with the rest ones.
Still, another guess value used sometimes is the test-retest reliability (correlation/covariance) coefficient. This would be the upper bound for the "true" communality.
And in specific cases, user-defined initial values are used (e.g. communality values borrowed from literature).

$^1$ A closer look. If $\bf R$ is the analyzed correlation or covariance matrix, and you make diagonal matrix $\bf D$ with the diagonal elements being the inverses of diagonal elements of $\bf R^{-1}$, then matrix $\bf DR^{-1}D-2D+R$ is called "image covariance matrix" of $\bf R$ (sic! "covariance" irrespective whether $\bf R$ is covariances or correlations). Its diagonal entries are "images" in $\bf R$ (actually, these images are the diagonal of $\bf R-D)$.

If $\bf R$ is correlation matrix, images are the squared multiple correlation coefficients (of dependency of a variable on all the other variables). If $\bf R$ is covariance matrix, images are the squared multiple correlation coefficients multiplied by the respective variable variance. These values - the images - are used as starting communalities in both cases.

A side note for the curious: matrix $\bf DR^{-1}D$ is known as "anti-image covariance matrix" of $\bf R$. If you convert it to "anti-image correlation matrix" (in a usual way like you convert covariance in correlation, $r_{ij}=cov_{ij}/(\sigma_i \sigma_j)$), then the off-diagonal elements as a result are the negatives of partial correlation coefficients (between two variables controlled for all the other variables). Partial correlation coefficients are optionally used within factor analysis to compute Kaiser-Meyer-Olkin measure of sampling adequacy (KMO).

Solved – “matrix is not positive definite” – even when highly correlated variables are removed

The best tool to resolve (multi-) collinearity is in my view the Cholesky-decomposition of the correlation/covariance matrix. The following example discusses even the case of collinearity, where none of the bivariate correlations are "extreme", because we have rank-reduction only over sets of more variables than only two.

If the correlation-matrix, say R, is positive definite, then all entries on the diagonal of the cholesky-factor, say L, are non-zero (aka machine-epsilon). Btw, to use this tool for the collinearity-detection it must be implemented as to allow zero-eigenvalues, don't know, whether, for instance, you can use SPSS for this.
The number of non-zero entries in the diagonal indicate the actual rank of the correlation-matrix. And because of the triangular structure of the L-matrix the variables above the first occuring diagonal zero form a partial set of variables which is of reduced-rank. However, there may be some variables in that block, which do not belong to that set. So to find the crucial subset which contains only the multicollinearity you do several recomputations of the cholesky-decomposition, where you reorder the variables such that you find the smallest possible subset, which shows rank-reduction - so this is an iterative procedure. (If needed, I'll show an example where I use my MatMate-program for the script, later).

Here is an example using random-data on 5 variables, say $x_1$ to $x_5$ which I configured, such that the correlation matrix is positive semidefinite (up to machine precision) because I made $x_5 = 2 \cdot x_2 + \sqrt 2 \cdot x_4 $ (and after that normed to unit-variance) and thus that subset of three variables make a collinear subspace (more exactly: we should call it "co-planar" since they are linearly dependent only in a plane). Here is the correlation-matrix R

;MatMate-Listing vom:06.03.2013 17:43:23
;============================================

R =        x1        x2        x3        x4        x5
------------------------------------------------------
x1        1.0000   -0.7506    0.2298   -0.8666    0.0952
x2       -0.7506    1.0000   -0.2696    0.4569    0.5355
x3        0.2298   -0.2696    1.0000    0.1890   -0.4407
x4       -0.8666    0.4569    0.1890    1.0000   -0.5066
x5        0.0952    0.5355   -0.4407   -0.5066    1.0000
------------------------------------------------------

and here the cholesky-factor / loadingsmatrix:

[20]     L = cholesky(R)

L=        f1        f2        f3        f4        f5
------------------------------------------------------
x1        1.0000     .         .         .         .    
x2       -0.7506    0.6607     .         .         .    
x3        0.2298   -0.1469    0.9621     .         .    
x4       -0.8666   -0.2930    0.3587    0.1856     .    
x5        0.0952    0.9186   -0.3406   -0.1762     .    
------------------------------------------------------

As we see that only 4 of 5 diagonal elements are non-zero (above machine-epsilon) we know, that the correlation matrix has rank 4 instead of 5 and we have collinearity. But we do not yet know, whether 4 variables are linearly dependent or whether we have possibly a rank reduced subspace of even smaller dimension. So we try iteratively the rotation to triangularity, where the order of the variables $x_1$ to $x_5$ is systematically altered to identify any possible smallest subset.

For instance, we make the last item "the first"

[22] l1=rot(L,"drei",5´1´2´3´4)

L1=       f1        f2        f3        f4        f5
------------------------------------------------------
x1        0.0952    0.9955     .         .         .    
x2        0.5355   -0.8053   0.2545      .         .    
x3       -0.4407    0.2730   0.7320     0.4421     .    
x4       -0.5066   -0.8221   0.2598      .         .    
x5        1.0000     .         .         .         .    
------------------------------------------------------

and we see that rank-reduction is already occuring if we ignore variable 3 - because the variables $x_1,x_2,x_4,x_5$ define already a 3-dimensional subspace (instead of a 4-dimensional one).

Now we proceed altering the order for the cholesky-decomposition (actually I do this by a column rotation with a "triangularity-criterion"):

[24] L1 = rot(L,"drei",5´4´1´2´3)

L1=       f1        f2        f3        f4        f5
------------------------------------------------------
x1        0.0952   -0.9492    0.3000     .         .    
x2        0.5355    0.8445     .         .         .    
x3       -0.4407   -0.0397    0.7803     .        0.4421
x4       -0.5066    0.8622     .         .         .    
x5        1.0000     .         .         .         .    
------------------------------------------------------

Now we're nearly done: the subset of $x_2,x_4,x_5$ forms a reduced subspace and to see more, we put them at "the top" of the cholesky-process:

[26] L1 = rot(L,"drei",5´4´2´1´3)

L1=       f1        f2        f3        f4        f5
------------------------------------------------------
x1        0.0952   -0.9492     .        0.3000     .    
x2        0.5355    0.8445     .         .         .    
x3       -0.4407   -0.0397     .        0.7803    0.4421
x4       -0.5066    0.8622     .         .         .    
x5        1.0000     .         .         .         .    
------------------------------------------------------

We see, that $x_1$ has a component outside of that reduced space, and $x_3$ has a further component outside of the rank 3 space, and are thus partly independent of that 2-dimensional subspace (which can thus be given the term "co-planarity"). We can now decide which of the three variables $x_2,x_4$ or $x_5$ can be removed to overcome the multi-collinearity problem.

If we would use some software which does not allow this flexible reordering "inside" the rotation-parameters/procedure, we would re-order the variables forming the correlation-matrix and would do the cholesky-decomposition to arrive at something like:

[26] L1 = cholesky(...) // something in your favorite software...

L1=       f1        f2        f3        f4        f5
------------------------------------------------------
x5        1.0000     .         .         .         .   Co-planar subset  
x2        0.5355    0.8445     .         .         .    
x4       -0.5066    0.8622     .         .         .    
------------------------------------------------------ 
x1        0.0952   -0.9492     .        0.3000     .      further linearly independent
x3       -0.4407   -0.0397     .        0.7803    0.4421  variables

[update]: Note that the candidates from which we would remove one, were not necessarily recognized by the inspection of correlations in the correlation-matrix. There the highest correlation is 0.8666 between $x_1$ and $x_4$ - but $x_1$ does not contribute to the rank-deficiency! Furthermore, the correlations between $x_2,x_4,x_5$ are all in an "acceptable" range when one wants to apply some jackknife-estimate for the removal of high-correlations assuming multicollinearity - one would not look at them as the most natural candidates from the set of bivariate correlations only.

Best Answer

Related Solutions

Solved – Estimating specific variance for items in factor analysis – how to achieve the theoretical maximum

Solved – “matrix is not positive definite” – even when highly correlated variables are removed

Related Question