Solved – Goodness of fit chi-square test for a number of PCA components

chi-squared-testfactor analysisp-valuepcar

I have been trying to carry out Principal Component Analysis (PCA) in R using the function psych::principal. However the returned $p$-value has been very low in several attempts using different sets of variables.

Does it mean that the test of the hypothesis that the number of components is sufficient is rejected? (The number of components was obtained using the scree plot and the criterion of eigenvalue larger than 1.)

Is it unimportant for the purpose of obtaining components using PCA? If not, what can be done?

                       RC1  RC2  RC4  RC3  RC5
SS loadings           5.19 4.24 3.01 2.63 1.16
Proportion Var        0.27 0.22 0.16 0.14 0.06
Cumulative Var        0.27 0.50 0.66 0.79 0.85
Proportion Explained  0.32 0.26 0.19 0.16 0.07
Cumulative Proportion 0.32 0.58 0.77 0.93 1.00

Test of the hypothesis that 5 components are sufficient.

The degrees of freedom for the null model are  171  and the objective function was  23.32
The degrees of freedom for the model are 86  and the objective function was  2.64 
The total number of observations was  103  with MLE Chi Square =  241.11  with prob <  1.2e-16

Thank you very much for the answers and input! All help are very much appreciated! I look forward to more detailed documentation of the package to be available.

Best Answer

Goodness of fit cannot be meaningfully computed for PCA. The principal function is doing something strange here.

Function principal() is part of the psych package that implements factor analysis, and is designed to implement PCA in a manner mimicking factor analysis. In factor analysis, goodness of fit can be meaningfully computed as follows.

Given a $p\times p$ covariance matrix $\mathbf S$, factor analysis fits a $p\times k$ matrix of loadings $\mathbf W$ where $k$ is the number of factors and a diagonal $p\times p$ matrix of uniquenesses $\boldsymbol \Psi$ such that $$\mathbf S \approx \mathbf C = \mathbf W \mathbf W^\top + \boldsymbol \Psi.$$ There are different methods of "factor extraction", differing in how exactly the distance between $\mathbf C$ ans $\mathbf S$ is measured. Maximum likelihood method finds loadings and uniquenesses maximizing log-likelihood of the observed data. The log-likelihood of the data is given by $$L_k=-\frac{n}{2}\left[p\ln(2\pi) + \ln|\mathbf C| + \mathrm{tr}(\mathbf C^{-1}\mathbf S)\right],$$ where $n$ is the number of data points (observations). Given two models with different number of factors $k_1$ and $k_2$, they can be compared via a likelihood ratio test that tells us that $$2(L_{k_1} - L_{k_2}) \sim \chi^2_{\mathrm{df}_1-\mathrm{df}_2}.$$ In particular, a model with $k$ factors can be compared to the "full" model with $k=p$ and $\mathbf C=\mathbf S$.

In PCA, there is no $\boldsymbol \Psi$ and so $\mathbf C_\mathrm{PCA}=\mathbf W\mathbf W^\top$. This makes $\mathbf C$ low-rank, with zero determinant, and non-invertible.

Consequently, in PCA likelihood cannot be computed and likelihood ratio is undefined (ultimately because PCA is not a probabilistic model).

What exactly principal() is doing to "compute" it, is impossible to understand from its manual page. Having looked into its source code I saw that all the goodness of fit computations are done by the factor.stats() function, which again has an unclear manual page. The source code of factor.stats() reveals that after computing model covariance matrix $\mathbf C = \mathbf W\mathbf W^\top$, the algorithm makes its diagonal equal to the diagonal of $\mathbf S$ (making it full rank). This is the correct computation in factor analysis because there the diagonals of $\mathbf C$ and $\mathbf S$ always coincide (due to the matrix of uniquenesses), but makes no sense in PCA.

The resulting goodness of fit computation is essentially a factor analysis computation, but with PCA loadings. This is weird. I am not sure how this could be justified. My advice would be to ignore it and to use some other approach instead.

(Having said that, I guess that a very low $p$-value still does indicate that the number of components is "not sufficient": after all, such "diagonal-adjusted" $\widetilde{\mathbf C}$ is closer to $\mathbf S$ than the original $\mathbf C$, so if $\widetilde{\mathbf C}$ does not provide a sufficient fit, then $\mathbf C$ does not either.)

Related Solutions

Solved – Chi square goodness of fit test

Since we are testing for a uniform distribution, the values of each season should be equal. The most likely value for each season assuming a uniform distribution is the average of all the seasons.

(13+24+18+25)/4 = 80/4 = 20

The formula for the Chi-Squared Statistic is given here.

http://en.wikipedia.org/wiki/Pearson's_chi-squared_test

((13-20)^2)/20 = 2.45
((24-20)^2)/20 = 0.80
((18-20)^2)/20 = 0.20
((25-20)^2)/20 = 1.25

Chi-Squared Statistic

2.45 + 0.80 + 0.20 + 1.25 = 4.7

The degrees of freedom are

df=4-1=3

You should be able to figure out the rest.

Solved – PCA and degrees of freedom

I think the point that I was missing is that PCA does not tell us about the bare number of degrees of freedom, but rather the variance-weighted number of degrees of freedom.

In my first example all ten variables had equal variance (barring sampling noise), and so the PCA shows each successive principal component adding a roughly equal amount of variance.

In the second example the variables I'm working with have a wide range of variances: $y_1$ has a much larger variance than $y_{10}$. The PCA therefore correctly tells me that fewer principal components are required to capture most of the variance, because the higher indexed $y$'s contribute increasingly little to the total variance of the system.

Best Answer

Related Solutions

Solved – Chi square goodness of fit test

Solved – PCA and degrees of freedom

Related Question