Solved – Principal components analysis: how independent should the input measures be

independencepca

I have a waveform, for which most people will measure either a peak amplitude or a slope. I have included the area under the curve and several other measures involving the same original waveform. I now have 10 measures on such waveforms (and I have measured these waveforms in 500 patients). Is it legitimate to put all these measures side by side in one big principal component analysis (PCA) to distil a smaller set of principal components. My worry is about the fact that some of the measures are inherently interdependent (for example, peak amplitude will be inherently correlated with area under the curve), so I was afraid it might fail to meet some of the assumptions behind PCA, such as independence of measures. Is there another analysis method I should use instead of PCA? Independent Component Analysis? Another multi-dimensional scaling method? Cluster analysis? Thanks very much!

A bit more detail: In my case, I have made 9 raw measures, x1 .. x9, in each patient. I already know that x2 and x4 are highly correlated, but their difference has important prognostic implications. On one hand, I am afraid PCA would fail to note this difference (because x2 and x4 are so correlated), so I would like to include x2-x4 as a 10th measure. However, I am bothered by the fact that I am thus introducing trivial correlation, which will inflate the importance of x2 and x4 and any noise in them. Thanks again.

Best Answer

@gung is right. You want the variables to be correlated. The idea of PCA is to reduce the dimension of the input variables. You have 10; if the first 2 PC's take up most of the variation, you can reduce your input variables to 2. But this is only possible when the variables are correlated.

And that being the case, why not throw the original waveform into the PCA? I assume that you have values for your waveform. Then see if the PCA pulls out some features of interest - such as amplitude or slope - rather than compute them directly.

A good reference is Ramsay's Functional Data Analysis with R and Matlab

With 500 patients, you are well positioned to look at your waveforms through that sort of analysis.

The other methods that you mention could be helpful - also growth curve analysis - but I would need to know what your research question is to advise further.