Solved – Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data

normal distributionpcazipf

I'm analyzing people based on their twitter stream. We are using a 'word bag' model of users, which basically amounts to counting how often each word appears in a persons twitter stream (and then using that as a proxy for a more normalized 'probability they will use a given word' in a particular length of text).

Due to constraints further down the pipeline, we cannot retain full data on usage of all words for all users, so we are trying to find the most 'symbolically efficient' words to retain in our analysis. That is, we're trying to retain a subset of dimensions, which, knowing their values would allow a hypothetical seer to most accurately model the probabilities of all words (including any we left out of the analysis).

So a principal components analysis (PCA) type approach seems an appropriate first step. (happily ignoring for now the fact that PCA would also 'rotate' us into dimensions that don't correspond to any particular word).

But I am reading that "Zipf distributions .. characterize the use of words in a natural language (like English) " and as far as I know, PCA analysis makes various assumptions about the data being normally distributed. So, I'm wondering whether the fundamental assumptions of the PCA analysis will be sufficiently far 'off' from reality to be a ral problem. That is, does PCA rely on the data being 'close to' Gaussian Normal for it to work at all well?

If this is a problem as I suspect, are there any other recommendations? That is, some other approach worth investigating that is 'equivalent' to PCA in some way but more appropriate for Zipf or power law distributed data?

Note that I am a programmer, not a statistician, so apologies if I messed up my terminology in the above. (Corrections of course welcomed!)

Best Answer

The truth is PCA contains an inherent assumption of linearity, i.e. that changing the basis can reframe the problem to provide a more discriminating view on the data. Does it have to be true when working with Zipf/power law following data? It depends on whether all your variables are of the same distribution. If so, you could take a logarithm of the values of all columns and perform PCA with sensible results.

Power law makes your variances explode, PCA will of course yield results, but they will be hard to interpret without making a mistake of arguing that a phenomenon is happening when it actually is only happening in the top 20% outliers. You can also try to use PCA to see the major differences, then divide the data to a point where the long tail is separated from the top outliers and then a PCA on the tail?

A good tutorial on PCA with assumptions can be found here: Jonathon Shlens: A Tutorial on Principal Component Analysis. CoRR abs/1404.1100 (2014)

Related Question