Solved – How to normalize Poisson distributed data before PCA

normalizationpcapoisson distribution

I am a newbie to principal component analysis (PCA). I will have to do PCA to data sets consisting of count statistics: all data are positive integers.

Before PCA the data needs to be normalized. It is more or less standard to do that by subtracting the mean and dividing by the standard deviation in the variable over the sample set. I wonder whether this is appropriate for data sets that are very skewed.

Best Answer

First off, be aware that the term "normalize" is ambiguous within statistical science. You apply it to scaling by (value $-$ mean) / standard deviation, which is commonly also described as standardization. But it is also often applied to transformations that produce versions of a variable that are more nearly normal (Gaussian) in distribution. Yet again, a further use is that of scaling to fit within a prescribed range, say $[0, 1]$.

Standardization itself does not affect how far a distribution is normal, as it is merely a linear transformation, and skewness and kurtosis (for example), and more generally all measures of distribution shape, remain as they were.

As for principal component analysis (PCA), prior standardization is common, indeed arguably essential, whenever the individual variables are measured using different units of measurement. Conversely, PCA without standardization can make sense so long as all variables are measured in the same units. The difference corresponds to basing PCA on the correlation matrix (prior standardization) and on the covariance matrix (no prior standardization). Without standardization, PCA results are inevitably dominated by the variables with highest variance; if that is desired (or at worst unproblematic), then you will not be troubled.

Other way round, all variables being standardized gives them all, broadly speaking, the same importance; and even that could be wrong, or not what you most want. For example, the variable with the least variance and that with the most will end up on the same scale and with equal weight. Only rarely does that match what a researcher most needs, although it can be hard to build in what is needed without subjectivity or circularity. In practice, PCA seems most successful when the input variables have a strong family resemblance and least successful when the researcher inputs a mishmash of quite different variables, as say different social, economic or demographic characteristics of countries or other political units. PCA is not a washing machine; the dirt is not removed, but just redistributed.

If skewness is very high, you have a choice. Often results will be clearer if PCA is applied to transformed variables. For example, the effects of outliers or extreme data points will often be muted when variables are transformed. Conversely, PCA as a transformation technique does not depend on, or assume, that any (let alone all) of the variables fed to it being normally distributed.

In abstraction, it is difficult to advise in detail, but it will often be sensible to apply PCA both to the original data when highly skewed and to transformed data, and then to report either or both results, depending on what is helpful scientifically or substantively.

PCA itself is indifferent to whether variables are transformed in the same way, or indeed to whether some variables are transformed and others are not. Whenever it makes sense, there is some appeal in transforming variables in the same way, but this is perhaps more a question of taste than of technique.

As a simple example, if several variables are all measures of size in some sense, then skewness is very likely. Transforming all variables by taking logarithms (so long as all values are positive) will then often be valuable as a precursor to PCA, but neither analysis should be thought of as "correct"; rather they give complementary views of the data.

Note 1: I rather doubt that you "have to" do PCA unless you are committed to some exercise as part of a course of study. It seems very likely that some kind of Poisson modelling would be closer to scientific goals and just as fruitful as PCA, but without detail on those goals that is a matter of speculation.

Note 2: In the case of positive integers, roots and logarithms both have merit as transformations. I note that you state that your data are Poisson distributed without showing any evidence.