Solved – Modelling Data with Many Zeros – Principal Component Analysis vs Zero Inflated Models

pcapoisson-regressionzero inflation

I have a data set (many continuous predictors, single response variable that is also continuous) with many zeros. I first used PCA and found the results to be very helpful. I further thought that the PCA scores remedy the zero problem by assigning the many zero values with numbers – but then I thought perhaps these PCA scores may not be relevant since they are based on a data set containing many zeros. PCA scores are orthogonal, they condense the data by many variables and also remove problems such as co linearity – but is it better to try some form of Poisson Regression or Zero Inflated Models?

Best Answer

The zero-inflated data issue is often an issue in community ecology data also. In ordinating sites by species communities, a PCA would result in just clustering of all of the sites near the origin. Thus we typically use distance-based ordinations, in which we calculate similarity/dissimilarity metrics for sites based on species compositions, and perform something like Principal Coordinates Analysis (PCoA) or Non-Metric Multidimensional Scaling (NMDS).

If you are trying for dimensionality reduction, researchers often do use the scores along the PCoA or NMDS axes as new variables too. here's a recent paper that did this using on the 'Bray-Curtis Dissimilarity Metric'.

These methods are pretty straightforward to implement in R - in particular, check out the 'vegan' package.

Related Question