Solved – use PCA with mixed and sparse data types

clusteringdimensionality reductionpcasparse

I am trying to reduce the dimensionality of a data set of about 100'000 rows and 1'000 columns, in order to cluster the individual observations with k-means. I tried PCA with rescaling (i.e., subtracting the mean and dividing by the standard deviation), but I am not sure this approach makes much sense, because

  • The majority of the variables are not normally distributed (i.e., they follow exponential distributions, or other skewed distributions)
  • Many variables are 0/1 flags, and most of them are very sparse (i.e., 99.9% of the data is 0, and 0.1% is 1)
  • There are many outliers, and it is not always clear if the best way to remove an outlier is by removing the corresponding column or row

Is there a better way to reduce the dimensionality than PCA? I also tried linear mapping each variable in the interval [0,1] instead of mean subtraction/sigma division rescaling, and I even tried substituting some variables with the corresponding deciles, but then again I don't know if the combination PCA+kmeans is the best way to perform the clustering in this case.

Best Answer

There is a paper PCA on a DataFrame that seems trying to solve this problem. The technique used here is called collectively Generalized Low Rank Models (PCA and Sparse PCA are examples of this family of methods).

If you are familiar with Python/R you can try to use GLRMs from H2O library. They can handle both categorical and continuous data in single row.