Clustering Large Datasets with Mixed Variable Types – How to Efficiently Classify Remaining Observations

clusteringgower-similaritylarge datamixed type data

I'm trying to run a cluster analysis on a large dataset (70k+ observations to cluster) with mixed variables (numeric, ordinal, binary and nominal). I don't think I can create the distance matrix using SAS over the entire dataset. So, I have tried to run a hierarchical clustering using Gower's distance over a subsample of my data. I've got some questions.

  1. If the above method (hierarchical clustering of a subsample) is
    appropriate, how can I then score the rest of the observations and
    assign (classify) them to the clusters obtained?

  2. If the above method isn't good, what are other recommended
    methods to cluster a large dataset with mixed variables? (Available
    in SAS if possible.)

  3. How can I check for correlations/multicolinearity among mixed
    variables? I don't know if running something like PCA or factor
    analysis makes sense with categorical data.

Best Answer

Hierarchical clustering in general does not scale well to large data sets. There are some special cases such as SLINK that need only $O(n)$ memory and $O(n^2)$ runtime (naive implementations need $O(n^2)$ memory and $O(n^3)$ runtime). So may need to look into alternative methods such as DBSCAN. DBSCAN will work with arbitrary distance measures; but you will probably not have index acceleration, so it will be $O(n^2)$ runtime, too. But it should still scale to 70k observations; I have ran DBSCAN on 100k years ago. The key is to not compute a complete distance matrix, because that needs $O(n^2)$ memory then.

However, neither will have an obvious way of classifying new observations. Clustering is just something different than classification. It's about getting a sketch of structure in the data to then analyze and turn into knowledge. No clustering will ever be perfect, but it may be able to tell you something you did not know before. You should then formalize it in a way that you can make use of it later.

Obviously, an universal approach is to train a classifier on the clusters afterwards.

I don't know what is available in SAS. I believe it only has the most basic methods available, nothing advanced.