Solved – How to Cluster with Non-normal data

clustering

I have a bunch of non-normal and normal data that I would like to cluster on. The normal data are things like height and weight. In total, it is 201 observations and 44 variables. The non-normal data is the frequency of an event divided by a players minutes times a scalar factor. So for example,

$$PointsperX = Points/Minutes * X$$

I have performed both a square root and log transformation, and the data is slightly more normal, but still very far off ($p-value$ increases by $0.0001$ to $p=0.0001$ for most variables).

Anyway, I proceeded with clustering my data, first I scaled it to have mean 0 and sd of 1, then used NbClust andBIC from the mclust package. I get 2 and 1 clusters respectively. However, I have a suspicion that NbClust only shows 2 because that is the minimum number. The number of clusters makes sense because the data is not normal so it won't cluster very well.

How should I handle the issue of non-normal data? Should I reduce the number of variables I look at? My idea revolves around using where players scored there points or performed blocks, hence I would prefer not to do this method unless others thought it would help clustering from past experience.

Many people have assumed data to be normal in past research when it is not, but that is a really poor thing to do in my mind. I have not tried fuzzy clustering yet, but I have a feeling it would result in simlar results as above.

Best Answer

DBSCAN is a cool clustering algorithm that doesn't make assumptions about how data are distributed. See http://scikit-learn.org/stable/modules/clustering.html#dbscan. Even though that description is from a Python library, there is an R package for DBSCAN, as well, called dbscan