Solved – Clustering a dataset with both discrete and continuous variables

clusteringcontinuous datadiscrete datagaussian mixture distributionk-means

I have a dataset X which has 10 dimensions, 4 of which are discrete values.
In fact, those 4 discrete variables are ordinal, i.e. a higher value implies a higher/better semantic.

2 of these discrete variables are categorical in the sense that for each of these variables, the distance e.g. from 11 to 12 is not the same as the distance from 5 to 6. While a higher variable value implies a higher in reality, the scale is not necessarily linear (in fact, it is not really defined).

My question is:

Is it a good idea to apply a common clustering algorithm (e.g. K-Means and then Gaussian Mixture (GMM)) to this dataset which contains both discrete and continuous variables?

If not:

Should I remove the discrete variables and focus only on the continuous ones?
Should I better discretize the continuous ones and use a clustering algorithm for discrete data?

Best Answer

So you've been told you need an appropriate distance measure. Here are some leads:

and, of course: Mahalanobis distance.

Best Answer

Related Solutions

Solved – Predicting with both continuous and categorical features

Solved – Correlated variables in kmeans clustering

Related Question