Solved – normalisation in k means clustering on percentages and other numerical variables

clusteringk-meansmachine learningnormalization

I have several variables to include in k-means, some of them are percentages (between 0-1) and some of them are numerical variables (positive values). I know normalisation is required when the variables are in different scales, so they are all in comparable ranges.

My question: since some of the variable are already between values 0-1 (the percentages), should I only normalise the other variables and leave the percentages as they are? or should I normalised the percentages too? (not sure if that would make sense).

I found several posts (for example: k-means clustering on percentages) but still not sure how to proceed… I really much appreciate your help. Thanks!

Best Answer

I would apply feature scaling independently to each variable. This is because if, for example, your percentages vary between .55 and .85 with feature scaling you'd still cover the whole range, because .55 would become your zero and .85 your one.

Related Question