Solved – k-mean clustering of week-times

clusteringk-meansmachine learning

I have data of meeting times. The data has weekday and hour of the day.

I want to cluster the meeting times (I have reason to believe there are two different kinds of meetings that tend to occur at cluster means) into two clusters.

I want to use k-means, but I am not 100% sure of a sensible data arrangement. I have reason to believe that weekend vs weekday is probably a meaningful distinction, so the distance from Friday to Saturday should probably be a bit more than be more than the distance from Saturday to Sunday. (Right? Or I could add in a third dimension, weekend dummy.)

I was thinking it makes sense to use (day, hour, weekend dummy(?)) to cluster. When I standardize, the algorithm puts too much influence on weekend vs. weekday it seems, and separates the clustered almost entirely that way.

My question is this: is there a way to still consider the influence of weekday vs. weekend but not have it overly weighted in the clustering. Should I consider a different standardization of each variable? Potentially manipulating the variance by scaling?

Best Answer

Don't use clustering.

This data is better handled via visualization.

Plot a 7x24 activity heat map, draw the clusters using rectangles, and maybe measure the significance using some statistical model.

K-means will not work well for cylindric (repetitive) data; and you can't easily integrate information like weekday vs. weekend. It's just not a good match.

Clustering like k-means is nothing but a heuristic; and a visual approach will do much better.

Related Question