Solved – Plotting k-means clustering with mixed numerical/categorical data

categorical datadata visualizationk-meansr

I have a dataset in CSV format that looks as follows:

guid,eventA.location,eventA.time,eventB.location,eventB.time,...
a12b,server3,1424474828.1804667,server7,1424474828.1804668,...
a12c,server3,1424474829.4444667,server2,1424474838.3334668,...

Each row has a unique guid, and the columns come in pairs of location and time. The locations are one of a small set of 10 possible values, server1 through server10. The times are in seconds since epoch. There are 400 guids, and about 40 events (so about 80 columns). Some cells may have NA values, but not too many, so I'm happy to get rid of the rows that have them.

How do I perform a k-means clustering on this data, and then create a nice plot of it? Not sure how to go about handling the non-numeric data, the N/A data, the fact that the time scale is very tight (within 30s, so relative to the absolute values of these since-epoch times, the differences look negligible but really aren't), etc.

Here's what I've tried so far, but not gotten very far, and the error messages don't make much sense to me:

> x <- read.csv('/path/to/file')
> km <- kmeans(x, 3) 
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)
In addition: Warning messages:
1: In do_one(nmeth) : NAs introduced by coercion
2: In do_one(nmeth) : NAs introduced by coercion
> km <- kmeans(na.omit(x), 3)
Error in sample.int(m, k) : invalid first argument
> km <- kmeans(factor(na.omit(x)), 3)
Error in sort.list(y) : 'x' must be atomic for 'sort.list'
Have you called 'sort' on a list?

I've also run daisy(na.omit(x)) but I'm not sure what to make of the output:

Dissimilarities :
dissimilarity(0)

Metric :  mixed ;  Types = N, I, N, I, N, I, N, I, N, I, N, I, N, I, N, I, N, I, N, I, N, I, N, I, A, I, N, I, N, I, N, I, N, I, N, I, N, I, A, I, N, I, N, I, N, I, N, I, A, N, I, N, I, N, I, N, I, N, I, N, I, N, A, N, I, N, I, N, I, N, I, N, I, N, I, N, I, N, I, A, I, N, I, N, I, N, I, N, I, N, I, N
Number of objects : 0
There were 50 or more warnings (use warnings() to see the first 50)

Best Answer

When applying the kmeans() clustering function in R, it is helpful to consult their documentation: https://stat.ethz.ch/R-manual/R-devel/library/stats/html/kmeans.html

Here you will see that the matrix is required to be numeric:

x: numeric matrix of data, or an object that can be coerced to such a matrix (such >as a numeric vector or a data frame with all numeric columns)

Now that you know the matrix must be numeric in nature, you have many options on how to handle that data, most of which are covered in the Clustering Mixed Data thread:

Clustering a dataset with both discrete and continuous variables

Related Question