Solved – Creating a cluster analysis on multiple variables

clusteringr

I am working on creating a cluster analysis for some very basic data in r for Windows [Version 6.1.76]. The groups themselves are countries and then I have 2 column with continuous numerical variables. I have applied a Ward Hierachical Method to the data

# Applying Ward Hierarchical Clustering
d = dist(conversion_set, method="euclidean")
fit = hclust(d, method="ward")

But I don't feel this represents what I am really trying to get to as it is just taking into account the first variable and disregarding the second. Is there a way to include both variables into the clustering calculations?

My data looks similar to this

Country – Var 1 – Var 2

US – 10 – 20

Canada – 5 – 30

….

Best Answer

Try this toy example

conversion_set <- data.frame(Country=c("United States", "Canada", "Mexico", 
                             "Guatemala", "Belize", "Honduras"), 
                             Var1=c(10,  5, 65, 10, 40, 70),
                             Var2=c(20, 30, 60, 80, 25, 90) ) 
numbers_only <- conversion_set[,-1]
rownames(numbers_only) <- conversion_set[,1]
# Applying Ward Hierarchical Clustering
d   <- dist(numbers_only, method="euclidean")
fit <- hclust(d, method="ward")
plot(fit)

which puts Belize closer to the United States and Canada than Guatemala is, and also puts Mexico and Honduras closer together than to Guatemala, as in

enter image description here

Related Question