Solved – How to cluster the data with binary variables

clusteringdistancer

I have a dataset like df and I want to cluster the data in R. These variables are binary showing that a person uses a programming language or not. I have these questions:

1- How can I visualize my data? Up to now, I tried heatmap (with clusters on both rows and columns)

2- What is the best distance calculation method for this data?

3- What is the best clustering method for this data?

4- Should I consider normalizing the variables i.e. (value-mean)/sd before clustering? All the variables are either zero or one but the variables have different standard deviations.

  language <- c(
  "R, Matlab",
  "Assembly, R, Go, Rust",
  "Java, Javascript, Ruby, SQL",
  "Java, Ruby",
  "C, C++",
  "PHP, Javascript, Ruby, Assembly, Swift, R, Matlab, Go, Haskell",
  "R",
  "Perl, Javascript, R",
  "Javascript, Ruby, Bash",
  "Python, PHP, Javascript",
  "Java",
  "Java, C"
)

df <-as.data.frame(language,stringsAsFactors = FALSE)


df <- reshape2::recast(
  data =  setNames(strsplit(language, ", ", T), language), 
  formula = L1~value, 
  fun.aggregate = length
)

str(df)

Best Answer

Here are my answers your questions:

4- If your variables are all either 0 or 1 you should not have to normalize.

2- There are many distance calculation methods available. My default one is Euclidean distance (for environmental variables) but you should look to find one appropriate for binary variables.

1- You can try non-metric multidimensional scaling using the package vegan to visualize your selected distance matrix. At the end, you'll have a 2D plot of each of your samples and the points closer will be the most similar ones while the points farther will be the most dissimilar ones.

3- I previously used the software PRIMER to do clustering analysis and the package clustsig seems to be doing pretty much the same thing in R. You should look into this package to perform your clustering analysis.