I have a dataset like df and I want to cluster the data in R. These variables are binary showing that a person uses a programming language or not. I have these questions:
1- How can I visualize my data? Up to now, I tried heatmap (with clusters on both rows and columns)
2- What is the best distance calculation method for this data?
3- What is the best clustering method for this data?
4- Should I consider normalizing the variables i.e. (value-mean)/sd before clustering? All the variables are either zero or one but the variables have different standard deviations.
language <- c(
"R, Matlab",
"Assembly, R, Go, Rust",
"Java, Javascript, Ruby, SQL",
"Java, Ruby",
"C, C++",
"PHP, Javascript, Ruby, Assembly, Swift, R, Matlab, Go, Haskell",
"R",
"Perl, Javascript, R",
"Javascript, Ruby, Bash",
"Python, PHP, Javascript",
"Java",
"Java, C"
)
df <-as.data.frame(language,stringsAsFactors = FALSE)
df <- reshape2::recast(
data = setNames(strsplit(language, ", ", T), language),
formula = L1~value,
fun.aggregate = length
)
str(df)
Best Answer
Here are my answers your questions:
4- If your variables are all either 0 or 1 you should not have to normalize.
2- There are many distance calculation methods available. My default one is Euclidean distance (for environmental variables) but you should look to find one appropriate for binary variables.
1- You can try non-metric multidimensional scaling using the package
vegan
to visualize your selected distance matrix. At the end, you'll have a 2D plot of each of your samples and the points closer will be the most similar ones while the points farther will be the most dissimilar ones.3- I previously used the software
PRIMER
to do clustering analysis and the packageclustsig
seems to be doing pretty much the same thing inR
. You should look into this package to perform your clustering analysis.