The function MDSplot plots the (PCA of) the proximity matrix. From the documentation for randomForest, the proximity matrix is:
A matrix of proximity measures among the input (based on the frequency that pairs of data points are in the same terminal nodes).
Based on this description, we can guess at what the different plots mean. You seem to have specified k=4, which means a decomposition of the proximity matrix in 4 components. For each entry (i,j) in this matrix of plots, what is plotted is the PCA decomposition along dimension i versus the PCA decomposition along dimension j.
I did a PCA on the same data and got a nice seperation between all the classes in PC1 and PC2 already, but here Dim1 and Dim2 seem to just seperate 3 behaviours. Does this mean that these three behaviours are the more dissimilar than all other behaviours (so MDS tries to find the greatest dissimilarity between variables, but not necessarily all variables in the first step)?
MDS can only base its analysis on the output of your randomForest. If you're expecting a better separation, then you might want to check the classification performance of your randomForest. Another thing to keep in mind is that your PCA is mapping from 9-dimensional data to 2 dimensions, but the MDS is mapping from an NxN-dimensional proximity matrix to 2 dimensions, where N is the number of datapoints.
What does the positioning of the three clusters (as e.g in Dim1 and Dim2) indicate?
It just tells you how far apart (relatively) these clusters are from each other. It's a visualisation aid, so I wouldn't over-interpret it.
Since I'm rather new to R I also have problems plotting a legend to this plot (however I have an idea what the different colours mean), but maybe somebody could help?
The way R works, there's no way to plot legend after-the-fact (unlike in say Matlab, where this information is stored inside the figure object). However, looking at the code for MDSplot, we see that relevant code block is:
palette <- if (require(RColorBrewer) && nlevs < 12) brewer.pal(nlevs, "Set1")
...
plot(rf.mds$points, col = palette[as.numeric(fac)], pch = pch, ...)
So the colours will be taken from that palette, and mapped to the levels (behaviours) in whichever order you've given them. So if you want to plot a legend:
legend(x,y,levels(fac),col=brewer.pal(nlevs, 'Set1'), pch=pch)
would probably work.
The problem, in particular with k-means applied to real world, labeled data is that clusters will usually not agree with your labels very well, unless you either generated the labels by using a similar clustering algorithm (self-fulfilling prophecy), or the data set is really simple.
Have you tried computing the k-means-statistics such as sums of squares etc. on the original data set? I would not at all be surprised if they are substantially worse than after running k-means.
I figure it's just another case of the algorithm does not fit to your problem.
Evaluating clustering algorithms is really hard. Because you actually want to find something you do not know yet. Even if the clustering would reproduce the original labels it then actually failed, because it did not tell you something new, and then you could just have used the labels instead.
Maybe the most realistic evaluation for a clustering algorithm is the following: if you incorporate the result from the clustering algorithm into a classification algorithm, does it improve the classification accuracy significantly? I.e. treat clustering as a preprocessing/support functionality for an algorithm that you can evaluate reasonably.
Best Answer
I know that you asked R solutions, but in python, specifically scikit-learn, there's an interesting class that implements a Random forest embedding. It constructs a random forest without class label infomation. As the output you get a new dataset, where your objects are embedded in a binary feature space. In this space you have a feature for each leaf of each tree of the random forest (a huge, depending on how many trees you use, sparse feature space). Each feature is one when the object falls in that specific leaf of that specific random tree.
The idea is that if two objects fall consistently in the same leaves across the forest it is likely that they are somewhat similar. You can then visualize this new dataset with an MDS plot, for example using the hamming distance.
This is a non-linear embedding, so it colud be likely that the forest is able to disentangle your objects and obtain a meaningful clustering.
Or, as I am not a fan of clustering in spaces developed for visualizations, you can use directly some clustering algorithm that works with non-euclidean distances (do not use the k-means!) and try to get a more interesting clustering.