First, let's generate some example data and cluster it:
data <- rFace(1000)
km <- kmeans(data, 6)
Now, we can use discrproj to find an appropriate projection that separates these clusters
dp = discrproj(data, km$clustering)
The result, dp
has several fields that are potentially useful. The field dp$proj
contains the coordinates of the original data points, projected onto our new space. This space has the same dimensionality as the original space, but the first two dimensions separate the clusters best (which is what plotcluster
actually displays)
Compare:
plot(dp$proj[,1], dp$proj[,2], pch=km$cluster+48, col=km$cluster) #+48 to get labels correct
with:
plotcluster(data, km$clustering)
Suppose you get some new points in your original space. You can project them into your new space using the basis vectors in dp$units
, like this:
newpts = newdata %*% dp$units[,1:2]
That should answer your first question. Unfortunately, I think the second part is effectively unanswerable because there are infinitely many points in the 31-d space that correspond to a given point in the 2D space.
Best Answer
You are talking about two distinct problems here
The second one is much easier than the first to answer.
To calculate the Euclidean distance when you have X, Y and Z, you simply sum the squares and square root. This works for any number of dimensions
$D=\sqrt{\sum_i X_i^2}$
The first part, visualisation, is much harder, but also has no right answer - it is simply a tool for checking that it is doing what you think it is, and for understanding what is going on. If N gets very large, there is no simple way to do this.
For three dimensions, there are a couple of common approaches, with their own pros and cons:
For higher dimensions you have to resort to more approximate techniques:
Specific to K-Means, and particularly useful when K is low (e.g. 2), you can plot the density of points at distances on the projection between a pair of clusters.
For example, suppose we go back to 2D and have a scatter chart like this:
Where the two big blobs are the KMeans centres, and I have added the line that passes through the two points. If you perpendicularly project each point onto that line, then you can view the distribution of the points around each centre like so:
Where I have marked on the location of the means with the thick lines. The second graph can be drawn regardless how many dimensions you are working in, and is a way of seeing how well separated the clusters are.