Depending on specifics, consider the following alternatives. I am sure that you're familiar with some, but maybe not all methods. Additionally, some of the papers, which I've referenced below, describe algorithm modifications, which might be appropriate for your specific task and data sets.
- K-means adaptations. For example, see this paper and this paper. Also, see this paper on using bootstrapping in K-means cluster analysis (while the paper is focused on the speed, the space improvement is IMHO implied as well due to nature of the bootstrapping process).
- Model-based clustering: mixture modeling. This is an interesting option, implemented in several
R
packages, most notably mclust
(http://www.stat.washington.edu/mclust). The approach, methods and software are well presented in this vignette paper.
- Model-based clustering: Dirichlet processes (DP). Another popular option is Bayesian-based Dirichlet mixture models and hierarchical DP. If I understand the material correctly, (probabilistic) topic modeling also fits this category and includes such approaches, as latent Dirichlet allocation (LDA) (note: not to be confused with different method with same abbreviation - linear discriminant analysis (LDA), mostly used for dimensionality reduction, as I understand). More information on LDA: this introductory paper, some other relevant publications and a very recent paper on much improved LDA approach.
- Hierarchical clustering (HC). In addition to traditional HC, you may find some interesting hybrid approaches, such as Dirichlet diffusion trees, which applies HC approach to DP mixtures (see this paper; other partial related research can be found via links on this page).
- Latent variable modeling (LVM)-based clustering. Clustering applications include latent class analysis (LCA)-based latent class clustering (LCC) (see this paper) and latent tree models, described here. For some discussion, including a comparison between LCA and K-means, see this page and this paper.
- Information theory-based clustering. For example, see this paper.
- Neural networks- and genetic algorithms-based clustering. For example, see this paper.
- If I remember correctly, I think that I've also seen some papers on using entropy for classification, but can't find them at the moment (will update, if that changes).
- Some other interesting/relevant papers: a comparison of LCC and PAM; clustering with Bregman divirgences (probably belongs to information theory-based clustering category).
- Some relevant discussions on Cross Validated: here and here
- For determining an optimal number of clusters, see this excellent answer on StackOverflow.
I will state what I think you are asking. If I have misunderstood your question, please comment and I will delete this answer.
I think that you are saying that you have some text data. Cosine is usually used to measure similarity of documents, but the similarity matrix can be converted to a distance/dissimilarity measure and it sounds like you have done that. You used this to perform clustering and want to visualize the results to see if the clustering makes sense and possibly gain some insight from the clusters. But you have only very high dimensional text (which is hard to plot) and a distance matrix. How can you get a useful visualization?
One way that is used to get a plot that shows clusters is to use principal components analysis on your data, then project the data onto the first two principal components. The two dimensional data can be plotted. The x-y coordinates are in terms of the principal components which are linear combinations of the original dimensions. This can be hard to interpret.
There are several other good methods to go from a distance matrix to a low-dimensional representation of your data suitable for graphing. The methods try to create a representation (probably 2-dimensional for graphing) that preserves the distance relations stored in the distance matrix. Of course, it is not generally possible to do this exactly, but still these methods can produce useful visualizations.
I will point you to two such methods:
Multi-dimensional Scaling
and
t-distributed Stochastic Neighbor Embedding
(tSNE)
Both can produce useful results from a distance matrix. Both have easy-to-use implementations in R and presumably other languages.
Both MDS and tSNE use optimization methods to construct a two-dimensional representation of the data and so are not even as simple as the linear combinations of dimensions that you get from PCA. Because of this, the two dimensions that are produced cannot generally be interpreted in terms of the original dimensions. They preserve the distance between points, but not the meaning of the dimensions.
I believe that the picture that you copied from the Code Project k-means page was merely meant to be illustrative of what happens when the original data has two dimensions, where the process is easier to understand. In that picture, the x and y are the x and y of the original data. A different example from the Code Project is closer to your use. It clusters words using cosine similarity and then creates a two-dimensional plot. The axes there are simply labeled x[,1] and x[,2]. The two coordinates were created by tSNE. Thus, you cannot really interpret the coordinates themselves. But there is reason to think that the relationships between the words are preserved as much as possible in reducing this to two dimensions.
Best Answer
In most situations, I would have thought that dsuch a plot basically means that there is no cluster structure in the data. However, clustering in very high dimensions such as this is tricky as for the Euclidean distance metric all distances tend to the same as the number of dimensions increases. See this Wikipedia page for references to some papers on this topic. In short, it may just be the high-dimensionality of the dataset that is the problem.
This is essentially "the curse of dimensionality", see this Wikipedia page as well.
A paper that may be of interest is Sanguinetti, G., "Dimensionality reduction of clustered datsets", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30 no. 3, pp. 535-540, March 2008 (www). Which is a bit like an unsupervised version of LDA that seeks out a low-dimensional space that emphasises the cluster structure. Perhaps you could use that as a feature extraction method before performing k-means?