I have 100 data points, observed on 15 variables. I want to cluster my 100 observations, but I am unable to visualise 15-dimensional clusters in MATLAB.
Solved – How to plot clusters in more than 3 dimensions
clusteringdata visualization
Related Solutions
A good idea might be to run some ANOVAS and MANOVAS on the cluster for whatever variables you're using. The variables that generated the cluster should generally yield significant differences, but if the 5 new vars you're incorporating were not the vars you used to generate the cluster solution, it's interesting to run them.
ANOVA, or a simple compare means test, maybe a t-test, will give you an F statistic, which is a relatively good indicator of how different each group [cluster in this case] is in terms of the relevant variables.
if your new 5 vars are categorical it might be as easy as a chi square test, but you might give multiple correspondence a try. multiple correspondence yields a biplot such that the distances between categories is an indicator of how much they tend to happen together, so if you have cluster 1 very near to 3 categories you conclude that those three categories are characteristic of cluster 1.
Or, you know, just describe the univariate statistics of each of your clusters.
You could use the conditional probabilities for the outcomes to be the correct outcomes under a prior assumption that the correct outcome occurs with some probability $p$ and all incorrect outcomes occur with probability $q=(1-p)/(n-1)$, where $n$ is the number of different outcomes. Then the probability for outcomes $a_i$ with sum $\sum_ia_i=S$ if answer $k$ is correct is proportional to $p^{a_k}q^{S-a_k}$, so you could assign confidence levels
$$c_k=\frac{p^{a_k}q^{S-a_k}}{\sum_i p^{a_i}q^{S-a_i}}=\frac{p^{a_k}q^{-a_k}}{\sum_i p^{a_i}q^{-a_i}}\;.$$
You can choose the parameter $p$ according to your needs. For $p=q=1/n$, you'll get $c_k=1/n$ (which makes sense, since if you assume that people are just guessing, no amount of clustering will raise your confidence in one of the outcomes). For $p$ near $1$, you'll get sharply peaked confidence levels even for moderate differences in the outcome counts. In the limit $p\to1$, $q\to0$ the confidence level for the outcome with the highest count will go to $1$ and the others will go to $0$, since you can multiply through by the lowest number of factors of $q$ and the other terms in the sum go to zero.
If you want the confidence level in your first example to be noticeably different from $100\%$, you'd have to choose $p$ quite close to $1/n=25\%$. Here are some values for your examples:
$$ \begin{array}{|c|c|c|c|c|c|} p&q&220&31&28&21\\ \hline 0.253&0.249& 0.879&0.043&0.041&0.037\\ \hline 0.256&0.248&0.994&0.002&0.002&0.002\\ \hline 0.259&0.247&1.000&0.000&0.000&0.000 \end{array} $$
$$ \begin{array}{|c|c|c|c|c|c|} p&q&113&110&106&103\\ \hline 0.253&0.249& 0.270&0.258&0.242&0.230\\ \hline 0.256&0.248&0.291&0.264&0.233&0.212\\ \hline 0.259&0.247&0.312&0.270&0.224&0.194\\ \hline 0.265&0.245& 0.354&0.280&0.204&0.162\\ \hline 0.280&0.240&0.458&0.288&0.156&0.098\\ \hline 0.310&0.230&0.632&0.258&0.078&0.032 \end{array} $$
Best Answer
Calculate distances between data points, as appropriate to your problem.
Then plot your data points in two dimensions instead of fifteen, preserving distances as far as possible. This is probably the key aspect of your question. Read up on multidimensional scaling (MDS) for this.
Finally, color your points according to cluster membership.