MATLAB: How to visualize high-dimensional clusters from the “kmeans” function

high-dimensionalStatistics and Machine Learning Toolboxvisualization

I applied the "kmeans" function to a dataset of 24 variables with the number of clusters being set to 3. How can I visualize the three clusters and their centroids?

Best Answer

Because the cluster data is 24-dimensional, it is often difficult to visualize them directly. A common way to deal with this is to first project or transform the data to lower dimensions (typically 2 or 3) and then apply visualization techniques to the reduced-dimensional data. As an example, suppose the "kmeans" function is applied to a data matrix "data" (300 x 24) with the number of clusters being set to 3:

rng("default");
data = randn(300, 24);
[idx, C] = kmeans(data, 3);

Then here are some visualization options:

   Option 1: Plot 2 or 3 dimensions of your interest. For instance, to plot the 4th dimension versus the 9th dimension of your data, one can do the following

scatter(data(:,4), data(:,9), [], idx);   % plot three clusters with different colors
hold on;
plot(C(:, 4), C(:, 9), 'kx');   % plot centroids

   Option 2: First reduce the dimensionality of your data using principal component analysis (PCA), and then plot the data in the principal-component space:

[standard_data, mu, sigma] = zscore(data);     % standardize data so that the mean is 0 and the variance is 1 for each variable
[coeff, score, ~]  = pca(standard_data);     % perform PCA
new_C = (C-mu)./sigma*coeff;     % apply the PCA transformation to the centroid data
scatter(score(:, 1), score(:, 2), [], idx)     % plot 2 principal components of the cluster data (three clusters are shown in different colors)
hold on
plot(new_C(:, 1), new_C(:, 2), 'kx')     % plot 2 principal components of the centroid data

Option 3: Use "silhouette" function to measure the goodness of the clustering:

silhouette(data, idx);

Related Solutions

MATLAB: How to visualize the contributive factors and distribution of coefficients in the “coeff” matrix output by “pca”

One possible approach is to provide the "coeff" matrix as the input to the "heatmap" function. The x-axis would represent the principal components, while the y-axis would represent the predictor variables. The heatmap would indicate the distribution of the contributive features for each principal component.

In order to illustrate, here is an example:

load hald
coeff = pca(ingredients)
heatmap(coeff,'XLabel','Principal Components','YLabel','Variables');

For more details about the "heatmap" function, you can refer to the following link:

<https://www.mathworks.com/help/releases/R2018b/matlab/ref/heatmap.html#bvlbg64-1>

Note that the "biplot" function provides another way to visualize the magnitude and sign of each variable's contribution to the first two or three principal components:

<https://www.mathworks.com/help/releases/R2018b/stats/biplot.html>

<https://www.mathworks.com/help/releases/R2018b/stats/pca.html#btjpztu-1>

MATLAB: How do clustering algorithms handle non-numeric or categorical data and is it possible to assign weights to individual features (columns in the data) during clustering

1) Do the k-means and hierarchical clustering algorithms handle non-numeric data? If not is there anyway of handling categorical data in clustering?

None of these algorithms take 'non-numeric' features as inputs and you will need to somehow convert the 'categorical' features into 'numeric'.

If you will try to call these functions on categorical features, MATLAB will show an error. Consider the following example:

>> data = [1:10; 1:10; 1:10]';
>> weights = [1 2 3];
>> weightedData = weights .* data;
Error using internal.stats.linkagemex
Function linkagemex only supports input of class 'double' or 'single'.
Error in linkage (line 259)
Z = internal.stats.linkagemex(Y,method,pdistArg, memEff);

Here, the error clearly indicates that the hierarchical clustering algorithm can only accept numeric data (i.e., 'double' or 'single' data types).

You will get a similar error for the 'kmeans' function as well (for example '>> idx = kmeans(X,3)' will produce ''Error using kmeans (line 166) Invalid data type. The first argument to KMEANS must be a real array.'').

In order to use categorical features for clustering, you need to 'convert' the categories you have into numeric types (say 'double') and the distance function you will use to define the dissimilarity of the data will be based on the 'double' representation of the categorical data. Please take a look at the following link for a descriptive example :

https://www.mathworks.com/help/stats/kmedoids.html#bukva31

*2) Is it possible to assign feature weights in hierarchical clustering / k-means clustering? *

There is no built in option (or way) for assigning feature weights in any of the clustering algorithms. However, you can use the 'kmediods' clustering as an alternative and define a custom 'Distance' function, where you can 'weigh' the input features as per your requirements. Please refer to the following link for an example of specifying the 'Distance' property for 'k-mediods' clustering:

https://www.mathworks.com/help/stats/kmedoids.html#buioqlz-Distance

Here, you will need to use custom pairwise distance function 'pdist'. Please refer to the following link for an example of defining a custom 'pdist' function:

https://www.mathworks.com/help/stats/pdist.html#mw_04e52647-b270-43f0-a8cf-8f2b3e9ca4bc

You can define your own MATLAB function like the 'naneucdist' function defined in the above link and add weights to the features as per your requirements.

Alternatively, if you have numerical features and an array of weights for each of these features, you can simply multiply the features with these weights. Consider the following example, where we have a dataset 'data' with three features:

>> data = [1:10; 1:10; 1:10]';
>> weights = [1 2 3];
>> weightedData = weights .* data;

Now, you can use the 'weightedData' for clustering as per your requirements.

Best Answer

Related Solutions

MATLAB: How to visualize the contributive factors and distribution of coefficients in the “coeff” matrix output by “pca”

MATLAB: How do clustering algorithms handle non-numeric or categorical data and is it possible to assign weights to individual features (columns in the data) during clustering

Related Question