I applied the "kmeans" function to a dataset of 24 variables with the number of clusters being set to 3. How can I visualize the three clusters and their centroids?
MATLAB: How to visualize high-dimensional clusters from the “kmeans” function
high-dimensionalStatistics and Machine Learning Toolboxvisualization
Related Solutions
One possible approach is to provide the "coeff" matrix as the input to the "heatmap" function. The x-axis would represent the principal components, while the y-axis would represent the predictor variables. The heatmap would indicate the distribution of the contributive features for each principal component.
In order to illustrate, here is an example:
load haldcoeff = pca(ingredients)heatmap(coeff,'XLabel','Principal Components','YLabel','Variables');
For more details about the "heatmap" function, you can refer to the following link:
Note that the "biplot" function provides another way to visualize the magnitude and sign of each variable's contribution to the first two or three principal components:
1) Do the k-means and hierarchical clustering algorithms handle non-numeric data? If not is there anyway of handling categorical data in clustering?
None of these algorithms take 'non-numeric' features as inputs and you will need to somehow convert the 'categorical' features into 'numeric'.
If you will try to call these functions on categorical features, MATLAB will show an error. Consider the following example:
>> data = [1:10; 1:10; 1:10]';>> weights = [1 2 3];>> weightedData = weights .* data;Error using internal.stats.linkagemexFunction linkagemex only supports input of class 'double' or 'single'.Error in linkage (line 259)Z = internal.stats.linkagemex(Y,method,pdistArg, memEff);
Here, the error clearly indicates that the hierarchical clustering algorithm can only accept numeric data (i.e., 'double' or 'single' data types).
You will get a similar error for the 'kmeans' function as well (for example '>> idx = kmeans(X,3)' will produce ''Error using kmeans (line 166) Invalid data type. The first argument to KMEANS must be a real array.'').
In order to use categorical features for clustering, you need to 'convert' the categories you have into numeric types (say 'double') and the distance function you will use to define the dissimilarity of the data will be based on the 'double' representation of the categorical data. Please take a look at the following link for a descriptive example :
*2) Is it possible to assign feature weights in hierarchical clustering / k-means clustering? *
There is no built in option (or way) for assigning feature weights in any of the clustering algorithms. However, you can use the 'kmediods' clustering as an alternative and define a custom 'Distance' function, where you can 'weigh' the input features as per your requirements. Please refer to the following link for an example of specifying the 'Distance' property for 'k-mediods' clustering:
Here, you will need to use custom pairwise distance function 'pdist'. Please refer to the following link for an example of defining a custom 'pdist' function:
You can define your own MATLAB function like the 'naneucdist' function defined in the above link and add weights to the features as per your requirements.
Alternatively, if you have numerical features and an array of weights for each of these features, you can simply multiply the features with these weights. Consider the following example, where we have a dataset 'data' with three features:
>> data = [1:10; 1:10; 1:10]';>> weights = [1 2 3];>> weightedData = weights .* data;
Now, you can use the 'weightedData' for clustering as per your requirements.
Best Answer
Because the cluster data is 24-dimensional, it is often difficult to visualize them directly. A common way to deal with this is to first project or transform the data to lower dimensions (typically 2 or 3) and then apply visualization techniques to the reduced-dimensional data. As an example, suppose the "kmeans" function is applied to a data matrix "data" (300 x 24) with the number of clusters being set to 3:
Then here are some visualization options:
Option 3: Use "silhouette" function to measure the goodness of the clustering: