Solved – What statistical test for cluster analysis results should I use

anovamachine learningmanovar

I have a set of 35 independent variables (features). I do not have response variable for my data set. I used density plots to identify multi-modal distribution in my independent variables. Hence, I used Gaussian mixture clustering technique to group the data. Upon clustering, I obtained 6 clusters.

I designed hypothesis to test my results as follows
Hypothesis 1: H0: there is no significant difference in means in the clusters formed.

Before proceeding to ANOVA, I did Shapiro – Wilk normality test (rejected null hypothesis W = 0.99132, p – value = 1.623e-12) and outlier test (found that there are outliers in the data)

Next, I did Levene 's Test for Homogeneity of Variance and found that variances for groups are unequal. Results were same for Bartlett test, and Fligner – Killeen test.

From this, I failed to meet the ANOVA assumption. Now, that ANOVA is out of the picture what non-parametric test should I use to test that the clusters I have are unique or distinct. Or should I use Test for Homogeneity of Variance to redefine my hypothesis and conclude my finding?

Just out of curiosity, I did do ANOVA and MANOVA analysis and from the results, I could reject null hypothesis.

I have looked into different validation indices such as silhouette width (using this to confirm the number of clusters formed are optimal), Dunn, pearsongamma etc.

I have over 100 similar data sets I need to validate. Any help regards to this is very much appreciated.

PS: my data is normalized data with mean = 0 and SD=1

Edit
I have some signal data for about 6 months (over 50,000 observations). I have extracted about 35 features form the data and have used Gaussian mixture clustering to cluster the data into distinct groups. I also have no label information to test the accuracy or kappa values. I have only some written records regarding the events on particular days. Based on my clustered data, I was able to cross-reference each cluster to a particular event.

Having said that, I also want to make sure that the clusters I have are distinct (mean to say no each cluster is different from one another). I want to do a statistical test with an appropriate hypothesis.

My final goal of this statistical test is to see which clusters are significantly different from one another and what features are significant for each cluster.

Best Answer

Let me know if I am understanding your question correctly. Your data do not have labels so you perform Gaussian clustering. And you want to perform hypothesis testing to check, using these clusters as "labels", if your data differ significantly?

It seems like you want to treat these clusters as different levels of a single "factor" (in ANOVA speak). If the equal-variance between clusters assumption holds, you can then proceed to perform a MANOVA (where the response is the 35-dimensional feature vector of your data points). But since these assumptions are violated, you cannot do the traditional MANOVA.

If I'm understanding this correctly, you can perform a permutation-based MANOVA. Anderson 2001 describes this approach. Essentially it applies the sum-of-squares (or any other dissimilarity measure) metric to your data points, generate distribution of F-ratios and compare that to your permuted data to obtain pvalue/confidence interval.

If you are using MATLAB, there is an implementation in the Fathom Toolbox.