Clustering – Understanding Mahalanobis Distance Calculation Between a Point and a Cluster

clusteringcovariancecovariance-matrixmahalanobis

I am slightly confused as to how you calculate Mahalanobis distance given a set of data. I have tried asking my tutor for help but he does not seem interested in helping what so ever and I am continuously insulted. I thought I would turn to the community for help.

I have a set of data here and I have performed distance calculation once using Euclidean distance to group the data. Now I am looking to calculate distance using Mahalanobis distance. I have calculated the means and also calculated a Pooled covariance matrix. I am unsure as to what I need to do from here to begin calculating distances for each point.

I think what I need to do is take a point and subtract the mean values. I then calculate a Pooled Covariance Matrix for each group and use this to calculate the distance between the point and the clusters data distribution. Whichever one yields the smallest distance out of the clusters, that will be the cluster the point belongs to.

Data clustered into 3 clusters after performing Euclidean distance to place points into initial groups

Pooled Covariance matrix
$$\begin{bmatrix}1.394&1.702\\1.702&6.62\end{bmatrix}$$

Inverse Pooled Covariance
$$\begin{bmatrix}1.046&-0.269\\-0.269&0.221\end{bmatrix}$$

Mahalanobis Formula

Pooled covariance matrix for each cluster

Cluster1
$$\begin{bmatrix}0.873&-0.234\\-0.234&0.158\end{bmatrix}$$

Cluster2
$$\begin{bmatrix}6.060&-3.030\\-3.030&6.060\end{bmatrix}$$

Cluster3
$$\begin{bmatrix}1.189&-0.573\\-0.573&0.722\end{bmatrix}$$

Calculating distance for point (1,1)and Cluster 1 distribution

Since cluster1 distribution has a smaller distance compared to cluster2, this point will belong to cluster1.

The idea of a pooled covariance matrix comes from the following argument.

1. Each group can have its sample covariance matrix calculated.

2. However, we believe that the groups all have the same population covariance and only differ in their means.

3. In order to get the tightest estimate that we can about the one covariance matrix shared by all three groups, we pool the sample covariance matrices for each individual group.

If you’re thinking that the groups might not have the same population covariance matrix, you’re right. However, your assignment seems to be assuming one population covariance matrix that is estimated using pooling of the sample covariance matrix from each group.

It’s possible that your calculation of the 1.394 matrix is incorrect, though the idea of having one population covariance matrix for all three groups is the key. Then it makes sense why you would use just the one covariance matrix in determining the Mahalanobis distance from each group, since you believe that to be the best estimate of the covariance matrix for all three groups (and, therefore, each individual group).