Solved – Standard deviation of a particular dimension in a multivariate Gaussian distribution

clusteringmultivariate regressionnormal distributionregressionstandard deviation

I have a set (cluster) of vectors in dimension d. From this I have calculated the sample mean and covariance matrix ( I make the assumption that they are from a multivariate Gaussian).

My question is, given a new vector (in dimension d) I am trying to decide if it belongs to this cluster by checking if the distance from the mean is less than 2 standard deviations.

In the one dimensional case I would simply check if x-x_bar > 2*sigma.

How does this extend to the multivariate case?

Thanks

Best Answer

First of all, in the univariate case (when $d=1$, e.g. the one you already know the decision rule for), assuming you have a vector of $n$ univariate measurements $x$ (so that $x$ is a $n\times 1$ matrix where each entry $x_i$ is a scalar), the decision rule you describe is really:

$$\left(\frac{n(n-1)}{(n-1)(n+1)}\frac{\left(x_i-\hat{\mu}_x\right)^2}{\hat{\sigma}^2_x}\right) > F_{0.95}(1, n-1)$$

where $F_{0.95}$ is the 95 percentile of a Fisher distribution (you consider that $x_i$ is too far from $\hat{\mu}_x$ in the metric $\hat{\sigma}$ to belong to the cluster with mean $\hat{\mu}_x$ and scale $\hat{\sigma}$). This is the correct version of your rule of thumb when $p=1$ (I denote $p$ what you write $d$, sorry for the confusion but if I change my notation now, my answers to your comments below will become meaningless)

In the multivariate case (where $p>1$, e.g. the one you are really interested in), this becomes: assuming $X$ is of dimensions $n\times p$ (so that each row $X_i$ of $X$ is a $p$-vector) and $\sigma_X^{-1}$ (the inverse of the variance covariance matrix of the $X$) exists:

$$\left(\frac{n(n-p)}{p(n-1)(n+1)}\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)\right) > F_{0.95}(p, n-p)$$

denoting $\mu_X$ the $p$-vector of means of $X$. $\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)$ is the vector of Mahalanobis distances of $X_i$ w.r.t. to $(\hat{\mu}_X,\hat{\sigma}_X)$

Related Question