If your samples dimensionality is less than the vector space dimensionality, singular matrices may arise. If you have less samples than $d+1$ (when $d$ is your dimensionality), this situation will even necessarily arise: $k+1$ samples span at most a $d$ dimensional hyperplane. Given such a small sample, you obviously cannot compute a variance in the orthogonal space.
This is why it's common to not use literal PCA, but instead perform singular value decomposition, which can be used to compute the pseudoinverse of a matrix.
If the matrix is invertible, the pseudoinverse will be the inverse.
However, if you are seeing non-invertible matrixes, chances are that your distance from the cluster will be meaningless if the vector is outside of the hyperplane the cluster repesents, because you do not know the variance in the orthogonal space (you can think of this variance as 0!) SVD can compute the pseudoinverse, but the "variances" will still be not determined by your data.
In this case, you should probably have been doing global dimensionality reduction first. Increasing the sample size will only help when you actually have non-redundant dimensions: no matter how many samples you draw from a distributions with $y=x$, the matrix will always be non-invertible, and you will not be able to judge the deviation $x-y$ with respect to a standard deviation (which is 0).
Furthermore, depending on how you compute the covariance matrix, you might be running into numerical issues due to catastrophic cancellation. The simplest workaround is to always center the data first, to get zero mean.
You can compute the pseudo-Mahalanobis distance by using the pseudo-inverse:
$$W_i'\hat{\sigma}_W^{-1}W_i$$
where
$$W^*=\left(X_i-\hat{\mu}_X\right)$$
and
$$W=W^*V_{W^*}$$
where
$V_{W^*}$ is the $V$ matrix of the SVD decomposition of $W^*$
Then simply compute the Mahalanobis distances on the matrix $W$ (instead of $X$).
note that you need to replace $p$ in the left hand side of your
inequality by $p^*$ (the rank of $V_{W^*}$)
Best Answer
First of all, in the univariate case (when $d=1$, e.g. the one you already know the decision rule for), assuming you have a vector of $n$ univariate measurements $x$ (so that $x$ is a $n\times 1$ matrix where each entry $x_i$ is a scalar), the decision rule you describe is really:
$$\left(\frac{n(n-1)}{(n-1)(n+1)}\frac{\left(x_i-\hat{\mu}_x\right)^2}{\hat{\sigma}^2_x}\right) > F_{0.95}(1, n-1)$$
where $F_{0.95}$ is the 95 percentile of a Fisher distribution (you consider that $x_i$ is too far from $\hat{\mu}_x$ in the metric $\hat{\sigma}$ to belong to the cluster with mean $\hat{\mu}_x$ and scale $\hat{\sigma}$). This is the correct version of your rule of thumb when $p=1$ (I denote $p$ what you write $d$, sorry for the confusion but if I change my notation now, my answers to your comments below will become meaningless)
In the multivariate case (where $p>1$, e.g. the one you are really interested in), this becomes: assuming $X$ is of dimensions $n\times p$ (so that each row $X_i$ of $X$ is a $p$-vector) and $\sigma_X^{-1}$ (the inverse of the variance covariance matrix of the $X$) exists:
$$\left(\frac{n(n-p)}{p(n-1)(n+1)}\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)\right) > F_{0.95}(p, n-p)$$
denoting $\mu_X$ the $p$-vector of means of $X$. $\left(X_i-\hat{\mu}_X\right)'\hat{\sigma}_X^{-1}\left(X_i-\hat{\mu}_X\right)$ is the vector of Mahalanobis distances of $X_i$ w.r.t. to $(\hat{\mu}_X,\hat{\sigma}_X)$