I was recently reading up on the Mahalanobis Distance, and understood how it generalizes distance measures for multivariate data such as the Euclidean Distance. However, what got me wondering was how does one derive the formula constructively? I understand how this general form applies in particular cases, but is there a way to construct this general formula from 'first principles'?

Particularly, could someone explain what is the significance of the inverted covariance matrix? (especially in the non-trivial case when it is not a diagonal matrix)

More than "derive", I would talk about "why". For some explanations have a look at my answers here and here. For a direct application consider for example the $p$-multivariate normal distribution

$$f(x|\theta, \Sigma)=\frac{1}{(2\pi)^\frac{p}{2}|\Sigma|^\frac{p}{2}} \exp\left( -\frac{1}{2}\langle (x-\theta),\Sigma^{-1}(x-\theta)\rangle \right);$$

the exponent is (up to a factor $-\frac{1}{2}$) the square Mahalanobis distance of $x$ from the mean $\theta$. This is an example of kernel (gaussian); it is widely used in density estimation. In the bivariate case, the level curves / density contours

$$\langle (x-\theta),\Sigma^{-1}(x-\theta)\rangle = K $$

are ellipses, with the usual statistical / mathematical interpretation.

In more mathematical terms, the squared Mahalanobis distance is an example of Bregman divergence generated by the convex function $F(x)=\frac{1}{2}\langle x,\Sigma^{-1}x\rangle$. In the regression context, it is also related to leverage; I refer to specialized texts for more details.

