I would like to introduce a geometric example to fix notation and then move to the Mahalonobis distance (quite informal) analysis.
- Ellipsoid in $\mathbb R^{3}$: an easy example
Let us introduce the geometric definition of an ellipsoid centered at $y\in\mathbb R^{3}$, i.e. the locus of points $x\in\mathbb R^{3}$ satisfying
$$\langle(x-y),A^{-1}(x-y)\rangle=1.$$
Here $A$ is any positive definite matrix. For example, if $A=diag\{a^2,b^2,c^2\}$, $x=(x_1,x_2,x_3)$ and $y=(y_1,y_2,y_3)$, then the ellipsoid we are looking for is
$$\frac{(x_1-y_1)^2}{a^2}+ \frac{(x_2-y_2)^2}{b^2}+ \frac{(x_3-y_3)^2}{c^2}=1.$$
The parameter $a$ controls the distance between the center $y$ and the point $x$ along the 1st axis; similar considerations hold for $b$ and $c$. The ratios $\frac{a}{b}, \frac{b}{c}$ and $\frac{a}{c}$ determine the "shape" ("oblate", "prolate") of the given ellipsoid.
- Ellipsoids and Mahalanobis distance in $\mathbb R^{n}$
The Mahalanobis distance
$$D(x,y)=\sqrt{\langle(x-y),C^{-1}(x-y)\rangle}$$
gives, by definition, the distance between the vectors $x$ and $y$ in $\mathbb R^{n}$ of realizations of a given multivariate random process $X=(X_1,\dots,X_n)$. $C$ is (an estimate of) the matrix of covariance.
Looking at the formula for the geometric ellipsiod, you can identify the random vector $y$ in $D(x,y)$ with the center of the ellipsoid defined by $D(x,y)=d$, for some $d>0$.
To be really precise, you should arrive at $D(x,y)=1$ by dividing the inverse $C^{-1}$ of the matrix of covariance by $d^2$.
The sentence "The Mahalanobis distance is simply the distance of the test point from the center of mass divided by the width of the ellipsoid in the direction of the test point."
can be understood with a simplified exposition.
Let us keep the above set up: $x$ is a random vector of realizations of the multivariate random process $X=(X_1,\dots,X_n)$. Let us introduce the vector $s=(s_1,\dots,s_n)$ of standardized variables
$$s_i:=\frac{x_i-\mu_i}{\sigma_i},$$
denoting by $\mu_i$ the mean of the realizations of the $i$-th random process $X_i$ and by $\sigma_i$ its variance. Let us consider the Mahalanobis distance of a vector $x$ from the center $\mu=(\mu_1,\dots,\mu_n)$, i.e.
$$D(x,\mu)=\sqrt{\langle (x-\mu),C^{-1}(x-\mu) \rangle }, $$
where the matrix of covariance is the diagonal matrix $C=\{\sigma^2_1,\dots,\sigma^2_n\}$. Then
$$D(x,\mu)=\sqrt{\langle s,s\rangle } ,$$
by construction. As in the geometric example above, the variance $\sigma_i$ in
$C=\{\sigma^2_1,\dots,\sigma^2_n\}$ controls the "shape" (or width) of the ellipsoid defined by the Mahalanobis distance along the $i$th-axis.
As told by @ArtificialBreeze in the comments on this question (I recommend reading these comments), the choice of whether to assume column vectors or row vectors is arbitrary, just be consistent with what you have already done.
Now for the QR! If you have already done the Gram-Schmidt process (or otherwise have an orthonormal basis), then now the last step is easy (but very tedious, as usual).
The QR factorization states that $A = QR$, where A is your orginal matrix (the very first matrix in the question, in my case). Q is the orthonormal one, easily obtained by the Gram-Schmidt process (my second matrix above).
Now for the fun trick!
A property of an orthonormal basis is the fact that $Q^{-1}$ = $Q^T$. So simply transpose your matrix to get $Q^{-1}$. Then multiply both sides by the transposed matrix, and you will have $Q^{-1}A = R$, now simply solve for R. You can then multiply out $QR$ to check your answer. Since there are so many operations, I would recommend that step, to check your answer.
You may also see the whole process, in full, from start to finish, here in this nice, concise video. It helped me!
Thank you everyone who commented and helped me out! I hope this helps others!
Best Answer
I will answer a slightly broader question:
Both can be answered as follows. Let $X=\{x^1,\dotsc,x^r\}\subset\mathbf R^N$ be a set of points taken from some metric space, together with their distances $d_X$.
Provided that $(X, d_X)$ has an isometric embedding $\iota$ into some lower dimensional $\mathbf R^n$ which we do not know yet, our goal is to find possible images $\hat x^i=\iota(x^i)$.
Let $D=(d_{ij})_{ij}$ with $d_{ij}=d_X(x^i, x^j)$. Since $\mathbf R^N$ is a euclidean space, we can form the Gram matrix $B=(b_{ij})_{ij}$ with $b_{ij}=\langle x^i, x^j\rangle$. Let $X=(x^1,\dotsc,x^n)$ and $\hat X=(\hat x^1,\dotsc,\hat x^n)$ as matrices so that $B=X^\mathrm T X$, and assume w. l. o. g. that $\{\hat x^i\}$ have the origin at their centre of mass. The pairwise distances satisfy \begin{align} d_{ij}^2 &= \langle x^i-x^j, x^i-x^j\rangle \\ &= b_{ii} -2b_{ij} + b_{jj}. \end{align}
Since $\iota$ is supposed to become an isometry, $B=\hat X^\mathrm T \hat X$. Since the points $\hat x^i$ are supposed to be centered at the origin,
$$\textstyle\sum_i b_{ij} = \textstyle\sum_{ik} \hat x^i_k \hat x^j_k = 0;$$
hence
\begin{align} \textstyle\sum_{i} d_{ij}^2 &= \operatorname{tr} B + r b_{jj}\\ \textstyle\sum_{j} d_{ij}^2 &= \operatorname{tr} B + r b_{ii}\\ \textstyle\sum_{ij} d_{ij}^2 &= 2r\operatorname{tr} B \end{align}
and conversely
\begin{align} b_{ij} &= \textstyle -\frac 12 \left[ d_{ij}^2 - \frac 1r \left( \sum_k d_{kj}^2 + \sum_k d_{ik}^2\right) + \frac{1}{r^2}\sum_{kl} d_{kl}^2\right]. \end{align} With the matrix $C=(\delta_{ij}-\frac 1r)_{ij}$ the last line can be written as
Since $B$ is a symmetric matrix, it can be factored as $B=Q\Lambda Q^\mathrm T$ for $Q$ normal and $\Lambda=\operatorname{diag}(\lambda_1,\dotsc, \lambda_n, 0, \dotsc, 0)$ diagonal with nonnegative entries in descending order. Then
contains coordinates for the $r$ points $\hat x^i\in\mathbf R^n$ which give rise to the prescribed distance matrix $D$, and all other rows of $\hat X$ are zero.
If $X$ cannot be embedded into $\mathbf R^n$ as assumed initially, $\Lambda$ might have non-zero entries beyond $\lambda_n$. If we drop all but the first $n$ eigenvectors and -values in $Q$ and $\Lambda$ respectively, we obtain a map $X\to\mathbf R^n$ for which the metric distortion $\sum_{ij} \bigl(d_{ij} - \Vert \hat x^i - \hat x^j \Vert\bigr)^2$ becomes minimal. A necessary and sufficient criterion for the embeddability of $X$ in $\mathbf R^n$ is the vanishing of all higher Cayley-Menger determinants. If $X$ cannot be embedded in any euclidean space, $\Lambda$ might contain negative entries.