Row (Column) reducing a matrix is essentially a process whereby you do a series of elementary operations, which when taken together has the same effect as multiplying your equation (augmented system) by an invertible matrix (which is a product of elementary operations - row or column).
Now the distinction between a row operation and a column operation, is that a row operation is the same as multiplying on the left, eg $E_RA=E_RB$ whereas a column operation is the same as multiplying on the right, eg $AE_C=BE_C$. To be able to do any such an operation on an augmented system $[A : B]$ you must have that $A$ and $B$ have the same amount of rows (columns) for a row operation (column operation). So you cannot for example perform a column operation on an augmented system of $Ax=B$ where $A$ has multiple columns, and $B$ only one - within the context of solving the system for $x$ such an operation would be the same as trying to multiply $A$ and $B$ on the right with some elementary matrix - and the multiplication will be undefined for one of these matrices since they do not have the same amount of columns.
Now, just as you can row reduce the augmented system $[A : I]$ to $[I : A^{-1}]$ (if $A$ is invertible) you can also column reduce $\begin{bmatrix} A \\ \hline \\ I \end{bmatrix}$ to $\begin{bmatrix} I \\ \hline \\ A^{-1} \end{bmatrix}$. The use of augmenting with $I$ is that it allows you to retrieve the composition of elementary operations as a single invertible matrix.
What about both row and column operations on a matrix? Here the second observation I want to make, is that the whole point of augmenting and row or column reducing together, is that instead of multiplying one matrix and then the other with the same elementary matrix you do it all together in one step, but usually in the end (if you are not solving a linear system) you want to retrieve the nonsingular matrices composed of the series of elementary operations. So you want $P$ and/or $Q$ so that $PAQ$ is some meaningful equivalent matrix. Consider reducing
$\begin{bmatrix}A & I \\ I & 0 \end{bmatrix}$ to $\begin{bmatrix}B & P \\ Q & 0 \end{bmatrix}.$
(A needn't be square, but then the two identity submatrices are not the same size). What this means is that $PAQ=B$ and we say that $A$ and $B$ are equivalent.
If $A$ is invertible then you can also use both row and column operations:
Reduce $\begin{bmatrix}A & I \\ I & 0 \end{bmatrix}$ to $\begin{bmatrix}I & P \\ Q & 0 \end{bmatrix}.$
What this means is $PAQ=I$, so that $A^{-1}=QP$.
The one exception I need to mention where one can use both row and column operations on an augmented system $[A : I]$ is where $A$ is a symmetric and you reduce to $[D : Q^T]$ by doing a series of alternating column and row operations (column operation followed by exactly the same row operation, eg swap column 1 and 2 followed by swap row 1 and 2). Then in the end you will have $D=Q^TAQ$. Of course this only works because $A$ is symmetric.
$$\begin{cases} x-X_c=(Au+B)\cos(Cv+D)\\y-Y_c=(Au+B)\sin(Cv+D)\end{cases} \quad\to\quad (x-X_c)^2+(y-Y_c)^2=(Au+B)^2$$
FIRST PART :
In 3D. Cartesian system $(x,y,u)$, this equation is the equation of a cone with vertical axe and center $(X_c\:,Y_c\:,-\frac{B}{A})$.
So, the problem can be interpreted as the fitting of a cone to a given set of points.
A cone is a particular quadratic surface. The problem is a particular case of the general problem of regression of quadratic surface. This problem is treated page 17 in the paper : https://fr.scribd.com/doc/14819165/Regressions-coniques-quadriques-circulaire-spherique (in French). A rough translation (adapted to the present case of cone) is given below.
The equation can be written as :
$$x^2+y^2=A^2u^2+2X_cx+2Y_cy+2ABu-X_c^2-Y_c^2+B^2$$
Let $\quad \begin{cases}R=x^2+y^2 \\ S=A^2 \\ T=2X_c \\ U=2Y_c \\ V=2AB \\ W=-X_c^2-Y_c^2+B^2 \end{cases} \to\quad R(x,y,u)=Su^2+Tx+Uy+Vu+W$
Given the data $(x_k,y_k,u_k)$ one computes $R_k=X_k^2+y_k^2$ from $k=1$ to $n$.
Then, the linear regression for $S,T,U,V,W$ $$R_k \simeq Su_k^2+Tx_k+Uy_k+Vu_k+W\qquad (n\text{ points})$$
leads to approximates of $X_c\:,Y_c\:,A,B$.
These approximates are good only if the scatter of the data is small, due to the not-matching number of parameters.
SECOND PART :
Using the approximates $X_c\:,Y_c\:,A,B$ computed above, the system of equation can be rewritten as :
$$\begin{cases} \cos^{-1}\left(\frac{x-X_c}{Au+B}\right)=Cv+D\\ \sin^{-1}\left(\frac{y-Y_c}{Au+B}\right)=Cv+D\end{cases}$$
Let $Y_k= \cos^{-1}\left(\frac{x-X_c}{Au+B}\right)$ and $Y_{n+k}=\sin^{-1}\left(\frac{y-Y_c}{Au+B}\right)$ from $k=1$ to $n$ which gives $2n$ values of $Y_k$.
Then, the linear regression for $C,D$
$$Y_k \simeq Cv_k+D\qquad (2n\text{ points})$$
leads to approximates of $C$ and $D$.
COMMENTS :
The above approximates of $X_c\:,Y_c\:,A,B,C,D$ are obtained according to a criteria of fitting which is not exactly the least mean squares of distances between the cone and the points. If you definitively want the mean least squares, a non linear regression has to be used, for example with the Levenberg–Marquardt method. It would requires an iterative process starting from initial guess of the six parameters. This is a difficult guess in general. The above direct method provides some values of the parameters which can be used as initial guess.
Best Answer
First question: If $X$ is square and nonsingular, then the equivalence between normal equations and "ordinary" system follows from the fact that $(X^TX)^{-1}X^T=X^{-1}X^{-T}X^T=X^{-1}$.
Second question: Linear least squares (LS) problems are generally more sensitive to the rounding errors than "ordinary" systems and avoiding them requires some attention.
First, one should never compute $X^TX$ explicitly unless there is a very good reason for that (e.g., when only a very rough approximation to the LS solution is needed and $X$ is not very ill-conditioned). Usually, a good idea is to use some sort of orthogonalisation algorithm applied to $X$ instead.
Second, the LS solution can be much more sensitive to the data perturbations (and hence rounding errors). Just for comparison, consider the following. Assume $X\in\mathbb{R}^{N\times M}$ has full column rank, $Y\in\mathbb{R}^{N}$, and we seek to solve for $Z=X^{\dagger}Y=(X^TX)^{-1}X^TY$. Assume that instead of this $Z$ we get a $\tilde{Z}=\tilde{X}^{\dagger}\tilde{Y}$, where $\tilde{X}=X+E$ and $\tilde{Y}=Y+F$ with $\|E\|_2\leq\epsilon\|X\|_2$, $\|F\|_2\leq\epsilon\|Y\|_2$, and $\epsilon\kappa_2(X)<1$. That is, $\tilde{Z}$ is a solution of the LS problem with some small perturbation of the given data.
Now in the case of the "ordinary" system with $M=N$, we have the following bound relating the forward error (relative norm of $Z-\tilde{Z}$) to the data perturbations: $$ \frac{\|Z-\tilde{Z}\|_2}{\|Z\|_2}\leq\frac{2\epsilon\kappa_2(X)}{1-\epsilon\kappa_2(X)}. $$ This means that the sensitivity of the solution of the ordinary (consistent) linear system with respect to the data perturbations is essentially given by the condition number of $X$. However, for the general LS problem, we have $$ \frac{\|Z-\tilde{Z}\|}{\|Z\|}\leq \frac{\epsilon\kappa_2(X)}{1-\epsilon\kappa_2(X)} \left[2+(\kappa_2(X)+1)\frac{\|R\|_2}{\|X\|_2\|Z\|_2}\right], $$ where $R=Y-XZ$. It means that for problems with relatively large residuals (where the data are not "well correlated") the sensitivity of the LS solution is in fact given by the square of the condition number of $X$.