Solved – Why does the total least squares line in 2D pass through the average across all data points

fittingmeantotal least squares

I have $N$ data points $\mathbf{m_k}$ and I want to fit a line through them with minimal error $$J = \sum_k^N ||\mathbf{m_k^*} – \mathbf{m_k}||^2 = \sum_k^N ||\mathbf{m_0} + a_k\mathbf{e} – \mathbf{m_k}||^2.$$

$\mathbf{m_k^*}$ is the point on the line with minimal distance to $\mathbf{m_k}$. It is described as $\mathbf{m_0} + a_k\mathbf{e}$ where $\mathbf{m_0}$ is the support vector of the line and $\mathbf{e}$ a directional vector with length $1$.
My lecture slides contain a proof that $a_k = \mathbf{e}^T(\mathbf{m_k} – \mathbf{m_0})$, yielding $J = \sum_k^N ||\mathbf{m_0} + \mathbf{e}^T(\mathbf{m_k} – \mathbf{m_0})\mathbf{e} – \mathbf{m_k}||^2,$
but I don't get any further from there. My lecture just assumes that $\mathbf{m_0} = \mathbf{\bar{m}}$ and it does feel reasonable, but I want proof.

I tried to search on the internet but most articles e.g. this one want to predict a $y_i$ from $x_i$ (and use vertical offsets whereas I do perpendicular offsets). But I don't want to do a regression nor do I want to be restricted to two dimensional space.
I'm not even quite sure what it is called that I'm trying to do, my guess is total least squares but I'm pretty confused by the wikipedia page.

Best Answer

This question can be reduced to a simple Ordinary Least Squares problem of fitting a constant to a set of numbers, where it is well known (and easy to show) that the unique solution is the mean of those numbers.

The analysis comes down to selecting an appropriate coordinate system and analyzing the squared distances (from the data points to any line) coordinate by coordinate. An appropriate coordinate system is one in which the direction along the line furnishes one coordinate (which happens to be irrelevant) and the other coordinates are perpendicular to it and to each other.


One way to define a line in $d$ dimensions is to provide $d-1$ mutually orthogonal unit vectors $\mathbf u_1, \ldots, \mathbf u_{d-1}$ and $d-1$ corresponding numbers $c_1, \ldots, c_{d-1}$. The simultaneous solution $\mathbf x\in\mathbb{R}^d$ of the $d-1$ equations

$$\mathcal{L}: \mathbf u_i^\prime \mathbf x = c_i\tag{1}$$

is a line and all lines can be described in this way.

The squared distance between any vector $\mathbf m_k$ and this line is the sum of squared differences between the products $\mathbf u_i^\prime \mathbf m_k$ and the constants $c_i$:

$$\delta(\mathbf m_k, \mathcal{L}) = \sum_{i=1}^{d-1} (\mathbf u_i^\prime \mathbf m_k - c_i)^2.$$

When $\mathcal L$ is any line minimizing the sum of those squared distances to the $\mathbf m_k$, then each $c_i$ must separately minimize its corresponding term. The result becomes obvious if we fix $i$ for the moment and rewrite this term in the form

$$y_k = \mathbf u_i^\prime \mathbf m_k,\tag{2}$$

for now we see that $c_i$ minimizes the sum $\sum_{k=1}^N (y_k - c_i)^2$. This is the very simplest version of an Ordinary Least Squares problem, involving just one parameter $c_i$.

It's easy to see that

$$c_i = \bar y_k = \frac{1}{N}\sum_{k=1}^N y_k\tag{3}$$

is the unique solution, because for any number $c$, the sum of squares

$$\sum_{k=1}^N (y_k - c)^2 = \sum_{k=1}^N (y_k - \bar y_k)^2 + N(c - \bar y_k)^2$$

depends on $c$ only through the second term, which obviously is minimized uniquely at $c = \bar y_k$.

Consider now the mean vector (or "barycenter")

$$\mathbf{\bar m} = \frac{1}{N}\sum_{k=1}^N \mathbf{m}_k.$$

To see whether it lies on $\mathcal L$, plug it in for $\mathbf x$ on the left hand side of $(1)$ and, for each $i$, use $(2)$ and $(3)$ to compute

$$\mathbf u_i^\prime \mathbf{\bar m} = \mathbf u_i^\prime \frac{1}{N}\sum_{k=1}^N \mathbf{m}_k = \frac{1}{N}\sum_{k=1}^N\mathbf u_i^\prime\mathbf{m}_k = \frac{1}{N}\sum_{k=1}^Ny_k = \bar y_k = c_i.$$

Thus $(1)$ holds, proving that the mean vector must lie on $\mathcal L$.

Related Question