[Math] simple explanation to why the line of best fit passes through $\bar x$ and $\bar y$

linear regressionstatistics

Is there a clear explanation someone can give an undergrad as to why a line of best fit in a linear model must always pass through a point/coordinate indicating the mean of the $x$ and $y$ values ($\bar{x}$, $\bar{y}$) and why the sums of the least squares must equal $0$?

It is a hard concept I have not been able to grasp…

Thank you.

Best Answer

You need a bit of undergraduate calculus to understand this.

Let $y=mx+b$ denote the equation of the line which minimizes

$$ S=\sum_{k=1}^n[y_k-(mx_k+b)]^2$$

Then S is a second-degree polynomial in two variables $m$ and $b$. So the partials derivatives of $S$ with respect to both $m$ and $b$ will be zero at any extreme value, such as the minimum.

Taking $\dfrac{\partial S}{\partial b}$ we get

\begin{eqnarray} \frac{\partial S}{\partial b}&=&-2\sum_{k=1}^n[y_k-(mx_k+b)]\\ &=&0 \end{eqnarray}

Therefore,

\begin{eqnarray} \sum_{k=1}^ny_k&=&m\sum_{k=1}^nx_k+\sum_{k=1}^nb\\ \sum_{k=1}^ny_k&=&m\sum_{k=1}^nx_k+nb\\ \frac{1}{n}\sum_{k=1}^ny_k&=&\frac{m}{n}\sum_{k=1}^nx_k+b\\ \bar{y}&=&m\bar{x}+b \end{eqnarray}

Related Question