I once wrote it up in detail for my blog, but I never published it because I thought people would not find it interesting. But since you ask, here it is.
There are a lot of formulas, but there is nothing difficult in it.
Scientific calculators all have a "linear regression" feature, where
you can put in a bunch of data and the calculator will tell you the
parameters of the straight line that forms the best fit to the data.
For example, suppose you have a bunch of data that looks like this:
Linear regression will calculate that the data are approximated by the
line $3.06148942993613\cdot x + 6.56481566146906$ better than
by any other line.
When the calculator does it, you just put in the data values and out
pop the parameters of the line, its slope and its $y$-intercept.
This can seem rather mysterious. When you see the formulas for the
parameters of the line, they look even more mysterious. But it turns
out that it is rather easy to figure out.
We want to find the
line $y = mx + b$ that fits the data as well as
possible. The first thing we need to decide is how to measure the
goodness of the fit. If a data point is $(x, y)$ and the
line actually goes through $(x, y')$ instead, how do we
decide the extent to which that is a problem?
We need to pick some error-measuring function $\varepsilon(y,
y')$ that tells us the error we attribute to the line going
through $y'$ instead of $y$. Then we can do the calculation
to figure out the line that has the minimum total error.
The $\varepsilon$ function we choose must satisfy some criteria. For
example, $\varepsilon(y, y)$ should be 0 for all $y$: if
the line actually goes exactly through the data point, we should
attribute a 0 error to that. And $\varepsilon(y, y')$
should never be negative: a error of 0 is perfect, and you shouldn't
be able to do better than perfect.
The $\varepsilon$ function that the pocket calculators all use is
$$\varepsilon(y, y') = (y-y')^{2}.$$
This is for a few reasons. One is that it satisfies the important
criteria above, and some others. For example, it is symmetric: if the
line passes a certain distance above where the point actually is, that
has the same error as if it passes the same distance below where the
point actually is.
Also, this error function says that missing a data
point by three feet isn't three times as bad as missing it by one
foot; it's nine times as bad. It's okay to miss the point by a
little bit, but missing by a lot is unacceptable. This is usually
something like what we want.
But the real reason the calculators all use this error function is
that it's really easy to calculate the best-fitting line for this
particular definition of "best-fitting".
Let's do the calculation. Suppose that the data are the $N$ points
$(x_{1}, y_{1}), (x_{2},
y_{2}), (x_{3},
y_{3})$…. And we're going to try to approximate
them with the line $y = mx + b$. We want to find
the values of $m$ and $b$ that minimize the total
error.
The total error is:
$$E(m, b) = \sum{\varepsilon(y_i, mx_i+b)}$$
That is, for each point $(x_{i},
y_{i})$, we take where it really is (that's
$y_{i}$) and where the line $y = mx +
b$ predicts it should be (that's
$mx_{i} + b$) and calculate the error of
the prediction. We add up the error for each point that the line was
supposed to predict, and that's the
total error for the line.
Since $\varepsilon(y, y') =
(y-y')^{2}$, the total error is:
$$E(m, b) = \sum{(y_i - mx_i - b)^2}$$
Or, grinding out the algebra:
$$E(m, b) = \sum{(y_i^2 + m^2x_i^2 + b^2 -
2mx_iy_i - 2by_i + 2bmx_i)}$$
We can add up each of the six kinds of components separately and then
total them all at the end:
$$E(m, b) = \sum{y_i^2} + \sum{m^2x_i^2} +
\sum{b^2} - \sum{2mx_iy_i} - \sum{2by_i} + \sum{2bmx_i}$$
We're going to need to spend a lot of time talking about things like
$\sum x_iy_i$, so let's make some
abbreviations:
$$\begin{array}{lcl}
\mathcal X &= & \sum x_i \cr
\mathcal Y &= & \sum y_i \cr
\mathcal A &= & \sum x_i^2 \cr
\mathcal B &= & \sum x_iy_i \cr
\mathcal C &= & \sum y_i^2 \cr
\end{array}$$
Also please recall
that $N$ is the total number of points.
With these abbreviations, we can write the total error
as:
$$E(m, b) = {\cal C} + m^2{\cal A} +
b^2N - 2m{\cal B} - 2b{\cal Y} + 2bm{\cal X}$$
Now, remember that we would like to find the values of $m$ and
$b$ that result in the minumum error. That is, we would like
to minimize $E$ with respect to $m$ and $b$. But that
is just a straightforward calculus problem.
We want to find the value of $m$ that minimizes $E$, so we take
the derivative of $E$ with respect to $m$:
$\def\d#1{{\partial E\over \partial #1}}$
$$\d m = 2m{\cal A} -
2{\cal B} + 2b{\cal X}$$
And we know that if there is a
minimum value for $E$, it will occur when $\d m = 0$:
$$2m{\cal A} -
2{\cal B} + 2b{\cal X} = 0\qquad{(*)}$$
And similarly, $\d b$ will also have to be 0:
$$\d b =
2bN - 2{\cal Y} + 2m{\cal X} = 0\qquad{(**)}$$
So we have two equations (* and **) in two unknowns ($m$ and $b$) and
we can solve them. First we'll solve the second equation for
$b$:
$$b = {{\cal Y} - m{\cal X}\over N}$$
Now we'll solve the first equation for $m$:
$$\eqalign{
m{\cal A} - {\cal B} + b{\cal X} & = & 0 \cr
m{\cal A} - {\cal B} + {{\cal Y} - m{\cal X}\over N}{\cal X} & = & 0 \cr
m{\cal A} - {\cal B} + {{\cal XY} - m{\cal X}^2\over N} & = & 0 \cr
mN{\cal A} - N{\cal B} + {\cal XY} - m{\cal X}^2 & = & 0 \cr
m(N{\cal A} - {\cal X}^2 ) & = & N{\cal B} - {\cal XY} \cr
%m & N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 \cr
}$$
Thus:
$$m = { N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 }$$
And that's it. We can get $m$ from $N$, $\cal A$, $\cal B$,
$\cal X$, and $\cal Y$, and we can get $b$ from $m$,
$N$, $\cal X$, and
$\cal Y$.
$\def\dd#1{{\partial^2 E\over \partial {#1}^2}}$
(Actually, we also need to check the second derivatives to make sure
we haven't found the values of $m$ and $b$ that give the
maximum error. But it should be clear from the geometry of the
thing that it is impossible that there could be a very-worst line: No
matter how badly the data are approximated by any given line, you
could always find another line that was worse, just by taking the bad
line and moving it another few miles away from the data. But we
should take the second derivatives anyway, if for no other reason than
to rule out the possibility of a saddle point in the $E$
function. The second derivatives are
$\dd m = 2\cal A$ and
$\dd b = 2N$, which
are both positive, so we really have found a minimum for the
error function.)
Now let's do some checking, and make sure it works and makes sense.
First, it should fail if $N=1$, because there are plenty of lines
that all go through any single point. And indeed, when there's only
one point, say $(x, y)$, then $X^{2} = x^{2}$, ${\cal A} = x^{2}$, the denominator
of the expression for $m$ vanishes and the whole thing blows
up.
It should also fail if all the $x$ values are equal, because then
the line through the data is vertical, and has an infinite slope. And sure
enough, if all the $x$ values are equal, then ${\cal X} =
Nx$, ${\cal X}^{2} =
N^{2}x^{2}$, ${\cal A} =
Nx^{2}$, and the denominator of the expression for
$m$ vanishes.
Now what if all the points happen to lie on a (non-vertical) straight
line? Then there are some numbers $p$ and $q$ so that
$y_{i} = px_{i} +
q$ for all $i$. We hope that the equations for
$m$ and $b$
will give
$m = p$ and $b = q$. If they don't, we're in
trouble, since the equations are supposed to find the best line
through the points, and the points actually lie on the line
$y = px + q$; this line will therefore have a total
error of 0, and that's the best we can do. Well, let's see how it
works out.
In such a case, we have
$ {\cal Y} = \sum {y_i} =
\sum {px_i+q} =
p{\cal X} + qN$.
Similarly, we also have:
$$\eqalign{
{\cal B} & = & \sum{x_iy_i} \cr
& = & \sum{x_i(px_i+q)} \cr
& = & \sum{px_i^2} + \sum{qx_i} \cr
& = & p{\cal A} + q{\cal X} \cr
}$$
Plugging
these into the formula for $m$ above we get:
$$\eqalign{
m & = & { N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 } \cr
& = & { N(p{\cal A} + q{\cal X}) - {\cal X}(p{\cal X} + qN) \over N{\cal A} - {\cal X}^2 } \cr
& = & { pN{\cal A} + qN{\cal X} - p{\cal X}^2 - qN{\cal X} \over N{\cal A} - {\cal X}^2 } \cr
& = & { pN{\cal A} - p{\cal X}^2 \over N{\cal A} - {\cal X}^2 } \cr
% & = & { p(N{\cal A} - {\cal X}^2) \over N{\cal A} - {\cal X}^2 } \cr
& = & p \cr
}$$
as hoped, and then plugging into the formula for $b$,
we get $b$ = $q$, also as hoped.
It's handy to be able to fit a line to any data set. A while back I
wrote a trivial little program to read data points from the standard input, do the least-squares analysis, and print out $m$ and $b$. Little utilities like this often come in handy.
Best Answer
Either cost function is fine. Since they are just constant multiples of each other, minimising one is equivalent to minimising the other (will result in same fitted $\beta$ coefficients).