[Math] Simple Least-Squares Regression Question

regressionstatistics

Given a set of 5 points (i.e. (1, 3), (2, 8) etc…), how can I get just the slope of the best fit line?

I've been looking up least squares regression, but I'm rather statistics ignorant and don't understand most of the terminology and math behind it. Can anyone explain it a bit more simply?

Best Answer

By best-fit line, I presume you mean the least-squares fit. The "least-squares fit line" for the given data $\{ (x_i, y_i) \}_{i=1}^n$ is, by definition, simply the line $\ell_{a,b}$ with the equation $y = a+bx$ that minimizes the least-square error: $$ Q(a,b) := \sum_{i=1}^n (y_i - a - bx_i)^2. $$ Notice that the quantity $|y_i - a- bx_i|$ is a measure of the deviation of the point $(x_i, y_i)$ from the line. Squared error refers to the fact that we are summing (over the $n$ data points) the sum of squares of these deviations from the line. [Another reasonable choice could be to minimize the sum of errors $\sum\limits_{i=1}^n \ |y_i - a - bx_i|$, but least squares has the advantage that it is easy to compute the minimizer analytically*.]

To find the line $\ell_{a,b}$ that minimizes $Q$, we resort to calculus. Taking partial derivatives of $Q$ w.r.t. $a$ and $b$, we get: $$ \begin{eqnarray*} \frac{\partial Q}{\partial a} &=& \sum_{i=1}^n 2 (a + bx_i - y_i) = 2an + 2b \sum_i x_i - 2\sum_i y_i. \\ \frac{\partial Q}{\partial b} &=& \sum_{i=1}^n 2 (a + bx_i - y_i) x_i = 2a \sum_i x_i + 2b \sum_i x_i^2 - 2\sum_i x_i y_i. \end{eqnarray*} $$ Setting both the partial derivatives to $0$, you can solve for $a$ and $b$.


*EDIT: Added the qualification analytically. See the comments under guy's answer for more on this.