Cost Function Confusion for Ordinary Least Squares estimation in Linear Regression

linear regressionmachine learningregression

Just wanted to check that my current understanding of linear regression is correct and address some confusion I have with the cost function used in OLS estimation. My current understanding is this:

Given a data set:

$\{y_i, x_{i1}, x_{i2}, … ,x_{ip}\}_{i=1}^{N}$

The Multiple Linear Regression Model that most accurately describes the relationship between dependent variable $y$ and independent variables $x_1, x_2, … , x_p$ is the linear function:

$y = \beta_0x_0 + \beta_1x_1 + … + \beta_px_p = \sum_{j=0}^p\beta_j(x_j)$

such that $\forall y_i$ (where $y_i = \beta_0x_{i0} + \beta_1x_{i1} + … + \beta_px_{ip} + \epsilon_i$)

the sum of squared residuals: $\sum_{i}^{N}\epsilon^2 = \sum_{i}^{N}(y_i – (\beta_0x_{i0} + … + \beta_px_{ip}))^2$ is minimized.

This all comes from the following sources:

https://en.wikipedia.org/wiki/Ordinary_least_squares#Matrix/vector_formulation

https://en.wikipedia.org/wiki/Linear_regression#Simple_and_multiple_linear_regression

https://en.wikipedia.org/wiki/Linear_least_squares

My confusion comes from other sources I have looked at:

https://stackoverflow.com/questions/34148912/feature-scaling-normalization-in-multiple-regression-analysis-with-normal-equa

https://machinelearningmedium.com/2017/08/11/cost-function-of-linear-regression/

which say that the Multiple Linear Regression Model that most accurately describes the relationship between $y$ and $x_1, x_2, … x_p$ is the same linear function I originally defined above but that the coefficients $\beta_0, \beta_1, …, \beta_p$ for it are those which minimize the cost function:

$\frac{1}{2N}\sum_{i=1}^{N}(y_i – (\beta_0x_{i0}+ \beta_1x_{i1} + … + \beta_px_{ip}))^2$

So my question is this: Is my current understanding correct? and which of these cost functions should I be using?

Best Answer

Either cost function is fine. Since they are just constant multiples of each other, minimising one is equivalent to minimising the other (will result in same fitted $\beta$ coefficients).

Related Solutions

Linear Regression – Derivation of the Formula for Ordinary Least Squares

I once wrote it up in detail for my blog, but I never published it because I thought people would not find it interesting. But since you ask, here it is. There are a lot of formulas, but there is nothing difficult in it.

Scientific calculators all have a "linear regression" feature, where you can put in a bunch of data and the calculator will tell you the parameters of the straight line that forms the best fit to the data. For example, suppose you have a bunch of data that looks like this:

linear regression data

Linear regression will calculate that the data are approximated by the line $3.06148942993613\cdot x + 6.56481566146906$ better than by any other line.

the same linear regression data with plotted linear approximation

When the calculator does it, you just put in the data values and out pop the parameters of the line, its slope and its $y$-intercept. This can seem rather mysterious. When you see the formulas for the parameters of the line, they look even more mysterious. But it turns out that it is rather easy to figure out.

We want to find the line $y = mx + b$ that fits the data as well as possible. The first thing we need to decide is how to measure the goodness of the fit. If a data point is $(x, y)$ and the line actually goes through $(x, y')$ instead, how do we decide the extent to which that is a problem?

We need to pick some error-measuring function $\varepsilon(y, y')$ that tells us the error we attribute to the line going through $y'$ instead of $y$. Then we can do the calculation to figure out the line that has the minimum total error.

The $\varepsilon$ function we choose must satisfy some criteria. For example, $\varepsilon(y, y)$ should be 0 for all $y$: if the line actually goes exactly through the data point, we should attribute a 0 error to that. And $\varepsilon(y, y')$ should never be negative: a error of 0 is perfect, and you shouldn't be able to do better than perfect.

The $\varepsilon$ function that the pocket calculators all use is $$\varepsilon(y, y') = (y-y')^{2}.$$ This is for a few reasons. One is that it satisfies the important criteria above, and some others. For example, it is symmetric: if the line passes a certain distance above where the point actually is, that has the same error as if it passes the same distance below where the point actually is.

Also, this error function says that missing a data point by three feet isn't three times as bad as missing it by one foot; it's nine times as bad. It's okay to miss the point by a little bit, but missing by a lot is unacceptable. This is usually something like what we want.

But the real reason the calculators all use this error function is that it's really easy to calculate the best-fitting line for this particular definition of "best-fitting".

Let's do the calculation. Suppose that the data are the $N$ points $(x_{1}, y_{1}), (x_{2}, y_{2}), (x_{3}, y_{3})$…. And we're going to try to approximate them with the line $y = mx + b$. We want to find the values of $m$ and $b$ that minimize the total error.

The total error is:

$$E(m, b) = \sum{\varepsilon(y_i, mx_i+b)}$$

That is, for each point $(x_{i}, y_{i})$, we take where it really is (that's $y_{i}$) and where the line $y = mx + b$ predicts it should be (that's $mx_{i} + b$) and calculate the error of the prediction. We add up the error for each point that the line was supposed to predict, and that's the total error for the line.

Since $\varepsilon(y, y') = (y-y')^{2}$, the total error is:

$$E(m, b) = \sum{(y_i - mx_i - b)^2}$$

Or, grinding out the algebra:

$$E(m, b) = \sum{(y_i^2 + m^2x_i^2 + b^2 - 2mx_iy_i - 2by_i + 2bmx_i)}$$

We can add up each of the six kinds of components separately and then total them all at the end:

$$E(m, b) = \sum{y_i^2} + \sum{m^2x_i^2} + \sum{b^2} - \sum{2mx_iy_i} - \sum{2by_i} + \sum{2bmx_i}$$

We're going to need to spend a lot of time talking about things like $\sum x_iy_i$, so let's make some abbreviations:

$$\begin{array}{lcl} \mathcal X &= & \sum x_i \cr \mathcal Y &= & \sum y_i \cr \mathcal A &= & \sum x_i^2 \cr \mathcal B &= & \sum x_iy_i \cr \mathcal C &= & \sum y_i^2 \cr \end{array}$$

Also please recall that $N$ is the total number of points.

With these abbreviations, we can write the total error as:

$$E(m, b) = {\cal C} + m^2{\cal A} + b^2N - 2m{\cal B} - 2b{\cal Y} + 2bm{\cal X}$$

Now, remember that we would like to find the values of $m$ and $b$ that result in the minumum error. That is, we would like to minimize $E$ with respect to $m$ and $b$. But that is just a straightforward calculus problem. We want to find the value of $m$ that minimizes $E$, so we take the derivative of $E$ with respect to $m$:

$\def\d#1{{\partial E\over \partial #1}}$ $$\d m = 2m{\cal A} - 2{\cal B} + 2b{\cal X}$$

And we know that if there is a minimum value for $E$, it will occur when $\d m = 0$:

$$2m{\cal A} - 2{\cal B} + 2b{\cal X} = 0\qquad{(*)}$$

And similarly, $\d b$ will also have to be 0:

$$\d b = 2bN - 2{\cal Y} + 2m{\cal X} = 0\qquad{(**)}$$

So we have two equations (* and **) in two unknowns ($m$ and $b$) and we can solve them. First we'll solve the second equation for $b$:

$$b = {{\cal Y} - m{\cal X}\over N}$$

Now we'll solve the first equation for $m$:

$$\eqalign{ m{\cal A} - {\cal B} + b{\cal X} & = & 0 \cr m{\cal A} - {\cal B} + {{\cal Y} - m{\cal X}\over N}{\cal X} & = & 0 \cr m{\cal A} - {\cal B} + {{\cal XY} - m{\cal X}^2\over N} & = & 0 \cr mN{\cal A} - N{\cal B} + {\cal XY} - m{\cal X}^2 & = & 0 \cr m(N{\cal A} - {\cal X}^2 ) & = & N{\cal B} - {\cal XY} \cr %m & N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 \cr }$$ Thus:

$$m = { N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 }$$

And that's it. We can get $m$ from $N$, $\cal A$, $\cal B$, $\cal X$, and $\cal Y$, and we can get $b$ from $m$, $N$, $\cal X$, and $\cal Y$.

$\def\dd#1{{\partial^2 E\over \partial {#1}^2}}$ (Actually, we also need to check the second derivatives to make sure we haven't found the values of $m$ and $b$ that give the maximum error. But it should be clear from the geometry of the thing that it is impossible that there could be a very-worst line: No matter how badly the data are approximated by any given line, you could always find another line that was worse, just by taking the bad line and moving it another few miles away from the data. But we should take the second derivatives anyway, if for no other reason than to rule out the possibility of a saddle point in the $E$ function. The second derivatives are $\dd m = 2\cal A$ and $\dd b = 2N$, which are both positive, so we really have found a minimum for the error function.)

Now let's do some checking, and make sure it works and makes sense. First, it should fail if $N=1$, because there are plenty of lines that all go through any single point. And indeed, when there's only one point, say $(x, y)$, then $X^{2} = x^{2}$, ${\cal A} = x^{2}$, the denominator of the expression for $m$ vanishes and the whole thing blows up.

It should also fail if all the $x$ values are equal, because then the line through the data is vertical, and has an infinite slope. And sure enough, if all the $x$ values are equal, then ${\cal X} = Nx$, ${\cal X}^{2} = N^{2}x^{2}$, ${\cal A} = Nx^{2}$, and the denominator of the expression for $m$ vanishes.

Now what if all the points happen to lie on a (non-vertical) straight line? Then there are some numbers $p$ and $q$ so that $y_{i} = px_{i} + q$ for all $i$. We hope that the equations for $m$ and $b$ will give $m = p$ and $b = q$. If they don't, we're in trouble, since the equations are supposed to find the best line through the points, and the points actually lie on the line $y = px + q$; this line will therefore have a total error of 0, and that's the best we can do. Well, let's see how it works out.

In such a case, we have $ {\cal Y} = \sum {y_i} = \sum {px_i+q} = p{\cal X} + qN$. Similarly, we also have:

$$\eqalign{ {\cal B} & = & \sum{x_iy_i} \cr & = & \sum{x_i(px_i+q)} \cr & = & \sum{px_i^2} + \sum{qx_i} \cr & = & p{\cal A} + q{\cal X} \cr }$$

Plugging these into the formula for $m$ above we get:

$$\eqalign{ m & = & { N{\cal B} - {\cal XY} \over N{\cal A} - {\cal X}^2 } \cr & = & { N(p{\cal A} + q{\cal X}) - {\cal X}(p{\cal X} + qN) \over N{\cal A} - {\cal X}^2 } \cr & = & { pN{\cal A} + qN{\cal X} - p{\cal X}^2 - qN{\cal X} \over N{\cal A} - {\cal X}^2 } \cr & = & { pN{\cal A} - p{\cal X}^2 \over N{\cal A} - {\cal X}^2 } \cr % & = & { p(N{\cal A} - {\cal X}^2) \over N{\cal A} - {\cal X}^2 } \cr & = & p \cr }$$

as hoped, and then plugging into the formula for $b$, we get $b$ = $q$, also as hoped.

It's handy to be able to fit a line to any data set. A while back I wrote a trivial little program to read data points from the standard input, do the least-squares analysis, and print out $m$ and $b$. Little utilities like this often come in handy.

Real Analysis – Why is the Least Square Cost Function for Linear Regression Convex?

Let $x_i \in \mathbb R^n$ be the $i$th training example and let $X$ be the matrix whose $i$th row is $x_i^T$. Let $y$ be the column vector whose $i$th entry is $y_i$. Define $J:\mathbb R^n \to \mathbb R$ by $$ J(\theta) = \frac{1}{2m} \sum_{i=0}^m (x_i^T \theta - y_i)^2. $$ Notice that $$ J(\theta) = \frac{1}{2m} \| X \theta - y \|_2^2. $$ You can easily check that the function $$ f(\theta) = \frac{1}{2m} \| \theta \|_2^2 $$ is convex by checking that its Hessian is positive definite. (In fact, $$ \nabla^2 f(\theta) = \frac{1}{m} I, $$ where $I$ is the identity matrix.)

A very useful fact to be aware of is that the composition of a convex function with an affine function is convex. Noting that $J(\theta) = f(X \theta - y)$ is in fact the composition of the convex function $f$ with the affine function $\theta \mapsto X \theta - y$, we can invoke this useful fact to conclude that $J$ is convex.

An alternative approach is to compute the Hessian of $J$ directly: $$ \nabla J(\theta) = \frac{1}{m} X^T(X\theta - y) $$ and $$\nabla^2 J(\theta) = \frac{1}{m} X^T X. $$ The matrix $X^T X$ is positive semidefinite, which shows that $J$ is convex.

Best Answer

Related Solutions

Linear Regression – Derivation of the Formula for Ordinary Least Squares

Real Analysis – Why is the Least Square Cost Function for Linear Regression Convex?

Related Question