Solved – Supervised learning : How did they find the Cost function to minimize

regressionsupervised learning

I'm studying a tutorial in a video about supervised learning, more specifically, it's about "linear regression with one variable", that is the cost function.

So my first question : is this "cost function" equal to "linear regression with one variable, or does "cost function" belong to linear regression functions with one variable?

Well, having this as the regression function:

$$\color{green}{\underline{\color{black}{h_\theta(x)}}}=\theta_0+\theta_1x$$

we need to find out the minimum values for $\theta_0$ and $\theta_1$, and where $h$ stands for hypothesis. So my second question is : Why should $\theta_0$ and $\theta_1$ have minimum values? I know that having many training set we need to have kind of approximation… but this didn't help me much.

Finally, to find out the solution (the minimum values $\theta_0$ and $\theta_1$), we need to minimize $J(\theta_0,\theta_1)$ which is the cost function and whose expression is below:

$$J(\theta_0,\theta_1) = \frac{1}{2m} \sum_{i=1}^m (h_\theta(x^{(i)})-y^{(i)})^2$$

$$\begin{array}{cc}\begin{array}{c}
\text{minimize}\\
\theta_0,\theta_1 \end{array} &
\underbrace{J(\theta_0,\theta_1)}\\ &\text{Cost function} \end{array} $$

where $m$ is the number of the training set and $i$ is the index of the elements in the set. So my third question is : How could we reach this expression? and Can anybody explain this expression to me?

Best Answer

You seem to be muddling different things together.

$h$ is NOT the cost function. $h$ is the object you're trying to fit (estimate parameter values for). It is $h$ that's the linear regression function. You fit it by choosing some cost function $J$, to measure (for a given $\theta_0$,$\theta_1$) how well or badly you fit the data relative to other values for $h_\theta$ ($J$ is big when it's a bad fit and small when it's a good fit). Minimizing the cost means you have the 'least costly' fit (best fit to the data by your cost criterion). You minimize $J$ to get $h$ 'close' to the data.

Now $\theta_0$ and $\theta_1$ don't have "minimum values", they have values that minimize $J$ -- it's $J$ that's at a minimum, not the $\theta$'s. The parameter estimates are at the argmin.

So you choose some $J$ that measures the overall 'badness of fit' - some measure of how far the data is from the given $h$.

The $J$ you have is the sum of squares of differences between $h$ (the line) and $y$ (the data). As you can see, it gets bigger when the fit is worse. It turns out to be a particularly convenient choice, as well as often satisfying people's notions of how a cost function should look.

The expression is just one way (though not the usual way for most of use us; most statisticians would use a different notation) to write that sum of squares. Since $J$ is the sum of squares of residuals, choosing $h$ to minimize $J$ makes the fitted $h$ the least squares regression line.

Other choices of J are definitely possible; see, for example, $L_1$ (least absolute values) regression, or regression based on M-estimators for some alternatives.

Related Question