[Math] The cost function derivation in andrew ng machine learning course

linear regressionmachine learning

Can you please give me how did Andrew NG, came up with the formula for cost function

$$J(\theta_0,\theta_1) = \frac{1}{2m}\sum_{i=0}^m{(H_\theta(x^i)-y^i))^2}$$

AFAIK the square is being taken to handle the negative values since
$y^2 = H_\theta(x)^2$ is same as $y = H_\theta(X)$

By removing from square from both sides. Is my understanding right? but why is sum being taken for all the samples in the training set

I have gone through the link Help understanding machine learning cost function

But still unable to understand the need to take sum of the squares and again dividing by 2m.

Kindly help me

Best Answer

Let me see if the following helps. First, remember:

  • $H_\theta$ is the model that we are trying to learn (from your tag, a linear function; e.g. in 1D, $\theta=(\theta_1,\theta_2)$ and $H_\theta(x) = \theta_1 + \theta_2x$). In order to "learn" (or even define) the function $H_\theta$, we need to find $\theta$.
  • Call $D=\{(x^i,y^i)\;\forall\;i\in[1,m]\}$ our training dataset. What it means is that, for each $x^i$, we want our learned function to produce something close to $y^i$. In other words, we want $H_\theta(x^i)$ to be close to $y^i$.

So our goal is to produce such a function $H_\theta$ using $D$. One way to do this is to define our goal as an optimization problem. So we define a cost function, denoted $J(\theta)$, that measures how bad our current $H_\theta$ is, which we will then try to minimize.

In essence, if $J(\theta)$ is very small (i.e. the cost is low), then we can say that $ H_\theta $ is doing a good job on $D$. This suggests we should sum over $D$. In essence, for every point $(x,y)$ in the dataset, we will measure the error that $H_\theta$ makes on that point, i.e. how far $H_\theta(x)$ is from $y$. So then the total error, or cost, will be the sum of the errors on these individual points.

Let's try something simple: $$ J_{\text{bad}}(\theta) = \sum_i H_\theta(x^i) - y^i $$ so if $H_\theta(x^i) = y^i$ all the time, we get $J_1=0$. That's good, but there's a problem. If $H_\theta$ is negative, then the cost is shrinking. So, if we optimize this, we will just want to make $H_\theta$ very negative! That's just dumb, so we need to fix this. One simple way is to take the square: $$ J_{\text{better}}(\theta) = \sum_i (H_\theta(x^i) - y^i)^2 $$ This is much better. The smallest that $J_{\text{better}}$ can ever be is zero, and this can only happen if $H_\theta$ scores perfectly on every data point. Since it is always positive, it reminds one of an energy, which is nice.

Now to finish off the cost function. First, we divide by $m$, so that instead of being the total error (or cost) of the function, it is the average error instead. Then, we also divide by $2$, because there is a square in the cost function. So, when we take the derivative (which we will, in order to optimize it), the square will generate a $2$ and cancel out. It's just aesthetics really.

So anyway, the final function is: $$ J(\theta) = \frac{1}{2m}\sum_i (H_\theta(x^i) - y^i)^2 $$

Recap: (1) We want to make every $H_\theta(x^i) \approx y^i$, so we define a distance between them (their squared difference, because we want to minimize to zero and want the "energy" or "cost" to always be positive) called the error, and sum over the whole dataset to get the total error. (2) We divide by $2m$ because it is more aesthetically pleasing to consider the average error and we want to cancel the inevitable $2$ from differentiation.

Sorry if this was too basic; hopefully it helped!


Edit: note that when we are considering the error, we can use any distance metric. For instance, we could instead use the distance: $$ J_p(\theta) = \frac{1}{m}\sum_i |H_\theta(x^i) - y^i|^p $$ Note that there will be a difference: the higher $p$ is, the more impact outliers will have on the function. The case $p=2$ is a natural one in ML because it is (the square of) the classical Euclidean distance. The case $p=1$ is certainly reasonable as well (though it comes with the drawback of being non-differentiable at $0$).

However, for linear regression specifically, there is extra reason to use the squared distance, also called the ordinary least squares (OLS) problem (note: this is mathematically slightly more advanced). If we assume that our dataset $D$ has been generated from a linear model with Gaussian noise $Y=\alpha X + \mathcal{E}$, then the minimizer of OLS is the maximum likelihood estimator of the model parameters. See also this post. So there is a beautiful connection between statistics, linear algebra, and optimization in this case.

See also this thread, which talks a little more about other reasons to choose the sum of squares rather than the sum of absolute values.