Regression – Understanding Cost Function in OLS Linear Regression

loss-functionsmachine learningregression

I'm a bit confused with a lecture on linear regression given by Andrew Ng on Coursera about machine learning. There, he gave a cost function that minimises the sum-of-squares as:

$$ \frac{1}{2m} \sum _{i=1}^m \left(h_\theta(X^{(i)})-Y^{(i)}\right)^2 $$

I understand where the $\frac{1}{2}$ comes from. I think he did it so that when he performed derivative on the square term, the 2 in the square term would cancel with the half. But I don't understand where the $\frac{1}{m}$ come from.

Why do we need to do $\frac{1}{m}$? In the standard linear regression, we don't have it, we simply minimise the residuals. Why do we need it here?

Best Answer

As you seem to realize, we certainly don't need the $1/m$ factor to get linear regression. The minimizers will of course be exactly the same, with or without it. One typical reason to normalize by $m$ is so that we can view the cost function as an approximation to the "generalization error", which is the expected square loss on a randomly chosen new example (not in the training set):

Suppose $(X,Y),(X^{(1)},Y^{(1)}),\ldots,(X^{(m)},Y^{(m)})$ are sampled i.i.d. from some distribution. Then for large $m$ we expect that $$ \frac{1}{m} \sum _{i=1}^m \left(h_\theta(X^{(i)})-Y^{(i)}\right)^2 \approx \mathbb{E}\left(h_\theta(X)-Y\right)^2. $$

More precisely, by the Strong Law of Large Numbers, we have $$ \lim_{m\to\infty} \frac{1}{m} \sum _{i=1}^m \left(h_\theta(X^{(i)})-Y^{(i)}\right)^2 = \mathbb{E}\left(h_\theta(X)-Y\right)^2 $$ with probability 1.

Note: Each of the statements above are for any particular $\theta$, chosen without looking at the training set. For machine learning, we want these statements to hold for some $\hat{\theta}$ chosen based on its good performance on the training set. These claims can still hold in this case, though we need to make some assumptions on the set of functions $\{h_\theta \,|\, \theta \in \Theta\}$, and we'll need something stronger than the Law of Large Numbers.

Related Question