Machine Learning – Why Divide by 2m in Regression

machine learningregression

I'm taking a machine learning course. The professor has a model for linear regression. Where $h_\theta$ is the hypothesis (proposed model. linear regression, in this case), $J(\theta_1)$ is the cost function, $m$ is the number of elements in the training set, $x^{(i)}$ and $y^{(i)}$ are the variables of the training set element at $i$

$$h_\theta = \theta_1x$$

$$J(\theta_1) = \frac{1}{2m} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2$$

What I don't understand is why he is dividing the sum by $2m$.

Best Answer

The $\frac{1}{m}$ is to "average" the squared error over the number of components so that the number of components doesn't affect the function (see John's answer).

So now the question is why there is an extra $\frac{1}{2}$. In short, it doesn't matter. The solution that minimizes $J$ as you have written it will also minimize $2J=\frac{1}{m} \sum_i (h(x_i)-y_i)^2$. The latter function, $2J$, may seem more "natural," but the factor of $2$ does not matter when optimizing.

The only reason some authors like to include it is because when you take the derivative with respect to $x$, the $2$ goes away.