Why use mean squared error instead sum squared error

machine learningoptimization

One of the most popular cost function for Machine Learning/Deep Learning for searching best parameter $\Theta$ in model is Mean Squared Error, written as

$J(y,\hat{y}) = \frac{1}{n}\Sigma_{i=1}^n(y_i-\hat{y_i})^2 $, where $\hat{y}$ is a function of $\Theta$

Some people preffered to use more derivative "friendly" which put 1/2 just to make sure there are no fraction appear when we taking gradient of the $J(y,\hat{y})$

$J(y,\hat{y}) = \frac{1}{2n}\Sigma_{i=1}^n(y_i-\hat{y_i})^2 $

My question is :

Why we put the $\frac{1}{n}$ in the first place? Isn't the $\Theta$ parameter that minimize $\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat{y_i})^2 $ also minimize $\Sigma_{i=1}^n(y_i-\hat{y_i})^2 $ ?

Best Answer

You have the same minimizer so from an optimization perspective it makes no difference.

The benefit is that with the division, the objective is (essentially) the estimated variance of the error residuals (making it easier to interpret, relevant to compare different runs with different data lengths, and in some cases relate to statistical theory etc)

Related Question