Regularization Methods – Understanding Why Multiply by 1/2m in Regularisation

In the week 3 lecture notes of Andrew Ng's Coursera Machine Learning class, a term is added to the cost function to implement regularisation:

$$J^+(\theta) = J(\theta) + \frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$$

The lecture notes say:

We could also regularize all of our theta parameters in a single summation:

$$min_\theta\ \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) – y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]
$$

$\frac 1 {2m}$ is later applied to the regularisation term of neural networks:

Recall that the cost function for regularized logistic regression was:

$$J(\theta) = – \frac{1}{m} \sum_{i=1}^m [ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 – y^{(i)})\ \log (1 – h_\theta(x^{(i)}))] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$

For neural networks, it is going to be slightly more complicated:
$$\begin{gather*} J(\Theta) = – \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 – y^{(i)}_k)\log (1 – (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}
$$

Why is the constant one-half used here? So that it is cancelled in the derivative $J'$?
Why the division by $m$ training examples? How does the amount of training examples affect things?

Best Answer

Let's suppose you have 10 examples and you don't divide a L2 regularization cost by number of examples m. Then a "dominance" of the L2 regularization cost compared to a cross-entropy cost will be like 10:1, because each training example can contribute to the overall cost proportionally to 1/m = 1/10.

If you have more examples, let's say, 100, then the "dominance" of the L2 regularization cost will be something like 100:1, so you need to decrease a λ accordingly, which is inconvenient. It's better to have λ constant regardless of a batch size.

Update: To make this argument more strong I created a jupyter notebook.

Best Answer

Related Solutions

Solved – SVM: Does C increase variance or stability (bias)

Solved – Should L2 regularization be corrected for scale

Related Question