[Math] Gradient descent: L2 norm regularization

gradient descentlinear algebramultivariable-calculusnumerical optimizationregularization

So I've worked out Stochastic Gradient Descent to be the following formula approximately for Logistic Regression to be:

$
w_{t+1} = w_t – \eta((\sigma({w_t}^Tx_i) – y_t)x_t)
$

$p(\mathbf{y} = 1 | \mathbf{x}, \mathbf{w}) = \sigma(\mathbf{w}^T\mathbf{x})$, where $\sigma(t) = \frac{1}{1 + e^{-t}}$

However, I keep screwing something with when adding L2 Norm Regularization:

From the HW definition of L2 Norm Regularization:

In other words, update $\mathbf{w}_t$ according to $l – \mu
\|\mathbf{w}\|^2 $, where $\mathbf{\mu}$ is a constant.

I end up with something like this:

$
w_{t+1} = w_t – \eta((\sigma({w_t}^Tx_i) – y_t)x_t + 2\mu w_t)
$

I know this isn't right, where am I making a mistake?

Best Answer

In your example you doesn't show what cost function do you used to calculate. So, if you'll use the MSE (Mean Square Error) you'll take the equation above.

The MSE with L2 Norm Regularization:

$$ J = \dfrac{1}{2m} \Big[\sum{(σ(w_{t}^Tx_{i}) - y_{t})^2} + \lambda w_{t}^2\Big] $$

And the update function:

$$ w_{t+1} = w_{t} - \dfrac{\gamma}{m}\Big(σ(w_{t}^Tx_{i}) - y_{t}\Big)x_{t} + \dfrac{\lambda}{m} w_{t} $$

And you can simplify to:

$$ w_{t+1} = w_{t}\Big(1 - \dfrac{\lambda}{m}\Big) - \dfrac{\gamma}{m}\Big(σ(w_{t}^Tx_{i}) - y_{t}\Big)x_{t} $$

If you use other cost function you'll take another update function.

Related Question