At 8:30 of this video Andrew Ng mentions that the cost function for stochastic gradient descent (for a single observation) for logistic regression is
$-y_i \log h_w(x_i) – (1 – y_i) \log h_w(1 – x_i) + \frac{\lambda}{2} ||w||^2$
My question (a rather technical one) is about the regularization term. If the cost function for all observations is
$\sum_{i=1}^n \{-y_i \log h_w(x_i) – (1 – y_i) \log h_w(1 – x_i)\} + \frac{\lambda}{2} ||w||^2$
should the cost function for a single observation be
$-y_i \log h_w(x_i) – (1 – y_i) \log h_w(1 – x_i) + \frac{\lambda}{2n} ||w||^2$
? In other words, the regularization term is divided by $n$; it is "spread out" across all observations. I know its rather technical as $\lambda$ can easily be changed, but I want to make sure I get the concept right.
Best Answer
First I would recommend you to check my answer in this post first.
How could stochastic gradient descent save time compared to standard gradient descent?
Andrew Ng.'s formula is correct. We should not use $\frac \lambda {2n}$ on regularization term.
Here is the reason:
As I discussed in my answer, the idea of SGD is use a subset of data to approximate the gradient of objective function to optimize. Here objective function has two terms, cost value and regularization.
Cost value has the sum, but regularization term does not. This is why regularization term does not need to divide by $n$ by SGD.
EDIT:
After review another answer. I may need to revise what I said. Now I think both answers are right: we can use $\frac \lambda {2n}$ or $\frac \lambda {2}$, each has pros and cons. But it depends on how do we define our objective function. Let me use regression (squared loss) as an example.
If we define objective function as $\frac {\|Ax-b\|^2+\lambda\|x\|^2} N$ then, we should divide regularization by $N$ in SGD.
If we define objective function as $\frac {\|Ax-b\|^2} N+\lambda\|x\|^2$ (as shown in the code demo). Then, we should NOT divide regularization by $N$ in SGD.
Here is some code demo, we are using all data in SGD, so it should be the exact gradient.:
The function
sq_loss_gr_approx_1
is right. Because loss function isv[1]/(2*n_data)+lambda*crossprod(x)
but not(v[1]+lambda*crossprod(x))/(2*n_data)