Solved – Stochastic gradient descent for regularized logistic regression

gradient descentmachine learningregularization

At 8:30 of this video Andrew Ng mentions that the cost function for stochastic gradient descent (for a single observation) for logistic regression is

$-y_i \log h_w(x_i) – (1 – y_i) \log h_w(1 – x_i) + \frac{\lambda}{2} ||w||^2$

My question (a rather technical one) is about the regularization term. If the cost function for all observations is

$\sum_{i=1}^n \{-y_i \log h_w(x_i) – (1 – y_i) \log h_w(1 – x_i)\} + \frac{\lambda}{2} ||w||^2$

should the cost function for a single observation be

$-y_i \log h_w(x_i) – (1 – y_i) \log h_w(1 – x_i) + \frac{\lambda}{2n} ||w||^2$

? In other words, the regularization term is divided by $n$; it is "spread out" across all observations. I know its rather technical as $\lambda$ can easily be changed, but I want to make sure I get the concept right.

Best Answer

First I would recommend you to check my answer in this post first.

How could stochastic gradient descent save time compared to standard gradient descent?

Andrew Ng.'s formula is correct. We should not use $\frac \lambda {2n}$ on regularization term.

Here is the reason:

As I discussed in my answer, the idea of SGD is use a subset of data to approximate the gradient of objective function to optimize. Here objective function has two terms, cost value and regularization.

Cost value has the sum, but regularization term does not. This is why regularization term does not need to divide by $n$ by SGD.


EDIT:

After review another answer. I may need to revise what I said. Now I think both answers are right: we can use $\frac \lambda {2n}$ or $\frac \lambda {2}$, each has pros and cons. But it depends on how do we define our objective function. Let me use regression (squared loss) as an example.

If we define objective function as $\frac {\|Ax-b\|^2+\lambda\|x\|^2} N$ then, we should divide regularization by $N$ in SGD.

If we define objective function as $\frac {\|Ax-b\|^2} N+\lambda\|x\|^2$ (as shown in the code demo). Then, we should NOT divide regularization by $N$ in SGD.

Here is some code demo, we are using all data in SGD, so it should be the exact gradient.:

# ------------------------------------------------------
# data, and loss function, and gradient
# ------------------------------------------------------
set.seed(0)
par(mfrow=c(2,1))
n_data=1e3
n_feature=2
A=matrix(runif(n_data*n_feature),ncol=n_feature)
b=runif(n_data)

sq_loss<-function(A,b,x,lambda){
  e=A %*% x -b
  v=crossprod(e)
  return(v[1]/(2*n_data)+lambda*crossprod(x))
}
sq_loss_gr<-function(A,b,x,lambda){
  e=A %*% x -b
  v=t(A) %*% e
  return(v/n_data+2*lambda*x)
}

# ------------------------------------------------------
# sgd: approximate gradient using subset of data
# ------------------------------------------------------

sq_loss_gr_approx_1<-function(A,b,x,nsample,lambda){
  # sample data and calculate gradient
  i=sample(n_data,nsample)
  gr=t(A[i,] %*% x-b[i]) %*% A[i,]
  v=matrix(gr/nsample,ncol=1)
  return(v+2*lambda*x)
}

sq_loss_gr_approx_2<-function(A,b,x,nsample,lambda){
  # sample data and calculate gradient
  i=sample(n_data,nsample)
  gr=t(A[i,] %*% x-b[i]) %*% A[i,]
  v=matrix(gr/nsample,ncol=1)
  return(v+2*lambda*x/nsample)
}

x=matrix(runif(2),ncol=1)
sq_loss_gr(A,b,x,lambda=3)
sq_loss_gr_approx_1(A,b,x,nsample=n_data,lambda=3)
sq_loss_gr_approx_2(A,b,x,nsample=n_data,lambda=3)

The function sq_loss_gr_approx_1 is right. Because loss function is v[1]/(2*n_data)+lambda*crossprod(x) but not (v[1]+lambda*crossprod(x))/(2*n_data)


> sq_loss_gr(A,b,x,lambda=3)
#          [,1]
# [1,] 3.317703
# [2,] 4.969016

> sq_loss_gr_approx_1(A,b,x,nsample=n_data,lambda=3)
#          [,1]
# [1,] 3.317703
# [2,] 4.969016

> sq_loss_gr_approx_2(A,b,x,nsample=n_data,lambda=3)
#           [,1]
# [1,] 0.1325575
# [2,] 0.1597326