Solved – How to make stochastic gradient descent algorithm converge to the optimum

gradient descentmachine learningregression

(Background info taken from my blog)
In logistic regression, the hypothesis function, which models the relationshiop
between the dependent variable $P(y = 1)$ and the independent variable $X$, is
:
\begin{align*}
H_i = h(X_i) &=
\frac{1}{1 + e^{-X_i \cdot \beta}}
\end{align*}
where $X_i$ is the $i$th row of the design matrix $\underset{m \times n}{X}$,
or in matrix form:
\begin{align*}
H &=
\frac{1}{1 + e^{-X \beta}}
\end{align*}

H is a $m\times 1$ matrix. Except for $X\beta$ all operations are element-wise.
The cost function $J$ is a measure of deviance of the modeled dependent
variable from the observed $y$

\begin{align*}
J &= (1/m)\sum_{i = 1}^m [-y_i\log H_i – (1-y_i)\log (1-H_i)] \\
&= (1/m)\sum_{i = 1}^m \left[
-y_i \log \frac{1}{1+e^{- X_i \cdot \beta}} – (1 – y_i) \log \left( 1 – \frac{1}{1+e^{- X_i \cdot \beta}} \right)
\right] \\
&= (1/m)\sum_{i = 1}^m \left[
y_i \log (1+e^{-X_i \cdot \beta}) + (1 – y_i) \log \left( 1+e^{X_i \cdot \beta} \right)
\right] \\
\end{align*}

\begin{align*}
\frac{\partial J}{\partial \beta_j}
&= \dfrac {1} {m} \sum_{i=1}^m \left[
y_{i}H_{i}e^{-X_{i}\cdot \beta }\left( -X_{ij}\right ) +
\left( 1-y_{i}\right) \dfrac {1} {1+e^{X_{i}\cdot\beta }}e^{X_{i}\cdot\beta }X_{ij}
\right] \\
&= \dfrac {1} {m} \sum_{i=1}^m \left[
y_{i}H_{i}e^{-X_{i}\cdot \beta }\left( -X_{ij}\right ) +
\left( 1-y_{i}\right) H_i X_{ij}
\right] \\
&= \sum_{i=0}^m H_{i}X_{ij}\left( -y_{i}e^{-X_{i}\cdot \beta }+1-y_{i}\right) \\
&= \sum_{i=0}^m H_{i}X_{ij}\left( 1-y_{i}\left( 1+e^{X_i\cdot\beta }\right) \right) \\
&= \sum_{i=0}^m H_{i}X_{ij}\left( 1-y_{i} / H_i\right) \\
&= \sum_{i=0}^m \left( H_{i}-y_{i}\right) X_{ij} \\
&= (H – y) \cdot X_j \\\\
\frac{\partial J}{\partial \beta}
&= X^T (H-y)
\end{align*}

Let $f_i' = \left( H_{i}-y_{i}\right) X_{ij}$, then
according to this video:

enter image description here

batch gradient descent can be described as:

Until convergence:

for all $j$:

$$\theta_j := \theta_j – \alpha \sum f_i'$$

and stochastic gradient descent can be described as:

Shuffle the rows of data, and until convergence:

for all $i$ in $1\cdots m$:

for all $j$ in $0\cdots n$:
\begin{align*}
\theta_j := \theta_j – \alpha f_i'
\end{align*}

This looks straight-forward, but when I implement stochastic
gradient descent in R, it's unable to converge anywhere close
to the optimum, here is the code:

logreg = function(y, x) {
    alpha = 1.15
    x = as.matrix(x)
    x = cbind(1, x)
    m = nrow(x)
    m1 = sample(m)
    n = ncol(x)

    b = matrix(rep(1, n))
    newb = b + .1
    h = 1 / (1 + exp(-x %*% b))
    J = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
    newJ = J+.5

    while(1) {
        cat("outer while...\n")
        for(i in m1) {
            Vi = exp(-as.numeric(x[i, ]%*%b))
            Hi = 1 / (1 + Vi)
            Ei = (Hi - y[i])
            sDerivJ = matrix(Ei * x[i, ])
            newb = b - alpha * sDerivJ
        }
        h = 1 / (1 + exp(-x %*% newb))
        newJ = -(t(y) %*% log(h) + t(1-y) %*% log(1 -h))
        if((newJ - J)/J > .15) {
            alpha = alpha/2
            next
        }
        print(b)
        print(newb)
        b = newb
        J = newJ
        if(max(abs(b - newb)) < 0.001)
        {
            break
        }
    }
    b
}

nr = 5000
nc = 20
set.seed(17)
x = matrix(rnorm(nr*nc), nr)
y = matrix(sample(0:1, nr, repl=T), nr)
testglm = function() {
    res = summary(glm(y~x, family=binomial))
    print(res)
}
testlogreg = function() {
    res = logreg(y, x)
    print(res)
}
print(system.time(testlogreg()))
print(system.time(testglm()))

I am wondering what went wrong.

Best Answer

I would suggest you make the question concise and skip the derivation part.

Two suggestions

Choose a small $\alpha$
You need to slowly decrease $\alpha$ over time.

Details see Alex.R's answer here

Why one epoch for stochastic gradient descent (SGD) is much better than one iteration for gradient decent (GD)?

Related Solutions

Solved – Stochastic gradient descent for regularized logistic regression

First I would recommend you to check my answer in this post first.

How could stochastic gradient descent save time compared to standard gradient descent?

Andrew Ng.'s formula is correct. We should not use $\frac \lambda {2n}$ on regularization term.

Here is the reason:

As I discussed in my answer, the idea of SGD is use a subset of data to approximate the gradient of objective function to optimize. Here objective function has two terms, cost value and regularization.

Cost value has the sum, but regularization term does not. This is why regularization term does not need to divide by $n$ by SGD.

EDIT:

After review another answer. I may need to revise what I said. Now I think both answers are right: we can use $\frac \lambda {2n}$ or $\frac \lambda {2}$, each has pros and cons. But it depends on how do we define our objective function. Let me use regression (squared loss) as an example.

If we define objective function as $\frac {\|Ax-b\|^2+\lambda\|x\|^2} N$ then, we should divide regularization by $N$ in SGD.

If we define objective function as $\frac {\|Ax-b\|^2} N+\lambda\|x\|^2$ (as shown in the code demo). Then, we should NOT divide regularization by $N$ in SGD.

Here is some code demo, we are using all data in SGD, so it should be the exact gradient.:

# ------------------------------------------------------
# data, and loss function, and gradient
# ------------------------------------------------------
set.seed(0)
par(mfrow=c(2,1))
n_data=1e3
n_feature=2
A=matrix(runif(n_data*n_feature),ncol=n_feature)
b=runif(n_data)

sq_loss<-function(A,b,x,lambda){
  e=A %*% x -b
  v=crossprod(e)
  return(v[1]/(2*n_data)+lambda*crossprod(x))
}
sq_loss_gr<-function(A,b,x,lambda){
  e=A %*% x -b
  v=t(A) %*% e
  return(v/n_data+2*lambda*x)
}

# ------------------------------------------------------
# sgd: approximate gradient using subset of data
# ------------------------------------------------------

sq_loss_gr_approx_1<-function(A,b,x,nsample,lambda){
  # sample data and calculate gradient
  i=sample(n_data,nsample)
  gr=t(A[i,] %*% x-b[i]) %*% A[i,]
  v=matrix(gr/nsample,ncol=1)
  return(v+2*lambda*x)
}

sq_loss_gr_approx_2<-function(A,b,x,nsample,lambda){
  # sample data and calculate gradient
  i=sample(n_data,nsample)
  gr=t(A[i,] %*% x-b[i]) %*% A[i,]
  v=matrix(gr/nsample,ncol=1)
  return(v+2*lambda*x/nsample)
}

x=matrix(runif(2),ncol=1)
sq_loss_gr(A,b,x,lambda=3)
sq_loss_gr_approx_1(A,b,x,nsample=n_data,lambda=3)
sq_loss_gr_approx_2(A,b,x,nsample=n_data,lambda=3)

The function sq_loss_gr_approx_1 is right. Because loss function is v[1]/(2*n_data)+lambda*crossprod(x) but not (v[1]+lambda*crossprod(x))/(2*n_data)


> sq_loss_gr(A,b,x,lambda=3)
#          [,1]
# [1,] 3.317703
# [2,] 4.969016

> sq_loss_gr_approx_1(A,b,x,nsample=n_data,lambda=3)
#          [,1]
# [1,] 3.317703
# [2,] 4.969016

> sq_loss_gr_approx_2(A,b,x,nsample=n_data,lambda=3)
#           [,1]
# [1,] 0.1325575
# [2,] 0.1597326

Solved – How do Newton-Raphson updates work in Gradient Boosting

Tree-boosting builds ensembles of trees in an iterative way. In every iteration, a new tree is added to the ensemble conditional on the current ensemble of trees in such a way that the empirical risk (aka the training loss) is minimized. Since the latter can usually not be done in closed form, an approximate minimization is done. Broadly speaking, a new tree is found using either functional gradient descent or a functional version of Newton's method, which is conceptually similar to Newton's method in finite dimensions (https://en.wikipedia.org/wiki/Newton%27s_method_in_optimization). Specifically, for Newton boosting, the empirical risk is replaced by a (functional) second order Taylor approximation, and a tree is found by minimizing this approximation. The latter can be done using weighted least squares. Note that for standard regression with a squared error, gradient and Newton boosting are equivalent.

For more details, this preprint https://arxiv.org/abs/1808.03064 explains the difference between Newton and gradient booting.

Disclaimer: I am the author of the article.

Best Answer

Related Solutions

Solved – Stochastic gradient descent for regularized logistic regression

Solved – How do Newton-Raphson updates work in Gradient Boosting

Related Question