Solved – Why don’t we use non-constant learning rates for gradient decent for things other then neural networks

deep learninggradient descentmachine learningoptimization

Deep learning literature is full of clever tricks with using non-constant learning rates in gradient descent. Things like exponential decay, RMSprop, Adagrad etc. are easy to implement and are available in every deep learning package, yet they seem to be nonexistent outside of neural networks. Is there any reason for this? If it is that people simply don't care, is there a reason why don't we have to care outside of neural networks?

Best Answer

Disclaimer: I don't have so much experience with optimization outside of neural networks, so my answer will be clearly biased, but there are several things that play role:

(Deep) neural networks have a lot of parameters. This has several implications:

Firstly, it kind-of rules out higher order methods simply because computing Hessian and higher derivatives becomes infeasible. In other domains, this may be a valid approach better than any tweaks to SGD.

Secondly, although SGD is wonderful, it tends to be impractically slow. These improved SGD variants mainly enable faster training, while potentially losing some of the nice properties of SGD. In other domains, the SGD training time may not be the bottleneck, so improvements gained by speeding it up may be simply negligible.
Training (deep) neural networks is non-convex optimization and I am not aware of significant convex relaxation results in the field. Unlike other fields, neural networks are not focused on provably globally optimal solutions, which leads to investing more efforts into improving the properties of loss surface and its traversal during the optimization.

In other fields, employing convex relaxation and obtaining globally optimal solutions may be in the center of the interest instead of the optimization algorithm, because once the problem is defined as a convex problem, the choice of the optimization algorithm cannot improve the quality of the solution.

I suppose this answer does not cover all possible aspects and I am myself curious about other opinions.

Related Solutions

Solved – R libraries for deep learning

OpenSource h2o.deepLearning() is package for deeplearning in R from h2o.ai here's a write up http://www.r-bloggers.com/things-to-try-after-user-part-1-deep-learning-with-h2o/

And code: https://gist.github.com/woobe/3e728e02f6cc03ab86d8#file-link_data-r

######## *Convert Breast Cancer data into H2O*
dat <- BreastCancer[, -1]  # remove the ID column
dat_h2o <- as.h2o(localH2O, dat, key = 'dat')

######## *Import MNIST CSV as H2O*
dat_h2o <- h2o.importFile(localH2O, path = ".../mnist_train.csv")

######## *Using the DNN model for predictions*
h2o_yhat_test <- h2o.predict(model, test_h2o)

######## *Converting H2O format into data frame*
df_yhat_test <- as.data.frame(h2o_yhat_test)

######## Start a local cluster with 2GB RAM
library(h2o)
localH2O = h2o.init(ip = "localhost", port = 54321, startH2O = TRUE, 
                    Xmx = '2g') 
########Execute deeplearning

model <- h2o.deeplearning( x = 2:785,  # column numbers for predictors
               y = 1,   # column number for label
               data = train_h2o, # data in H2O format
               activation = "TanhWithDropout", # or 'Tanh'
               input_dropout_ratio = 0.2, # % of inputs dropout
               hidden_dropout_ratios = c(0.5,0.5,0.5), # % for nodes dropout
               balance_classes = TRUE, 
               hidden = c(50,50,50), # three layers of 50 nodes
               epochs = 100) # max. no. of epochs

Solved – the correct equation of AdaGrad one should use if one aims to use AdaGrad in practice as the automatic way to choose the step size

The update rule

$$w_i^{(t+1)} = w_i^{(t)} - \frac{\eta}{\sqrt{\sum_{\tau=1}^t g_{\tau,i}^2}}g_{t,i},$$

is the composite mirror descent variety of ADAGRAD. Take a look at equation 4 (or 23) from the paper. When there's no regularization term, the update can be derived (take derivative and set to 0) as:

$$ w_{t+1} = w_t - \eta \,\text{diag}(G)^{-1/2} g_t.$$

Notice that $\text{diag}(G)^{-1/2}_{ii} = \frac{1}{\sqrt{\sum_{\tau=1}^t g_{\tau,i}^2}}$ and that multiplying a diagonal matrix by a vector amounts to element-wise multiplication.

$\eta$ is a constant step size which determines the size of the step at each update. I would try several choices and compare them.
Changing an algorithm from SGD to ADAGRAD just requires plugging in your gradient values. That is, you need to keep track of the sum of squared gradient elements for the term in the denominator.

Best Answer

Related Solutions

Solved – R libraries for deep learning

Solved – the correct equation of AdaGrad one should use if one aims to use AdaGrad in practice as the automatic way to choose the step size

Related Question