[Math] What aspects of convex optimization are used in artificial intelligence, if any

artificial intelligenceconvex optimizationneural networks

I work on convex optimization with Stephen Boyd's book.

As an example, support vector machines are mentioned as an application of separating hyperplanes theorem. I am wondering if there is any other application of convex optimization to artificial intelligence.

I am very unaware of AI methods, but as far as I know it involves minimizing functions. Does any of them involve minimizing a convex (or quasi-convex, log-convex) function?

Best Answer

Many models used in machine learning are either continuous optimization problems (linear / logistic regression, simple neural networks, SVMs) or pieces of it are continuous (more complicated neural networks, regression trees, etc). A huge part of convex optimization is learning foundational methods to solve continuous and convex problems.

One common complaint is that many ML problems may be nonconvex (neural networks being the main one). However, many of the current tools used to solve nonconvex problems are either the same or small variants of the traditional tools presented for continuous convex problems, such as

  • gradient descent, from which we get stochastic gradient descent, i.e. ML's favorite algorithm
  • block coordinate descent or minimization
  • acceleration schemes for first order methods (most famously, Nesterov's acceleration and Polyak's heavy ball)

In fact, methods like stochastic gradient descent, and even asynchronous versions (which are all the craze in ML today) appeared as early as Bertsekas' Nonlinear Programming book in the 1990s, which was primarily aimed at convex optimization. And, methods like ADMM were proposed and analyzed in the 70s. I don't see them appearing too much in industrial ML, but in theoretical ML they are favored for their ease in decomposing large problems and exploiting sparsity. A key concept you will get from Boyd/Vandenberghe that you might not get from ML blogs is concept from convex optimization that is relevant in ML is duality and dual forms, mostly because of SVMs. (For kernel SVMs, we always solve the dual form.)

Therefore, learning about convex optimization, even though it is restricted to convex problems, can be extremely helpful and foundational in understanding the newer, "hotter" methods of today.

That being said...

There are several key differences between learning ML and learning optimization (convex or otherwise), and I think it is important to be aware of them, so you don't put too much emphasis on skills that my have less relevance to ML. To me, they are

  • Optimization folk care about training error. ML folk care about generalization error. In practice, what this means is that ML folk are completely comfortable with getting within a neighborhood of the true minimizer, since that often doesn't affect test error (and may even act as implicit regularization).
  • Optimization folk tend to prefer simpler methods with better optimality properties. ML folk tend to prefer models that better fit the data structure, which may be highly generalizable but not necessarily "simple" from an optimization point of view. (e.g. decision trees)
  • Most "foundational" optimization textbooks (like Boyd/Vandenberghe) came out of operations research (OR), which largely involves relaxations of combinatorial problems. Most machine learning motivations do not intersect with operations research very much.
  • Many great optimization papers involve deriving rates, properties, and guarantees that do not match practice very well. I believe this is in part because of the previous point. (They do much better on operations research problems.)
  • A specific artifact of the OR history is that it often seems convex optimization is very concerned with constraints and sensitivity analysis; in ML, this is often much less important in the downstream task.
  • Another artifact of the OR history is the emphasis on smoothness. Many optimization methods get great convergence results by forcing the problem to be an unconstrained smooth problem. (The most famous of these are barrier methods, which turn into interior point methods, which is what default CVX mainly runs on.) In machine learning, we often don't care about this. Either we use a proximal operator and maintain the problem's original form, or we just ignore nonsmooth points altogether, and just converge to a neighborhood of the optimal variable without complaint. (Result of caring more about generalization error than training error.) The other thing this buys us is scalability, which is really the name of the game in today's data science world.

I'm guessing I wrote a lot of controversial things; I have worked on convex optimization for 7 years now and ML for the past 3 years. I suppose someone with a different background may have an entirely different perspective.

Related Question