I wanted to know how much of machine learning requires optimization. From what I've heard statistics is an important mathematical topic for people working with machine learning. Similarly how important is it for someone working with machine learning to learn about convex or non-convex optimization?
Solved – Optimization and Machine Learning
machine learningoptimization
Related Solutions
In machine learning "programming" = coding up an algorithm, in operations research "programming" = optimization?
More serious answer, I think the differences are more historical lineage and application area than techniques per se. One perspective on the cultures of (academic) statistics vs. machine learning I found interesting is "The Stats Handicap".
Statistics is the oldest, and originated out of mathematics and probability, perhaps emerging as a distinct discipline in the late 19th century (though much of theory is older). Of the three, statistics is perhaps the most associated with "academic science", and is certainly the most concerned with rigorous approaches to experimental design and data collection.
Operations research seems to originate closer to WWII, and is generally associated with large organizations (e.g. military, logistics/supply-chain, industrial engineering), focusing on managing and optimizing their "operations", as it were.
(In terms of "data science" traditions with a long history, another big one would be econometrics. Wikipedia says it's economics, while CV says it's statistics, for what that's worth!)
Machine learning is the most recent, but to me is more ambiguous, and at least in the popular-media it is essentially a re-branding of "AI". This broader sense includes many strands, including computer vision and probabilistic robotics. Computer science is an integral part of all of these, however.
Finally, I would say that buzzwords like "Data Science" and "Analytics" are largely marketing terms. They are less likely to be used between members of these communities, vs. when communicating with outsiders (or when outsiders are talking between themselves).
Conceptually, the only thing you need to know to understand machine learning algorithms is "there is an optimum, and we can find it". Practically, it's always useful to have some idea how optimization is happening "under the hood". At very least, it will give you some insight into how the the performance and storage requirements of your ML algorithms are likely to scale with data size and dimension, and under what circumstances you are likely to run into problems. Of course, optimization is a rich and interesting area of its own, which will exercise your brain and round out your CS education even if you never do machine learning.
As requested in a comment, here is a list of important topics in optimization:
- Continuous optimization
- Toy algorithms: gradient descent, simplex method
- Powell's method
- BFGS
- Model trust methods
- Stochastic optimization
- Simulated annealing
- Genetic algorithms and swarm methods
- Stochastic gradient descent
- Constrained optimization
- Barrier methods
- Linear programming (with interior-point methods)
- Integer programming
- Dynamic programming
Best Answer
The way I look at it is that statistics / machine learning tells you what you should be optimizing, and optimization is how you actually do so.
For example, consider linear regression with $Y = X\beta + \varepsilon$ where $E(\varepsilon) = 0$ and $Var(\varepsilon) = \sigma^2I$. Statistics tells us that this is (often) a good model, but we find our actual estimate $\hat \beta$ by solving an optimization problem
$$ \hat \beta = \textrm{argmin}_{b \in \mathbb R^p} ||Y - Xb||^2. $$
The properties of $\hat \beta$ are known to us through statistics so we know that this is a good optimization problem to solve. In this case it is an easy optimization but this still shows the general principle.
More generally, much of machine learning can be viewed as solving $$ \hat f = \textrm{argmin}_{f \in \mathscr F} \frac 1n \sum_{i=1}^n L(y_i, f(x_i)) $$ where I'm writing this without regularization but that could easily be added.
A huge amount of research in statistical learning theory (SLT) has studied the properties of these argminima, whether or not they are asymptotically optimal, how they relate to the complexity of $\mathscr F$, and many other such things. But when you actually want to get $\hat f$, often you end up with a difficult optimization and it's a whole separate set of people who study that problem. I think the history of SVM is a good example here. We have the SLT people like Vapnik and Cortes (and many others) who showed how SVM is a good optimization problem to solve. But then it was others like John Platt and the LIBSVM authors who made this feasible in practice.
To answer your exact question, knowing some optimization is certainly helpful but generally no one is an expert in all these areas so you learn as much as you can but some aspects will always be something of a black box to you. Maybe you haven't properly studied the SLT results behind your favorite ML algorithm, or maybe you don't know the inner workings of the optimizer you're using. It's a lifelong journey.