Why is “convexity” important in mathematics

convex optimizationconvex-analysisoptimization

Recently, I have been interested in better understanding the origins and history of Convexity in mathematics. In particular, why is "Convexity" so important, such that it (historically) made us interested in classifying functions as either Convex or Non-Convex?

I found Roman J. Dwilewicz's A short history of convexity, which provides some important notes on the history of convexity in mathematics. For instance, the famous Greek philosopher Archimedes first made some observations on Convexity:

Convexity was further defined by mathematicians such as Cauchy and Euler, until we developed some of the more familiar definitions of Convexity:

My question

Does anyone know what relevance the convex property might have had that lead to it being defined and studied over the years — and when did we start noticing that convex functions have noticeably different properties and behaviors compared to non-convex functions, such that it became of interest to study and classify functions based on this convexity?

E.g., in optimization, we are routinely taught that convex functions are generally easier to optimize compared to non-convex functions.

Best Answer

I searched this question in hopes of an answer, but it seems I am obliged to provide one myself. This answer leans heavily on the point of view of functional analysis, since that's the area in which I personally first saw the hints of the profound significance/grand story of convexity.

Part 1: Functional Analysis LC-TVS, and (geometric) Hahn-Banach

For convex functions specifically, there is What are the main uses of convex functions?. One of the answers mentions Jensen's inequality from probability theory, which is in fact was my first exposure to the importance of convexity (in the case of Jensen's inequality, the key fact is that the graphs of convex functions rest/float atop a sea of linear functions, i.e. for every point on the graph of a convex function, there is a linear function going through that point that is completely below --- less than or equal to --- the graph of the function).

The next major area in which I encountered convexity heavily was functional analysis. There are certain [spaces equipped with a family of seminorms] that appear very naturally in analysis (click link for examples) --- like the continuous functions $\mathcal C(U)$ on an open set $U\subseteq \mathbb R^d$ with topology given by uniform convergence on compact subsets $K\subseteq U$; or the Schwartz functions $\mathcal S(\mathbb R^d)$. The coarsest topology on these spaces s.t. all the seminorms in the family are continuous makes the space into a locally convex topological vector space (LC-TVS).

The name comes from the fact that there is another equivalent definition of LC-TVS's, involving neighborhood bases of convex sets: see here. This equivalence is proven using Minkowski functionals. Applying the Hahn-Banach theorem (a cornerstone theorem in functional analysis about extending continuous linear functionals defined on a subspace to the whole space, preserving any sublinear upper bounds --- e.g. any seminorms) to these Minkowski functionals, we obtain the hyperplane separation theorem for convex sets in (LC-)TVS's/"geometric Hahn-Banach theorem" (see this article for some proofs).

Even the finite-dimensional version of the hyperplane separation theorem is HUGELY important. It can be used to prove the Farkas lemma, whose Wikipedia page boasts:

Farkas' lemma is the key result underpinning the linear programming duality and has played a central role in the development of mathematical optimization (alternatively, mathematical programming). It is used amongst other things in the proof of the Karush–Kuhn–Tucker theorem in nonlinear programming. Remarkably, in the area of the foundations of quantum theory, the lemma also underlies the complete set of Bell inequalities in the form of necessary and sufficient conditions for the existence of a local hidden-variable theory, given data from any specific set of measurements.

In practical applications like the aforementioned linear programming, convex shapes show up most often as the intersections of half spaces. In fact the hyperplane separation theorem tells us convex sets in general can always be written as an intersection of half spaces.

Also, the hyperplane separation theorem proves the "key fact" from Jensen's inequality above, which nicely ties together what I have written so far. In general, studying how the graphs of convex functions lie on supporting (tangent) lines or intersect with (secant) lines leads to some important results in basic real analysis (e.g. continuity of convex functions, existence/monotonicity of left/right derivatives, etc.).

Part 2: Interpretations of Convex Combinations (more restrictive class of linear combinations; weighted averages)

Another way one can think of convex sets is that they are generalizations of linear subspaces: linear subspaces are closed under linear combinations (i.e. $x_1,\ldots, x_n \in S \implies \sum a_i x_i \in S$ for all $a_i \in \mathbb R$), while convex spaces are closed under convex combinations (i.e. $x_1,\ldots, x_n \in S$, and $\sum a_i = 1 \implies \sum a_i x_i \in S$). Minkowski's theorem is an excellent demonstration of this phenomenon. It will help to watch this beautiful video from YT channel Euler's Basement for my ensuing discussion.

Using area considerations, one can show that any (measurable) region $R$ in $\mathbb R^2$ with are $>1$ contains some $x,y$ s.t. $x-y\in \mathbb Z^2$; i.e. one can translate $R$ so that both the origin and some lattice point in $\mathbb Z^2$ lies in $R$.
What if we want a lattice point in $R$ without needing to translate? Well, of course if $x-y \in R$ we'd be done by above. In other words, we are wishing for something like $R$ being closed under this linear combination $x,y\in R \implies x-y\in R$. This condition is equivalent to $R$ being a subgroup of $(\mathbb R^2, +)$. Unfortunately, this is too strong:
- Trivia (not checked carefully): one can prove that this property implies e.g. unbounded, and [non-empty interior of $R \cap L$ in the subspace topology of some linear subspace $ L \implies$ contains open ball in $L$ centered at origin $\implies$ contains all of $L$]. Exc. 14 in these notes tell us that the only closed such subgroups are discrete sets (necessarily a lattice I think???), 1-dimensional linear subspaces (i.e. line through origin), equally spaced translates of such lines (through $\mathbf 0$), grids of lines (hitting $\mathbf 0$) and $\mathbb R^2$ itself. Thus non-closed subgroups $R$ can only be dense subsets of these aforementioned closed possibilities, where furthermore in the subspace topology of any affine linear subspace $L$, the intersection $R \cap L$ has empty interior.
The idea is that we can think of $x-y$ as the convex combination of $2x$ and $-2y$: $x-y = \frac{2x+(-2y)}2$! To make this argument work, it essentially suffices to furthermore suppose $R$ is symmetric ($y\in R\implies -y \in R$) and convex: then $x,y\in R \implies 2x, 2y \in 2R$; by symmetry $-2y \in 2R$ and by convexity $x-y = \frac{2x-2y}{2} \in 2R$. So we have proved that any set of the form $2R$ for convex symmetric $R$ with area $>1$ must contain a non-zero lattice point --- or equivalently, every convex symmetric $R$ with area $>4$ must contain a non-zero lattice point.

Minkowski's theorem is surprisingly useful, for example in algebraic number theory. It is the starting point of the study of the "geometry of numbers".

So perhaps one intuition to keep in mind that if one can prove something for linear spaces/combinations, one may be able to prove something similar for convex spaces/combinations. Indeed, in the Wikipedia page of the Hahn-Banach theorem discussed above, one can in fact preserve all convex upper bounds, not just sublinear upper bounds (as stated above)!

Convex combinations can also be thought of as weighted averages. So convex sets $K$ get mapped to themselves under "averaging maps". A general philosophy is that averaging maps are very useful to cook up fixed points (e.g. the Weyl averaging trick in representation theory to cook up a $G$-invariant inner product); and indeed using an averaging map trick on iterates of a continuous affine map $T: K\to K$ (for compact convex $K$) leads to a fixed point of $T$; and we can easily extend this to a common fixed point of a whole family of continuous affine $\{T_a\}_{a\in A}$, as long as we assume commutativity of the family --- this is the Markov-Kakutani fixed point theorem!

Part 3: Convex Hulls of "Extreme" Points, and Simplices

Another very common/useful way convex sets appear is in the form of convex hulls of some collection of points in $\mathbb R^d$ (or more general (LC-)TVS's), i.e. the set of all (finite) convex combinations of such points. One general instance of this phenomenon is the Krein-Milman theorem in functional analysis. It basically tells us that we can understand (compact) convex sets (in LC-TVS's) by just understanding their "extreme points" (in linear programming discussed above, the fact that the extreme points of the convex polytope are optimal is essentially this principle). See this excerpt of a book for a proof (a main tool is "geometric Hahn-Banach" discussed above!), and applications (e.g. the Stone-Weierstrass theorem); I also remember it briefly appears in Folland's Abstract Harmonic Analysis. See also https://mathoverflow.net/questions/97433/elementary-applications-of-krein-milman.

In finite dimensions $\mathbb R^d$, given $d+1$ "extreme points" (in general position), we can form their convex hull to create a "$d$-dimensional" space --- this is called a $d$-simplex. One really nice feature of this construction is that not only do we have simple "linear" (I emphasize this because linear things are often the easiest to deal with) descriptions of the interior of the convex hull (compare to the general case, where one needs to prove the extremely non-trivial Jordan-Brouwer separation theorem to even be able to say that some "hull" has an inside and an outside), we also have an extremely simple "linear" description of the boundary: namely convex hulls of $d$ out of the $d+1$ points!

A particular nicety of the "linear" boundary and the ease of talking about inside/outside, is that we can talk about inward/outward pointing normal vectors to boundary "faces" (just using linear algebra, since the faces are subsets of affine subspaces), and hence we can define orientation. With these orientations, we can talk about gluing of such simplices, and that can lead to a theory of simplicial homology. Studying mappings of such simplices to more general toplogical spaces, one can define orientability for those spaces (like manifolds), and also "boost" simplicial homology to singular homology. In summary, the particularly nice/simple descriptions of simplices (stemming from their convex/linear nature) form the backbone/foundation of algebraic topology!

Another important application (to basic algebraic topology/homotopy theory) of the simple "linear" description of the interior of convex sets is that every point is path-connected to every other point by the straight line between them (so in particular convex sets are contractible and simply connected), or more generally the convex hull of any collection of points in the set (in particular the interior of a triangle whose vertices are in the set). I personally have seen these applications of convexity in complex analysis, e.g. these notes of Tao (just Ctrl-F "convex").

Part 4: the Triptych of "Classical Theorems" in Finite-Dimensional Convex Geometry (Caratheodory, Radon, Helly) and Combinatorics

One extra thing to notice about $d$-simplices is that every point in a $d$-simplex can be written uniquely as a convex combination of the extremal vertices. This is due to the fact that for $d+1$ points in general position (by definition), the vectors $x_2-x_1, \ldots, x_{d+1}-x_1$ are linearly independent. If we add any other point $x_{d+2}$ (indexing set denoted $I$) then we get that these "difference vectors" are linearly dependent: there are $\beta_i \in \mathbb R$ not all zero s.t. $\sum_{i\in I\setminus \{1\}} (x_i - x_1) = 0$. Defining $\beta_1:= \sum_{i\in I\setminus \{1\}} \beta_i$, we get that $\sum_{i\in I} \beta_i x_i = 0$ and $\sum_{i\in I} \beta_i = 0$. We can use these to produce continuum many ways of writing the same point as a convex combination: if $J\subseteq I$ is s.t. $a_i >0$ for all $i \in J$, then for all $\lambda \in \mathbb R$, $$\sum_{i\in J} \alpha_i = 1, \sum_{i\in J} \alpha_i x_i = \sum_{i\in J} (\alpha_i - \lambda \beta_i) x_i = 0$$ with $\sum_{i\in J} (\alpha_i-\lambda \beta)_i$ remaining $1$. For all $\lambda$ small enough, we can furthermore guarantee all $\alpha_i':= \alpha_i - \lambda \beta_i\geq 0$; in fact we can take $\lambda$ so that at least one $\alpha_i'$ (for $i \in J$) is furthermore $=0$. This is a reduction of the length of the size of the convex combination, so by induction, we can always reduce any finite convex combination to one of length $d+1$. This is the famous Caratheodory theorem; proof taken from pg. 2 of these Cornell notes.

So linear dependence gives us continuum many ways of rewriting a convex combination. What happens when we compare 2 different convex combinations? Well, if we have $$\sum_{i\in I} \beta_i x_i = \sum_{i\in I} \alpha_i x_i$$ and not all $\gamma_i:=\beta_i - \alpha_i$ equal $0$, then we have $$\sum_{i \in I} \gamma_i x_i = 0.$$ Some $\gamma_i$ are necessarily going to be positive and some negative (by the fact that both the $\beta_i$ and $\alpha_i$ both are $\geq 1$ and sum to 1, and are not identically equal), and we can move all the negatives to the RHS to get $$\sum_{i\in \Gamma^+} \gamma_i x_i = \sum_{i \in \Gamma^-} |\gamma_i| x_i$$ where $\Gamma^+ := \{ i\in I: \gamma_i>0\}$ and $\Gamma^- := \{ i\in I: \gamma_i<0\}$ (non-empty subsets of $I$). Because $\sum_{i\in I} \gamma_i = \sum_{i\in I} \beta_i - \sum_{i\in I} \alpha_i = 1-1=0$, we see that $\sum_{i\in \Gamma^+} \gamma_i = \sum_{i \in \Gamma^-} |\gamma_i|=: \gamma>0$, so dividing by $\gamma$ we get $$\frac 1{\gamma} \sum_{i\in \Gamma^+} \gamma_i x_i = \frac1 \gamma \sum_{i \in \Gamma^-} |\gamma_i| x_i$$ where the LHS is a convex combination of $\{x_i: i \in \Gamma^+\}$, and the RHS $\{x_i: i \in \Gamma^-\}$. In other words, we have partitioned our collection of points into 2 (disjoint) subsets whose convex hulls intersect. This is the famous Radon theorem; proof taken from this thesis. The Cornell notes present essentially the same proof, but phrased in terms of a "extra coordinate set to 1" trick.

This ~100-page AMS survey discusses the last of the 3 "classical convex geometry" theorems, Helly's theorem, by framing it (or rather the special case for half spaces which does imply the full version) as the "dual" of Caratheodory's theorem: see pg. 24, "polar" defined on pg. 7. The above linked Cornell notes prove Helly's theorem directly from Radon's theorem (1 paragraph on pg. 2). And the above linked thesis makes it clear that all 3 of these "classical" theorems are equivalent to each other in some sense.

Finally, I'd like to discuss some results similar to these 3 classical ones, but branching out into further territory. The Cornell notes also prove a generalization of Radon's theorem, called Tverberg's theorem. The proof (see pg. 3-5 of Cornell notes) goes via a "colorful Caratheodory" theorem by Barany, and then a super-genius (this blog post says this proof reduced the time needed to present the proof from 6-7 lectures to 1!) application of the "extra coordinate set to 1" trick (mentioned above) and the "Sarkaria tensor trick". There's an entire 34-page survey of Tverberg's theorem in the AMS, beginning with

Tverberg’s theorem has been a cornerstone of combinatorial convexity for over 50 years. Its impact and influence is only comparable to that of the famous and classic theorems of Caratheodory and Helly. This gem lies at the crossroads of combinatorics, topology, and linear algebra, and it continues to yield challenging and interesting open problems [link to Gil Kalai 2019 blogpost on recent research!].

Not only are there "colorful" variants of the 3 classical theorems, but also "fractional" variants as well. Again the most thorough overview is the above linked ~100-page AMS survey. I myself encountered the (fractional) Helly property the first time in a model theory class; see these notes.

There are also connections to algebraic topology. The Wiki page for Radon's theorem says that Radon's theorem is essentially a special case of a much more general topological fact, called the Borsuk-Ulam theorem (though in my eyes, the Borsuk-Ulam theorem is only "natural"/"expected" in the context of algebraic topology (AT); once one develops some facts about cohomology for real projective spaces, which are natural/common spaces in AT, one can "pullback" these facts to facts about spheres --- see Wiki). The theorem does require a significant amount of work no matter how one tries to go about it; the most "elementary" I could find is here (though I'm not 100% sure there aren't any gaps). Interestingly, the Borsuk-Ulam theorem is one of those "Rosetta stone" theorems with equivalent versions in different fields --- see Wiki, and https://mathoverflow.net/questions/362025/sperners-lemma-implies-tuckers-lemma-simple-combinatorial-proof.

The Borsuk-Ulam theorem moreover has some cute applications (ham-sandwich-theorem, necklace division, etc.); see these notes and/or these notes.

Summary/Linear Narrative through Above Points:

Let me try to present what I think to be the most natural linear path through these ideas (a little bit changed from above, since I got a better understanding of things while writing). We start the story when we encounter in (functional) analysis some very natural/common spaces of functions that are not normable, but rather have a countable collection of seminorms. Trying to study this topology without referring to the seminorms leads us to a characterization in terms of convex sets, using Minkowski functionals. Applying the foundational Hahn-Banach theorem to these functionals, we get the hyperplane separation theorems, which in particular tell us that FACT: convex sets are intersections of the half spaces containing them. I pointed out some applications of this FACT above (Jensen's inequality, basic facts in real analysis, linear programming, etc.).

We can then ask about the "tightest" possible hyperplanes. Picture passing a plane through a (compact) convex body $K$ in $\mathbb R^3$ (this corresponds to looking at the level sets $f^{-1}(t)$ for a varying "time" parameter $t$, where $f: \mathbb R^3 \to \mathbb R$ is a linear functional). Then intuitively (rigorously using compactness) is some point $T$ s.t. at time $T$ the hyperplane still touches the convex body, but for $t>T$ it doesn't (corresponding to $f$ being maximized on the compact set $K$, with maximum value $T$). One can show (this is the Krein-Milman lemma) that the maximum is attained in an extreme subset $M \subseteq K$. Iterating transfinitely/using Zorn's lemma, we can narrow down to an extreme point $x_{f,T}$ at which $f$ attains the maximum value $T$.

From this, the Krein Milman theorem is intuitive: any half space $[f\leq t]:=\{x: f(x)\leq t\}$ (for a linear function $f$) containing the set of extreme points must contain the extreme point $x_{f,T}$ (using the notation established above), which implies $T = f(x_{f,T}) \leq t$. But by definition $T$ is the maximum of $f$ on $K$, so indeed $[f\leq t]\supseteq [f\leq T]$ contains all of $K$. We have shown that every half space containing the extreme points $\mathcal E(K)$ contain also $K$, hence (using the FACT above) the convex hull of $\mathcal E(K)$ contains $K$ (the opposite containment is trivial).

Remark: this isn't the approach taken in the above linked book excerpt; nor in this mathonline link --- the latter seems short, but skips a step. It claims "Since $z$ is an extreme point of $M$ and $M\subseteq K$ we have that $z$ is an extreme point of $K$" without proof. This is addressed in Prop. 8.6 in the book excerpt. I have avoided this in my proof, so I'm a bit worried it's not completely sound, but even if there is some technical error I overlooked, I think the intuition is still very good.

Alright! That's a really nice result. But so far we've been focused on sort of the "big picture" of the convex sets, by thinking about it from the perspective of half-spaces/linear functionals. At best we've focused on just a couple points on the boundary. Let's now try to engage with the points inside the set. We can write those guys in terms of convex combinations (of e.g. extreme points). In Part 2 above, I gave the intuitions that the formal/algebraic essence of convex combinations can be thought of as (1) weighted averages (I demonstrated this intuition using Markov-Kakutani FPT) or (2) a more restrictive version of general linear combinations which gives us a generalization of linear subspaces (I demonstrated this intuition using Minkowski's theorem).

The simplest case of weighted averages is to just consider 2 points: parameterizing the weighted averages provides exactly a formula for the line segment going from one of the points to the other. The fact that we can always travel between points in such a simple (for instance linear, in particular continuous) manner makes the homotopy theory for convex sets trivial, a fact that can be utilized in areas such as complex analysis.

Elsewhere in Part 3, I consider the weighted averages/convex combinations/convex hull of $d+1$ points (in general position) in $\mathbb R^d$ to create a "$d$-dimensional object/volume". The number $d+1$ is the minimum number of points one can use to get a $d$-dimensional convex hull in $\mathbb R^d$, so these $d$-simplices are the simplest $d$-dimensional objects we can construct using convex hulls. Astoundingly this simplicity/"barebones structure" leads to 2 amazing things:

The boundary is given by the union of the $d$ faces formed by taking convex hulls of any $d$-sized subset of the $d+1$ points. As I said before this "recursive" structure can be leveraged to build the foundation of algebaic topology.
Every point in the convex hull of the $d+1$ points can be written uniquely as a convex combination of the $d+1$ points.

Uniqueness of the representation (coming from linear independence of $x_1-x_{d+1}, \ldots, x_d-x_{d+1}$ in $\mathbb R^d$) i.e. "general position") as a convex combination is interesting, but so is non-uniqueness:

I explained (in detail in Part 4) that studying the situation where we have 2 different ways of writing an element $x$ as the convex combination of points in 2 sets $A,B$ ($A,B$ can overlap, or even be the same set), then we can partition $A \cup B$ into 2 (disjoint) sets whose convex hulls intersect. This is the key idea behind Radon's theorem.
Since linear independence gives rise to uniqueness (if and only if in fact), linear dependence gives rise to non-uniqueness (this is one way to cook up the different ways of writing the same element as needed in the previous paragraphs). Exploiting linear dependence leads us to being able to reduce the number of parts in a convex combination, leading to Caratheodory's theorem. Thus general convex hulls in $\mathbb R^d$ can be "understood in terms" (more explicitly, can be written as a union) of the simplest $d$-dimensional convex hulls.

Pursuing these last results about discrete combinatorial geometry further (duality, continuous analogues, colorful/fractional versions, etc.) takes us to surprisingly great heights, from connections to algebraic topology to model theory and elsewhere yonder.