Compute the dual of an optimization problem defined on a function space

machine learningoptimizationstatistics

I am interested in one result in the first version of the paper titled "On the Margin Theory of Feedforward Neural Networks" by Colin Wei, Jason D. Lee, Qiang Liu and Tengyu Ma.

In Equation B.1 of the appendix (page 17), the authors introduce the following optimization problem (which relates to the maximum margin of an $\ell_1$-SVM classifier) :
$$\begin{align}
\min_{\alpha\in\mathcal L^1(\mathbb S^{d-1})} &||\alpha||_1 = \int_{u\in\mathbb S^{d-1}} |\alpha(u)|du \\
\text{subject to }\, &y_i\langle \alpha,\varphi(x_i)\rangle \ge 1 \end{align} \tag1$$

Here, $\mathbb S^{d-1}$ is the unit sphere in $\mathbb R^d$, and $\mathcal L^1(\mathbb S^{d-1})$ is the set of real-valued functions defined on $\mathbb S^{d-1}$ with Lebesgue integrable absolute value.
For all $i \in [\![ 1;n ]\!]$, the pair $(x_i,y_i)\in\mathbb R^d \times \{-1,1\} $ represents datapoints and their associated labels, while $\varphi$ is a "lifting function" mapping any datapoint $x_i$ to the infinite dimensional space $\mathcal L^\infty(\mathbb S^{d-1}) $, and thus $\varphi(x_i)\in\mathcal L^\infty(\mathbb S^{d-1})$.
Lastly, $\langle\cdot,\cdot\rangle$ represents the dot product : $\langle f,g\rangle =\int_{u\in\mathbb S^{d-1}} f(u)g(u)du $.

In claim B.3 of the paper (page 17), the authors claim that the dual of equation $(1)$ has the following form :
$$
\begin{align}
\max_{\lambda\in\mathbb R^n} &\sum_{i=1}^n \lambda_i \\
\text{subject to } &\left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert\le 1\; \forall u\in\mathbb S^{d-1}\\
&\lambda_i \ge 0 \; \forall i \in [\![ 1;n ]\!]
\end{align} \tag2
$$

Where $\phi$ is such that $\forall u \in \mathbb S^{d-1}, \forall x \in \mathbb R^d, \varphi(x)[u] = \phi(u^Tx)$.

The details on how to prove that $(2)$ is the dual of $(1)$ are however not given and I fail to prove it myself, being quite unfamiliar with Lagrangian optimization on function spaces. My question is thus : How to prove that $\mathbf{(2)}$ is the dual of $\mathbf{(1)}$ ?

I attempted to do the proof by proceeding like in the more common "subset of $\mathbb R^d$" case, i.e. minimizing the Lagrangian, which according to my calculations has the following expression :
$$L(\alpha,\lambda) = ||\alpha||_1 + \sum_{i=1}^n\lambda_i\left(1-y_i\left\langle\alpha,\varphi(x_i)\right\rangle\right) $$
However, I can't get much further than this, as I don't know how to find the minimum of $L(\alpha,\lambda)$ for $\alpha\in\mathcal L^1(\mathbb S^{d-1})$.
Even though the paper suggests to take $L^*(\lambda) = \sum_i \lambda_i $, I don't know how to prove that it is indeed the solution, and I have no idea where that second constraint with the absolute value in $(2)$ comes from. I'm afraid that my approach is totally wrong.
Any help with this problem will be much appreciated.

Best Answer

So, you want to find $L^*(\lambda) = \inf_\alpha L(\alpha,\lambda)$.

short version: There are two relevant cases. First, if $$ \left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert> 1 \forall u\in A\subset\mathbb S^{d-1} $$ holds for some measurable set $A$ with nonzero measure, then one can show that $L^*(\lambda)=-\infty$.

In the second case, where $$ \left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert\le 1 \forall u\in \mathbb S^{d-1} $$ holds, one can show that $\alpha=0$ is a minimizer of $L(\cdot,\lambda)$.

elaborations: We define the function $w(u):=\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)$. Then one can show $$ L(\alpha,\lambda) =\sum_{i=1}^n \lambda_i + \|\alpha\|_1 - \langle \alpha ,w\rangle $$ holds. In the first case, we can without loss of generality assume that $w\geq 1 + \delta$ on a measurable set $B$ with nonzero measure and some $\delta>0$ (for $w<-1$ the proof would work similar). We choose $\alpha_k\in \mathcal L^1$ such that $\alpha_k(u)=k$ on $B$ and $0$ otherwise. Then we have $$ L(\alpha_k,\lambda) =\sum_{i=1}^n \lambda_i + \int_{\Bbb S^{d-1}} |\alpha_k(u)| - \alpha_k(u) w(u)\,\mathrm du =\sum_{i=1}^n \lambda_i + \int_B |\alpha_k(u)| - \alpha_k(u) w(u)\,\mathrm du \leq \sum_{i=1}^n \lambda_i + \int_B k - k (1+\delta)\,\mathrm du $$ which converges to $-\infty$ if $k\to\infty$. Thus we have $L^*(\lambda)=-\infty$, which means that these $\lambda$ are not feasible in the dual problem.

In the second case, we have $|w(u)|\leq1$ for all $u$, and therefore $|\alpha(u)|-\alpha(u) w(u)\geq0$ for all $u$. This implies $$ L(\alpha,\lambda) =\sum_{i=1}^n \lambda_i + \int_{\Bbb S^{d-1}} |\alpha(u)| - \alpha(u) w(u)\,\mathrm du \geq \sum_{i=1}^n \lambda_i = L(0,\lambda) $$ and therefore $L^*(\lambda)= \sum_{i=1}^n \lambda_i$.

remarks regarding context: Claim B.3 in the paper is used to show Lemma B.1. However, I think Lemma B.1 is false (if I am understanding the notation correctly): If $\operatorname{supp}(\alpha^*)$ is finite, then $\alpha^*$ would only be nonzero on finitely many points, and $\langle \alpha^*,\varphi(x_i)\rangle=0$ would hold for all $i$ (since $\langle\cdot,\cdot\rangle$ represents an integral). Thus, the optimal value would always be $0$ and $\alpha^*=0$ would always be the best possible solution to (3.6). However, this should not be true for all data $x_i,y_i,\varphi$.

Related Question