This is an instance of a bilevel program, as you can see it as an optimization problem where one of the constraints is that a variable is optimal in some other optimization problem. It can be represented as a MILP via KKT conditions.
Here is a test in the MATLAB Toolbox YALMIP
https://yalmip.github.io/tutorial/bilevelprogramming/
A = randn(10,7);
b = 10*rand(10,1);
c = randn(7,1);
x = sdpvar(7,1);
optimize([0<=x<=1, A*x <= b],-c'*x)
z = value(c'*x);
y = binvar(7,1);
OuterC = [c'*x <= z - 0.1];
OuterO = sum(y);
InnerC = [0<=x<=1, x <= 1-y, A*x <= b]
InnerO = -c'*x;
solvebilevel(OuterC,OuterO,InnerC,InnerO,x)
value(x)
value(y)
This example uses the built-in naive bilevel solver. A typically better solution is to ask it to use an external MILP solver instead or manually use the kkt operator to define the MILP and then call the MILP solver
https://yalmip.github.io/example/bilevelprogrammingalternatives/
https://yalmip.github.io/command/kkt/
So, you want to find $L^*(\lambda) = \inf_\alpha L(\alpha,\lambda)$.
short version:
There are two relevant cases.
First, if
$$
\left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert> 1
\forall u\in A\subset\mathbb S^{d-1}
$$
holds for some measurable set $A$ with nonzero measure,
then one can show that $L^*(\lambda)=-\infty$.
In the second case, where
$$
\left\lvert\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)\right\rvert\le 1
\forall u\in \mathbb S^{d-1}
$$
holds, one can show that $\alpha=0$ is a minimizer of $L(\cdot,\lambda)$.
elaborations:
We define the function $w(u):=\sum_{i=1}^n \lambda_i y_i \phi(u^Tx_i)$.
Then one can show
$$
L(\alpha,\lambda)
=\sum_{i=1}^n \lambda_i + \|\alpha\|_1 - \langle \alpha ,w\rangle
$$
holds.
In the first case, we can
without loss of generality assume that $w\geq 1 + \delta$
on a measurable set $B$ with nonzero measure and some $\delta>0$
(for $w<-1$ the proof would work similar).
We choose $\alpha_k\in \mathcal L^1$ such that $\alpha_k(u)=k$
on $B$ and $0$ otherwise.
Then we have
$$
L(\alpha_k,\lambda)
=\sum_{i=1}^n \lambda_i +
\int_{\Bbb S^{d-1}} |\alpha_k(u)| - \alpha_k(u) w(u)\,\mathrm du
=\sum_{i=1}^n \lambda_i +
\int_B |\alpha_k(u)| - \alpha_k(u) w(u)\,\mathrm du
\leq \sum_{i=1}^n \lambda_i +
\int_B k - k (1+\delta)\,\mathrm du
$$
which converges to $-\infty$ if $k\to\infty$.
Thus we have $L^*(\lambda)=-\infty$, which means that
these $\lambda$ are not feasible in the dual problem.
In the second case, we have
$|w(u)|\leq1$ for all $u$,
and therefore $|\alpha(u)|-\alpha(u) w(u)\geq0$
for all $u$.
This implies
$$
L(\alpha,\lambda)
=\sum_{i=1}^n \lambda_i +
\int_{\Bbb S^{d-1}} |\alpha(u)| - \alpha(u) w(u)\,\mathrm du
\geq
\sum_{i=1}^n \lambda_i
= L(0,\lambda)
$$
and therefore $L^*(\lambda)= \sum_{i=1}^n \lambda_i$.
remarks regarding context:
Claim B.3 in the paper is used to show Lemma B.1.
However, I think Lemma B.1 is false (if I am understanding the notation correctly):
If $\operatorname{supp}(\alpha^*)$ is finite, then
$\alpha^*$ would only be nonzero on finitely many points, and $\langle \alpha^*,\varphi(x_i)\rangle=0$ would hold for all $i$
(since $\langle\cdot,\cdot\rangle$ represents an integral).
Thus, the optimal value would always be $0$ and $\alpha^*=0$
would always be the best possible solution to (3.6).
However, this should not be true for all data $x_i,y_i,\varphi$.
Best Answer
The step functions all add up in the same direction, since $\phi$ has the same sign in all $y_i$ – thus $q$ is minimal for all $\phi$ such that all step functions are $0$, which occurs for $\phi\lessgtr(k-x)/l$, where the inequality is $\lt$ or $\gt$ and $x$ is the greatest or least of the $x_i$, depending on whether $l$ is negative or positive, respectively.