Why is this functional derivative equal to $0$

frechet-derivativemachine learningoptimizationreference-requestreproducing-kernel-hilbert-spaces

I am currently reading the paper Exponential Convergence Rates in Classification (2005) by Vladimir Koltchinskii and Oleksandra Beznosova, and I'm having trouble following the proof of the main result, which is about the convergence rate of the Empirical Risk Minimizer to the Bayes Classifier. Let me introduce the relevant notations :

We are dealing with a binary classification problem, so there is a dataset of $n$ pairs $(X_i,Y_i)\in\mathcal X\times\mathcal Y$ where $\mathcal X\subseteq \mathbb R^d$ and $\mathcal Y = \{-1,1\}$. To learn a classifier on this dataset, we optimize the penalized Empirical Risk over a RKHS $\mathcal H$ of functions defined on $\mathcal X$. In other words, given our dataset, we set our classifier $\hat f_n$ as
$$\hat f_n := \arg\min_{f\in\mathcal H}\ \underbrace{\frac 1 n\sum_{i=1}^n\ell(Y_if(X_i)) + \lambda\|f\|^2}_{\mathcal P_n(f)} \tag1$$
Where $\|\cdot\| $ is the norm induced by the RKHS inner product, $\lambda$ is a nonnegative real parameter and $\ell$ a smooth and convex loss function. So far so good.

Next, for some function $h\in\mathcal H$ (satisfying some assumptions that can be found on page 5), the authors define the function $\mathcal L_n$ as
$$\mathcal L_n : \mathbb R\ni\alpha \mapsto \frac 1 n\sum_{i=1}^n\ell[Y_i(\hat f_n(X_i)+\alpha h(X_i))] + \lambda\|\hat f_n+\alpha h\|^2 = \mathcal P_n(\hat f_n + \alpha h)$$
One can clearly see that, because $\hat f_n$ is defined as a minimizer of $\mathcal P_n$, $\mathcal L_n$ is minimal at $\alpha = 0$.

What I am having trouble with however, is the claim (page 7) that because $\mathcal L_n(\alpha)$ is minimal at $\alpha = 0$ , we have

$$ \frac{\partial\mathcal L_n(\alpha)}{\partial\alpha}\bigg|_{\alpha=0} = 0\tag2$$

Furthermore, the authors seems to implicitly assume that the minimizer of $(1)$ is unique which is not immediately clear to me.

My question is thus the following : Why do we have $\partial_\alpha\mathcal L_n(0) = 0$ ?

Update : My background dealing with functional derivatives is basically non-existent, so there may be something obvious I am missing. Nonetheless, let me try to better illustrate what I don't understand :
After reading this introductory note by Frigyik, Srivastava and Gupta and the Wiki pages on Gâteaux and Fréchet derivatives, I got that the quantity $\partial_\alpha\mathcal L_n(0)$ is essentially the Gâteaux/Fréchet derivative of $\mathcal P_n$ at point $\hat f_n$, i.e.
$$\frac{\partial\mathcal L_n(\alpha)}{\partial\alpha}\bigg|_{\alpha=0} = D\mathcal P_n(\hat f_n) $$
I also got that, if $\hat f_n$ is a (local) minimizer of $\mathcal P_n$, then $D\mathcal P_n(\hat f_n)$ has to be zero.

My issue however is that $\hat f_n$ is a minimizer of $\mathcal P_n$ only over the set $\mathcal H$, and there are no guarantees that it contains any (local) minimizer of $\mathcal P_n$ (as a functional over the (much) bigger set $\mathbb R^{\mathcal X}$). As an example, consider a non-zero function $g$ defined on $\mathcal X$ and consider the set $\mathcal G = \{\kappa g, \kappa\in\mathbb R\}$. If we define
$$\hat g_n :=\arg\min_{g\in\mathcal G} \mathcal P_n(g)$$
We would conclude that $\hat g_n$ is a local minimum of $\mathcal P_n$ no matter what (non trivial) function $g$ was considered, which means that one could "artificially" create arbitrarily many (local) minimizers of $\mathcal P_n$ which seems very wrong.

Update 2 : To give a better "counterexample", consider the (convex) map defined on $\mathbb R^2$ as $\varphi : \vec x \mapsto \|\vec x\|_2^2$, and consider its restriction to the (convex) set $\mathcal S := \left\{\begin{pmatrix}x\\y\end{pmatrix} \in \mathbb R^2, 1\le y\le 2\right\}$. Clearly, $\vec 0$ is the only global minimizer of $\varphi$, and the only point at which its gradient is zero, but it is not in $\mathcal S$. It is not hard to see either that
$$\arg\min_{\vec x\in\mathcal S} \varphi(\vec x) = \begin{pmatrix}0\\1\end{pmatrix} $$
So $(0,1)^T$ is a minimizer of $\varphi$ over $\mathcal S$, but the gradient at that point is not zero.

Similarly, I think it would be possible to build similar counterexamples for problem $(1)$, where $\hat f_n$ is a minimizer of $\mathcal P_n$ over $\mathcal H$ but the Gâteaux derivative at that point is not zero. My question thus remains the same : what conditions on the set $\mathcal H$ need to be made to ensure that the Gâteaux derivative at $\hat f_n$ is zero ? Are they satisfied for problem $(1)$ ?

Best Answer

This is related to the notion of directional derivative. As you note, $\hat f_n$ is a minimizer of $\mathcal{P}_n$ over $\mathcal{H}$. It follows, that, starting from $\hat f_n$, the functional $\mathcal{P}_n$ is flat in all directions. In other words: For all directions $h\in\mathcal{H}$ (and this is what you also write in the post) it holds that the directional derivative of $ \mathcal{P}_n$ in $\hat f_n$ in direction of $h$ is zero and this is exactly the derivative the paper talks about.

Best Answer

Related Solutions

Convergence theorems for Kernel SVM and Kernel Perceptron

Convergence of $M$-estimators when the argmin is not unique

Related Question