Convergence of $M$-estimators when the argmin is not unique

convergence-divergencemachine learningoptimizationprobability theorystatistics

Let $(X_i)_{1\le i\le n}$ be i.i.d. random variables taking values in a compact set $\mathcal X\subseteq \mathbb R^d$, and let $\mathcal P_n = \mathcal P_n(\cdot\mid X_1,\ldots,X_n)$ and $\mathcal P$ be real-valued objective functions defined on a compact set $\Theta\subseteq \mathbb R^p$, such that the following holds :
$$\mathbb P\left(\sup_{\theta\in\Theta}|\mathcal P_n(\theta) – \mathcal P(\theta)|>\epsilon\right)=O(a_n) $$
Where $a_n$ is a deterministic sequence that goes to zero when $n\to\infty$.

Now let $\hat \theta_n \in \arg\min_{\theta\in\Theta} \mathcal P_n(\theta)$. My question is : can the following relation hold for some $\theta^*\in\arg\min_{\theta\in\Theta} \mathcal P(\theta)$ ? :
$$\mathbb P\left(\|\hat\theta_n – \theta^*\|_2 >\epsilon\right) = O(b_n) $$
Where $b_n$ is a deterministic sequence that goes to zero when $n\to\infty$, which can (ideally) be expressed as $g(a_n)$ with $g$ a positive function that vanishes at $0$.

For the application I have in mind, I am thinking of Empirical Risk Minimization, so the maps $\mathcal P_n$ and $\mathcal P$ have the following expressions :
$$\mathcal P_n(\theta) :=\frac 1 n\sum_{i=1}^{n} L(X_i,\theta),\quad \mathcal P(\theta) = \mathbb E[L(X_1,\theta)] $$
$L(x,\cdot)$ can be assumed to be Lipschitz and differentiable, but not convex. (I'd still be interested in a positive result with proof in the convex case though, or any other setting really.)


My thoughts : The difficulty lies in the fact that $\arg\min_{\theta\in\Theta} \mathcal P(\theta)$ is not reduced to a singleton. Indeed, if that were the case, it is shown in chapter 5 of Van der Vaart's Asymptotic Statistics (1996) that $\sqrt n(\hat\theta_n – \theta^*)$ would be asymptotically normally distributed (up to additional assumptions).

I would thus like to know if something similar could be said (again, up to additional assumptions) in the case where the argmin has multiple elements. Maybe in this case one should consider $\text{dist}(\hat\theta_n,\arg\min\mathcal P) $ and show that it is asymptotically normal too, but I don't really see how to adapt Van der Vaart's argument in that case.

Best Answer

Counterexample:

Let,

$$X_i \sim \mathcal{U}(0,1) \\ \mathcal P_n(\theta) :=\frac 1 n\sum_{i=1}^{n} -\mathbb{I}_{|X_i - \theta|<1}|X_i - \theta|\\ \mathcal P(\theta) = \mathbb E[ -\mathbb{I}_{|X_1 - \theta|<1}|X_1 - \theta|] $$

Then $\mathcal P(\theta)$ has two minima in $\theta = 0$ and $\theta = 1$, but $\mathcal P_n(\theta)$ will have mostly a single minima, and based on symmetry will be fifty-fifty equally often close to zero as to one. Therefore $\mathbb P\left(\|\hat\theta_n - \theta^*\|_2 >\epsilon\right)$ will be approximately $0.5$ and does not approach zero.

Related Question