Beating the $1/\sqrt n$ rate of uniform-convergence over a linear function class

learning-theorymachine learningmeasure-concentrationpr.probabilityst.statistics

Let $P$ be a probability distribution on $\mathbb R^d \times \mathbb R$, and let $(x_1,y_1), \ldots, (x_n,y_n)$ be an iid sample of size $n$ from $P$. Fix $\epsilon,t\gt 0$. For any unit-vector $w \in \mathbb R^n$ and $i \in \{1,2,\ldots,n\}$, define
$$
z_i(w) := 1[|y_ix_i^\top w – \epsilon| \ge t] = \begin{cases}
1,&\mbox{ if }|y_i x_i^\top w – \epsilon| \ge t,\\
0,&\mbox{ else.}
\end{cases}
$$

Finally, define $\Delta_{n,\epsilon,t}(P) := \sup_{\|w\|=1} |\frac{1}{n}\sum_{i=1}^n z_i(w) – z(w)|$, where $z(w) = \mathbb P(|y_1x_1^\top w – \epsilon| \ge t)$ is the expected value of $z_i(w)$.

Using VC theory, one can show that $\Delta_{n,\epsilon,t}(P) \le C\sqrt{d/n}$ (ignoring log-factors) w.p $1-o(1)$ in the limit $n \to \infty$. This is because the function class $\{(x,y) \mapsto 1[|yx^\top w – 1| \ge t] \mid w \in \mathbb R^d\text{ with }\|w\|=1\}$ has VC-dimension $O(d)$.

Question. What kinds of assumptions on the distribution $P$ can allow the $\sqrt{d/n}$ rate to be improved to something like $Cd/n$ or even $C/n$ where the constant $C$ is now allowed to depend on $\epsilon$ and $t$ ? For example, what if $P$ is given by $x \mid y \sim N(\mu_y,\Sigma_y)$ (i.e the marginal distribution of $x$ is a Gaussian mixture), and $\mathbb P(y=1) = \pi \in (0,1)$, where $\mu_{\pm 1} \in \mathbb R^d$ and $\Sigma_{\pm 1}$ are positive-definite $d \times d$ matrices ?

Best Answer

It is not possible to get $|\frac1n\sum_i z_i(w) - E[z_i(w)]|=O_P(r_n)$ with $r_n \lll n^{-1/2}$ even for a single $w$, since it would contradict the CLT whenever Var$[z_i(w)]>0$.

Related Question