Lower Tail of Random Rank One Sums – Measure Concentration and Probability

measure-concentrationpr.probabilityrandom matricesreference-request

Let $\{x_i\}_{i\geq 1}$ be iid random elements of the sequence space $\ell^2(\mathbb{N})$;
assume that $\|x_i\|_2 \leq 1$ almost surely. Let $\Sigma = \mathbb{E}[x_1 \otimes x_1]$.

Define
$$
\Sigma_n = \frac{1}{n} \sum_{i=1}^n x_i \otimes x_i.
$$
I am interested in upper bounding the following lower tail
$$
\mathbb{P}\Big\{ \lambda_{\rm max}(\Sigma – \Sigma_n) > t \Big\}, \quad \mbox{for}~t > 0.
$$
What is a reasonably good bound for this quantity?

Best Answer

Warning: This is not a proper answer, just a dump of the thoughts I have had about this problem so far. Also, I'm not an expert in random matrix theory, so some bounds I'll be using may cry for improvement and if someone can do any of them better, I would appreciate both a helping hand and a criticism. We'll still get something in the end, but I'm in no way happy with that something, so, please, don't accept this post as an answer yet (or other people may get an impression that the problem is solved while it certainly isn't).

The tools

First of all, we'll need the Bernstein (a.k.a. Hoeffding, a.k.a....) inequality in the following form. Let $\xi$ be a real random variable on $[-1,1]$ with $E\xi=0$ and $E\xi^2=\lambda$. Then if $\xi_1,\dots,\xi_n$ are independent and identically distributed with $\xi$, we have for every $t>0$ and every $s\in(0,1)$ $$ P(\xi_1+\dots+\xi_n>nt)\le e^{ns(s\lambda-t)}\,. $$ That comes straight from the consideration of the exponential moment $Ee^{s\xi}\le E(1+s\xi+s^2\xi^2)=1+s^2\lambda\le e^{s^2\lambda}$ and the usual chain of inequalities $$ e^{snt}P(\xi_1+\dots+\xi_n>nt)\le Ee^{s(\xi_1+\dots+\xi_n)}=(Ee^{s\xi})^n\le e^{s^2\lambda n}\,. $$

Next, we'll need the Shur bound for the operator norm of a matrix $B=(b_{ij})$, which is $$ \|B\|\le \sqrt{ \left(\sup_i\sum_j|b_{ij}|\right)\left(\sup_j\sum_i|b_{ij}|\right) }\,. $$

Third, we'll need the fact that the operator norm of the operator $S=\sum_{i} x_i\otimes x_i$ is at most the operator norm of the Gram matrix $G=(\langle x_i,x_j\rangle)_{i,j}$. That is because for every unit vector $v\in H$, we have $$ |Sv|^2=\left|\sum_i \langle x_i,v\rangle x_i\right|^2= \sum_{i,j} \langle x_i,v\rangle G_{ij} \langle x_j,v\rangle \\ \le \|G\| \sum_i \langle x_i,v\rangle^2 =\|G\|\langle Sv,v\rangle\le \|G\|\|S\|\,, $$ so $\|S\|^2\le \|G\|\|S\|$, i.e., $\|S\|\le\|G\|$.

At last, we will need to recall that in the coupon collector problem with $d$ coupons, the typical (in any reasonable sense; say, median) time needed to collect all coupons is about $d\log d$.

The rest will be just some common sense that I'll introduce on the fly.

Restatement

We can restate the problem in the following way. What is the best universal (i.e., depending on $n$ and $t$ only) upper bound for the probability of the event that the operator $$ \frac 1n\sum_{i=1}^n x_i\otimes x_i-\frac 12 \Sigma+\frac t2 I = T-\frac 12 \Sigma+\frac t2 I $$ is not positive definite?

We'll be interested in the regime of small $t$ (say, $t<1/2$).

Example (presumably, the worst case scenario)

We'll start with the following example. Let an integer $d$ be just below $1/t$. Consider the uniform distribution on the $d$ orthonormal basis vectors in $\mathbb R^d$. Then $\Sigma=d^{-1}I$, so our event definitely holds if at least one basis vector does not appear among $x_i$. By the coupon collector mumbo-jumbo, this probability is uniformly separated from $0$ when $n<d\log d\approx \frac 1t \log\frac 1t$ and is about $d(1-\tfrac 1d)^n\approx \frac 1te^{-nt}$ afterwards, so $L(n,t)=\min(1,\tfrac 1t e^{-nt})$ is definitely a lower bound if you ignore a few absolute constant factors. I believe that it is actually the truth and we'll try to show that and prove enough to make this conjecture rather plausible, but I don't have the full argument yet.

Reduction to a finite-dimensional space

Our first reduction will be reducing the question to a $d$-dimensional subspace of our Hilbert space where $d=d(t)$ depends on $t$ only. We shall get $d(t)\approx t^{-2}$ at this first reduction though I suspect that one can get $d\approx t^{-1}$ immediately if one is just a bit more inventive.

Pick $\lambda>0$ and split our Hilbert space into the orthogonal sum of two $\Sigma$- invariant subspaces $H'\oplus H''$ such that $\Sigma\le\lambda I$ on $H''$ and $\Sigma\ge \lambda I$ on $H'$. Since the trace of $\Sigma$ is at most $1$, the dimension $d$ of $H'$ is at most $\lambda^{-1}$. Now take a unit vector $v=(v',v'')\in H$ and also split our random variable $x$ as $(x',x'')$. We have $$ \langle (T-\tfrac 12\Sigma+\tfrac t2 I v,v\rangle= \\ \frac 1n\sum_i \langle v',x_i'\rangle^2 + \frac 1n\sum_i \langle v'',x_i''\rangle^2+ \frac 2n\sum_i \langle v',x_i'\rangle\langle v'',x_i''\rangle \\ -\frac 12\langle \Sigma v',v'\rangle-\frac 12\langle \Sigma v'',v''\rangle+\frac t2(|v'|^2+|v''|^2) \\ \ge \frac 34 \frac 1n\sum_i \langle v',x_i'\rangle^2 -\frac 12\langle \Sigma v',v'\rangle+\frac t2 |v'|^2 \\ -3\frac 1n\sum_i \langle v'',x_i''\rangle^2+(\tfrac t2-\lambda)|v''|^2 $$ where the inequality $2ab\ge -\frac 14 a^2-4b^2$ was used to get rid of the cross-term.

Out of the last two lines, the non-negativity of the first one corresponds pretty much to the original problem in the finite-dimensional space $H'$ except we have $\frac 34 T$ instead of $T$ (note that $P_{H'}\Sigma P_{H'}=E x'\otimes x'$ and we still have $|x'|\le 1$. Thus, if we show that with probability $\ge 1-ne^{-ct}$ the second of the last two lines is non-negative for every unit vector $v$, we will be able to concentrate on the finite-dimensional problem. (note that $ne^{-ct}$ is essentially the same function as $t^{-1}e^{-ct}$ in my sense when truncated at $1$ from above).

The second line will certainly be non-negative if the norm of the operator $\frac 1n\sum_i x_i''\otimes x_i''$ is not more than $\frac t6-\frac\lambda 3$. Switching to the Gram matrix, and using the Shur test, we see that this will certainly happen if $$ \frac 1n\left[|x_i''|^2+\sum_{j:j\ne i}|\langle x''_i,x''_j\rangle|\right]\le \frac t6-\frac\lambda 3\,. $$ Since $|x''_i|^2\le 1$, we can reduce it to $$ \frac 1{n-1}\sum_{j:j\ne i}|\langle x''_i,x''_j\rangle|\le \frac t6-\frac\lambda 3-\frac 1n\,. $$ The RHS for $n>\frac 1t\log \frac 1t$, $\lambda\le \frac t3$ and reasonably small $t$ is at least $0.1 t$. To estimate the LHS, we notice that for every fixed $x''_i$, the i.i.d. random variables $\xi_j=|\langle x''_i,x''_j\rangle|$, satisfy $E\xi_j^2\le\lambda$ and, thereby, $E\xi_j\le \sqrt{\lambda}$, so, by Bernstein, $$ P\left(\frac 1{n-1}\sum_{j:j\ne i}\xi_j>\sqrt{\lambda}+\tau\right)\le e^{s(n-1)(s\lambda-\tau)}\,. $$ Hence, choosing $\lambda=(0.05t)^2$, $\tau=0.05 t$ and $s=1$, we get our probability bound $e^{-cnt}$ for one row (or column: the Gram matrix is symmetric). The trivial union bound finishes the story then.

Thus, dropping the event of probability that is right in line with the conjecture, we can assume WLOG that we are working in the finite-dimensional space of dimension $\approx t^{-2}$.

This part will, probably, survive in the final version. At least, it definitely doesn't hurt to make this initial reduction step. The rest will, most likely, have to be modified a lot

Crude net argument

We thus want to estimate the probability that the operator $A=\frac 34 T-\frac 12\Sigma+\frac t2 I$ is not positive-definite on the $d$-dimensional space with $d\approx t^{-2}$. Note that the norm of $A$ is at most $1$, so it would be enough to check the inequality $\langle Av,v\rangle\ge \frac t4 |v|^2$ on, say, $\frac t8$-net on the unit $d$-dimensional sphere. For each unit vector $v$, the failure of the verification means that $$ \frac 1n\sum_{i}\xi_i\ge (\tfrac 23 t+\tfrac 13 e(v)) $$ where $e(v)=E[\langle v,x\rangle^2]$ and $\xi_i=e(v)-\langle v,x_i\rangle^2$. Note that $$ E\xi^2\le E \langle v,x_i\rangle^4\le E\langle v,x_i\rangle^2=e(v) $$ (the subtraction of the mean only reduces the second moment), so, by Bernstein, the probability of the verification failure is at most $ e^{ns(se(v)-[(\tfrac 23 t+\tfrac 13 e(v)])}=e^{-\frac 29 nt} $ if we choose $s=\frac 13$. That is just the decay we want. The issue is that the $t/8$-net in the $d\approx t^{-2}$-dimensional space is huge: it has about $e^{t^{-2}\log(1/t)}$ elements, so the trivial union bound gives us that extra factor, which means that we have proved our correct exponential decay $e^{-cnt}$ only for $n\ge t^{-3}\log\frac 1t$ while the conjecture is equivalent to saying that it holds for $n\ge t^{-1}\log\frac 1t$. The whole blood should be spilled now on reducing this 3 to 1 and I'd like to think a bit before posting the next morsel.

As I expected, Mark Rudelson had the right tools in his toolbox immediately. The key words are "matrix Bernstein inequality". You can get a ready reference in Vershinin's book but I, as usual, prefer to present a full exposition, so here is the argument.

Matrix Bernstein inequality (the simplest version) Suppose that $X$ is a symmetric $d\times d$ random matrix and we have $n$ independent identically distributed with $X$ random symmetric matrices $X_j$. Choose some positive number $\tau>0$ and write the chain of inequalities $$ E \operatorname{Tr}[e^{\tau\sum_{1\le i\le n}X_i}]\le E\operatorname{Tr}[e^{\tau\sum_{1\le i\le n-1}X_i}e^{\tau X_n}] \\ = E\operatorname{Tr}[e^{\tau\sum_{1\le i\le n-1}X_i}Ee^{\tau X_n}]\le \|Ee^{\tau X}\| E\operatorname{Tr}[e^{\tau\sum_{1\le i\le n-1}X_i}] $$
where in the first step the Golden-Thompson inequality $\operatorname{Tr}[e^{A+B}]\le\operatorname{Tr}[e^Ae^B] $ was used. A nice proof of it can be found on the Terence Tao blog: https://terrytao.wordpress.com/2010/07/15/the-golden-thompson-inequality/

Continuing all the way down and using the trivial inequality $\operatorname{Tr} A\le d\|A\|$ at the last step, we conclude that $$ E \operatorname{Tr}[e^{\tau\sum_{1\le i\le n}X_i}]\le d\|Ee^{\tau X}\|^n\,. $$ In particular, this implies that $$ P(\lambda_{\text{max}}(\sum_i X_i)\ge \omega n)\le d\|Ee^{\tau X}\|^ne^{-\omega\tau n}\,. $$ We shall apply this first to $X_i=x_i\otimes x_i$, $\tau=1$ in the situation when $Ex\otimes x\le \lambda I$ with $\lambda$ being a small multiple of $t\in(0,1)$ and $d\approx t^{-2}$. Then $$ e^{X}=I+\frac{e^{|x|^2}-1}{|x|^2}x\otimes x\le I+(e-1)x\otimes x $$ so $Ee^{\tau X}\le (1+(e-1)\lambda)I\le e^\lambda I$ and we conclude that $$ P(\left\|\sum_{i=1}^n x_i\otimes x_i\right\|>cnt)\le Ct^{-2}e^{(\lambda-ct)n}\,, $$ which will allow us using the trickery from the first morsel to remove the invariant subspace of $\Sigma$ on which $\Sigma<\lambda I$ and reduce our considerations to the invariant subspace on which $\Sigma\ge\lambda I$ with $\lambda$ comparable with $t$ and $d$ comparable to $t^{-1}$ or less.

Now, in that subspace, we shall use $X_i=I-\Sigma^{-1/2}x_i\otimes x_i\Sigma^{-1/2}=I-y_i\otimes y_i$ where $y_i=\Sigma^{-1/2}x_i$. Note that $Ey\otimes y=I$ and $|y_i|^2\le Ct^{-1}$ now, so we can use $\tau=ct$ with small absolute $c>0$ and write $$ e^{\tau X}=e^{ct}(I-\tfrac {1-e^{-ct|y|^2}}{|y|^2}y_i\otimes y_i)\le e^{ct}(I-t\tfrac {1-e^{-cC}}{C}y_i\otimes y_i)\,, $$ so, taking the expectation and then the norm, we get $$ \|Ee^{\tau X}\|\le e^{ct}(1-t\tfrac{1-e^{-cC}}C)\le e^{2c^2Ct} $$ provided that $cC<1, t<1$. Taking now $\omega=0.1$, say, we conclude that $$ P(\lambda_{\text{max}}(\sum_i X_i)>0.1 n)\le C't^{-1}e^{(2c^2C-0.1 c)tn} $$ which, for small enough $c$, is what we wanted. It remains to note that the complement of the event in parantheses is $ \sum_i X_i\le 0.1nI $ or, equivalently, $\Sigma-\frac 1n\sum_i x_i\otimes x_i\le 0.1\Sigma$.

That finishes the proof of the conjectured universal bound. Now we can think of what happens if the spectrum of $\Sigma$ decays faster.

Best Answer

Related Solutions

Normalized Concentration Inequality – Empirical CDF (iid Sum)

Related Question