Central limit theorem for uncorrelated and identically distributed random variables

central limit theoremprobabilityprobability theoryprobability-limit-theorems

I'm trying to determine whether a sum of identically distributed but only uncorrelated continous random variables could possibly converge to a normal distribution. First of all we define the i.i.d. random variables $\theta_i\sim Unif(0,2\pi)$ , so a uniform distribution on the interval $(0,2\pi)$, density function given by:
$$
f_{\theta_i}(\theta) = \begin{cases}\frac{1}{2\pi} \mbox{ if $\theta\in(0,2\pi$)}
\\ 0 \mbox{ otherwise} \end{cases}
$$

My question is: what we can say about the sum of random variables
$$
\sum_{i<j}^{N}\cos(\theta_i-\theta_j) \mbox{ ?}
$$

First of all we can note that if $i\not = j \not = k \not = s$ then $\cos(\theta_i-\theta_j)$ and $\cos(\theta_k-\theta_s)$ are indipendent (because function of indipendent random variables). The main problem is that the random variables are not mutual indipendent but "only" uncorrelated (or at least, I manage to prove that they are uncorrelated but I don't think that they can possibly be indipendent.. although I have to find a counter-example to prove it). On the other hand $\{\cos(\theta_i-\theta_j)\}_{i<j}$ are identically distributed with $\mathbb{E}(cos(\theta_i-\theta_j))=0$ and $Var(cos(\theta_i-\theta_j))=\frac{1}{2}$. I know that the standard CLT can not be applied here.. but I also know the existence of some more general form of this theorem with weaker hypothesis.

Do you know if in this particular case is possible to apply one of them? Thanks in advance for your help!

UPDATE 1

Since

$$
|\{(i,j)\in\mathbb{N}^2:i<j\}| = \binom{N}{2} = \frac{N(N-1)}{2}
$$

the sum counts $\frac{N(N-1)}{2}$ random variables and so applying CLT (in its standard version (?)) we have:
$$
\frac{\sum_{i<j}^{N}\cos(\theta_i-\theta_j)-\binom{N}{2}\mu}{\sqrt{\binom{N}{2}\sigma}} = \frac{\sum_{i<j}^{N}\cos(\theta_i-\theta_j)}{\sqrt{\frac{N(N-1)}{4}}} = \frac{2\sum_{i<j}^{N}\cos(\theta_i-\theta_j)}{\sqrt{N(N-1)}} \longrightarrow Z\sim\mathcal{N}(0,1)
$$

where we use the fact that $\mu = 0$ and $\sigma^2 = \frac{1}{2}$.

MY QUESTION

I would like to know how can I justify rigorously this fact. In particular what version of CLT should I use?

Observations

Note that if we define $X_{ij} = \cos(\theta_i-\theta_j)$ , our sequence of random variables composes a triangular array with entries $\{X_{ij}\}_{1\leq i\leq j-1,j \geq 2}$ where we are interested in the convergence of the sum of all its entries:
$$
\sum_{i<j}^{N}X_{ij}
$$

as $N\to +\infty$. Furthermore I notice that every row of the triangle is composed by indipendent random variables. For example if $j=5$ then:
$X_{15},X_{25},X_{35},X_{45}$ are indipendent.

Lindeberg-Feller central limit theorem

Our array satisfies the Lindeberg-Feller CLT, which implies that if $S_k$ is the sum of the $k-th$ row, then $S_k\longrightarrow\mathcal{N}(0,\sigma^2)$.. but how to use it to the problem of the convergence of the sum off ALL the entries of the array?

Doubts

After my answer (you can find below) , where I referred to some papers , suddenly thanks to @jd27 I understand that my generation with R doesn't seem like a normal distribution.. although because of the results I cited it must be.

The ingredients I have used are:

  1. Pairwise indipendence (i.e. $\forall i_1<j_1$ and $\forall i_2<j_2$ with $(i_1,j_1)\not = (i_2,j_2)$ we have that
    $\cos(\theta_{i_1}-\theta_{j_1})$ and $\cos(\theta_{i_2}-\theta_{i_2})$ are
    indipendent);
  2. Symmetry of $X_{ij}$ (i.e. $X_{ij}\sim -X_{ij}$);
  3. The random variables $X_{ij}$ are identically distributed with
    finite expected value and finite variance.

I don't understand what I did wrong.. Is one of this properties false in my sequence of random variables? Thanks in advance for your help.

Also we can note that:

$$
\min {\sum_{i<j}^{N}\cos(\theta_i-\theta_j)} = -\frac{N}{2}
$$

while
$$
\max {\sum_{i<j}^{N}\cos(\theta_i-\theta_j)} = \binom{N}{2}
$$

which is strange because it produces a distribution..that does not seem like a normal, although the limit distribution of
$$
\frac{\sum_{i<j}^{N}\cos(\theta_i-\theta_j)}{\sqrt{\frac{N(N-1)}{4}}} \longrightarrow W
$$

seems to have $\mathbb{E}(W) = 0$ and $Var(W) = 1$ as you can see from the output of the code:

library(EnvStats)
nsample <- 20000
n <- 50
mat <- matrix(rep(0,n^2), n, n)
total <- integer(nsample)
for (k in 1:nsample) {
  sample <- runif(n, min = 0, max = 2*pi)
  for(i in 1:n) {
    for(j in 1:n) {
      if(j > i) {
        mat[i,j] <- cos(sample[i]-sample[j])
      }
    }
  }
  total[k] <- (sum(mat))/(sqrt(n*(n-1)/4))
}
T <- total
CDF <- ecdf(T)
par(mfrow=c(1,2))
plot(CDF)
epdfPlot(T, epdf.col = "red")
print(sum(1/nsample*total))
print(var(T))

enter image description here

Limit distribution seems like a log-normal distribution:

https://en.wikipedia.org/wiki/Log-normal_distribution

But also reminds the Landau distribution:

https://en.wikipedia.org/wiki/Landau_distribution

Best Answer

This is not a complete answer, but i think that the conjecture from the R simulation that a CLT type result (with a normal distribution) holds might not be accurate.

If i understand the problem correctly i think your R code is not calculation the right thing. To sample $\sum_{i<j}^N \cos ( \theta_i - \theta_j) $ we need to draw $\theta_1(\omega) , \dots , \theta_N( \omega)$ for one outcome $\omega$ and then calculate the sum with these values. But you do not do that. In this part of the code:

mat <- matrix(0, n, n)
total <- integer(nsample)
for (k in 1:nsample) {
  for(i in 2:n) {
    sample1 = runif(1, min = 0, max = 2*pi)
    for(j in 1:i-1) {
    sample2 = runif(1, min = 0, max = 2*pi) 
    mat[i,j] <- cos(sample1-sample2) 
    }
  }
  total[k] <- (sum(mat))/(sqrt(n*(n-1)/4))
  mat <- matrix(0, n, n)
}

You are sampling way too many i.i.d.'s. What you are calculating is:

Let $ \theta_2, \dots, \theta_N$ i.i.d. uniform (corresponding to sample1 in your code) and for $i \in \{2 , \dots N \}$ let $\theta_{ij} $ be i.i.d. uniform for $j \in \{1, \dots, i-1\}$ (corresponding to sample2 in your code). Then what you are sampling in the code is (ignoring the constant factor): $$ \sum_{i=2}^N \sum_{j=1}^{i-1} \cos ( \theta_i - \theta_{ij}). $$ Which is not the same as the above.

We can sample the correct expression in python like this:

import numpy as np
from matplotlib import pyplot as plt
import numba as nb

rng = np.random.default_rng()
N_sum = 70
N_samples = 7500

@nb.njit(fastmath=True)
def Sn(rng):
    """Sample the sum (once)"""
    thetas = rng.uniform(0,2*np.pi,size=N_sum)
    ps=0.0
    for j in range(N_sum):
        for i in range(j):
            ps+= np.cos(thetas[i]-thetas[j])
    return ps/np.sqrt(N_sum*(N_sum-1)/4)

@nb.njit()
def genSns(rng):
    """Sample the sum multiple times"""
    out = np.zeros(shape=N_samples)
    for i in range(N_samples):
       out[i]= Sn(rng)
    return out

sns =genSns(rng)

fig, ax = plt.subplots()
weights, bins, _ = ax.hist(sns,bins=100,density=True)
ax.set_ylabel("Density")
ax.set_xlabel("Sum Value")
ax.set_xlim(-1.1,6)
plt.show()

PDF Histogram

This is a totally different PDF and it also seems (at least at first glance) to be essentially "immune" to the usual CTL as increasing N_sum does not appear to change much.

The boundedness of the sum from below (the minimum is slightly below $-1$ in the above plot) makes sense. For example in the case $N=3$ we have $$ \sum_{i<j}^N \cos ( \theta_i - \theta_j) = \cos ( \theta_1 - \theta_2) +\cos ( \theta_1 - \theta_3) + \cos ( \theta_2 - \theta_3) $$ And if we compute the global minimum we arrive at $- 3/2$. Its probably also possible to find a recursive formula for this minimum for any $N$ to show that the sum is bounded from below. It seems that in general the minimum of the sum is at $-N/2$. Which leads to a minimum of $\approx -1$ in the PDF histogram.

Related Question