Solved – Covariance of two sample means

mathematical-statisticsself-study

I am trying to derive the covariance of two sample means and get confused at one point. Given is a sample of size $n$ with paired dependent observations $x_i$ and $y_i$ as realizations of RVs $X$ and $Y$ and sample means $\bar{x}$ and $\bar{y}$. I try to derive $cov(\bar{x},\bar{y})$.

I am relatively sure the result should be

$cov(\bar{x},\bar{y})=\frac{1}{n}cov(X,Y)$

However I arrive at

$$cov(\bar{x},\bar{y})=E(\bar{x}\bar{y})-\mu_x\mu_y = E(\frac{1}{n^2}\sum x_i \sum y_i) -\mu_x\mu_y
=\frac{1}{n^2} n^2 E(x_i y_i) -\mu_x\mu_y=cov(X,Y)$$

I used

$$E(\frac{1}{n^2}\sum x_i \sum y_i)=\frac{1}{n^2} E(x_1y_1+x_2y_1+…x_ny_n)=\frac{1}{n^2} n^2 E(x_iy_i)$$

Somewhere should be a flaw in my thinking.

Best Answer

Covariance is a bilinear function meaning that $$ \operatorname{cov}\left(\sum_{i=1}^n a_iC_i, \sum_{j=1}^m b_jD_j\right) = \sum_{i=1}^n \sum_{j=1}^m a_i b_j\operatorname{cov}(C_i,D_j).$$ There is no need to mess with means etc.

Applying this to the question of the covariance of the sample means of $n$ independent paired samples $(X_i, Y_i)$ (note: the pairs are independent bivariate random variables; we are not claiming that $X_i$ and $Y_i$ are independent random variables), we have that \begin{align} \operatorname{cov}\left(\bar{X},\bar{Y}\right) &= \operatorname{cov}\left(\frac{1}{n}\sum_{i=1}^n X_i, \frac 1n\sum_{j=1}^n Y_j\right)\\ &= \frac{1}{n^2}\sum_{i=1}^n \sum_{j=1}^n \operatorname{cov} (X_i, Y_j)\\ &= \frac{1}{n^2}\sum_{i=1}^n \operatorname{cov} (X_i, Y_i) &\scriptstyle{\text{since $X_i$ and $Y_j$ are independent, and thus uncorrelated, for $i \neq j$}}\\ &= \frac 1n\operatorname{cov} (X, Y) \end{align}

Related Solutions

Solved – Likelihood Ratio for the Bivariate Normal distribution

As has been made clear in the comments, the OP is interested in the Likelihood ratio when the common variance is also estimated, and not known.

The joint density of one pair of $\{X_i, Y_i$}, given also the maintained assumptions on the parameter values is

$$ f(x_i,y_i) = \frac{1}{2 \pi \sigma^2\sqrt{3/4}} \ \exp\left\{ -\frac{2}{3}\left[ \frac{(x_i-\mu_x)^2}{\sigma^2} + \frac{(y_i-\mu_y)^2}{\sigma^2} - \frac{(x_i-\mu_x)(y_i-\mu_y)}{\sigma^2} \right] \right\}$$

So the joint Likelihood of the sample (not log likelihood) is

$$ L(\mu_x, \mu_y, \sigma^2 \mid, \mathbf x, \mathbf y, \rho=1/2) = \left(\frac{1}{2 \pi \sigma^2\sqrt{3/4}}\right)^n \\ \times \exp\left\{ -\frac{2}{3\sigma^2}\left[ \sum_{i=1}^n(x_i -\mu_x)^2 +\sum_{i=1}^n(y_i -\mu_y)^2 - \sum_{i=1}^n(x_i-\mu_x)(y_i-\mu_y) \right] \right\}$$

Denote $L_1$ the maximized likelihood with the sample means (MLEs for the true means), and $L_0$ the likelihood with the means set equal to zero. Then the Likelihood Ratio (not the log such) is $$ LR \equiv \frac {L_0}{L_1} = \frac {\hat \sigma^{2n}_1\cdot \exp\left\{ -(2/3\hat \sigma^2_0)\cdot\left[ \sum_{i=1}^nx_i^2 +\sum_{i=1}^ny_i^2 - \sum_{i=1}^nx_iy_i \right] \right\}}{\hat \sigma^{2n}_0 \cdot\exp\left\{ -(2/3\hat \sigma^2_1)\cdot\left[ \sum_{i=1}^nx_i^2 -n\bar x^2 +\sum_{i=1}^ny_i^2 -n\bar y^2 - \sum_{i=1}^nx_iy_i+n\bar x\bar y \right] \right\}}$$

where $\hat \sigma^2_1$ is the estimate with unconstrained means and $\hat \sigma^2_0$ is the estimate with the means constrained to zero.

The OP has (correctly) calculated the MLEs for the common variance as

$$\hat \sigma^2_1 = \frac{2}{3n} \sum_{i=1}^n \left[ \left(x_i-\bar{x}\right)^2+\left(y_i-\bar{y} \right)^2-\left(x_i-\bar{x}\right)\left(y_i-\bar{y} \right) \right]$$

$$\hat \sigma^2_0 = \frac{2}{3n} \sum_{i=1}^n \left( x_i^2+y_i^2-x_iy_i \right)$$

If we plug these into the LR, inside the exponential, both in the numerator and the denominator, things cancel out and we are left simply with

$$ LR = \frac {\hat \sigma^{2n}_1}{\hat \sigma^{2n}_0 } $$

Our goal is not to derive the LR per se -it is to find a statistic to run the test we are interested in. So let's consider the quantity (which is the reciprocal of quantity presented in the question)

$$\left(LR\right)^{-1/n} = \frac {\hat \sigma^{2}_0}{\hat \sigma^{2}_1}$$

$$ = \frac{2}{3n} \frac{\sum_{i=1}^{n}x_i^2+\sum_{i=1}^{n}y_i^2-\sum_{i=1}^n x_iy_i }{\hat \sigma^{2}_1}$$

$$= \frac {1}{3n}\cdot\left[\sum_{i=1}^n\left(\frac {x_i}{\hat \sigma_1}\right)^2 + \sum_{i=1}^n\left(\frac {y_i}{\hat \sigma_1}\right)^2 + \sum_{i=1}^n\left(\frac {x_i-y_i}{\hat \sigma_1}\right)^2\right]$$

Note that $\hat \sigma^{2}_1$ is a consistent estimator of the true variance, irrespective of whether the true means are zero or not. Also (given equal variances and $\rho =1/2$),

$$Z_i = X_i - Y_i \sim N(\mu_x-\mu_y, \sigma^2)$$

Under the null of zero means, then, all $(x_i/\hat \sigma_1)^2$, $(y_i/\hat \sigma_1)^2$ and $(z_i/\hat \sigma_1)^2$ are chi-squares with one degree of freedom (and i.i.d., per sum). Each sum (denote the three sums for compactness $S_x, S_y, S_z$) has expected value $n$ and standard deviation $\sqrt {2n}$ (under the null). So subtract $n$ 3 times and add $n$ 3 times, and also divide and multiply by $\sqrt {2n}$ and re-arrange to get

$$\sqrt {n}\left(LR\right)^{-1/n} = \frac {\sqrt 2}{3}\cdot\left[\frac {S_x - E(S_x)}{SD(S_x)} + \frac {S_y - E(S_x)}{SD(S_x)} + \frac {S_z - E(S_z)}{SD(S_z)}\right] + 1$$

The three terms inside the bracket, are the subject matter of the Central Limit Theorem, and so each element converges to a standard normal. Therefore we have arrived (due to initial bi-variate normality) at

$$\frac {3}{\sqrt 2} \left[\sqrt{n}\left(LR\right)^{-1/n} -1\right] \xrightarrow{d} N(0, AV)$$

Of course in order to actually use the left-hand side as a statistic in a test, we need to derive the asymptotic variance -but for the moment, I do not feel up to the task. I just note that one should determine whether the three $S$'s are asymptotically independent or not.

Mathematical Statistics – How to Find the Covariance of Two Random Sums

To help make the ideas clear, I will use capital letters for random variables.

Everything follows from the restriction $\sum W_i=n,$ because that implies this sum has zero variance. Since each $W_i$ is a Bernoulli variable,

$$\operatorname{Var}(W_i) = \frac{n(N-n)}{N^2}.$$

Computing the variance of the sum and assuming, as is the case with simple random sampling, that for $i\ne j$ $\operatorname{Cov}(W_i,W_j)$ does not depend on $i$ or $j,$ we find

$$\begin{aligned} 0 &= \operatorname{Var}\left(\sum_{i=1}^N W_i\right) \\ &=\sum_{i=1}^N \operatorname{Var}\left(W_i\right) + \sum_{i\ne j}^N \operatorname{Cov}(W_i,W_j) \\ &= N\frac{n(N-n)}{N^2} + N(N-1)\operatorname{Cov}\left(W_1,W_2\right), \end{aligned}$$

enabling us to solve for the covariance as

$$\operatorname{Cov}\left(W_i,W_j\right) = \operatorname{Cov}\left(W_1,W_2\right) = -\frac{n(N-n)}{N^2(N-1)}.$$

Consequently, assuming $N\gt 1,$ for fixed coefficients $(x_i)$ and $(y_i)$ and writing $\bar x = \sum x_i/N,$ $\bar y = \sum y_i/N,$ and $\overline{xy}=\sum_{i}x_iy_i/N,$ we find

$$\begin{aligned} \operatorname{Cov}\left(\sum_{i=1}^N x_iW_i, \sum_{j=1}^N y_jW_j\right) &= \sum_{i=1}^N x_iy_i \operatorname{Var}\left(W_i\right) + \sum_{i\ne j}^N x_iy_j\operatorname{Cov}\left(W_i,W_j\right) \\ &= \frac{n(N-n)}{N^2}\sum_{i=1}^Nx_iy_i - \frac{n(N-n)}{N^2(N-1)}\sum_{i\ne j}^N x_iy_j \\ &= \frac{n(N-n)}{N}\overline{xy} - \frac{n(N-n)}{N-1} \bar{x}\bar{y} + \frac{n(N-n)}{N(N-1)}\overline{xy}\\ &= \frac{n(N-n)}{N-1}\left(\overline{xy} - \bar{x}\bar{y} \right). \end{aligned}$$

(When $N=1$ the double sum doesn't appear and the result easily reduces to $0.$)

If we draw one of the $(x_i,y_i)$ randomly and equiprobably from all $N$ of these paired values, we have a bivariate random variable $(X,Y),$ enabling the result to be written

$$\operatorname{Cov}\left(\sum_{i=1}^N x_iW_i, \sum_{j=1}^N y_jW_j\right) = \frac{n(N-n)}{N-1} \operatorname{Cov}(X,Y).$$

I was tempted to check this result with simulation, but elected to use an exhaustive enumeration of all the possible samples instead, of which there are $\binom{N}{n}.$ For small $N$ this is feasible and gives precise results. The output computes the covariance of the weighted sums in three ways: using the formula in terms of $\overline{xy}-\bar{x}\bar{y},$ the formula in terms of $\operatorname{Cov}(X,Y),$ and--this is the verification--the population covariance of all possible sample sums.

An example of its output for $N=20,$ $n=15$ is

Direct Formula Covariance formula         Exhaustive 
     -3.035239          -3.035239          -3.035239

demonstrating agreement in this case.

Here's the R code.

#
# Create *any* bivariate population you like.
#
N <- 20
# set.seed(17)
x <- rnorm(N)
y <- rnorm(N) - x
#
# Specify the sample size.
#
n <- 15
if(choose(N, n) > 1e6) stop("Are you sure you want to do this?", call.=FALSE)
# 
# Compute the distribution of the sample sum.
#
W <- combn(1:N, n)
wx <- apply(W, 2, function(w) sum(x[w]))
wy <- apply(W, 2, function(w) sum(y[w]))
#
# Compare various formulae.
#
c(`Direct Formula` = n * (N-n) / (N-1) * (mean(x*y) - mean(x)*mean(y)),
`Covariance formula`=n * (N-n) / N * cov(x, y),
Exhaustive = cov(wx, wy)*(1 - 1/length(wx)))

# plot(wx, wy) # Can be interesting...
```

Best Answer

Related Solutions

Solved – Likelihood Ratio for the Bivariate Normal distribution

Mathematical Statistics – How to Find the Covariance of Two Random Sums

Related Question