R – Computation of Variance for the Average of Correlated Random Variables Using R Function

correlationcovariancersumvariance

I want to calculate the variance of the average of n correlated variables. I found a formula for that in Borenstein et al. (2009) Introduction to Meta-Analysis.

$$\operatorname{Var}\left(\frac{1}{m}\sum_{i=1}^mY_i\right)=\left(\frac{1}{m}\right)^2\operatorname{Var}\left(\sum_{i=1}^m Y_i\right) = \left(\frac{1}{m}\right)^2\left(\sum_{i=1}^m V_i + \sum_{i\ne j}r_{ij}\sqrt{V_i}\sqrt{V_j}\right).$$

I think that this the same formula as here Variance of average of $n$ correlated random variables, except it is based on correlations instead of variances.

I've implemented the formula in R like this

# Estimates
est <- c(v1 = 1, v2 = 1.3, v3 = 1.1, v4 = 1.2)
# Variance of estimates
var_est <- c(v1 = 0.05, v2 = 0.10, v3 = 0.15, v4 = 0.20)
# Correlations among estimates
cor_v <- c(   1, -0.5, -0.4, -0.3,
           -0.5,    1,  0.3,  0.2,
           -0.4,  0.3,    1,  0.1,
           -0.3,  0.2,  0.1,    1)
correlations <- matrix(cor_v, nrow = 4, ncol = 4, 
                       dimnames = list(names(var_est), names(var_est)))
row.names(correlations) <- names(var_est)
colnames(correlations) <- names(var_est)
covar_tot <- 0
for(v1 in names(var_est)){
  for(v2 in names(var_est)){
    if(v1 != v2){
      i_j <- correlations[v1, v2] * sqrt(var_est[v1]) * sqrt(var_est[v2])
      covar_tot <- covar_tot + i_j
    }
  }
}

# Average of estimates
mean(est)
# [1] 1.15

# Variance of average of estimates
unname((1/length(var_est)) ^ 2 * (sum(var_est) + covar_tot))
# [1] 0.02904385

My questions are:
Does the implementation look correct?
Better yet, is there is an R library that accepts variances and correlations of estimates and returns the variance of average (or sum) of estimates?

Best Answer

When implementing statistical formulas, simplicity is your friend. Consequently, this formula is best expressed in matrix form when you are considering programming it.

To achieve such an expression, let $V = (v_{ij})$ be the covariance matrix for a (column) vector of random variables $\mathbf Y = (Y_1, \ldots, Y_m)^\prime.$ The variances $v_{ii} = \sigma_i^2$ are the diagonal entries, equal to the squares of the standard deviations. (Your formula assumes these are all nonzero.)

The correlation matrix $R=(r_{ij})$ is obtained by dividing all rows and all columns by their respective values of $\sigma_i:$

$$ r_{ij} = \frac{v_{ij}}{\sigma_i\sigma_j}.$$

Consequently, $V$ can be recovered from $R$ and the standard deviations by solving this equation, giving

$$ v_{ij} = r_{ij}\sigma_i\sigma_j.$$

In matrix form this can be written

$$V = \Sigma R \Sigma$$

where $\Sigma$ is the diagonal matrix with entries $\sigma_1, \ldots, \sigma_m.$

Your formula is a special example of the covariance of two linear combinations, say

$$X = a_1 Y_1 + \cdots + a_m Y_m = \mathbf a^\prime \mathbf Y;\quad Z = b_1 Y_1 + \cdots + b_m Y_m = \mathbf b^\prime \mathbf Y$$

for two fixed (non-random column) vectors $\mathbf a$ and $\mathbf b.$ In the question, both these vectors are

$$\mathbf a = \mathbf b = \left(\frac{1}{m}, \ldots, \frac{1}{m}\right)^\prime = \frac{1}{m}\left(1,1,\ldots,1\right)^\prime$$

and you wish to compute the associated covariance

$$\operatorname{Var}(\mathbf a^\prime \mathbf Y) = \operatorname{Cov}(X,Z).$$

Since for any vectors $\mathbf a$ and $\mathbf b,$

$$\operatorname{Cov}(\mathbf a^\prime \mathbf Y, \mathbf b^\prime \mathbf Y) = \mathbf{a}^\prime V \mathbf{b} = \mathbf{a}^\prime \Sigma R \Sigma \mathbf{b},$$

this is what you should program. Why?

  1. It is simple.

  2. Therefore it is more likely to be correctly implemented.

  3. Therefore it is much clearer to any reader what it is intended to do (and what it actually does).

  4. Therefore it is easier to maintain.

  5. It is far more efficient (R can capitalize on multiple cores, vectorized arithmetic, etc.)

  6. It is more general: it can be applied in many similar situations where $\mathbf a$ and $\mathbf b$ do not have the special values in the question.


Here is a direct, unoptimized implementation in R to show how this analysis leads to effective code.

variance <- function(a, b = a, sigma2 = rep(1, length(a)), R = diag(1, length(a))) {
  sigma <- sqrt(sigma2)
  (a * sigma) %*% R %*% (b * sigma)
}

This begs the ultimate question: How can we tell that a software implementation is correct? One way is to compare it to a known correct solution. A wise choice of test cases can help, but it always is useful to test random inputs: they can present unanticipated problems.

Here is such a test of this implementation.

m <- 4 # Must be 2 or greater for this to make sense
check <- replicate(50, {
  Y <- matrix(rnorm(m^2), m)
  a <- rnorm(m)
  b <- rnorm(m)

  c(Direct = cov(Y %*% a, Y %*% b),
    `Our solution` = variance(a, b, diag(var(Y)), cor(Y)))
})

if(1 == (1 - sum((check["Direct", ] - check["Our solution", ])^2))) {
  message("Everything checks.")
} else {
  stop("There's an error!")
}

It creates $50$ datasets of vectors $Y$ and $50$ pairs of vectors $\mathbf a$ and $\mathbf b.$ In each case it computes the covariance of the two linear combinations using the built-in cov function and also uses this solution based on the correlation matrix (computed using the built-in cor function) and the individual variances (using the built-in var function). At the end it compares the two sets of results. When the sum of squares of differences is negligible compared to $1,$ it concludes that "everything checks." Otherwise it warns you there's a problem.

The output when I run it always is "Everything checks."

Now that we have some comfort this method is correct, let's apply it to the data in the question.

sigma2 <- c(v1 = 0.05, v2 = 0.10, v3 = 0.15, v4 = 0.20)
R <- matrix(c(   1, -0.5, -0.4, -0.3,
              -0.5,    1,  0.3,  0.2,
              -0.4,  0.3,    1,  0.1,
              -0.3,  0.2,  0.1,    1), 4)

# The calculations...
m <- length(sigma2)
c(variance(rep(1/m, m), sigma2 = sigma2, R = R))

[1] 0.02904385

The output differs from that shown in the question. If you feel it still would be worthwhile to debug that code, be aware there are several problems with it, including the use of an undefined variable var_tot and an error in the variance formula itself.

Related Question