Confidence Intervals – How to Calculate for Multiple Variables in Multivariate Analysis

confidence intervalmeanmultivariate analysisnormal distribution

Consider the univariate case, suppose we have some sample data on the heights of men and women – we are interested in estimating the average heights of men and women in the population, and the variance of these estimates.

For instance, based on the data (provided we assume that this data comes from an underlying normal distribution), we might conclude that:

  • The average height of men is 161 cm "plus-minus" 6.1 cm
  • The average height of women is 152 cm "plus-minus" 2.3 cm

In this case, we can infer that there was a much larger "spread" (i.e. less narrow width) in the data for men compared to the data for women. Thus, the "confidence interval" will naturally be smaller for women compared to men.

My Question: How can we extend this analysis for multivariate data?

For example, suppose we have data on the height, weight, and salary of men and women. We believe that this data comes from some multivariate normal distribution and this multivariate normal distribution has a "non-trivial" covariance matrix (i.e. the non-diagonal elements of the covariance matrix are not necessarily 0).

This means, we now have a "mean vector" of height, weight and salary measurements for both men and women. And of course, we now also have a "confidence ellipsoids" corresponding to the variance in these measurements.

  • In general, how do apply the idea of "confidence intervals" to this multivariate data?

In the univariate case, we were able to determine that the confidence interval was "tighter" for women compared to men.

Can we somehow comment on the "hypervolume" of the confidence ellipsoids corresponding to these multivariate estimates and determine if this hypervolume is bigger in the estimates for men instead of women?

Thanks!

Best Answer

Could you use the generalised variance? Generalised variance is simply $\text{det}(\Sigma)$. Of course, we can map this to the standard error by dividing by $n = \text{sample size}$.

We can illustrate with a brief, 2d example. Consider variables $X$, $Y$ with correlation $\rho$ and covariance matrix \begin{equation} \Sigma = \begin{pmatrix} \sigma^2_X & \rho \sigma_X \sigma_Y \\ \rho \sigma_X \sigma_Y & \sigma_Y^2 \end{pmatrix} \end{equation} then $\text{det} (\Sigma) = \sigma^2_X \sigma_Y^2 - \rho^2 \sigma^2_X \sigma_Y^2$. If $\rho$ is close to either $1$ or $-1$ then $\text{det} (\Sigma) \approx 0$. However $\text{det} (\Sigma)$ is largest when $\rho= 0$.

Below is a plot of some bivariate Normal samples with $\sigma_X = \sigma_Y = 1$, $\mu_X = \mu_Y = 0$ and some different values of $\rho$. Notice that more "squashed" elipsoids have a lower generalised variance.

Bivariate normal samples with unit variance and rho = (-0.8, -0.3, 0.3, 0.8)

R code to make plot:

library(mvtnorm)
gen_var = function(rho, sx = 1, sy = 1) {
  (sx*sy)^2 - (sx*sy*rho)^2
}
cov_mat = function(rho, sx = 1, sy = 1) {
  matrix(c(sx^2, rho*sx*sy, rho*sx*sy, sy^2), ncol = 2)
}
rho_seq = c(-0.8, -0.3, 0.3, 0.8)
par(mfrow = c(2, 2))
n = 100
x.points = seq(-3,3,length.out=n)
y.points = x.points
z = matrix(0,nrow=100,ncol=100)

set.seed(123)
for (i in 1:4) {
  mu = c(0, 0)
  Sig = cov_mat(rho_seq[i])
  X = rmvnorm(100, mean = c(0,0), sigma = Sig)
  plot(X, main = paste0("rho = ", rho_seq[i],". Gen var = ", gen_var(rho_seq[i])),
       xlab = "x", ylab = "y", pch = 19, col = rgb(0, 0, 0, alpha = 0.4))
  for (i in 1:100) {
    for (j in 1:100) {
      z[i,j] = dmvnorm(c(x.points[i],y.points[j]),
                        mean = mu, sigma = Sig)
    }
  }
  contour(x.points, y.points, z,
          add = TRUE, col = rgb(1, 0, 0, alpha = 0.4)
           )
}
Related Question