Solved – Calculating sample mean and sample variance on all samples vs distinct subsets

meansamplingvariance

Consider 1000 samples drawn from an unknown distribution. Whats the difference between the following two ways of calculating sample mean and sample variance?

  1. find sample mean and sample variance over all 1000 samples.

  2. find sample mean over first 500 samples and sample variance over remaining 500 samples.

What is the difference and which method is preferred?

Best Answer

If the data are normal, the sample mean and the sample variance are independent. If the data are not normal, the covariance/correlation between the sample mean and the sample variance are $O( \kappa_3 n^{-1})$ where $\kappa_3$ is the (central) skewness of the distribution. If, for any reason, you really need the sample mean and the sample variance to be independent, then you can calculate these statistics on independent subsets of data. However, as other people noted, you will suffer efficiency losses.

Cross-validation that you've been thinking of, essentially, fights correlations between different statistics computed on the same data. There are applications where that's a dire necessity. E.g., in regression, the correlation between $\hat y_i$ and $y_i$ is given by the hat-value $h_{ii}$, the diagonal entry of the projector matrix $X(X'X)^{-1}X'$, and is $O(1)$, so you may need to suppress it when you talk about model selection or residual diagnostics (or at least control for it with degrees of freedom corrections like $n-p = n-\sum_i h_{ii}$). But there are applications where splitting the sample is a ridiculous overkill.