Solved – Can someone explain to me the sampling distribution of sample variance in comparison to that of the sample mean

distributionsinferencesampling

I have read tons of things already about the sampling distribution of the sample variance but I can't get quite a good grasp of exactly what it is like in terms of the formulas of the measurements. Could anyone explain to me in detail what those measurements are in parallel with the sampling distribution of the sample mean?

I will be posting my understanding of the formulas of the sampling distribution of the sample mean so you guys have a reference of what information to parallel.

Sampling Distribution of the Sample Mean

Qualities

  1. Distribution Type

When the population distribution is definitely or approximately normally distributed, the sampling distribution will always be normally distributed. But if the population distribution is not normally distributed, a rough guideline is followed where the sample must be equal to or greater than 30 to approximate normal distribution.

  1. Standardized Test Statistic

Z Score is used when the population variance is known and the distribution is definitely or approximately normally distributed. If the distribution is not normal, the sample size must be greater than or equal to 30.
T score is used if the population variance is given and distribution is not normal and the sample size is less than 30. If the population variance is not given and the sample variance is, this is also used.

Quantities

  1. Mean

E(X) = μ

  1. Variance

σ is given

V(X) = σ^2/n

s is given

V(X) = s^2/n

  1. Standardized Test Statistic

Z Score Given σ

Z = (X̄ – μ) / (σ / √n)

Z Score Given s

Z = (X̄ – μ) / (s / √n)

T Score

t = (X̄ – μ) / (s / √n)

Here are some information I have gathered based on readings about the topic.
I've been reading about the sampling distribution of the sampling variance having a chi-squared distribution with n – 1 degrees of freedom. Also, the formula of (n – 1)S^2 / σ pops up. Is this like for the chi-square distribution, where the chi-square is our standardized test statistic?

Definitely there is information online. But I have come here because I really want cohesiveness and a comparison for me to better grasp the concept.

Best Answer

I'll do my best to summarise some of this in a hopefully digestible way. I think some of the confusion arises from the difference between "the variance of the sample mean" and "the variance of a sample" (and potentially the variance of the variance of a sample)

1: Variance of the Sample Mean. Take a sample of size N, calculate its mean. Take another sample, calculate its mean, etc... now you have lots of sample means. The variance of the means of those samples is the variance of the sample means

2: Sample variance: Take a sample of size N. Calculate the variance within that sample

3: Variance of sample variance. As in (1), take many samples of size N, calculate all of their variances, then calculate the variance of these.

Now let's state some facts about these

Sample Mean:

You sample N times from a distribution with mean $\mu$ and variance $\sigma ^{2}$. The expected value of your sample mean is $\mu$, and the variance of the sample mean (see (1) above) will be $\frac{\sigma ^{2}}{N}$

The above holds for most underlying distributions (there are some restrictions, e.g. the mean/variance must be defined).

If the underlying distribution is Gaussian, then we can say more than just what the expected value and variance of the sample mean will be. Then we know the full distribution. The sample mean will be a normally distributed with mean $\mu$ and variance $\frac{\sigma ^{2}}{N}$ (which is consistent with what I just said), but if the distribution is not normal, then this will only be approximately true as N increases (this is the central limit theorem). The number 30 is not a good benchmark for what a good N is, it depends on the distribution.

Sample Variance:

If you take a sample of size N from a distribution and calculate the variance of the sample ((2) above), its expected value is $\frac{N-1}{N}\sigma^{2}$, i.e. a little bit smaller than the true distribution's variance. So if you took many samples of size N, calculated the variance within each sample and averaged these variances, you'd expect to get the above.

I'm not aware if there is a formula for the variance on the variance ((3) above).

The above holds for most distributions (as with sample mean). If however you know the underlying distribution is normal, then again, you don't just know the expected value of the sample variance, you know its full distribution, and it's given by a chi-squared distributed with (N-1) degrees of freedom, which I believe is consistent with the expected value being $\frac{N-1}{N}\sigma ^{2}$, although this is not as obviously trivially true as in the sample mean case above.

t-statistic (combining the two)

Now when you take a sample from a distribution, the t-statistic is a way of combining the sample mean and sample variance, $\frac{\bar{x}}{\frac{s}{\sqrt{N-1}}}$ where $\bar{x}$ is the sample mean and $s$ is the square root of the sample variance. This might seem like a somewhat arbitrary quantity to calculate, but it turns out that one can show that this quantity follows the t-distribution with (N-1) degrees of freedom, provided the underlying data is sample from a normal distribution.

I'm not very knowledgable about what this is used for in practice (it's called a t-test but I don't use them much, so I'll let somebody else take this part), but it involves comparing the t-statistic of different samples and then referencing them against a "t-table" to determine whether they're likely to have come from the same underlying distribution or not. This is where the number 30 comes in. For samples of size N, you must reference a "t-table with N degrees of freedom". If turns out that as N grows to about 30, a t-table starts to look very similar to a Z-table. What that means is that for N>30, the t-statistic is distributed approximately normally, i.e. like the Z-statistic. The Z-statistic involves dividing the sample mean by the distributional variance if you know it...but in practice this is never the case, when would it ever be the case that you're sampling from a distribution whose mean you don't know but whose variance you do?

Note that all of this stuff around t- and z- statistics only apply when you assume that your sample has been sampled from a normal distribution. If you don't know the underlying distribution, you can still make some assertions (subject to some assumptions about the distribution) about expected values of the sample mean, the variance of the sample mean, and expected values of the sample variance, but knowing the means and variances of distributions is less powerful than knowing the full distribution.