Solved – Is the distribution of the ratio of the sample variance to the populaton variance from a normal population exactly or approximately Chi Square

chi-squared-distributiondiscrete dataprobability

The reason I ask is this: Suppose you draw a large sample from normally distributed population. And just by chance all samples have exactly the same value–not very likely but not impossible either. The sample mean is of course the identical value of each sample drawn. But the sample variance is zero. So the ratio of the sample variance to the population variance is also zero. At zero the Chi Square distribution is also zero suggesting that there is no chance of drawing such a sample…yet I just did it.

Where is the disconnect?? What is wrong with this reasoning?? Is the ratio of sample variance to population variance only approximately Chi Square distributed?

Many of the posted answers indicate that it is impossible to draw the exact same value from the same normal distribution more than once. Does this mean that the “disconnect” that I mention in my original question arises from trying to apply a continuous distribution to a discrete population?
Consider the following example. Non-Senior Executive Service white-collar Federal employees are paid on the GS scale. It consists of 15 grades with 10 steps within each grade for a total of 150 possible pay levels. Assume for this discussion that pay is distributed normally with the exception that it is grouped into these 150 pay levels. Clearly it is possible—not very likely but possible—to draw a large sample from this population where the same value is obtained for each observation. (Even if the sample size is large enough to exceed the number of population members in the largest pay group this could occur if sampling is done with replacement—you could get the same guy multiple times.) As a consequence, a zero sample variance could occur.

So, I have some questions:
1. As noted in the original question, a zero sample variance would imply a zero probability of occurrence because the ratio of the sample variance to the population variance times n-1 is Chi Square distributed. Yet it is possible to get such an outcome. Does this mean that because the population is not exactly normal—because of the 150 groups—that the distribution of the ratio of the sample variance to the population is only approximately Chi Square?

  1. Much of the world is discrete. When dealing with discrete populations how can one proceed to estimate the mean and variance and establish confidence intervals for each without the possibility of stumbling into this sort of situation? Is there a way to design your sampling process to avoid this?

Best Answer

You are misunderstanding something or doing something wrong. It is theoretically impossible for two observations from a normal distribution to be exactly the same---much less having all observation in a large sample be exactly the same.

About rounding. However in practice, one must round normal observations to some number of decimal places, and this can produce ties. (Even though rounded normal data are no longer normal, prudent rounding seldom causes difficulty.)

For example, here is an example of $n = 100$ observations, sampled in R from $\mathsf{Norm}(\mu=50,\sigma=5)$ and rounded to integers. (Then also sorted, making easy to see the ties.)

set.seed(715)
x = sort(round(rnorm(100, 50, 5)));  x
  [1] 39 40 41 41 41 42 42 43 43 43 43 43 43 43 44 44 45 45 45 45
 [21] 46 46 46 46 47 47 47 47 48 48 48 48 48 48 49 49 49 49 49 49
 [41] 50 50 50 50 50 50 50 50 50 50 50 50 50 51 51 51 51 51 51 52
 [61] 52 52 52 52 52 52 52 52 52 53 53 53 53 53 53 54 54 54 55 55
 [81] 55 55 55 55 55 56 56 56 56 57 57 57 57 57 57 57 60 60 61 61
length(unique(x))
[1] 21

There are many ties: In fact, out of these 100 observations, there are only 21 uniquely different values. Rounding these data to the nearest integer is not good practice. Maybe rounding to one place (79 uniquely different values) or two places (95) would be better.

There are differences in the sample standard deviation depending on rounding: $S=5.012$ for data rounded to integers, $S=5.988$ for rounding to 1 decimal place, $S = 5.992$ for rounding to 2 places, and $S=5.992$ for data as generated by R.

Chi-squared distribution related to sample variance. For a random sample $X_1, X_2, \dots,X_n$ from a normal population with standard deviation $\sigma,$ one has

$$Q = \frac{(n-1)S^2}{\sigma^2} \sim \mathsf{Chisq}(\nu = n-1),$$

where $V = S^2 = \frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2.$ Note that $E(Q) = \nu = 4$ and $Var(Q) = 2\nu = 8.$ The chi-squared distribution of $Q$ has $\nu = n-1 = 5 - 1 = 4$ degrees of freedom. [Intuitively, one says that using $\bar X$ to estimate $\mu$ when computing $S^2$ amounts to a one linear constraint and hence the 'loss' of one degree of freedom.]

We illustrate by using R to find the sample variance $S^2$ for each of a million samples of size $n=5$ from $\mathsf{Norm}(100, 10).$ R carries about 16 places of accuracy internally (often displaying 6 or 7) and the simulation does not round the data. With a million iterations one can expect answers to be correct to about 2 significant digits.

set.seed(715)    # for reproducibility
m = 10^6;  n = 5;  mu = 100;  sg = 10
v = replicate( m,  var(rnorm(n, mu, sg)) )
q = (n-1)*v/sg^2
mean(q)
[1] 4.000429    # aprx E(Q) = 4
var(q)
[1] 8.038763    # aprx V(Q) = 8

hdr="Simulated Dist'n of Q with Density of CHISQ(4)"
hist(q, prob=T, col="skyblue2", main=hdr)
  curve(dchisq(x, 4), add=T, n=10001, col="red", lwd=2)

enter image description here

Addendum. You may be familiar with the 95% t confidence interval for the population mean $\mu$ of a normal population, based on a random sample of size $n.$ It is of the form

$$ \bar X \pm t_c\,S/\sqrt{n},$$ where $\bar X$ and $S$ are the sample mean and sample standard deviation respectively. Also, the values $\pm t_c$ cut probability 0.025 from the upper and lower tails, respectively, of Student's t distribution with $\nu = n -1$ degrees of freedom.

The displayed expression for $Q$ above is the basis for a 95% confidence interval for the population variance $\sigma^2,$ based on a chi-squared distribution, as follows:

$$ 0.05 = P(L \le Q \le U) = P\left(\frac 1U \le \frac{\sigma^2}{(n-1)S^2} \le \frac 1L\right) \\ = P\left(\frac{(n-1)S}{U} \le \sigma^2 \le\frac{(n-1)S}{L}\right),$$

where $L$ and $U$ cut probability 0.025 from the lower and upper tails, respectively, of the chi-squared distribution with $\nu = n - 1$ degrees of freedom. Then the CI is of the form $\left((n-1)S^2/U,\, (n-1)S^2/L\right).$

Thus if $n = 20$ and $S^2 = 49.0,$ a 95% CI for $\sigma^2$ is $(45.45,\, 286.7),$ computed in R as shown below. A 95% CI $(6.742,\, 16.933)$ for the population standard deviation $\sigma$ is found by taking square roots of the endpoints.

19*49.0 / qchisq(c(.975,.025), 10)
[1]  45.45193 286.72861
sqrt(19*49/qchisq(c(.975,.025), 10))
[1]  6.741805 16.933063

Addendum considering only 150 possible salaries. In the simulation below, I do not pretend to have captured the GS salary scale, but I do have a discrete distribution of salaries with 150 possibilities, mean around 65 and SD around 5.6 and that is roughly normal in shape.

There are 10,000 samples of size $n=50,$ yielding about 8400 uniquely different sample standard deviations, and thus about 8400 distinct values of $Q,$ which are nearly distributed as $\mathsf{Chisq}(49).$ [Theoretically, I admit it is possible that a sample of size 50 could have all equal values, but (if I ever bought lottery tickets) my chances of winning a major lottery would be much greater. Running the program again with $n=5,$ I also get good results.]

set.seed(718); m = 10^4; n = 50
sal = seq(20, 170, len=150)
  pr=dbinom(0:149, 149, .3) 
x = sample(sal, n*m, rep=T, prob=pr)
sg = var(x)
DTA = matrix(x, nrow=m)
s = apply(DTA,1,sd)      # m sample SDs
q = (n-1)*s^2/sg
hist(q, prob=T, col="skyblue2")
  curve(dchisq(x,n-1), add=T, col="red")

enter image description here

Related Question