Check if a sample is representative of a population if I know the population’s parameters in advance

hypothesis testinginferencepythonsample

I have a dataset with the real height of an entire generation of high school students. Since I have the height of literally every student in the generation, I know the population’s mean, $\mu$, as well as its variance, $\sigma^2$.

The problem: I was given a non-random sample of this population and was asked to check if it is representative of the population.

My approach: I think I can check if the sample’s mean follows a normal distribution with mean $\mu$ and variance $\frac{\sigma^2}{n}$.

  • $H_0$: $\bar{x} \sim N(\mu, \frac{\sigma^2}{n})$
  • $H_1$: It doesn't

In Python:

# Libraries
import numpy as np
import pandas as pd
from scipy import stats

# NOTE: `df` is population, `sm` is sample (both are pandas data frames)

# Population parameters
mu = df['height'].mean()
s2 = df['height'].var(ddof=0) # 0 delta-degrees of freedom

# Observed values
x_bar = sm['height'].mean()

# Observed (normalized) statistic
z = (x_bar  - mu) / np.sqrt(s2 / n)

# p-value for two-tailed test
p = 2 * stats.norm.cdf(z, loc=0, scale=1)

Naturally, if p $\leq \alpha$, then I'll reject $H_0$.

My question: Is this a reasonable way to test this hypothesis? If so, can anyone spot any errors in my code? If it isn't reasonable, what other simple tests can be applied to this solve this problem?

Best Answer

Here are some things you might want to try:

Suppose the population distribution is $\mathsf{Norm}(\mu=68, \sigma=3.5).$ Then we can use R to take a fictitious sample of size $n = 100$ from a slightly different population. Can we detect that the sample did not come from the supposed population?

First, we can take an informal look at the sample mean and standard deviation of the data.

set.seed(828)
x = rnorm(100, 69, 3.2)
summary(x);  length(x);  sd(x)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  62.61   66.70   68.99   68.93   70.59   76.19 
[1] 100         # sample size
[1] 2.965027    # sample standard deviation

We notice that $\bar X =68.93$ is somewhat different from population mean $\mu = 68$ and that $S_x = 2.965$ is somewhat different from population $\sigma = 3.5.$

We also compare a histogram (blue) of the data with the population density function (red) and the empirical CDF (blue) of the data with the population CDF. Our sample is relatively small, so we don't expect a perfect match of the sample and population graphs, but there are striking discrepancies.

enter image description here

R code for figure.

par(mfrow=c(1,2))
hist(x, prob=T, col="skyblue2")
 curve(dnorm(x, 68, 3.5), add=T, col="red")
plot(ecdf(x), col="blue")
 curve(pnorm(x, 68, 3.5), add=T, col="red")
par(mfrow=c(1,1))

A Kolmogorov-Smirnov test compares the sample ECDF with the population CDF. Its test statistic $D$ is the largest vertical distance, in the right panel of the figure, between the CDF and the ECDF. It is not large enough to say that the sample is significantly different from the population at the 5% level. We can see a marked difference between the CDF and the ECDF, but the K-S test is known not to have good power.

ks.test(x, pnorm, 69, 3.5)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.08631, p-value = 0.4456
alternative hypothesis: two-sided

We can also use a one-sample t test to see if $\bar X$ differs significantly from the population mean $\mu.$ The P-value $0.0022$ strongly rejects $H_0: \mu = 68$ in favor of $H_a: \mu \ne 68.$

t.test(x, mu = 68)

        One Sample t-test

data:  x
t = 3.1436, df = 99, p-value = 0.002203
alternative hypothesis: true mean is not equal to 68
95 percent confidence interval:
 68.34376 69.52042
sample estimates:
mean of x 
 68.93209 

Also, we can use the relationship $Q = \frac{99S_X^2}{\sigma^2} \sim \mathsf{Chisq}(\nu=99)$ to get a 95% CI for $\sigma$ based on our sample of size $n = 100.$ The CI is of form $\left(\sqrt{99S_x^2/U}, \sqrt{99S_x^2/L}\right),$ where $L$ and $U$ cut probability $0.025$ from the lower and upper tails, respectively, of $\mathsf{Chisq}(\nu=99).$ For our fictitious data the 95% CI computes to $(2.603, 3.444),$ which does not include the population $\sigma = 3.5.$

sqrt(99*var(x)/qchisq(c(.975,.025),99))
[1] 2.603314 3.444399

It follows that the null hypothesis $H_0: \sigma = 3.5$ would be rejected at the 5% level of significance in favor of the alternative $H_a: \sigma \ne 3.5.$

In summary, graphical methods suggest that the sample was not taken at random from $\mathsf{Norm}(\mu=68,\sigma=3.5),$ and specific tests for parameter values show a disagreement between our fictitious sample and the target population. However, a Kolmogorov-Smirnov test does not detect a discrepancy between the sample ECDF and the population CDF.

Note: One test of goodness of fit of the sample to the population would have been a chi-squared test. You could use the histogram function in R to find the observed counts in each of the eight histogram bins and use the CDF of the population to find the expected counts. Some of the expected counts in the tails may be below 5, so you may need to combine bins.

I could not reject the null hypothesis that my fictitious data are consistent with the supposed hypothesis. You may want to try a chi-squared test for your actual data.

In R, here is how to get relevant counts and bin boundaries using a non-plotted histogram:

hist(x, plot=F)
$breaks
[1] 62 64 66 68 70 72 74 76 78

$counts
[1]  4 14 18 27 22  9  5  1

....