Inference – What to Check/Test When Generalizing from a Sample to a Population

descriptive statisticsinference

How do I ensure that I'm allowed to generalize findings from a sample to the population? So far, I'm only aware of the standard error. Is this (alone/itself) sufficient or even valid?
As the CI is connected to that, do I need both of them respectively is the standard error meaningless without its CI?
Are there other tests, quantities and so on?

img

Best Answer

Here is an example showing that you could easily mistake a sample of size $n=100$ from $\mathsf{Gamma}(\mathrm{shape}=50,\mathrm{rate}=5)$ as a sample from $\mathsf{Norm}(\mu = 10, \sigma=\sqrt{2}).$

Use R to sample $n=100$ observations at random from the gamma distribution above, and summarize the sample:

set.seed(2022)
x = rgamma(100, 50, 5)
summary(x);  sd(x)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   7.182   9.182   9.947   9.979  10.903  14.278 
[1] 1.301298  # sample SD

A Shapiro-Wilk normality test fails to reject the null hypothesis that the population has a normal distribution (P-value $0.46 > 0.05 = 5\%.)$

shapiro.test(x)

        Shapiro-Wilk normality test

data:  x
W = 0.98733, p-value = 0.4602

Also, a normal probability plot (Q-Q plot) of the 100 observations is very nearly linear, as plots of a normal distribution should be.

qqnorm(x); qqline(x, col="blue")

enter image description here

Moreover, (falsely) assuming normality, a 95% t confidence interval based on the sample is $(9.72,\, 10.24),$ which includes $\mu=10.$

t.test(x)$conf.int
[1]  9.720308 10.236719
attr(,"conf.level")
[1] 0.95

Also, a 95% chi-squared CI for the population variance is $(1.3,\, 2.3),$ which includes $\sigma^2 = 2.$

99*var(x)/qchisq(c(.975,.025),99)
[1] 1.305417 2.285194

Finally, a histogram of the 100 observations seems consistent with a sample from $\mathsf{Norm}(\mu=10,\sigma=\sqrt{2}).$

hist(x, prob=T, col="skyblue2")
 curve(dnorm(x, 10, sqrt(2)), add=T, col="brown", lwd=2)
  curve(dgamma(x, 50, 5), add=T, lty="dotted")

enter image description here

The solid brown curve is the density function of $\mathsf{Norm}(\mu=10,\sigma=\sqrt{2}).$ Even if someone suggested that the population might be gamma distributed, the density function (dotted) of $\mathsf{Gamma}(50,5)$ does not seem an obviously better fit to the histogram.

Note: The inability to distinguish between $\mathsf{Norm}(10,\sqrt{2})$ and $\mathsf{Gamma}(50,5)$ does not detract from the practical use of either distribution in applied probability modeling. Normal distributions are familiar and often used. However, on theoretical grounds it may be preferable to use a corresponding gamma distribution in practice--especially if negative values are impossible or if one needs to accommodate to occasional high outliers.

Related Question