Solved – Distribution of Sample Means Compared to Population Mean

distributionsgroup-differencesmeansampling

Assumptions

I have a population (n=5000) and I know everything about it (all point values, mean, standard deviation, etc.) From this, I will sample 1000 items. I can calculate the mean of the sample and compare it to the population mean. Oh, and the first assumption (truly a fact) is that I am an armchair statistician, not at all trained.

Problem Statement

I want to generalize an answer to this question: What is the probability a random sample mean is expected to be more than x% different from the population mean some percentage (say, 50%) of the time, given I know everything about the population?

What I've Tried

I did a brute force analysis of one population. I took 100x 1000-item samples, calculated the mean of each sample, and looked at the distribution of these means. I observed, as expected (I think), that the differences between sample and population means is roughly normally distributed around zero. So, with 100 trials analyzed I was able to say random sampling of this particular population will produce sample means more than +/-5% different from the population mean 50% of the time due to chance. (Is that a correct conclusion?)

I also looked at some related questions here but I did not find quite what I was looking for. I get that the mean of the sample means should converge to the population mean, but I can't quite make the leap to understand how the sample means are distributed around the population's mean.

Other Possibly Useful Information

  • The population's variable of interest is always right-skewed (and possibly has a lognormal distribution)
  • All population variable values are positive
  • The population's standard deviation is typically 2-5 times the population's mean

Best Answer

The distribution of the difference (sample.mean-population.mean) depends on the population standard deviation and the sample size (in particular, the standard deviation of the difference is related to both -- it's $\sigma/\sqrt{n}$).

This is the (true) standard error of the sample mean.

The distribution of percentage difference will depend on where the population mean is. Consider that percentage is basically dividing the raw difference by the mean and multiplying by 100. (I'll leave aside the x100 for now.)

So instead of $\sigma/\sqrt{n}$ for the standard deviation of the difference, you're dealing with $\frac{\sigma}{\mu\sqrt{n}}$ for the standard deviation of the relative difference.

Consider I have a mean of 100, and and a standard error of 5. I compute a relative standard error of 0.05 (5%). Now if I subtract 90 from every observation, my standard error is unchanged (it doesn't involve the mean), but my relative standard error has jumped to 50. So I can't make general comments about the size of the relative standard error without reference to where the mean is. It depends on that mean, quite directly.

This is a warning -- when you do simulation, you can't just make conclusions based on one set of parameter values unless you really understand how it will generalize to other values. If you don't, you have to let the simulation tell you how things change as you play with mean and standard deviation independently.

Now $\frac{\sigma}{\mu}$ is called the coefficient of variation. It's often useful in situations where spread tends to increase proportionally when the mean increases (generally the same time that percentage changes make the most sense, basically).

Related Question