Descriptive Statistics – Understanding Population vs. Sample

descriptive statisticspopulationsample

The problem that bugs me can be boiled down to: "What is population? (Really)." This is my thought experiment I've dealing with for a few days.

There is a researcher A, who gathers data from $N = 20$ observations on some variable. He gets $\bar{x_A} = 17.21$. He thinks that he can't measure all the population, 'cause "the amount of money, time" is too much to do that (the usual story, lecturers tell the students).

But there is also a researcher B, who happened to collect the data from entire population, because it turns out that it's feasible, it's like $N=100$ people, but researcher A didn't know that. You can think of some rare genetic diseases. Not that many people have them, so it's quite possible to get the data from all patients suffering from the condition – at least in some thought experiments like this.

The researcher B finally gets $\bar{x_B} = 18.82$. Obviously, $\bar{x_A} \neq \bar{x_B}$.

Questions:

[1] Is really the entire set of observations the population? I know that it may sound stupid (yes, it does), but I wonder if there is some catch in here?
If it is, then – according to what they say – you can't use inferential statistics and you can't do eg. (statistical) significance tests when you are the researcher B. Or maybe is the population some theoretical entity and the researcher B's data is not a population?

[2] It is also said, that if you can get the data from all observations in population, you get the population's parameter, so you may write $\mu$ instead of $\bar{x_B}$. So The researcher B 's $\bar{x_B}$ becomes $\mu = 18.82$ -… right?

[3] Is the researcher A's mean $\bar{x_A} = 17.21$ somehow wrong?

[4] Can you point me to some valuable literature? 🙂 I feel I need to educate myself on this.

Best Answer

Assuming that the $N=100$ people that Researcher B examined is the set of people that Researcher A was interested in, then yes, that is the "population" of interest in both cases. When we refer to the "population" we are talking about whatever is the group of interest for which we want to make an inference. (For this reason, I often call it the "population of interest" to stress that meaning.) You also appear to be putting the cart before the horse on this, insofar as you are worried that Researcher B does not have an inference to make. Since Researcher B directly observes all the patients in the population, it is not really surprising that he does not require statistical inference --- statistical inference is only needed when we want to make inferences about unknown things from known things.

[1] Under the circumstances you have decribed, the $N=100$ values are the population of interest for both researchers. However, if Research B becomes interested in some broader theoretical group beyond those patients, then that broader group would become the new population of interest.

[2] If you are consistent/sensible in your interpretation, you can use either notation, but that is a big if. You will find that texts on model-based sampling theory will tend to reserve the Greek letters for parameters of an infinite population, whereas texts on design-based sampling theory use them for quantities of a finite population. When dealing with a finite population, I personally prefer to use the model-based notation, and therefore avoid the Greek notation for population quantities, so I would continuse to refer to the population mean as $\bar{x}_{100}$, not $\mu$. This makes it easier to extend things to make inferences about a larger infinite population if needed, and it also reduces the likelihood that someone will misinterpret the meaning of my notation.

[3] The sample mean for the smaller sample is not wrong merely by virtue of being different to the population mean. It is what it is, and so long as it was measured and recorded correctly, it is correct. Often the sample mean is used as a point estimator for the population mean, and when this occurs there will be some error in the estimator (i.e., a difference between the estimator and the thing it is trying to estimate). Estimation error is something that is expected in statistical inference, and we have teachniques to measure how big it is likely to be. The presence of estimation error does not mean that the estimator is "wrong".

[4] Big question --- I'll try to come back to this sometime.

Related Question