Mathematical Statistics – What If a Non-Random Sample is Identical to a Random Sample?

mathematical-statisticssampling

Sometimes, in political polls, pollsters take non-random samples from a given population, but then they apply the results of the theory of random sampling to their non-random sample. I've heard someone (not a statistician) argue that this is still valid procedure because the non-random sample obtained is one of the possible random samples.

In fact, suppose the following happens: Researcher 1, through some non-random sampling method, selects individuals A, B, C. Researcher 2 makes use of random sampling, and obtains the same sample A, B, C. Both apply random sampling theory to analyse their sample. What's the difference? What makes researcher 1 wrong?

Thoughts

My only thoughts abouts this, at least so far, is that what makes the random sample theoretically valid is the procedure that random sampling dictates, and not the particular sample obtained.

If that wasn't the case, you could fix basically any sample you want (say, a sample of 3000 white, 24-year-old, college-educated women), then claim that this sample is okay to use because it is one of the possible random samples of 3000 people of your population.

Best Answer

A particularly biased / non-representative sample is unlikely if you sample randomly.

In an ideal world you'd have a non-random sample which perfectly accurately represents the population such that the proportion of every demographic is the same in the sample as it is in the population as a whole.

This is pretty hard problem to solve in the real world though (to say the least), as you'd need to understand every demographic and how it affects your results. You might say "white, 24-year-old, college-educated women" is specific enough and you just need to make sure your sample has the right proportion of such people (and similarly for every other similar demographic), but they may be more or less likely to act in a certain way based on where they live, where they studied, where they grew up, their religion and many other factors. So you need to take all of that into account too. That'll be a whole lot of work, and in the process you'll probably answer your original query anyway without ever using the sample you generated. Basically doing that just doesn't make a whole lot of sense.

In the real world a random sample is a "good enough" attempt to obtain an accurate representation of the population.

Now it is indeed possible to get a random sample that doesn't reflect what the population as a whole looks like particularly well (i.e. a "biased" sample).

But the probability of getting any given sample when sampling randomly decreases significantly as the sample becomes more biased and a less accurate representation of the population as a whole. This applies especially when you have larger samples.

This is acceptable since statistics is generally about having high confidence of being correct rather than having absolute certainty.

Think of it this way: if 70% of your population is women and you randomly pick one person, you have a 70% chance of picking a woman. So you would expect roughly 70% of your random sample to be women. The maths might not work out to exactly 70% in all cases, but that's the general idea. So the sample proportions should roughly correspond to the proportions of the population as a whole. You should be rather surprised if your sample somehow ends up with 0% women.


There could also be issues depending on how you obtain a random sample. If you want to sample from everyone living in a country, you could, for example, get a random subset of registered voters or people with driver's licences. But then your sample would be heavily biased towards people who are registered to vote or have driver's licences.

This may also lead to a partially random sample where you combine differently-sized random samples from different sources such that the end result is more representative of the population as a whole. Although I'm not sure whether and how often this is done in practice. Finding a single data source for the entire population would be preferable.

But that's a whole other question.

Related Question