Solved – Is a chisquare on a (nearly) complete population data necessary

chi-squared-testpopulationsamplingsurvey

I have data from a survey that was (attempted to to be) administered to all children of particular grades in a certain state. I am getting it after a cleaning step by the survey designers that removed obvious invalid answers (from obnoxious teens).

Questions A and B have binary answers, and I'm interested in reporting the percentages of children in the 2×2 categories.

Of the approximately 100,000 observations I have, there are 3,500 that have missing data for either A or B, and are not included in the table. There's decent reason to believe that for these 2 questions, non-responses won't be particularly biased one way or the other.

What is the proper way to test/summarize any differences between the categories? Is a chi-square test meaningful here? If the non-responses are unbiased, do I just have a really large random sample? Or can I assume that I am just reporting the actual proportions and no statistical testing is needed?

Best Answer

The answer is "it depends".

Some discussion in this related question and here and here. Basically, if you are interested only in describing this particular population, you could report just your proportions (possibly after imputing values for children you don't have) and be done with it. Some hard-liners insist there is no statistical inference to be made (other than the imputation) as you have all the data already.

If however you wish to answer a question that is not just about an actual finite population but the data generating process that produced the population, then it is often sensible to treat the "population" as though it is a sample from an infinite set generated by that process. Often these questions will be the ones of most theoretical or policy-relevant interest. This means you can do all the "usual" inference including Chi-square statistics in this case.

I personally am of the view that for many purposes is extremely useful to be able to know whether the observed relationship in the actual population was plausibly generated through random chance or not. For example, we may well be interested in semi-hypothetical populations - other states or times - that are important but too difficult to characterise exactly. Considering the hyper-population of the population you actually have can be a good starting point.