Solved – census statistical techniques

census

All of my statistical training has been about dealing with samples (psych background). I am involved in a project where we have a census dataset (demographic data from all the relocation sites throughout Phnom Penh, Cambodia). I am not certain if the usual methods of comparing group means, such as using an ANOVA or Kruskal-Wallis test, apply to census data. For example, we want to know whether older relocation sites are associated with a higher percentage of households with toilets. If the assumptions for normality/homoscedasticity etc are met I'd do a Pearson correlation. There are issues with the distributions of both variables though, so we're using Kendall's τ.

The percent toilets variable is severely negatively skewed (most sites have 100% households with toilets) and no amount of transformation would make the distribution normal (I tried square root, log and inverse and reflect transformations although i don't know how to do a Box Cox). Hence we decided to try splitting it up into three groups, few HH with toilets, a moderate percentage with toilets, and most HH with toilets. With census data, I suspect I should not use ANOVA/parametric equivalent to test the significance of the difference between group means. Do I just report the group means and comment on the difference/lack of difference between them?

Thanks. Any references for census statistical analysis appreciated. I have Andy Field's SPSS text book but it's all about samples, which is a pity. I've been Googling all day…


Hi all thanks very much for your advice. A friend made these comments about census statistics:

When you have a sample you use inferential stats to generalise to the population. When you have a census you already have data for the whole population, so there is no need to generalise.

For example, if you used sampling, and there is a 3% difference between groups, then you have to use inferential stats to decide whether that 3% difference is real, or just due to random chance when you did the sampling.

But if you did a census, and there is a 3% difference between groups, well, then there's definitely a 3% difference. That 3% difference is not due to random chance in sampling, because you have data for the whole population. However, even with a census you will still need to use your own judgement to think about why there is a 3% difference (for reasons other than random chance in sampling), and whether the 3% difference is large enough to have any practical significance for the work you are doing.

So basically, just use descriptive stats. Correlations are fine, but you only need the r value to show the strength of the correlation, not the p value which is related to random chance in sampling.

A lot of people don't get the difference between sample stats and census stats, and will complain that you didn't do the stats properly. I've had cases where I ended up having to do inferential stats on census data just because people complained so much that there were no p values on anything!

If you have a lot of missing data from a census sometimes you need some fancy inferential stats to fill it in. I doubt this will apply to you, but it does apply to the US population census because (for some bizarre libertarian reason) completing the census survey in not mandatory in the US.

Best Answer

I think your bigger issue is actually not census v. sample (and for that, see my comment) but the appropriate way to compare proportions. I'd drop any idea of approximating to normal and use logistic regression, treating the households as trials and those with toilets as a success.

Breaking your nice proportion data into categories is a shame as you lose a lot of information.

Related Question