Statistical Significance – Significance Test for Non-Normal Population

distributionspopulationstatistical significance

I'm a journalist, and I am trying to work out whether hospitals that are in political districts with low majorities (i.e. where the political representative is fighting hard for his or her seat) are more likely to get extra funding.

In other words, do hospitals that have received extra funding tend to be in districts with relatively low majorities? I'm not sure which significance test is most appropriate here.

Here are my numbers:

  • There are 600 districts overall. The average political majority is 18.5%, with a s.d. of 12.1%.
  • There are 1588 hospitals overall. Considering each hospital as one member of the population, and looking at the district it is in, the average majority is 18.6% with a s.d. of 12.6%.
  • There are 203 hospitals that have received extra funding. Considering each hospital as one element in the population, and looking at the district it is in, the average majority is 16.6% with a s.d. of 11.5%.

So I an see that the hospitals are in districts with lower majorities on average, but I'm not sure whether this counts as significant.

(I'm pretty sure there is something going on! I also have the stats for each individual hospital, if that helps. It's a long time since I did statistics at college and I have forgotten whether I should be looking at means or at something else.)

What complicates this for me is that the distribution of majorities probably isn't Normal, because I'm looking at the modulo majority, and am not concerned with which party has the majority.

Any thoughts on how to assess the significance of this finding?

Best Answer

There are a lot of sophisticated methods that could be applied to this problem, taking into account the probability distribution of majorities, clustering by district, controlling for other district-level variables etc. Generally I would like to use a trimmed mean to compare this sort of thing; and to get a good view of the statistical properties of an estimate of the trimmed mean you would use a bootstrap. All this I imagine is beyond your resources and timeframe unless you have statistical consultant on tap.

This is also a good example of applying statistic inference techniques that were designed for samples to a whole population. See some discussion on previous questions - here and here. My view is strongly that it is useful and policy-relevant to treat such a situation as though the census of hospitals and their grants were a random sample from a hypothetical super-population and draw conclusions on whether there is statistically significant evidence the data-generating process has produced something different from what woudl be expected under a null hypothsis (in this case, the null hypothesis would be 'no relationship between majority size and funding behavior).

Putting all that aside however, a basic pragmatic approach to statistics would note several things:

  • While the distribution of majorities is certainly not normal, there are limits on how badly skewed it could be. After all, they can't possibly get bigger than 100.
  • Your sub-population has 203 members, which is normally enough for the central limit theorem to kick in with a good range of distributions, which means that although the original population of distributions is not normally distributed, your estimates of the mean of majorities in your subpopulation is going to be close to normally distributed
  • the standard deviations for the populations you quote can be converted into estimates of the standard deviation of your estimated mean majority in each population by dividing them by the square root of the relevant sample size
  • those standard deviations of the estimated mean majority can be combined into an estimate of the standard deviation of your estimate of the difference between the two populations - let's call this the standard error of your estimated difference - as follows:

$se_{diff} = \sqrt{\frac{sd_{pop1}^2}{n_{pop1}}+\frac{sd_{pop2}^2}{n_{pop2}}}$

  • your estimate of the difference in mean between the two populations will be approximately normally distributed with the above standard error as its standard deviation; so you can multiply the standard error by 1.96 to give the radius of an approximate 95 percent confidence interval for the difference in mean majority between the two populations.

My calculations suggest this gives you an estimate of the difference in mean majority for the districts of hospitals that received grants in the interval of (0.3,3.7) percent, which does not include zero, so this crude first effort certainly suggests that there is something going on here. However, the interval nearly includes zero, and in any account the smaller majority in districts where hospitals received funds is not that much smaller, that I'd be careful before drawing too many conclusions from this.

To get a better answer you would need to bring in some of the more sophisticated techniques mentioned earlier.

Related Question