Solved – Comparing (and testing) two discrete distributions with different magnitudes

chi-squared-testdistributionspaired-comparisonssurvey

I am comparing authoritative survey data (large amount of observations) with data gained from a social network (very small amount of observations). Particularly, I want to compare population per district as surveyed with population per district as found in a location based social network.

Example dataset:

   type variable   value
1     1      vgi    1064
2     2      vgi     873
3     3      vgi       8
4     4      vgi     246
9     1      pop 2248360
10    2      pop 3544721
11    3      pop   70934
12    4      pop 2090647

type is the district (1-4), variable=vgi denotes users found in the social network while variable=pop is the actual population size per distcrict.

Even though the scales are completely different in magnitude, is there a way to qualitatively (e.g. with a plot) and quantitatively compare both distributions?

With qualitative I mean a plot where one can easily see which district is likely under- or overrepresented on social media and with quantitative I mean something like a Chi-Square-Test in order to see whether the distributions significantly differ from each other. For example, one can see from the data that district 2 is underrepresented on vgi, or one could also say that district 1 is overrepresented on vgi — but that is the problem – what is considered over- or underrepresented?!

I don't have experience with such data, thus I am asking. I was able to plot both distributions with R, but the different scales make them hard to compare – I should probably transform one of both types but I don't know how.

Best Answer

You could present relative frequencies of people found in the social network, i.e. "value over pop"

 type Percent
    1  0.0473
    2  0.0246
    3  0.0113
    4  0.0118

and just compare these percentages.

enter image description here

As the numbers and also the barplot show, the relative frequencies of people found in the districts vary quite a bit, i.e. not all districts are equally represented in the social network.

I doubt whether it is useful to use methods from inductive statistics here because your data set does not seem to be a random sample from a population. Should my impression be wrong, then you could either think of adding binomial confidence intervals to each of those percentages and/or run a chi-squared goodness-of-fit test using the population distribution as the reference.

In R:

N <- c(2248360, 3544721, 70934, 2090647)
n <- c(1064, 873, 8, 246)
chisq.test(n, p = N/sum(N))

# Output
        Chi-squared test for given probabilities

data:  n
X-squared = 526.0491, df = 3, p-value < 2.2e-16

At the 5% level, you could reject the null hypothesis that all districts are equally represented in the social network.