I have a data set which has high values. I want to make a chi-square test on this set.
+--------+---------+---------+---------+---------+ +----------+ | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | || SUM | +--------+--------+---------+---------+---------+---------+-----+----------+ | Male | 9639 | 281060 | 1355555 | 2257670 | 2686581 | || 6590505 | +--------+--------+---------+---------+---------+---------+-----+----------+ | Female | 127728 | 993121 | 2057165 | 2536860 | 2710454 | || 8425328 | +--------+--------+---------+---------+---------+---------+-----+----------+ | | | | | | || | +========+========+=========+=========+=========+=========+=====+==========+ | SUM | 137367 | 1274181 | 3412720 | 4794530 | 5397035 | || 15015833| +--------+--------+---------+---------+---------+---------+-----+----------+
When I calculate the expected value with the formula, I got the following table:
(For the first column and first row: 6590505 * 137367 / 15015833 = 60290,9)
EXPECTED VALUE TABLE +---------+--------+---------+---------+---------+ | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | +--------+---------+--------+---------+---------+---------+ | Male | 60290,9 | 559243 | 1497856 | 2104337 | 2368779 | +--------+---------+--------+---------+---------+---------+ | Female | 77076,1 | 714938 | 1914864 | 2690193 | 3028256 | +--------+---------+--------+---------+---------+---------+
Then, subtract expected from actual, square it, then divide by expected:
(For the first column and first row:
(9639 – 60290,9)*(9639 – 60290,9) / (60290,9) = 42553,9)
+---------+--------+---------+---------+---------+ | 15-19 | 20-24 | 25-29 | 30-34 | 35-39 | +--------+---------+--------+---------+---------+---------+ | Male | 42553,9 | 138376 | 13519 | 11172,6 | 42637,3 | +--------+---------+--------+---------+---------+---------+ | Female | 33286,8 | 108241 | 10574,9 | 8739,52 | 33352 | +--------+---------+--------+---------+---------+---------+
So, Chi-square is the sum of all cells which is: 42553,9 + 138376 + … + 8739,52 + 33352 = 442453
Chi-square = 442453
Degrees of Freedom:
Multiply (rows − 1) by (columns − 1), which is
(2 – 1) * (5 – 1) = 4
Degrees of Freedom(DF) = 4
I choose Confidence Level = 0.05
So, when I look it up to Chi-square Distribution Table, the number is 9.49.
Obviously it's not proper to compare with 9.49 and 442453. What am I missing?
Best Answer
Everything you did was correct.
R
provides the same answer:The output being:
What you might have missed, is that sample size can actually be too large to make meaningful use of p-values. See for a discussion of this here (Lin, M., Lucas Jr, H. C., & Shmueli, G. (2013). Research commentary - too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906-917.).
Don't rely for your interpretation on p-values when your samples are very large. The p-value is just the probability of getting this or more extreme data if the null hypothesis is true, with huge data this probability can get arbitrarily small.
Edit: I assumed that in your table in each cell there is the number of persons of a certain age and sex, and thus your sample size is huge. If this is not the case, Chi-Squared test may not be correct test.