# Chi-Square Test – Handling High Sample Size and Unbalanced Data

chi-squared-testnormalizationp-valueunbalanced-classes

I have a data set which has high values. I want to make a chi-square test on this set.

         +--------+---------+---------+---------+---------+     +----------+
| 15-19  | 20-24   | 25-29   | 30-34   | 35-39   |     ||  SUM    |
+--------+--------+---------+---------+---------+---------+-----+----------+
| Male   | 9639   | 281060  | 1355555 | 2257670 | 2686581 |     || 6590505 |
+--------+--------+---------+---------+---------+---------+-----+----------+
| Female | 127728 | 993121  | 2057165 | 2536860 | 2710454 |     || 8425328 |
+--------+--------+---------+---------+---------+---------+-----+----------+
|        |         |         |         |         |     ||         |
+========+========+=========+=========+=========+=========+=====+==========+
| SUM    | 137367 | 1274181 | 3412720 | 4794530 | 5397035 |     || 15015833|
+--------+--------+---------+---------+---------+---------+-----+----------+


When I calculate the expected value with the formula, I got the following table:

(For the first column and first row: 6590505 * 137367 / 15015833 = 60290,9)

EXPECTED VALUE TABLE
+---------+--------+---------+---------+---------+
| 15-19   | 20-24  | 25-29   | 30-34   | 35-39   |
+--------+---------+--------+---------+---------+---------+
| Male   | 60290,9 | 559243 | 1497856 | 2104337 | 2368779 |
+--------+---------+--------+---------+---------+---------+
| Female | 77076,1 | 714938 | 1914864 | 2690193 | 3028256 |
+--------+---------+--------+---------+---------+---------+


Then, subtract expected from actual, square it, then divide by expected:

(For the first column and first row:
(9639 – 60290,9)*(9639 – 60290,9) / (60290,9) = 42553,9)

         +---------+--------+---------+---------+---------+
| 15-19   | 20-24  | 25-29   | 30-34   | 35-39   |
+--------+---------+--------+---------+---------+---------+
| Male   | 42553,9 | 138376 | 13519   | 11172,6 | 42637,3 |
+--------+---------+--------+---------+---------+---------+
| Female | 33286,8 | 108241 | 10574,9 | 8739,52 | 33352   |
+--------+---------+--------+---------+---------+---------+


So, Chi-square is the sum of all cells which is: 42553,9 + 138376 + … + 8739,52 + 33352 = 442453

Chi-square = 442453

Degrees of Freedom:
Multiply (rows − 1) by (columns − 1), which is
(2 – 1) * (5 – 1) = 4

Degrees of Freedom(DF) = 4
I choose Confidence Level = 0.05

So, when I look it up to Chi-square Distribution Table, the number is 9.49.

Obviously it's not proper to compare with 9.49 and 442453. What am I missing?

Everything you did was correct. R provides the same answer:

    male <- c(9639, 281060, 1355555, 2257670, 2686581)
female <- c(127728, 993121, 2057165, 2536860, 2710454)
data <- matrix(c(male, female), nrow=2, byrow=TRUE)
chisq.test(x = data)


The output being:

    Pearson's Chi-squared test
data:  data
X-squared = 442453, df = 4, p-value < 2.2e-16


What you might have missed, is that sample size can actually be too large to make meaningful use of p-values. See for a discussion of this here (Lin, M., Lucas Jr, H. C., & Shmueli, G. (2013). Research commentary - too big to fail: large samples and the p-value problem. Information Systems Research, 24(4), 906-917.).

Don't rely for your interpretation on p-values when your samples are very large. The p-value is just the probability of getting this or more extreme data if the null hypothesis is true, with huge data this probability can get arbitrarily small.

Edit: I assumed that in your table in each cell there is the number of persons of a certain age and sex, and thus your sample size is huge. If this is not the case, Chi-Squared test may not be correct test.