Chi-square, effect size and unbalanced groups

chi-squared-testeffect-size

For context: I should note in advance I am a relative beginner with this.

Data context

I have data on some 600 000 persons which includes a column of whether these persons took parental leave or not (coded simply as 1 – took parental leave, 0, took no parental leave). I also have a column coding each person as male or female. I want to know whether persons coded as female are more likely to take parental leave than persons coded as male.

So I made a 2×2 table (female/male; no parental leave/parental leave) and applied the chi-square test which is significant (as expected). The residuals + prop table show that indeed women are overrepresented in taking 'parental leave'. So far so good.

Problem statement

However, the effect size is relatively small (Cramer's V about 0,15). For a number of reasons this seems counterintituive – the difference between men and women in the 'parental leave = 1' group seems quite large. I googled/read a bit about effect size & unbalanced groups. In this case there is a large dataset, with a relatively small proportion of the 600 000 persons taking parental leave. Could this affect the effect size, if yes, is there any measure other than Cramer's V that should be used in this regard?

Note: I am not specifically looking for a large effect size, just wondering whether I am applying the right measure.

Own research I have read the post: Chi-square Test with High Sample Size and Unbalanced Data but it didn't quite answer my question (the issue seems similar though).

Best Answer

Answer from comment thread:

It sounds like you are describing the most useful effect size for your situation:

relative to the proportion of men/women in the overall population, women are about twice as likely to take up parental leave.

If I understand what you are saying, this is the odds ratio.

For future readers, as an example of a 2 x 2 table with Cramer V = 0.15, and OR = 2, the following is code in R:

Matrix = matrix(c(550, 1100, 250, 1000), nrow=2, byrow=TRUE)

library(vcd)

assocstats(Matrix)

oddsratio(Matrix, log=FALSE)

   ###                     X^2 df   P(> X^2)
   ### Likelihood Ratio 64.712  1 8.8818e-16
   ### Pearson          63.294  1 1.7764e-15
   ### 
   ### Phi-Coefficient   : 0.148 
   ### Contingency Coeff.: 0.146 
   ### Cramer's V        : 0.148 
   ###
   ### odds ratio
   ###
   ### 2

OR = (550 / 1100) / (250 / 1000)

names(OR) = "Odds ratio"

OR

   ### Odds ratio 
   ###          2