Solved – Testing a 2×2 contingency table: male/female, employed/unemployed

chi-squared-testfishers-exact-testhypothesis testing

I major in science, and my knowledge of statistics is rather superficial.

Problem

I had to find a data set and analyze it to the best of my ability as an assignement for my statistics course. This is no longer an assignment, I just need help in interpreting why I did my analysis badly and what I should have done instead.

I used a categorical data set of employment rates in New Zealand, planning to arrange it in a 2×2 contingency table and use Pearson's chi-squared test and Fisher's exact test to test whether gender correlates with employment.

What I want to answer

  1. Understand why I cannot use chi-squared test and Fisher's exact test for this problem and learn what I should have used instead. "Odds-ratio as a function of time", I assume? Any useful links on how do that, perfectly in R?
  2. Understand the "sequential correlation" comment regarding the first part of the assignment and what exactly should I have done.

Way to help me #1 (shorter)

That's how our data looks (based on a census):

                 Male     Female
Employed      1201600    1060200
Unemployed      73300      75000

I did a chi-squared test and a Fisher's exact test in R, assuming that the obtained p-value will tell me the probability of such a distribution of jobs (or one more extreme) given that the null is true (that males and females have equal chances of getting a job). I got a very small p-value, and Fisher's test gave me odds ratio of 1.16, meaning that there is a correlation, and specifically males are 16% more likely to find a job in NZ.

However, according to my lecturer, I used these tests inappropriately. I didn't quite understand why, but I think he was saying that these tests assume independence, and because there's a given amount of jobs available in NZ, our samples are not independent… I'm not sure about it though (you can see his feedback quoted below).

Way to help me #2 (longer)

If you have some spare time, I would appreciate it very much if you could look at the whole assignment. I will also provide the lecturer's feedback, so if you could interpret it for me, it would be great! The assignment is very easy for a mathematician / statistician, there's only two questions there, it's just full of padding where I tried to demonstrate that I know what I'm doing, you can skip most of it.

Here's the link to a PDF file with the assignment I didn't succeed in: statistics assignment.pdf.

Lecturer's feedback

Your figure 1 exhibits sequential correlation which is the real reason why linear regression does not work. Neither fisher's test nor chi squared is good for your 2×2 table. This is because you want to test homogeneity, but you are rejecting the null because of non-independence (which is not interesting). The distinction between the two is irrelevant here (they are asymptotically identical in any case). You could have plotted the odds ratio as a function of time.

Best Answer

Some immediate responses:

1) Your lecturer means that the data show autocorrelation. This leads to inefficient estimates of regression coefficients in simple linear regression. Depending on whether it was covered in your course, that's a mistake.

2) Maybe I do not understand the problem fully, but IMAO the chi-squared test of independence is used correctly here, except for two other issues:

3) Your chi-square test has an immense power, because of the sample size. It's hard not be significant even if effects were very small. Furthermore, it appears you have a census of the population. In this situation statistical inference is unnecessary, because you obseve all population units. But that's not what the lecturer remarks.

4) You seem to aggregate the data across time points. You should actually test once per time point, since otherwise you aggregate effects over time (you count units multiple times). But that's also not what the lecturer remarks.

The lecturer actually remarks that you want to test the null of homogeneity, where you tests the null of independence. So what does he mean by homogeneity?

I suppose he refers to the test of marginal homogeneity in paired test data. This test is used to assess whether there was a change across time (repeated measures). This is however not what you want to assess in the first place. My guess is that he did not understand you want to test whether gender and employment at time point x are related. Maybe he also tried to suggest that what you should test is change across time (or no change, in which case the multiple repeated contingency would be called homogenous indeed).