I'm using scipy and I'd like to calculate the chi-squared value of a contingency table of percentages.
This is my table, it's of relapse rates. I'd like to know if there are values that are unexpected, i.e. groups where relapse rates are particularly high:
18-25 25-34 35-44 ...
Men 37% 36% 64% ...
Women 24% 25% 32% ...
The underlying data looks like this:
18-25 25-34 35-44 ...
Men 667 of 1802 759 of 2108 1073 of 1677 ...
Should I just use the raw values, so have a contingency table like this, and run a chi-squared test on the raw values?
18-25 25-34 35-44 ...
Men 667 759 1073 ...
That doesn't seem quite right, because it doesn't capture the relative underlying size of each group.
I have been Googling, but haven't been able to find an explanation I understand of what I should do. How should I find unexpected values in data like this?
Best Answer
As long as the percentages all add to 100 ((not the case in your illustration) and reflect mutually exclusive and exhaustive outcomes (not the case either), you can compute $X^2$ using the percentages, and multiply it by $N/100$.
In your case, you really have a 3-way table. It appears that what you'd really like to know is how age and sex affect relapse rates. So I think you're better off forgetting the chi-square stuff, and instead using the actual frequencies for each cell:
Then run a logistic regression model with Age, Sex, and Age:Sex as the predictors. You can then see what the effects of those factors are, do comparisons among predictions, etc. It'd be a lot more informative than a chi-square statsitic of some independence hypothesis.