I have a dataset within which I have a particular response variable that I'm interested in and numerous predictor variables. All variables are nominal and have as many as 15 possible values. When I cross tabulate any given predictor variable with the response variable, I get many cells with 0 counts, making performing a chi-squared test of independence inappropriate. That's fine, because I can use Fisher's exact test, but it's a problem in terms of calculating effect size since Cramer's V and every other method I've found that works for nominal data like mine seems to rely on chi-squared. Are there any alternatives to Cramer's V that don't have this problem? Or, if I'm misunderstanding something, is it still valid to user Cramer's V even if a chi-squared test is inappropriate?
Solved – n alternative to Cramer’s V for computing effect size when chi-squared is inappropriate
categorical datachi-squared-testcramers-veffect-size
Related Solutions
Because your sample size is large, the Chi-square test is likely to return a low p-value even for a table with small differences from the expected proportions.
To get a sense of the effect size being reported by Cramer's v, it is helpful to look at the proportions in the table. For example, for column 1, you can see that there is not much difference in the proportions for each grade within the rows. Row 0 is about one-half to 1 percent of the observations in each grade. Row 2 is, say, 93 to 97 percent of observations in each grade. And so on.
Whether these kind of differences in proportions are meaningful in your context is up to you. The p-value and Cramer's v give you certain information. The practical importance of your results is something you will have decide.
The following is code for R.
I am getting a slightly different Cramer's v than you, so that's something that you might want to look into.
Input =("
Col1 Grade1 Grade2 Grade3 Grade4 Grade5
0 290 392 932 1812 2854
1 522 421 574 917 1247
2 56789 81296 117971 147811 204480
3 3719 2975 2811 1704 2244
")
Matrix = as.matrix(read.table(textConnection(Input),
header=TRUE,
row.names=1))
Matrix
chisq.test(Matrix)
### Pearson's Chi-squared test
###
### X-squared = 8113.9, df = 12, p-value < 2.2e-16
library(vcd)
assocstats(Matrix)
### Cramer's V : 0.065
prop.table(Matrix, margin=2)
### Grade1 Grade2 Grade3 Grade4 Grade5
### 0 0.004729289 0.004607212 0.007621353 0.011901947 0.013537294
### 1 0.008512720 0.004948051 0.004693837 0.006023226 0.005914858
### 2 0.926108937 0.955479291 0.964698090 0.970882268 0.969903949
### 3 0.060649054 0.034965446 0.022986720 0.011192559 0.010643899
Complete re-write:
I think the correct approach to calculating Cohen's w is to use the expected values for the P0 values. I looked back at Cohen (1988), and this isn't precisely clear, but I think that's the intention.
So the problem is that your second case (dat_0_better
) doesn't represent the expected values for dat
, but those for dat_0
does.
chisq.test(dat)$expected
### [,1] [,2]
### [1,] 20 20
### [2,] 30 30
So the calculation of w in the first case, I believe, is correct † .
library(rcompanion)
cohenW(dat)
### Cohen w
### 0.4082
The table that you've constructed with dat
includes the information that the control treatment results in 10 out of 50. This is taken into account with the expected values of the table, so I don't think you need to alter the null hypothesis to account for this.
I think what I'm saying makes sense in the standard sample size calculation. It's the case that those before us did the hard work.
† Caveat: I am the author of the rcompanion
package. I don't know of another package in R that calculates Cohen's w, though I would suspect there are some.
Best Answer
The effect sizes I assume you are considering --- Cramer's V, (phi), Contingency coefficient C, and Cohen's w --- can all be calculated with the chi-square value. But the chi-square is simply calculated from the difference of observed values from expected values. This is way Cohen defines his w in Cohen (1988).
I assume that because there's no inference with these statistics, that it is fine to report them even if some test using the chi-square statistic would not be appropriate. It's like saying the difference between two means is some value, without addressing whether or not you could use a t-test or not in this case.