Solved – Chi squared test result and Cramer’s V value

I have one dataset which contain many categorical variables. The target variable is also categorical. I tried to find correlation between each categorical variable with target variable, using chi squared test and also got Cramer's V value. Here is a part of output-

column  Cramer's V      p value
Col1     0.065430   0.000000e+00
Col2     0.084450   0.000000e+00
Col3     0.059535   0.000000e+00
Col4     0.119343   0.000000e+00
Col5     0.108018   0.000000e+00
Col6     0.040086  1.842584e-218
Col7     0.021307   7.523901e-61
Col8     0.012404   3.865421e-20
Col9     0.009400   2.183289e-11
Col10    0.010728   6.082550e-15

I found p value and chi-sq statistic using python's function scipy.stats.chi2_contingency and then found Cramer's V by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1. (according to wikipedia page)

What I got is that all p values belonging to chi squared tests are nearly 0, which means there is very high correlation. But the Cramer's V value is also very less (<<1), which (again from wikipedia page) suggests that there is no strong association. So how to interpret these conflicting results? Is there a strong correlation or there is no correlation? Or, if my approach is wrong, please suggest the correct way.

EDIT

Following is the contingency table for "col1" column and target column.

TARGET_COl               Grade 1  Grade 2  Grade 3  Grade 4  Grade 5
Col1                                             
0                           290      392      932     1812     2854
1                           522      421      574      917     1247
2                         56789    81296   117971   147811   204480
3                          3719     2975     2811     1704     2244

And this is for "col2" column and target column.

TARGET_COl         Grade 1  Grade 2  Grade 3  Grade 4  Grade 5
Col2                                             
0                    50867    73899   107101   135400   193526
1                    10453    11185    15187    16844    17299

Input =(" Col1 Grade1 Grade2 Grade3 Grade4 Grade5 0 290 392 932 1812 2854 1 522 421 574 917 1247 2 56789 81296 117971 147811 204480 3 3719 2975 2811 1704 2244 ") Matrix = as.matrix(read.table(textConnection(Input), header=TRUE, row.names=1)) Matrix chisq.test(Matrix) ### Pearson's Chi-squared test ### ### X-squared = 8113.9, df = 12, p-value < 2.2e-16 library(vcd) assocstats(Matrix) ### Cramer's V : 0.065 prop.table(Matrix, margin=2) ### Grade1 Grade2 Grade3 Grade4 Grade5 ### 0 0.004729289 0.004607212 0.007621353 0.011901947 0.013537294 ### 1 0.008512720 0.004948051 0.004693837 0.006023226 0.005914858 ### 2 0.926108937 0.955479291 0.964698090 0.970882268 0.969903949 ### 3 0.060649054 0.034965446 0.022986720 0.011192559 0.010643899

Best Answer

Because your sample size is large, the Chi-square test is likely to return a low p-value even for a table with small differences from the expected proportions.

To get a sense of the effect size being reported by Cramer's v, it is helpful to look at the proportions in the table. For example, for column 1, you can see that there is not much difference in the proportions for each grade within the rows. Row 0 is about one-half to 1 percent of the observations in each grade. Row 2 is, say, 93 to 97 percent of observations in each grade. And so on.

Whether these kind of differences in proportions are meaningful in your context is up to you. The p-value and Cramer's v give you certain information. The practical importance of your results is something you will have decide.

The following is code for R.

I am getting a slightly different Cramer's v than you, so that's something that you might want to look into.

Best Answer

Related Solutions

Solved – n alternative to Cramer’s V for computing effect size when chi-squared is inappropriate

Related Question