I have one dataset which contain many categorical variables. The target variable is also categorical. I tried to find correlation between each categorical variable with target variable, using chi squared test and also got Cramer's V value. Here is a part of output-
column Cramer's V p value
Col1 0.065430 0.000000e+00
Col2 0.084450 0.000000e+00
Col3 0.059535 0.000000e+00
Col4 0.119343 0.000000e+00
Col5 0.108018 0.000000e+00
Col6 0.040086 1.842584e-218
Col7 0.021307 7.523901e-61
Col8 0.012404 3.865421e-20
Col9 0.009400 2.183289e-11
Col10 0.010728 6.082550e-15
I found p value and chi-sq statistic using python's function scipy.stats.chi2_contingency
and then found Cramer's V by taking the square root of the chi-squared statistic divided by the sample size and the minimum dimension minus 1. (according to wikipedia page)
What I got is that all p values belonging to chi squared tests are nearly 0, which means there is very high correlation. But the Cramer's V value is also very less (<<1), which (again from wikipedia page) suggests that there is no strong association. So how to interpret these conflicting results? Is there a strong correlation or there is no correlation? Or, if my approach is wrong, please suggest the correct way.
EDIT
Following is the contingency table for "col1" column and target column.
TARGET_COl Grade 1 Grade 2 Grade 3 Grade 4 Grade 5
Col1
0 290 392 932 1812 2854
1 522 421 574 917 1247
2 56789 81296 117971 147811 204480
3 3719 2975 2811 1704 2244
And this is for "col2" column and target column.
TARGET_COl Grade 1 Grade 2 Grade 3 Grade 4 Grade 5
Col2
0 50867 73899 107101 135400 193526
1 10453 11185 15187 16844 17299
Best Answer
Because your sample size is large, the Chi-square test is likely to return a low p-value even for a table with small differences from the expected proportions.
To get a sense of the effect size being reported by Cramer's v, it is helpful to look at the proportions in the table. For example, for column 1, you can see that there is not much difference in the proportions for each grade within the rows. Row 0 is about one-half to 1 percent of the observations in each grade. Row 2 is, say, 93 to 97 percent of observations in each grade. And so on.
Whether these kind of differences in proportions are meaningful in your context is up to you. The p-value and Cramer's v give you certain information. The practical importance of your results is something you will have decide.
The following is code for R.
I am getting a slightly different Cramer's v than you, so that's something that you might want to look into.