Interpret categorical correlation coefficients

classificationcorrelationmachine learningneural networksstatistical significance

As my dataset contains lot of nominal categorical variables (as input) and binary variable as output, I am using CRAMER's V test to determine the correlation between different input features and against the output variable.

However, unlike perarson correlation coefficent which is easy to interpret or put in words the association, am not able to understand how to interpret the correlation coefficient

example

Pearson correlation coefficient (Age, salary) = 0.95

Age and salary are highly positiively correlated. Meaning, As age increases, salary also increases (vice versa).

However, for categorical variable like below

Cramer's V coefficient (Industry Segment, Target met) = 0.90

How to interpret – ?

Cramer's V coefficient (Regional movie industry, National movie industry)= 0.90

How to interpret – ?

Best Answer

There's two aspects here: the mathematical and the linguistic.

OP and I spoke in a separate chat about what he was curious about, and it was the conceptual/linguistic aspect (not mathematical). For future reader's, I'm leaving a link which succinctly explains the mathematical basis/calculation/basis for Cramer's V under "Calculation". Read here about Cramer's V.

Linguistically, "correlation" isn't exactly the right term to describe the relationship between two nominal variables, since they don't "vary" together, because they don't increase/decrease. They just change categories/groups, so you're really applying statistics on how often certain categories "coincide".

Therefore, two nominal variables are said to have an "association" (as just a general term), rather than "correlation". For Cramer's V, the closer you are to 1, the stronger the association between the variables - the closer to 0, the weaker the association between the variables.

Related Solutions

Solved – How to interpret a significant but weak correlation

More meaningful in this case is the $\text{R}^2$ which explains the proportion of variation in your observations accounted by the association. For example if your $R$ was 0.1 (p= 0.005) due to the large sample size, it means 1% of the variation in tail lesions in pigs is accounted for by abscesses. In a multifactorial situation such associations though informative may not be very meaningful. Again be cautious since correlation does not imply causation.

Solved – find repeated measure correlation coefficient using linear mixed model

I believe that what you're asking for is the Pearson correlation coefficient, aka "R". This can be derived from the coefficient of determination, or R-squared. You can use the package MuMIn to find the marginal and conditional R-squared values of your model.

"The marginal R squared values are those associated with your fixed effects, the conditional ones are those of your fixed effects plus the random effects. Usually we will be interested in the marginal effects." R squared for mixed models – the easy way; by Philip Martin.

Here's all you need:

library(lme4)
library(MuMIn)

fit<-lmer(circumference~age+(1|Tree), data=Orange)
summary(fit)

r.squaredGLMM(fit)

To understand the "Correlation of Fixed Effects" at the bottom of your summary() output, see the following:

How do I interpret the 'correlations of fixed effects' in my glmer output?

Best Answer

Related Solutions

Solved – How to interpret a significant but weak correlation

Solved – find repeated measure correlation coefficient using linear mixed model

Related Question