Correlation – Calculating Correlation Coefficient Between Nominal and Numeric or Ordinal Variables

categorical datacontinuous datacorrelationMATLABordinal-data

I've already read all the pages in this site trying to find the answer to my problem but no one seems to be the right one form me…

First I explain you the kind of data I'm working with…

Let's say that I have an array vector with several names of city, one for each of 300 users. I also have another array vector with scores response to a survey of each user or a continuous value for each user.

I would like to know if exist a correlation coefficient that compute the correlation between these two variables so, between a nominal and a numeric/continuous or ordinal variables.

I've searched on the Internet and in some pages they suggest to use the contingency coefficient or Cramer's V or Lambda coefficient or Eta . For each of this measure the just say that they could be applied for such data in which we have a nominal variable and interval or numerical variable.
The thing is that searching and searching, trying to understand every one of them, sometime is written or watching the examples that they are reasonable to use them if you have dichotomous nominal variable, except for Cramer's V, other time is not written any requirement for the type of data.
A lot of other pages say that is right to apply regression instead, that is right, but I would just simply like to know if there is a coefficient like pearson/spearman for this kind of data.

I also think that is no so properly to use Spearman Correlation coeff since the cities are not sortable.

I have also built the function of Cramer'sV and Eta by myself (I'm working with Matlab) but for Eta they don't talk about any p-value to see if the coefficient is statistically significant…

In the matlabWorks site there is also a nice toolbox that says to compute eta^2 but the kind of input it needs is not understandable.

Is here someone that have done a test like mine? If you need more detail to understand the kind of data I'm using just ask me and I'll try to explain you better.

Best Answer

Nominal vs Interval

The most classic "correlation" measure between a nominal and an interval ("numeric") variable is Eta, also called correlation ratio, and equal to the root R-square of the one-way ANOVA (with p-value = that of the ANOVA). Eta can be seen as a symmetric association measure, like correlation, because Eta of ANOVA (with the nominal as independent, numeric as dependent) is equal to Pillai's trace of multivariate regression (with the numeric as independent, set of dummy variables corresponding to the nominal as dependent).

A more subtle measure is intraclass correlation coefficient (ICC). Whereas Eta grasps only the difference between groups (defined by the nominal variable) in respect to the numeric variable, ICC simultaneously also measures the coordination or agreemant between numeric values inside groups; in other words, ICC (particularly the original unbiased "pairing" ICC version) stays on the level of values while Eta operates on the level of statistics (group means vs group variances).

Nominal vs Ordinal

The question about "correlation" measure between a nominal and an ordinal variable is less apparent. The reason of the difficulty is that ordinal scale is, by its nature, more "mystic" or "twisted" than interval or nominal scales. No wonder that statistical analyses specially for ordinal data are relatively poorly formulated so far.

One way might be to convert your ordinal data into ranks and then compute Eta as if the ranks were interval data. The p-value of such Eta = that of Kruskal-Wallis analysis. This approach seems warranted due to the same reasoning as why Spearman rho is used to correlate two ordinal variables. That logic is "when you don't know the interval widths on the scale, cut the Gordian knot by linearizing any possible monotonicity: go rank the data".

Another approach (possibly more rigorous and flexible) would be to use ordinal logistic regression with the ordinal variable as the DV and the nominal one as the IV. The square root of Nagelkerke’s pseudo R-square (with the regression's p-value) is another correlation measure for you. Note that you can experiment with various link functions in ordinal regression. This association is, however, not symmetric: the nominal is assumed independent.

Yet another approach might be to find such a monotonic transformation of ordinal data into interval - instead of ranking of the penultimate paragraph - that would maximize R (i.e. Eta) for you. This is categorical regression (= linear regression with optimal scaling).

Still another approach is to perform classification tree, such as CHAID, with the ordinal variable as predictor. This procedure will bin together (hence it is the approach opposite to the previous one) adjacent ordered categories which do not distinguish among categories of the nominal predictand. Then you could rely on Chi-square-based association measures (such as Cramer's V) as if you correlate nominal vs nominal variables.

And @Michael in his comment suggests yet one more way - a special coefficient called Freeman's Theta.

So, we have arrived so far at these opportunities: (1) Rank, then compute Eta; (2) Use ordinal regression; (3) Use categorical regression ("optimally" transforming ordinal variable into interval); (4) Use classification tree ("optimally" reducing the number of ordered categories); (5) Use Freeman's Theta.