Solved – Correlations between continuous and categorical (nominal) variables

biostatisticscategorical datacorrelationdescriptive statisticsspearman-rho

I would like to find the correlation between a continuous (dependent variable) and a categorical (nominal: gender, independent variable) variable. Continuous data is not normally distributed. Before, I had computed it using the Spearman's $\rho$. However, I have been told that it is not right.

While searching on the internet, I found that the boxplot can provide an idea about how much they are associated; however, I was looking for a quantified value such as Pearson's product moment coefficient or Spearman's $\rho$. Can you please help me on how to do this? Or, inform on which method would be appropriate?

Would Point Biserial Coefficient be the right option?

Best Answer

The reviewer should have told you why the Spearman $\rho$ is not appropriate. Here is one version of that: Let the data be $(Z_i, I_i)$ where $Z$ is the measured variable and $I$ is the gender indicator, say it is 0 (man), 1 (woman). Then Spearman's $\rho$ is calculated based on the ranks of $Z, I$ respectively. Since there are only two possible values for the indicator $I$, there will be a lot of ties, so this formula is not appropriate. If you replace rank with mean rank, then you will get only two different values, one for men, another for women. Then $\rho$ will become basically some rescaled version of the mean ranks between the two groups. It would be simpler (more interpretable) to simply compare the means! Another approach is the following.

Let $X_1, \dots, X_n$ be the observations of the continuous variable among men, $Y_1, \dots, Y_m$ same among women. Now, if the distribution of $X$ and of $Y$ are the same, then $P(X>Y)$ will be 0.5 (let's assume the distribution is purely absolutely continuous, so there are no ties). In the general case, define $$ \theta = P(X>Y) $$ where $X$ is a random draw among men, $Y$ among women. Can we estimate $\theta$ from our sample? Form all pairs $(X_i, Y_j)$ (assume no ties) and count for how many we have "man is larger" ($X_i > Y_j$)($M$) and for how many "woman is larger" ($ X_i < Y_j$) ($W$). Then one sample estimate of $\theta$ is $$ \frac{M}{M+W} $$ That is one reasonable measure of correlation! (If there are only a few ties, just ignore them). But I am not sure what that is called, if it has a name. This one may be close: https://en.wikipedia.org/wiki/Goodman_and_Kruskal%27s_gamma

Related Question