Solved – Partial correlation and multiple regression controlling for categorical variables

correlationdescriptive statistics

I am looking through the linear association between variable $X$ and $Y$ with some controlling variables $\mathbf{Z} =(Z_1, Z_2, Z_3, \dots)$

One approach is to do a regression $E(Y)= \alpha_0 +\alpha_1 X + \sum\beta_iZ_i$ and look at $\alpha_1$.

Another approach is to calculate the partial correlation $\rho_{XY.\mathbf{Z}}$

My first question is which one is more appropriate? (Added: From the comments, they are equivalent but different presentations)

For multivariate normal $(X,Y,\mathbf{Z})$, partial correlation $\rho_{XY . Z}$ would be a better choice as its value can tell us how well $X$ and $Y$ being associated, for example $\rho_{XY . \mathbf{Z}} = \pm 1$ can be interpreted as $X$ and $Y$ have a perfect linear relationship after controlling $\mathbf{Z}$.

How about if some $Z_i$'s are categorical? Is partial correlation still proper to measure the association between $X$ and $Y$ controlling for $Z_i$'s? (Added: It is acceptable by changing categorical variables to dummy and then control for them, just like how we handle them in regression). As when I learnt partial correlation, it was used for multivariate normally distributed variables; I am not sure whether it is still appropriate and meaningful when normality of data is violated (say, highly skewed) or even the continuity is not the case (say, one of our controlling variables is 'place of birth'). The calculation of partial correlation of $X$ and $Y$ controlling $Z$ involve Pearson's correlation $\rho_{XZ}$ and $\rho_{YZ}$ which does not make sense when $Z$ is categorical, which makes $\rho_{XY . Z}$ look weird.

Also, is there any robust version of partial correlation (like kendall's $\tau$/Spearman's rank correlation to Pearson's correlation)?

Raised from ssdecontrol: Regression "works" and "makes sense" with categorical predictors, but correlation is occasionally said to be inappropriate for categorical data. Since regression is partial correlation, we have an apparent paradox.

Thanks.

Best Answer

It seems to me that the only unanswered part of your question is the part cited below:

Also, is there any robust version of partial correlation (like kendall's 𝜏 τ /Spearman's rank correlation to Pearson's correlation)?

The same way you can have partial Pearson correlation coefficient, you can have partial Spearman correlation coefficient and also Kendall. See some R code below with the package ppcor that helps you with partial correlation.

library(ppcor)

set.seed(2021)
N <- 1000
X <- rnorm(N)
Y <- rnorm(N)
Z <- rnorm(N)

pcor.test(X, Y, Z, method='pearson')

You will be given an estimate of $-0.01175714$. If you rank the variables, that would be equivalent to the Spearman correlation.

pcor.test(rank(X), rank(Y), rank(Z), method='pearson')

And this way you get a partial spearman correlation of $0.008965395$. But you don't have to do this, you can just changed to spearman in the parameter of the function.

pcor.test(X, Y, Z, method='spearman')

And here we go, $0.008965395$ again. If you want to do the partial Kendall correlation, just changed the method parameter again.

pcor.test(X, Y, Z, method='Kendall')

This time, we got a partial Kendall correlation of $0.006344739$.

If by robust you mean not depending on the distribution of the random variables, among other things, and most importantly, a measure of independence, I recommend you to read about Mutual Information.