R – Calculating Covariance Between Continuous and Discrete Variables for Analysis

covariance-matrixr

I am interested in estimating the variance-covariance between few variables in my dataset. The variables are a combination of continuous and discrete. I am curious how covariance can be estimated between a continuous and discrete variable.

In this example , corrplot , authors shows covariance/correlation between a set of continuous variables (mpg, cyl, disp, ht, drat, wt, qsec) and discrete variables (vs, am, gear, carb).

https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

I was under the impression that we can only estimate covariance between two continuous variables or two discrete variables and not between continuous and discrete variables. Please let me know where I am wrong. Thanks.

Best Answer

All of the variables in the mtcars are coded as numeric (vs and am are binary, gear is in {3,4,5}, carb is in {1,2,3,4,6,8}). As long as a variable can be represented as an ordered numeric value, it makes at least some sense to compute a Pearson correlation. A truly unordered categorical or nominal variable (e.g. {red, yellow, blue} or {chocolate, vanilla, strawberry}), in the absence of a meaningful ordering, would be hard to use in this context. (In this case I could make at least a plausible case that the non-binary variables here are truly ratio variables and thus are fully legitimate targets of correlation measures.)

As far as R is concerned, it doesn't even matter if the numeric coding makes sense; as long as both variables are numeric, it will happily compute a correlation (or covariance) for you.

Whether a naive Pearson correlation involving the ranks of an ordered categorical predictor is useful or not depends on the application. For example, if you were using correlations to decide on a subset of uncorrelated features for some kind of predictive model, it could be fine. If you want to ensure specific theoretical properties, it might not be.

If you do want "statistically meaningful" ordered-ordered or ordered-continuous correlations, you should look into polychoric and polyserial correlations, e.g. as implemented in the polycor package for R by John Fox.

Related Question