Solved – How to use a correlation matrix in Big Data

correlationlarge datamulticollinearityr

I'm fairly new to Big Data and have been reading the book 'Applied Predictive Modeling' by Max Kuhn, Kjell Johnson. I'm trying to understand how to use the correlation matrix in the context of big data.

This is an example of a correlation matrix that one can generate in R:

https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html

enter image description here

In big data, the datasets are huge with hundreds of predictor variables so expect this square to also be huge.

I understand that to prevent multicollinearity, one should not take a extreme blue or the extreme red pair of predictor variables as they are correlated and this could affect the results coming out of your predictive model. Instead, you should pick pairs of variables with low correlation, like qsec and drat with a correlation of 0.09.

However, is even generating this matrix even relevant in the big data context as per my understanding, most predictive models have feature selection in place so the correlated values should be filtered out by the predictive model already saving us from doing so manually?

I can see the relevance of the correlation matrix for a small linear regression model if you want to see whether something is correlated or not so you make a decision on whether to take out the variable of the model or not, but I just cannot seem to wrap my head around the relevance of this matrix in the Big Data context.

Best Answer

The sensitivity of your model to highly correlated values will depend on the model selection. Random forest models handle correlated variables pretty well, but that doesn't mean you are really benefiting from having them there.

A random forest model may pick any one of the correlated variables as a predictor without a substantial preference of one over the other. Once one is used, the importance of the other correlated variables goes down. This could affect your interpretation of the data later on if you aren't aware of the other correlations. You could incorrectly assume one of the correlated variables is more important than the other for the model.

The correlation matrix is useful for exploratory data analysis. Once you decide what kind of model you want to build, you can decide whether filtering highly correlated variables is appropriate for the situation. The caret package in R has nice,simple tools for data pre-processing.