Solved – Correlation Between Input and Output Variable

correlationfeature selectionmachine learningregressiontime series

Let's say I want to model a problem with 5 input variables and 1 output variable. I have measured the correlation of each input with the output. 3 of the 5 inputs have correlation less than 0.1, and the remaining 2 inputs have correlation greater than 0.7. Should I use all input variables in my model, or only the two with high correlation?

Best Answer

Play with it a bit. Run some scatterplots to get a feel for your data.

In all likelihood, you'll find that since those two are so highly correlated with the output variable, they're also correlated with each other, possibly even more strongly. If so, putting both of them in the model will result in some multicollinearity, risking both of them being declared insignificant, and skewing the coefficient estimates.

Try running all subsets regression, since you're working with so few variables, and then pick the best performing combination!

Related Question