Let's say I want to model a problem with 5 input variables and 1 output variable. I have measured the correlation of each input with the output. 3 of the 5 inputs have correlation less than 0.1, and the remaining 2 inputs have correlation greater than 0.7. Should I use all input variables in my model, or only the two with high correlation?
Solved – Correlation Between Input and Output Variable
correlationfeature selectionmachine learningregressiontime series
Related Solutions
I think that you could look into the field of Sensitivity Analysis : https://en.wikipedia.org/wiki/Sensitivity_analysis. In your case, I would advise to compute the Sobol' indices (https://en.wikipedia.org/wiki/Variance-based_sensitivity_analysis).
These indices represent the fraction of variance carried by a variable and/or a set of variables. Several R packages exist in order to compute first and second order indices quite efficiently, by using specific designs.
In your case, as the number of model evaluation is pretty small and the number of inputs is large, you could try to look into surrogate based sensitivty analysis (see for instance https://doi.org/10.1016/j.apm.2013.01.019): Take a well behaved initial design (Latin Hypersquare or other space filling designs), and based on these evaluations, build a surrogate model (using Kriging). This surrogate will then be used for intensive computations, and can give some insightful results.
Be aware however that due to the high number of inputs, an accurate surrogate will probably need a lot of initial runs to be generated. A usual rule of thumb is to take $10d$ initial design points, where $d$ is the input dimension.
You have stumbled into the fact that you cannot simply make a correlation matrix by assembling individually valid pairwise correlations. There are many questions on the site related to this, have a look at
It might help with an intuitive example: Three persons are running along a linear road. It is impossible for all three of them to run in opposite directions, as there are only two directions!. So the three running velocities cannot all be negatively correlated!
But this example have negative and positive correlations, your have only positive. So let us extend it: Let A, B and C all run in the same direction. If A and B run with exactly the same speed, the correlation of C with those two must be equal. And, extending by continuity, if A and B have very similar speeds, C's correlations with them cannot be very dissimilar.
So the matrix you have assembled is simply not a valid correlation matrix! Maybe you should rather ask about your real problem, and tell us from where your correlations come? In fact, you must make rather large modifications to your matrix to make it valid. Some code (R):
make_Sigma <- function(rho1=0.1, rho2=0.7, rho3=0.01) {
blck1= 1:3
blck2= 4:7
Sigma <- diag(10)
Sigma[c(blck1, blck2), ] <- rho1
Sigma[blck2, blck1] <- rho2
Sigma[-c(blck1, blck2), ] <- rho3
diag(Sigma) <- 1
Sigma[upper.tri(Sigma)] <- t(Sigma)[upper.tri(Sigma)]
Sigma
}
(your code from the Q assembled as a function, to facilitate experimentation). Then use that a valid correlation matrix cannot have negative eigenvalues:
min(eigen(make_Sigma(rho2=0.7), only.values=TRUE)$values)
[1] -1.17539
min(eigen(make_Sigma(rho2=0.35), only.values=TRUE)$values)
[1] 0.03652832
But maybe some smaller modifications on the smaller correlations is enough?
Best Answer
Play with it a bit. Run some scatterplots to get a feel for your data.
In all likelihood, you'll find that since those two are so highly correlated with the output variable, they're also correlated with each other, possibly even more strongly. If so, putting both of them in the model will result in some multicollinearity, risking both of them being declared insignificant, and skewing the coefficient estimates.
Try running all subsets regression, since you're working with so few variables, and then pick the best performing combination!