Solved – Random Forest: what if I know a variable is important

parameterizationrrandom forest

My understanding is the the random forest picks randomly mtry variables to build each decision tree. So if mtry=ncol/3 then each variables will be used on average in 1/3 of the trees. And 2/3 of the trees will not use them.

But what if I know that a single variable is probably very important, would it be good to manually increase the probability that this variable be picked in each tree? Is it feasible with the randomForest package in R?

Best Answer

Note that mtry is the number of variables randomly sampled as candidates at each split. And from this candidates the best is choosen to perform splitting. Thus the proportion you have mentioned is not satisfied completely. More important variables appear more frequently, and less important – less frequently. So if the variable is really very important, then there is a great probability that it will be picked in a tree and you do not need manual correction. But sometimes (rarely) there is a need to force the presence of some variable (regardless of its possible importance) in the regression. As far as I know R package random forest does not support such possibility. But if this variable has no intercorrelation with others you can make ordinary regression with this variable as single term and then run random forest regression on the residuals of this ordinary regression. If you still want to correct the possibility of choosing prespecified variables, then modification of source code with next compilation is your option.