Solved – Can random forest based feature selection method be used for multiple regression in machine learning

borutafeature selectionmachine learningmultiple regressionrandom forest

I would like to have a good feature selection method for a continuous response variable, over around 100 predictors. I would like to keep my model type as linear multiple regression model, rather than tree-based model.

My current method: I could calculate the (linear) correlation between each of my predictor and response, and select a subset of predictors with "strong" correlations for the final multiple regression. The prediction performance of the selected predictors will be then determined in this final multiple regression model. However, feature selection in this way is subjective, and I am afraid of missing "important" features.

I would like to apply a more objective and complete way of feature selection, such as "all-relevant" feature selection in Boruta or "variable of importance" in random forest. However, as I understand, both methods are based on tree-based random forest, which are not linear regression.

My questions are:

Is my current method a proper way to handle my research purpose?
Can random forest based feature selection handle feature selection purpose for multiple linear regression model?
Are there any other feature selection methods recommended?

Best Answer

Firstly, a method that first looks at univariate correlations for pre-identifying things that should go into a final model, will tend to do badly for a number of reasons: ignoring model uncertainy (single selected model), using statistical significance/strength of correlation as a criterion to select (if it is about prediction, you should rather try to assess how much something helps for prediction - these are not necessarily the same thing), "falsely" identifying predictors in univariate correlations (i.e. another predictor is even better, but because the one you look at correlates a bit with it, it looks like it correlates pretty well with the outcome) and missing out on predictors (they may only show up/become clear once other ones are adjusted for).

Additionally, not wrapping this into any form of bootstrapping/cross-validation/whatever to get a realistic assessment of your model uncertainty is likely to mislead you.

Furthermore, treating continuous predictors as having linear effects can often be improved upon by methods that do not make such an assumption (e.g. RF).

Using RF as a pre-selection for a linear model is not such a good idea. Variable importance is really hard to interpret and it is really hard (or meaningless?) to set a cut-off on it. You do not know whether variable importance is about the variable itself or about interactions, plus you are losing out on non-linear transformations of variables.

It depends in part of what you want to do. If you want good predictions, maybe you should not care too much about whether your method is a traditional statistical model or not.

Of course, there are plenty of things like the elastic net, LASSO, Bayesian models with the horseshoe prior etc. that fit better into a traditional modeling framework and could also accomodate e.g. splines for continuous covariates.

Related Solutions

Solved – Effect of categorical interaction terms with random forest machine learning algorithm

I think your questions are very interesting, I spend some of my time looking at the effective mapping curvature of random forest(RF) model fits. RF can capture some orders of interactions depending on the situation. x1 * x2 is a two-way interaction and so on... You did not write how many levels your categorical predictors had. It matters a lot. For continous variables(many levels) often no more than multiple local two-way interactions can be captured. The problem is, that the RF model itself only splits and do not transform data. Therefore RF is stuck with local uni-variate splits which is not optimal for captivating interactions. Therefore RF is fairly shallow compared to deep-lerning. In the complete other end of the spectre are binary features. I did not know how deep RF can go, so I ran a grid-search simulation. RF seems to capture up to some 4-8 orders of interactions for binary features. I use 12 binary variables and 100 to 15000 observations. E.g. for the 4th order interaction, the prediction vector y is:

orderOfInteraction = 4
y = factor(apply(X[,1:orderOfInteraction],1,prod))

where any element of X either is -1 or 1 and the product of the first four variable columns of X is the prediction. All four variables are completely complimentary. Therefore, no main-effects, 2nd or 3rd order effects. The OOB prediction error will therefore reflect only how well RF can captivate an interaction of the Nth order.

Things which makes RF captivate higher order of interactions: plenty of observation, few levels in variables, few variables

Limiting factors for RF captivating higher orders: the opposite of above, limited sampsize, limited maxnodes and redundant/sufficient lower order information.

The last one means that if RF can find the same information in low-order interactions, there is, so to say, no need to go deeper. Information may not even be redundant. It just have to be sufficient for RF to make correct binary predictions.

Depth of random forest: OOB err.rate vs. observations vs. order of interaction

  rm(list=ls())
  library(randomForest)
  library(parallel)
  library(rgl)

  simulate.a.forest = function(std.pars,ite.pars) {
    #Merge standard parameters with iterated parameters
    run.pars = c(std.pars,ite.pars)

    #simulate data of a given order
    X = replicate(run.pars$vars,sample(c(-1,1),run.pars$obs,replace=T))
    y = factor(apply(X[,1:run.pars$intOrder],1,prod))

    #run forest with run.pars, pars with wrong name is ignored
    rfo = do.call(randomForest, run.pars)

    #Fetch OOB error.rate and return
    out = rev(rfo$err.rate[,1])[1] #fetch error rate from object
    names(out) = paste(ite.pars,collapse="-")[1]
    return(out)
  }

  ## Lets try some situations (you can also pass arguments to randomForest here)
  intOrders = c(2,3,4,5,6,12) #hidden signal is a N-way interaction of Nth order
  obss = c(100,500,1000,3500,7000,15000) #available observations

  ## Produce list of all possible combinations of parameters
  ite.pars.matrix = expand.grid(intOrder=intOrders,obs=obss)
  n.runs = dim(ite.pars.matrix)[1]
  ite.pars.list   = lapply(1:n.runs, function(i) ite.pars.matrix[i,])

  i=1 ##for test-purposes
  out = mclapply(1:n.runs, function(i){
    #line below can be run alone without mclapply to check for errors before going multicore
    out = simulate.a.forest(std.pars=alist(x=X,y=y,
                                           ntree=250,
                                           vars=12),
                                           #sampsize=min(run.pars$obs,2000)),
                            ite.pars=ite.pars.list[[i]])
    return(out)
  })

  ##view grid results
  persp3d(x = intOrders,xlab="Nth order interaction",
          y = log(obss,base=10),ylab="10log(observations)",
          z = matrix(unlist(out),nrow=length(intOrders)),zlab="OOB prediction error, binary target",
          col=c("grey","black"),alpha=.2)

  rgl.snapshot(filename = "aweSomePlot.png", fmt = "png", top = TRUE)

Random Forest – Can It Be Used for Feature Selection in Multiple Linear Regression?

Since RF can handle non-linearity but can't provide coefficients, would it be wise to use Random Forest to gather the most important Features and then plug those features into a Multiple Linear Regression model in order to explain their signs?

I interpret OP's one-sentence question to mean that OP wishes to understand the desirability of the following analysis pipeline:

Fit a random forest to some data
By some metric of variable importance from (1), select a subset of high-quality features.
Using the variables from (2), estimate a linear regression model. This will give OP access to the coefficients that OP notes RF cannot provide.
From the linear model in (3), qualitatively interpret the signs of the coefficient estimates.

I don't think this pipeline will accomplish what you'd like. Variables that are important in random forest don't necessarily have any sort of linearly additive relationship with the outcome. This remark shouldn't be surprising: it's what makes random forest so effective at discovering nonlinear relationships.

Here's an example. I created a classification problem with 10 noise features, two "signal" features, and a circular decision boundary.

set.seed(1)
N  <- 500
x1 <- rnorm(N, sd=1.5)
x2 <- rnorm(N, sd=1.5)

y  <- apply(cbind(x1, x2), 1, function(x) (x%*%x)<1)

plot(x1, x2, col=ifelse(y, "red", "blue"))
lines(cos(seq(0, 2*pi, len=1000)), sin(seq(0, 2*pi, len=1000)))

And when we apply the RF model, we are not surprised to find that these features are easily picked out as important by the model. (NB: this model isn't tuned at all.)

x_junk   <- matrix(rnorm(N*10, sd=1.5), ncol=10)
x        <- cbind(x1, x2, x_junk)
names(x) <- paste("V", 1:ncol(x), sep="")

rf <- randomForest(as.factor(y)~., data=x, mtry=4)
importance(rf)

    MeanDecreaseGini
x1         49.762104
x2         54.980725
V3          5.715863
V4          5.010281
V5          4.193836
V6          7.147988
V7          5.897283
V8          5.338241
V9          5.338689
V10         5.198862
V11         4.731412
V12         5.221611

But when we down-select to just these two, useful features, the resulting linear model is awful.

summary(badmodel <- glm(y~., data=data.frame(x1,x2), family="binomial"))

The important part of the summary is the comparison of the residual deviance and the null deviance. We can see that the model does basically nothing to "move" the deviance. Moreover, the coefficient estimates are essentially zero.

Call:
glm(formula = as.factor(y) ~ ., family = "binomial", data = data.frame(x1, 
    x2))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6914  -0.6710  -0.6600  -0.6481   1.8079  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.398378   0.112271 -12.455   <2e-16 ***
x1          -0.020090   0.076518  -0.263    0.793    
x2          -0.004902   0.071711  -0.068    0.946    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 497.62  on 499  degrees of freedom
Residual deviance: 497.54  on 497  degrees of freedom
AIC: 503.54

Number of Fisher Scoring iterations: 4

What accounts for the wild difference between the two models? Well, clearly the decision boundary we're trying to learn is not a linear function of the two "signal" features. Obviously if you knew the functional form of the decision boundary prior to estimating the regression, you could apply some transformation to encode the data in a way that regression could then discover... (But I've never known the form of the boundary ahead of time in any real-world problem.) Since we're only working with two signal features in this case, a synthetic data set without noise in the class labels, that boundary between classes is very obvious in our plot. But it's less obvious when working with real data in a realistic number of dimensions.

Moreover, in general, random forest can fit different models to different subsets of the data. In a more complicated example, it won't be obvious what's going on from a single plot at all, and building a linear model of similar predictive power will be even harder.

Because we're only concerned with two dimensions, we can make a prediction surface. As expected, the random model learns that the neighborhood around the origin is important.

M                 <- 100
x_new             <- seq(-4,4, len=M)
x_new_grid        <- expand.grid(x_new, x_new)
names(x_new_grid) <- c("x1", "x2")
x_pred            <- data.frame(x_new_grid, matrix(nrow(x_new_grid)*10, ncol=10))
names(x_pred)     <- names(x)

y_hat             <- predict(object=rf, newdata=x_pred, "vote")[,2]

library(fields)
y_hat_mat         <- as.matrix(unstack(data.frame(y_hat, x_new_grid), y_hat~x1))

image.plot(z=y_hat_mat, x=x_new, y=x_new, zlim=c(0,1), col=tim.colors(255), 
           main="RF Prediction surface", xlab="x1", ylab="x2")

As implied by our abysmal model output, the prediction surface for the reduced-variable logistic regression model is basically flat.

bad_y_hat     <- predict(object=badmodel, newdata=x_new_grid, type="response")
bad_y_hat_mat <- as.matrix(unstack(data.frame(bad_y_hat, x_new_grid), bad_y_hat~x1))
image.plot(z=bad_y_hat_mat, x=x_new, y=x_new, zlim=c(0,1), col=tim.colors(255), 
           main="Logistic regression prediction surface", xlab="x1", ylab="x2")

HongOoi notes that the class membership isn't a linear function of the features, but that it a linear function is under a transformation. Because the decision boundary is $1=x_1^2+x_2^2,$ if we square these features, we will be able to build a more useful linear model. This is deliberate. While the RF model can find signal in those two features without transformation, the analyst has to be more specific to get similarly helpful results in the GLM. Perhaps that's sufficient for OP: finding a useful set of transformations for 2 features is easier than 12. But my point is that even if a transformation will yield a useful linear model, RF feature importance won't suggest the transformation on its own.

Best Answer

Related Solutions

Solved – Effect of categorical interaction terms with random forest machine learning algorithm

Random Forest – Can It Be Used for Feature Selection in Multiple Linear Regression?

Related Question