Random Forest – Can It Be Used for Feature Selection in Multiple Linear Regression?

feature selectionmachine learningrandom forestregressionregression-strategies

Since RF can handle non-linearity but can't provide coefficients, would it be wise to use random forest to gather the most important features and then plug those features into a multiple linear regression model in order to obtain their coefficients?

Best Answer

Since RF can handle non-linearity but can't provide coefficients, would it be wise to use Random Forest to gather the most important Features and then plug those features into a Multiple Linear Regression model in order to explain their signs?

I interpret OP's one-sentence question to mean that OP wishes to understand the desirability of the following analysis pipeline:

  1. Fit a random forest to some data
  2. By some metric of variable importance from (1), select a subset of high-quality features.
  3. Using the variables from (2), estimate a linear regression model. This will give OP access to the coefficients that OP notes RF cannot provide.
  4. From the linear model in (3), qualitatively interpret the signs of the coefficient estimates.

I don't think this pipeline will accomplish what you'd like. Variables that are important in random forest don't necessarily have any sort of linearly additive relationship with the outcome. This remark shouldn't be surprising: it's what makes random forest so effective at discovering nonlinear relationships.

Here's an example. I created a classification problem with 10 noise features, two "signal" features, and a circular decision boundary.

set.seed(1)
N  <- 500
x1 <- rnorm(N, sd=1.5)
x2 <- rnorm(N, sd=1.5)

y  <- apply(cbind(x1, x2), 1, function(x) (x%*%x)<1)

plot(x1, x2, col=ifelse(y, "red", "blue"))
lines(cos(seq(0, 2*pi, len=1000)), sin(seq(0, 2*pi, len=1000))) 

enter image description here

And when we apply the RF model, we are not surprised to find that these features are easily picked out as important by the model. (NB: this model isn't tuned at all.)

x_junk   <- matrix(rnorm(N*10, sd=1.5), ncol=10)
x        <- cbind(x1, x2, x_junk)
names(x) <- paste("V", 1:ncol(x), sep="")

rf <- randomForest(as.factor(y)~., data=x, mtry=4)
importance(rf)

    MeanDecreaseGini
x1         49.762104
x2         54.980725
V3          5.715863
V4          5.010281
V5          4.193836
V6          7.147988
V7          5.897283
V8          5.338241
V9          5.338689
V10         5.198862
V11         4.731412
V12         5.221611

But when we down-select to just these two, useful features, the resulting linear model is awful.

summary(badmodel <- glm(y~., data=data.frame(x1,x2), family="binomial"))

The important part of the summary is the comparison of the residual deviance and the null deviance. We can see that the model does basically nothing to "move" the deviance. Moreover, the coefficient estimates are essentially zero.

Call:
glm(formula = as.factor(y) ~ ., family = "binomial", data = data.frame(x1, 
    x2))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.6914  -0.6710  -0.6600  -0.6481   1.8079  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)    
(Intercept) -1.398378   0.112271 -12.455   <2e-16 ***
x1          -0.020090   0.076518  -0.263    0.793    
x2          -0.004902   0.071711  -0.068    0.946    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 497.62  on 499  degrees of freedom
Residual deviance: 497.54  on 497  degrees of freedom
AIC: 503.54

Number of Fisher Scoring iterations: 4

What accounts for the wild difference between the two models? Well, clearly the decision boundary we're trying to learn is not a linear function of the two "signal" features. Obviously if you knew the functional form of the decision boundary prior to estimating the regression, you could apply some transformation to encode the data in a way that regression could then discover... (But I've never known the form of the boundary ahead of time in any real-world problem.) Since we're only working with two signal features in this case, a synthetic data set without noise in the class labels, that boundary between classes is very obvious in our plot. But it's less obvious when working with real data in a realistic number of dimensions.

Moreover, in general, random forest can fit different models to different subsets of the data. In a more complicated example, it won't be obvious what's going on from a single plot at all, and building a linear model of similar predictive power will be even harder.

Because we're only concerned with two dimensions, we can make a prediction surface. As expected, the random model learns that the neighborhood around the origin is important.

M                 <- 100
x_new             <- seq(-4,4, len=M)
x_new_grid        <- expand.grid(x_new, x_new)
names(x_new_grid) <- c("x1", "x2")
x_pred            <- data.frame(x_new_grid, matrix(nrow(x_new_grid)*10, ncol=10))
names(x_pred)     <- names(x)

y_hat             <- predict(object=rf, newdata=x_pred, "vote")[,2]

library(fields)
y_hat_mat         <- as.matrix(unstack(data.frame(y_hat, x_new_grid), y_hat~x1))

image.plot(z=y_hat_mat, x=x_new, y=x_new, zlim=c(0,1), col=tim.colors(255), 
           main="RF Prediction surface", xlab="x1", ylab="x2")

enter image description here

As implied by our abysmal model output, the prediction surface for the reduced-variable logistic regression model is basically flat.

enter image description here

bad_y_hat     <- predict(object=badmodel, newdata=x_new_grid, type="response")
bad_y_hat_mat <- as.matrix(unstack(data.frame(bad_y_hat, x_new_grid), bad_y_hat~x1))
image.plot(z=bad_y_hat_mat, x=x_new, y=x_new, zlim=c(0,1), col=tim.colors(255), 
           main="Logistic regression prediction surface", xlab="x1", ylab="x2")

HongOoi notes that the class membership isn't a linear function of the features, but that it a linear function is under a transformation. Because the decision boundary is $1=x_1^2+x_2^2,$ if we square these features, we will be able to build a more useful linear model. This is deliberate. While the RF model can find signal in those two features without transformation, the analyst has to be more specific to get similarly helpful results in the GLM. Perhaps that's sufficient for OP: finding a useful set of transformations for 2 features is easier than 12. But my point is that even if a transformation will yield a useful linear model, RF feature importance won't suggest the transformation on its own.