Solved – Using LASSO on random forest

classificationensemble learninglassorandom forest

I would like to create a random forest using the following process:

Build a tree on a random samples of the data and features using information gain to determine splits
Terminate a leaf node if it exceeds a pre-defined depth OR any split would result in a leaf count less than a pre-defined minimum
Rather than assign a class label for each tree, assign the proportion of classes in the leaf node
Stop building trees after a pre-defined number have been constructed

This bucks the traditional random forest process in two ways. One, it uses pruned trees that assign proportions rather than class labels. And two, the stop criteria is a pre-determined number of trees rather than some out-of-bag error estimate.

My question is this:

For the above process that outputs N trees, can I then fit a model
using logistic regression with LASSO selection? Does anyone have
experience fitting a Random Forest classifier and post-processing with
logistic LASSO?

The ISLE framework mentions using LASSO as a post-processing step for regression problems but not classification problems. Furthermore, I don't get any helpful results when googling "Random forest lasso".

Best Answer

This sounds somewhat like gradient tree boosting. The idea of boosting is to find the best linear combination of a class of models. If we fit a tree to the data, we are trying to find the tree that best explains the outcome variable. If we instead use boosting, we are trying to find the best linear combination of trees.

However, using boosting we are a little more efficient as we don't have a collection of random trees, but we try to build new trees that work on the examples we cannot predict well yet.

For more on this, I'd suggest reading chapter 10 of Elements of Statistical Learning: http://statweb.stanford.edu/~tibs/ElemStatLearn/

While this isn't a complete answer of your question, I hope it helps.

Related Solutions

Is Random Forest a Boosting Algorithm? – Detailed Explanation

Random Forest is a bagging algorithm rather than a boosting algorithm. They are two opposite way to achieve a low error.

We know that error can be composited from bias and variance. A too complex model has low bias but large variance, while a too simple model has low variance but large bias, both leading a high error but two different reasons. As a result, two different ways to solve the problem come into people's mind (maybe Breiman and others), variance reduction for a complex model, or bias reduction for a simple model, which refers to random forest and boosting.

Random forest reduces variance of a large number of "complex" models with low bias. We can see the composition elements are not "weak" models but too complex models. If you read about the algorithm, the underlying trees are planted "somewhat" as large as "possible". The underlying trees are independent parallel models. And additional random variable selection is introduced into them to make them even more independent, which makes it perform better than ordinary bagging and entitle the name "random".

While boosting reduces bias of a large number of "small" models with low variance. They are "weak" models as you quoted. The underlying elements are somehow like a "chain" or "nested" iterative model about the bias of each level. So they are not independent parallel models but each model is built based on all the former small models by weighting. That is so-called "boosting" from one by one.

Breiman's papers and books discuss about trees, random forest and boosting quite a lot. It helps you to understand the principle behind the algorithm.

Solved – lasso regression on top of random forest

First I think it is hard to say one model out "perform" another. Each model has different pros and cons and should be applied to different cases. For example, I would not say random forest outperforms linear regression, because linear regression is 1. more "stable" 2. requires less computational power 3. more interpretable, plus, if you ground truth between feature and value is really linear, no one can beat linear regression.

Now, back to your question, on code to try two approaches.

You can easily to do the experiment with both way and compare the performance. The trick is using model.matrix in R. Here is one example from ISL book to use model.matrix to convert factors to design matrix and use ridge or lasso.

# Chapter 6 Lab 2 of ISL book: Ridge Regression and the Lasso
library(ISLR)
library(glmnet)
Hitters=na.omit(Hitters)

# transfer formula input to matrix input
x=model.matrix(Salary~.,Hitters)[,-1]
y=Hitters$Salary

set.seed(1)
train=sample(1:nrow(x), nrow(x)/2)
test=(-train)
y.test=y[test]

grid=10^seq(10,-2,length=100)

# The Lasso
lasso.mod=glmnet(x[train,],y[train],alpha=1,lambda=grid)
plot(lasso.mod)
set.seed(1)
cv.out=cv.glmnet(x[train,],y[train],alpha=1)
plot(cv.out)

# get best fit lamda and fit all data
bestlam=cv.out$lambda.min
lasso.pred=predict(lasso.mod,s=bestlam,newx=x[test,])
mean((lasso.pred-y.test)^2)
out=glmnet(x,y,alpha=1,lambda=grid)

On the other hand, you can easily do randomForest like

randomforest(Salary~.,data=Hitters)

Best Answer

Related Solutions

Is Random Forest a Boosting Algorithm? – Detailed Explanation

Solved – lasso regression on top of random forest

Related Question