Solved – Random forest and LASSO regression both give different variable importances

lassorandom forestsparse

I have a dataset with 163 observations (all countries in the world with population > 1000000) and 290 variables related to their disease burden and performance. Because there are more variables than observations I cannot run a standard linear regression. Therefore I tried both a random forest and a LASSO regression. Both give different variables importances. Which one is the most reliable in this case?

Best Answer

Before going deeper into the comparison make sure that each of the two methods agrees with itself. You can find this out by bootstrapping the entire variable importance process a few times. Plot the original variable importance for each variable vs. the importance estimated from a bootstrap sample.

The bootstrap involves taking samples of size $n$ with replacement from the original dataset of $n$ observations, and repeating any analysis. The repetitions have to be "from scratch." Here is what the process looks like in R:

n <- NROW(mydata)   # mydata = data table, data frame, or matrix
for(i in 1 : 5) {
    s <- sample(1 : n, n, replace=TRUE)
    f <- whateveranalysis(mydata[s, ])
    # Print what you need and look across the 5 bootstraps to
    # see the volatility
}

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Finally we were able to produce the same solution with both methods! First issue is that glmnet solves the lasso problem as stated in the question, but lars has a slightly different normalization in the objective function, it replaces $\frac{1}{2N}$by $\frac{1}{2}$. Second, both methods normalize the data differently, so the normalization must be swiched off when calling the methods.

To reproduce that, and see that the same solutions for the lasso problem can be computed using lars and glmnet, the following lines in the code above must be changed:

la <- lars(X,Y,intercept=TRUE, max.steps=1000, use.Gram=FALSE)

la <- lars(X,Y,intercept=TRUE, normalize=FALSE, max.steps=1000, use.Gram=FALSE)

and

glm2 <- glmnet(X,Y,family="gaussian",lambda=0.5*la$lambda,thresh=1e-16)

glm2 <- glmnet(X,Y,family="gaussian",lambda=1/nbSamples*la$lambda,standardize=FALSE,thresh=1e-16)

Solved – Random forest – binary classification vs. regression

Due to the class imbalance, you should have a look at the probabilities that your forests outputs (I'm not familiar with the random forest R package, but I think there is an option (type="prob") in the predict function that will give you a matrix of class probabilities.

I believe, the next thing to do with these probabilities is to derive a ROC curve and see if it performs better than the majority vote. In that case, it just means you should consider a 'soft' voting approach while optimising the threshold (based on the ROC curve) to determine the predicted class (which is straightforward in a binary case) instead of a 'majority' voting one.

Best Answer

Related Solutions

Solved – Why do Lars and Glmnet give different solutions for the Lasso problem

Solved – Random forest – binary classification vs. regression

Related Question