Solved – Random forest and LASSO regression both give different variable importances

lassorandom forestsparse

I have a dataset with 163 observations (all countries in the world with population > 1000000) and 290 variables related to their disease burden and performance. Because there are more variables than observations I cannot run a standard linear regression. Therefore I tried both a random forest and a LASSO regression. Both give different variables importances. Which one is the most reliable in this case?

Best Answer

Before going deeper into the comparison make sure that each of the two methods agrees with itself. You can find this out by bootstrapping the entire variable importance process a few times. Plot the original variable importance for each variable vs. the importance estimated from a bootstrap sample.

The bootstrap involves taking samples of size $n$ with replacement from the original dataset of $n$ observations, and repeating any analysis. The repetitions have to be "from scratch." Here is what the process looks like in R:

n <- NROW(mydata)   # mydata = data table, data frame, or matrix
for(i in 1 : 5) {
    s <- sample(1 : n, n, replace=TRUE)
    f <- whateveranalysis(mydata[s, ])
    # Print what you need and look across the 5 bootstraps to
    # see the volatility
}