Solved – Why does Lasso do better than SVM

cross-validationlassorandom forestrmssvm

I have been evaluation various regression techniques over a regression dataset . I am surprised by the fact that cross-validated RMSE of Lasso is better than SVM and Random Forest in my case.

Can this happen? I believed that a non-linear modelling technique like random forest or SVM would do better than a linear model like Lasso.

Is that really possible!?

Best Answer

There is no perfect algorithm. I believe Loess, at least as implemented in R, is limited to ~4 features. Given so few features, the overhead of RandomForests or SVM-regression is likely wasted. It might be that the intrinsic scaling of the data is important and the RandomForest loses that in it's trees. For the SVM it could easily be the difficulty in properly tuning it or choosing the right kernel. If the relationship is simple enough, you don't need to expand in the faux-infinite dimensions of kernel space to understand it.

Having said that, just because Loess is better in this particular training set via cross-validation, that doesn't mean it will always be better. All models are just approximations.

Related Solutions

Solved – Using LASSO on random forest

This sounds somewhat like gradient tree boosting. The idea of boosting is to find the best linear combination of a class of models. If we fit a tree to the data, we are trying to find the tree that best explains the outcome variable. If we instead use boosting, we are trying to find the best linear combination of trees.

However, using boosting we are a little more efficient as we don't have a collection of random trees, but we try to build new trees that work on the examples we cannot predict well yet.

For more on this, I'd suggest reading chapter 10 of Elements of Statistical Learning: http://statweb.stanford.edu/~tibs/ElemStatLearn/

While this isn't a complete answer of your question, I hope it helps.

Solved – What are RMSE SD and Rsquared SD metrics in resampling results using R package:caret

It is the standard deviation of the resamples:

> lmFit$resample
          RMSE  Rsquared Resample
    1 4.702857 0.7283872    Fold1
    2 5.266187 0.6838433    Fold2
> apply(lmFit$resample[, 1:2], 2, sd)
      RMSE   Rsquared 
0.39833479 0.03149727 
> lmFit
Linear Regression 

506 samples
 13 predictors

No pre-processing
Resampling: Cross-Validated (2 fold) 

Summary of sample sizes: 253, 253 

Resampling results

  RMSE  Rsquared  RMSE SD  Rsquared SD
  4.98  0.706     0.398    0.0315

Max

Best Answer

Related Solutions

Solved – Using LASSO on random forest

Solved – What are RMSE SD and Rsquared SD metrics in resampling results using R package:caret

Related Question