Solved – XGBoost (Extreme Gradient Boosting) or Elastic Net More Robust to Outliers

boostingcartglmnetoutlierspredictive-models

I have recently been doing work with predictive models for a continuous response. I am doing a comparison between Elastic Net (glmnet) package in R and XGBoost (xgboost) package in R. Originally, I built the model using Elastic Net for its ability to perform feature selection and also for its ability to shrink the coefficeints of correlated variables.

I am exploring XGBoost because of its predictive capabilities, the summary of feature importance it provides, its ability to capture non-linear interactions and also because I believe that it might be more robust in the presence of outliers.

My questions are:

Is XGBoost or gradient boosted trees in general better at finding non-linear interactions than a generalized linear model?
Is my assumption about XGBoost or gradient boosted trees in general being robust to outliers a fair assumption?

Here is my model set up and finding:

For model validation I have a training and testing set. I $log$ transform the response variables before model fitting. I make predictions on the testing set and then exponentiate the results to return to the original scale. I make predicted vs. observed plots for each model.

XGBoost Predicted Vs. Observed Plot

The exponentiated predicted values have some outliers but the fit in general is good.

Elastic Net Predicted Vs. Observed Plot

With the elastic net model when I convert back to original scale there is an extreme predicted value. I am interpreting this as that the GLM NET has a few cases that it is not quite sure how to predict (outliers).

I would love to hear opinions! Thank you in advance for any help or comments!

Best Answer

1 Yes boosted trees would more easily fit unknown non-linear effects or interactions than regularized linear regression. However, as soon as you are aware of some specific non-linearity you could simply transform data to linearity and continue to use a linaer learner.
2 That depends on how you train the models. If you're new to boosted trees, check out some tutorials on how to avoid overfitting. I cannot see from your plots what kind of cross-validation was used. Use a thorough outer cross-validation and perhaps compare the results to a random forest model. RF models are much easier to handle, and default settings are often near optimal. A crude thumb rule; if your RF performs better than boosted trees(measured by outer cross validation), you either have chosen sub optimal training paramters for your boosted trees model or your data is quite noisy.

Related Solutions

Solved – XGBoost – Can we find a “better” objective function than RMSE for regression

does the same logic hold true for gradient boosted trees?

Yes, by any mean. Gradient boosting can be used to minimize any sensible loss function, and it is very effective in doing it.

It is worth saying that generalised linear models are generally picked considering not the loss/utility function (that answers the question: how well is doing my model/how bad are its errors), but the kind of random variable you want to model. Then, for instance, if you have a target variable which is the number of some kind of events registered in some time, it makes sense to use a Poisson model. In case you have a rich complex dataset, Xgboost can model a Poisson response much better than a GLM.

You can define a custom objective if you wish, but does it matter?

Of course it does, but I'd like to point out that trees are perfectly non-linear (there is no constraint on the functional form) and thus a model trained with MSE loss can often do quite well even if judged with quite different score functions, even for classification tasks! However, MSE is always symmetric and when circumstances require weighting one tail more than the other (like for gamma regression, or for binary regression, when close to extremes) MSE is not optimal and does not perform as well as the most fitting loss function.

But what is it?

This depends on your goal. For ordinary regression MSE is such an appreciated choice because it models the conditional mean of the target variable, which is often the objective, it benefits from the conceptual link with gaussian variable and central limit theorem, it is fast, and it is actually quite robust. This of course doesn't mean you have to use it, it's just a good standard, but every problem is different and many many times you don't want to predict the conditional mean. For instance you could need to predict the order of magnitude of some measure, then MSE should be applied on the logarithm of that variable, or you could have a situation where outliers are common and shouldn't affect the predictions more than other residuals, in that case MAE is a better loss. You can't list them all because there are infinitely many!

Solved – Boosting AND Bagging Trees (XGBoost, LightGBM)

Bagging: Take N random samples of x% of the samples and y% of the Features

Instances are repeatedly sub-sampled in Bagging, but not Features. (RandomForests, XGBoost and CatBoost do both):

Given dataset D of size N.
For m in n_models:
    Create new dataset D_i of size N by sampling with replacement from D.
    Train model on D_i (and then predict)
Combine predictions with equal weight

Include an initialization step in your Boosting pseudo code to get rid of redundancy:

Init data with equal weights (1/N).
For m in n_model:
    Train model on weighted data (and then predict)
    Update weights according to misclassification rate.
    Renormalize weights
Combine confidence weighted predictions

Bagged Boosted Trees (as you call it) is certainly a reasonable Approach, but different from XGBoost or CatBoost:

Given dataset D of size N.
For m in n_models:
    Create new dataset D_i of size N by sampling with replacement from D.
    (Insert Boosting pseudo code here (on D_i))
Combine predictions with equal weight

XGBoost and CatBoost are both based on Boosting and use the entire training data. They also implement bagging by subsampling once in every boosting Iteration:

Init data with equal weights (1/N).
For m in n_model:
    Train model on weighted bootstrap sample (and then predict)
    Update weights according to misclassification rate.
    Renormalize weights
Combine confidence weighted predictions

If you want to stick to "fit model to residuals", then this would be equivalent to "fit model to residuals of data in bootstrap sample".

Further Remarks:

There is no "best way to do it" as you suggest (no free lunch theorem). "Bagged Boosted Trees" might outperform XGBoost on certain data sets.

Take a single random sample of x% of the samples

This line is confusing. Where did you get this from?

if i mod bag_frequency == 0 (i.e., bag every 5 rounds):

This should not be mentioned in your pseudo code. Especially when there are other more important parameters left out (like learning rate in boosting).

Best Answer

Related Solutions

Solved – XGBoost – Can we find a “better” objective function than RMSE for regression

Solved – Boosting AND Bagging Trees (XGBoost, LightGBM)

Related Question