Solved – Decision trees, Gradient boosting and normality of predictors

boostingdescriptive statisticsextreme valuenormal distributionoutliers

I have a question regarding the normality of predictors. I have 100,000 observations in my data. The problem I am analysing is a classification problem so 5% of the data is assigned to class 1, 95,000 observations assigned to class 0, so the data is highly imbalanced. However the observations of the class 1 data is expected to have extreme values.

  • What I have done is, trim the top 1% and bottom 1% of the data removing, any possible mistakes in the entry of such data)
  • Winsorised the data at the 5% and 95% level (which I have checked and is an accepted practise when dealing with such data that I have).

So;
I plot a density plot of one variable after no outlier manipulation
enter image description here

Here is the same variable after trimming the data at the 1% level
enter image description here

Here is the variable after being trimmed and after being winsorised
enter image description here

My question is how should I approach this problem.

First question, should I just leave the data alone at trimming it? or should I continue to winsorise to further condense the extreme values into more meaningful values (since even after trimming the data I am still left with what I feel are extreme values). If I just leave the data after trimming it, I am left with long tails in the distribution like the following (however the observations that I am trying to classify mostly fall at the tail end of these plots).
enter image description here

Second question, since decisions trees and gradient boosted trees decide on splits, does the distribution matter? What I mean by that is if the tree splits on a variable at (using the plots above) <= -10. Then according to plot 2 (after trimming the data) and plot 3 (after winsorisation) all firms <= -10 will be classified as class 1.

Consider the decision tree I created below.

enter image description here

My argument is, regardless of the spikes in the data (made from winsorisation) the decision tree will make the classification at all observations <= 0. So the distribution of that variable should not matter in making the split? It will only affect at what value that split will occur at? and I do not loose too much predictive power in these tails?

Best Answer

Second question. Yes, algorithms based on decision trees are completely insensitive to the specific values of predictors, they react only to their order. It means that you don't have to worry about "non-normality" of your predictors. Moreover, you can apply any monotonic transformation to your data, if you want - it will not change predictions of decision trees at all!

First question. I feel you should leave your data alone. By trimming and winsorizing it, you discard information that might be meaningful for your classification problem.

For linear models, long tails introduce noise that may be harmful. But for decision trees it is not a problem at all.

If you are too afraid of long tails, I would suggest to apply a transformation to your data, that puts it into a prettier scale without distorting the order of your observations. For example, you can make the scale logarithmic by applying $$ f(x) = \text{sign}(x) \alpha\log(|x / \alpha|+1) $$

For small x (roughly from $-\alpha$ to $\alpha$), this function is close to identity, but large values are heavily shrunk towards 0, but monotonicity is strictly preserved - thus, no information is lost.

How can removing extremes affect quality of prediction? By removing extreme values, you indeed can prevent your model for making splits in very high or very low points. This restriction leads exactly to non-increasing of the ability of your model to fit the training data. You have quite a large dataset (100K of points is quite much, if it is not very high-dimensional), so I assume that your model doesn't suffer from severe overfitting, if you regularize it properly (e.g. by controlling maximum tree size and number of trees). If it is the case, then restricting the model from splitting in high or low points will lead to degeneration of prediction quality on the test set as well.

Related Question