Why Boosting Method is Sensitive to Outliers – Exploring Machine Learning Techniques

boostingcartmachine learningoutlierssvm

I found many articles that state that boosting methods are sensitive to outliers, but no article explaining why.

In my experience outliers are bad for any machine learning algorithm, but why are boosting methods singled out as particularly sensitive?

How would the following algorithms to rank in terms of sensitivity to outliers: boost-tree, random forest, neural network, SVM, and simple regression methods such as logistic regression?

Best Answer

Outliers can be bad for boosting because boosting builds each tree on previous trees' residuals/errors. Outliers will have much larger residuals than non-outliers, so gradient boosting will focus a disproportionate amount of its attention on those points.

Related Solutions

Solved – How are Random Forests not sensitive to outliers

Your intuition is correct. This answer merely illustrates it on an example.

It is indeed a common misconception that CART/RF are somehow robust to outliers.

To illustrate the lack of robustness of RF to the presence of a single outliers, we can (lightly) modify the code used in Soren Havelund Welling's answer above to show that a single 'y'-outliers suffices to completely sway the fitted RF model. For example, if we compute the mean prediction error of the uncontaminated observations as a function of the distance between the outlier and the rest of the data, we can see (image below) that introducing a single outlier (by replacing one of the original observations by an arbitrary value on the 'y'-space) suffices to pull the predictions of the RF model arbitrarily far away from the values they would have had if computed on the original (uncontaminated) data:

 library(forestFloor)
library(randomForest)
library(rgl)
set.seed(1)

X = data.frame(replicate(2,runif(2000)-.5))
y = -sqrt((X[,1])^4+(X[,2])^4)
X[1,]=c(0,0);
y2<-y
rg<-randomForest(X,y)   #RF model fitted without the outlier
outlier<-rel_prediction_error<-rep(NA,10)

for(i in 1:10){
    y2[1]=100*i+2
    rf=randomForest(X,y2)   #RF model fitted with the outlier
    rel_prediction_error[i]<-mean(abs(rf$predict[-1]-y2[-1]))/mean(abs(rg$predict[-1]-y[-1]))
    outlier[i]<-y2[1]
}
plot(outlier,rel_prediction_error,type='l',ylab="Mean prediction error (on the uncontaminated observations) \\\ relative to the fit on clean data",xlab="Distance of the outlier")

How far? In the example above, the single outlier has changed the fit so much that the mean prediction error (on the uncontaminated) observations is now 1-2 orders of magnitude bigger than it would have been, had the model been fitted on the uncontaminated data.

So it is not true that a single outlier cannot affect the RF fit.

Furthermore, as I point out elsewhere, outliers are much harder to deal with when there are potentially several of them (though they don't need to be a large proportion of the data for their effects to show up). Of course, contaminated data can contain more than one outlier; to measure the impact of several outliers on the RF fit, compare the plot on the left obtained from the RF on the uncontaminated data to the plot on the right obtained by arbitrarily shifting 5% of the responses values (the code is below the answer).

Finally, in the regression context, it is important to point out that outliers can stand out from the bulk of the data in both the design and response space (1). In the specific context of RF, design outliers will affect the estimation of the hyper-parameters. However, this second effect is more manifest when the number of dimension is large.

What we observe here is a particular case of a more general result. The extreme sensitivity to outliers of multivariate data fitting methods based on convex loss functions has been rediscovered many times. See (2) for an illustration in the specific context of ML methods.

Edit.

Fortunately, while the base CART/RF algorithm is emphatically not robust to outliers, it is possible (and quiet easy) to modify the procedure to impart it robustness to "y"-outliers. I will now focus on regression RF's (since this is more specifically the object of the OP's question). More precisely, writing the splitting criterion for an arbitrary node $t$ as:

$$s^∗=\arg\max_{s} [p_L \text{var}(t_L(s))+p_R\text{var}(t_R(s))]$$

where $t_L$ and $t_R$ are emerging child nodes dependent on the choice of $s^∗$ ( $t_L$ and $t_R$ are implicit functions of $s$) and $p_L$ denotes the fraction of data that falls to the left child node $t_L$ and $p_R=1−p_L$ is the share of data in $t_R$. Then, one can impart "y"-space robustness to regression trees (and thus RF's) by replacing the variance functional used in the original definition by a robust alternative. This is in essence the approach used in (4) where the variance is replaced by a robust M-estimator of scale.

(1) Unmasking Multivariate Outliers and Leverage Points. Peter J. Rousseeuw and Bert C. van Zomeren Journal of the American Statistical Association Vol. 85, No. 411 (Sep., 1990), pp. 633-639
(2) Random classification noise defeats all convex potential boosters. Philip M. Long and Rocco A. Servedio (2008). http://dl.acm.org/citation.cfm?id=1390233
(3) C. Becker and U. Gather (1999). The Masking Breakdown Point of Multivariate Outlier Identification Rules.
(4) Galimberti, G., Pillati, M., & Soffritti, G. (2007). Robust regression trees based on M-estimators. Statistica, LXVII, 173–190.

    library(forestFloor)
    library(randomForest)
    library(rgl)
    set.seed(1)

    X<-data.frame(replicate(2,runif(2000)-.5))
    y<--sqrt((X[,1])^4+(X[,2])^4)
    Col<-fcol(X,1:2) #make colour pallete by x1 and x2
    #insert outlier2 and colour it black
    y2<-y;Col2<-Col
    y2[1:100]<-rnorm(100,200,1);    #outliers
    Col[1:100]="#000000FF" #black

    #plot training set
    plot3d(X[,1],X[,2],y,col=Col)
    rf=randomForest(X,y)    #RF on clean data
    rg=randomForest(X,y2)   #RF on contaminated data
    vec.plot(rg,X,1:2,col=Col,grid.lines=200)
    mean(abs(rf$predict[-c(1:100)]-y[-c(1:100)]))
    mean(abs(rg$predict[-c(1:100)]-y2[-c(1:100)]))

Solved – State-of-the-art ensemble learning algorithm in pattern recognition tasks

State-of-the-art algorithms may differ from what is used in production in the industry. Also, the latter can invest in fine-tuning more basic (and often more interpretable) approaches to make them work better than what academics would.

Example 1: According to TechCrunch, Nuance will start using "deep learning tech" in its Dragon speech recognition products this september.

Example 2: Chiticariu, Laura, Yunyao Li, and Frederick R. Reiss. "Rule-Based Information Extraction is Dead! Long Live Rule-Based Information Extraction Systems!." In EMNLP, no. October, pp. 827-832. 2013. https://scholar.google.com/scholar?cluster=12856773132046965379&hl=en&as_sdt=0,22 ; http://www.aclweb.org/website/old_anthology/D/D13/D13-1079.pdf

With that being said:

Which of the ensemble learning algorithms is considered to be state-of-the-art nowadays

One of the state-of-the-art systems for image classification gets some nice gain with ensemble (just like most other systems I far as I know): He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). https://scholar.google.com/scholar?cluster=17704431389020559554&hl=en&as_sdt=0,22 ; https://arxiv.org/pdf/1512.03385v1.pdf

Best Answer

Related Solutions

Solved – How are Random Forests not sensitive to outliers

Edit.

Solved – State-of-the-art ensemble learning algorithm in pattern recognition tasks

Related Question