Solved – What to do with many multivariate outliers

multivariate analysisoutliersregression

I'm doing a multiple regression with 5 continuous predictors and 1 continuous outcome variable. I've already removed a small handful of univariate outliers (n = 5), leaving my total sample size at N = 95.

However, when I run my regression, I end up identifying many multivariate outliers that exceed the Mahalanobis distance criteria. Specifically, I find 11 cases with a Mahalanobis distance score above the cut-off of 11.07, with 5 predictors and significance at .05. I've gone through and can't see any errors in the data, nor are any of the cases severely deviating on any of the variables. What should I do? Surely I can't delete over 10% of my data?

Best Answer

You probably shouldn't have deleted any observations, certainly not simply because they were outliers. Instead, you can either use a method that is OK with outliers (e.g. quantile regression, robust regression, tree models) or transform the variables (if that is sensible in your case).

Related Solutions

Outliers Detection – Best Methods to Identify Outliers in Multivariate Data

Have a look at the mvoutlier package which relies on ordered robust mahalanobis distances, as suggested by @drknexus.

Solved – Data cleansing in regression analysis

Use a robust fit, such as lmrob in the robustbase package. This particular one can automatically detect and downweight up to 50% of the data if they appear to be outlying.

To see what can be accomplished, let's simulate a nasty dataset with plenty of outliers in both the $x$ and $y$ variables:

library(robustbase)
set.seed(17)
n.points <- 17520
n.x.outliers <- 500
n.y.outliers <- 500
beta <- c(50, .3, -.05)
x <- rnorm(n.points)
y <- beta[1] + beta[2]*x + beta[3]*x^2 + rnorm(n.points, sd=0.5)
y[1:n.y.outliers] <- rnorm(n.y.outliers, sd=5) + y[1:n.y.outliers]
x[sample(1:n.points, n.x.outliers)] <- rnorm(n.x.outliers, sd=10)

Most of the $x$ values should lie between $-4$ and $4$, but there are some extreme outliers:

Raw data scatterplot

Let's compare ordinary least squares (lm) to the robust coefficients:

summary(fit<-lm(y ~ 1 + x + I(x^2)))
summary(fit.rob<-lmrob(y ~ 1 + x + I(x^2)))

lm reports fitted coefficients of $49.94$, $0.00805$, and $0.000479$, compared to the expected values of $50$, $0.3$, and $-0.05$. lmrob reports $49.97$, $0.274$, and $-0.0229$, respectively. Neither of them estimates the quadratic term accurately (because it makes a small contribution and is swamped by the noise), but lmrob comes up with a reasonable estimate of the linear term while lm doesn't even come close.

Let's take a closer look:

i <- abs(x) < 10        # Window the data from x = -10 to 10
w <- fit.rob$weights[i] # Extract the robust weights (each between 0 and 1)
plot(x[i], y[i], pch=".", cex=4, col=hsv((w + 1/4)*4/5, w/3+2/3, 0.8*(1-w/2)), 
     main="Least-squares and robust fits", xlab="x", ylab="y")

Scatterplot with fits

lmrob reports weights for the data. Here, in this zoomed-in plot, the weights are shown by color: light greens for highly downweighted values, dark maroons for values with full weights. Clearly the lm fit is poor: the $x$ outliers have too much influence. Although its quadratic term is a poor estimate, the lmrob fit nevertheless closely follows the correct curve throughout the range of the good data ($x$ between $-4$ and $4$).

Best Answer

Related Solutions

Outliers Detection – Best Methods to Identify Outliers in Multivariate Data

Solved – Data cleansing in regression analysis

Related Question