Solved – how to detect outliers from residual plot

outliersrresiduals

I have the following residual plot. Can I detect outliers from residual plot?
I want to remove 200 outliers in my data set, but I do not know how should I do that in R ?

residual plots:

enter image description here

scatter plots:

enter image description here

Best Answer

In general you can define outliers differently, depending on what exactly you are trying to achieve. For example, a presence of observations with very high leverage won't necessarily indicate that they are effecting the regression at all. On the other hand, presence of values with high Cook Distance, can certainly do. It is also possible that some values will have both. High Studentized residuals can indicate Heteroscedasticity. Here's an illustration of how you can identify/inspect each when compared to your original data and fitted regression line

Create some dummy data set and fit a linear regression model

set.seed(11)
df <- data.frame(x = rnorm(200), y = rnorm(200, 10, 5))
fit <- lm(y ~ x, data = df)
# summary(fit)

We will use influencePlot from car package in order to identify outliers and plot them, when

x axis are hat values
y axis are Studentized residuals

Circles representing the observations proportional to Cooks distances

library(car)
(outs <- influencePlot(fit))
#        StudRes         Hat      CookD
# 62  -2.3075152 0.035229039 0.30844382
# 73   2.7848421 0.008209828 0.17618044
# 196  0.5258255 0.047410106 0.08310058

enter image description here

Now, we can get the corresponding row names of the, for example, 2 highest values in each

n <- 2
Cooksdist <- as.numeric(tail(row.names(outs[order(outs$CookD), ]), n))
Lev <- as.numeric(tail(row.names(outs[order(outs$Hat), ]), n))
StdRes <- as.numeric(tail(row.names(outs[order(outs$StudRes), ]), n))

And plot them over the fitted regression line

plot(df$x, df$y)
abline(fit, col = "blue")
points(df$x[Cooksdist], df$y[Cooksdist], col = "red", pch = 0, lwd = 15)
points(df$x[Lev], df$y[Lev], col = "blue", pch = 25, lwd = 8)
points(df$x[StdRes], df$y[StdRes], col = "green", pch = 20, lwd = 5)
text(df$x[as.numeric(row.names(outs))], 
     df$y[as.numeric(row.names(outs))], 
     labels = round(df$y[as.numeric(row.names(outs))], 3),
     pos = 1)

enter image description here

You can clearly see that some of the outliers are overlapping, when the leverage ones (the blue triangles) can sometimes affect the regression line while in other occasions be almost on it, while the red squares (Cook Distance) always have high effect on the regression line.

Related Solutions

Solved – Removal of multi-dimensional outliers

To make a long story short, you should use a tool such as robust PCA analysis. I may come back to this with a more substantive post, but the short version is explained in this post

Regression – Detecting Patterns in Residual Plots for Regression Analysis

A dependent mixture model (hidden Markov model) may be of use, depending on the type of deviations expected.

Assume that your observations come from two distributions (or states), both of which are normally distributed, but have different mean and variance.

A number of parameters can be estimated: The initial state probabilities (2 parameters), the state transition probabilities between neighbouring data points (4 parameters) and finally the mean and variance of the two distributions (4 parameters).

In R, this model can be estimated using the depmixS4 package:

library(depmixS4)

set.seed(3)
y = rnorm(100)
y[30:35] <- rnorm(6,mean=4,sd=2)
plot(1:100,y,"l")

m <- depmix(y~1,nstates=2,ntimes=100)
fm <- fit(m)

means <- getpars(fm)[c(7,9)]
lines(1:100,means[fm@posterior$state],lwd=2,col=2)

enter image description here

See http://cran.r-project.org/web/packages/depmixS4/vignettes/depmixS4.pdf for references