Solved – how to detect outliers from residual plot

outliersrresiduals

I have the following residual plot. Can I detect outliers from residual plot?
I want to remove 200 outliers in my data set, but I do not know how should I do that in R ?

residual plots:

enter image description here

scatter plots:

enter image description here

Best Answer

In general you can define outliers differently, depending on what exactly you are trying to achieve. For example, a presence of observations with very high leverage won't necessarily indicate that they are effecting the regression at all. On the other hand, presence of values with high Cook Distance, can certainly do. It is also possible that some values will have both. High Studentized residuals can indicate Heteroscedasticity. Here's an illustration of how you can identify/inspect each when compared to your original data and fitted regression line

Create some dummy data set and fit a linear regression model

set.seed(11)
df <- data.frame(x = rnorm(200), y = rnorm(200, 10, 5))
fit <- lm(y ~ x, data = df)
# summary(fit)

We will use influencePlot from car package in order to identify outliers and plot them, when

  1. x axis are hat values
  2. y axis are Studentized residuals
  3. Circles representing the observations proportional to Cooks distances

    library(car)
    (outs <- influencePlot(fit))
    #        StudRes         Hat      CookD
    # 62  -2.3075152 0.035229039 0.30844382
    # 73   2.7848421 0.008209828 0.17618044
    # 196  0.5258255 0.047410106 0.08310058
    

enter image description here

Now, we can get the corresponding row names of the, for example, 2 highest values in each

n <- 2
Cooksdist <- as.numeric(tail(row.names(outs[order(outs$CookD), ]), n))
Lev <- as.numeric(tail(row.names(outs[order(outs$Hat), ]), n))
StdRes <- as.numeric(tail(row.names(outs[order(outs$StudRes), ]), n))

And plot them over the fitted regression line

plot(df$x, df$y)
abline(fit, col = "blue")
points(df$x[Cooksdist], df$y[Cooksdist], col = "red", pch = 0, lwd = 15)
points(df$x[Lev], df$y[Lev], col = "blue", pch = 25, lwd = 8)
points(df$x[StdRes], df$y[StdRes], col = "green", pch = 20, lwd = 5)
text(df$x[as.numeric(row.names(outs))], 
     df$y[as.numeric(row.names(outs))], 
     labels = round(df$y[as.numeric(row.names(outs))], 3),
     pos = 1)

enter image description here

You can clearly see that some of the outliers are overlapping, when the leverage ones (the blue triangles) can sometimes affect the regression line while in other occasions be almost on it, while the red squares (Cook Distance) always have high effect on the regression line.