Solved – a mathematical way to define a point on a scatter plot as an outlier

outliers

I have a graph and there are two points that could be two potential outliers. I'm trying to create a polynomial line of best fit with undefined order.

I believe I could use a >2 standard deviations exclusion rule, but I'm sure how exactly this is applied. Do I determine the qualifiying standard deviation by using all the points initially?

That's what I did, but I have two points that qualify for exclusion. Do I remove both at once or one at a time? If I remove only the one with the greatest deviation from the line of best fit, I can recalculate the line and it will give me a new line of best fit of new polynomial order. If this new fit is used, the other point which was previously an outlier is no longer an outlier.

What is the correct procedure?

Best Answer

Without knowing everything about your data or what your project is, it's hard to suggest what the "right" method is. A better way to think about it is probably that either method may work, but that you need transparency in how you did it when you present your results.

If you are removing two outliers from 10,000, I don't think it particularly matters either way. If you are removing two outliers from 10 records, it becomes significantly more important!

In general, if you are using a 2SD method, I would say you should remove both of them at the same time - you set the exclusion criteria, and then you remove everything that doesn't fit it. It does not seem to me that you have any analytic justification for the other approach - why would you remove one and then recalculate?

With that said - if the outlying data points don't have extreme leverage on your model, or are generally unobtrusive, do you think it's necessary to even remove them? I usually suggest not dropping observations unless they are severely disruptive to modeling. HTH!

Related Solutions

Solved – Simple algorithm for online outlier detection of a generic time series

Here is a simple R function that will find time series outliers (and optionally show them in a plot). It will handle seasonal and non-seasonal time series. The basic idea is to find robust estimates of the trend and seasonal components and subtract them. Then find outliers in the residuals. The test for residual outliers is the same as for the standard boxplot -- points greater than 1.5IQR above or below the upper and lower quartiles are assumed outliers. The number of IQRs above/below these thresholds is returned as an outlier "score". So the score can be any positive number, and will be zero for non-outliers.

I realise you are not implementing this in R, but I often find an R function a good place to start. Then the task is to translate this into whatever language is required.

tsoutliers <- function(x,plot=FALSE)
{
    x <- as.ts(x)
    if(frequency(x)>1)
        resid <- stl(x,s.window="periodic",robust=TRUE)$time.series[,3]
    else
    {
        tt <- 1:length(x)
        resid <- residuals(loess(x ~ tt))
    }
    resid.q <- quantile(resid,prob=c(0.25,0.75))
    iqr <- diff(resid.q)
    limits <- resid.q + 1.5*iqr*c(-1,1)
    score <- abs(pmin((resid-limits[1])/iqr,0) + pmax((resid - limits[2])/iqr,0))
    if(plot)
    {
        plot(x)
        x2 <- ts(rep(NA,length(x)))
        x2[score>0] <- x[score>0]
        tsp(x2) <- tsp(x)
        points(x2,pch=19,col="red")
        return(invisible(score))
    }
    else
        return(score)
}

Solved – Best way to display data with outliers

The standard traditional tool is a histogram. You can do this with the analysis tool pack in Excel, but I'd recommend using a stats package instead.

An extension of the histogram is a line plot showing the density - this is basically your idea of shwoing the bell curve, and it is probably the right one. From here there are various options such as drawing vertical lines to show the mean, median, 95th percentile, etc. To do this you will definitely want a stats package.

Some examples are below, including the code in R (which is free) that generated the data and drew the plots. You can see it that's not necessarily that hard to do this sort of thing in a stats package, if you're prepared to move beyond Excel.

# generate data
times <- rgamma(1000,1,1)

# draw histogram, showing counts
hist(times, col="grey")

# draw a density line plot
plot(density(times), bty="l")

# add vertical lines for the median and 95th percentile
abline(v=quantile(times, c(0.5, 0.95)), lty=2:3)

enter image description here

Best Answer

Related Solutions

Solved – Simple algorithm for online outlier detection of a generic time series

Solved – Best way to display data with outliers

Related Question