Solved – a mathematical way to define a point on a scatter plot as an outlier

outliers

I have a graph and there are two points that could be two potential outliers. I'm trying to create a polynomial line of best fit with undefined order.

I believe I could use a >2 standard deviations exclusion rule, but I'm sure how exactly this is applied. Do I determine the qualifiying standard deviation by using all the points initially?

That's what I did, but I have two points that qualify for exclusion. Do I remove both at once or one at a time? If I remove only the one with the greatest deviation from the line of best fit, I can recalculate the line and it will give me a new line of best fit of new polynomial order. If this new fit is used, the other point which was previously an outlier is no longer an outlier.

What is the correct procedure?

Best Answer

Without knowing everything about your data or what your project is, it's hard to suggest what the "right" method is. A better way to think about it is probably that either method may work, but that you need transparency in how you did it when you present your results.

If you are removing two outliers from 10,000, I don't think it particularly matters either way. If you are removing two outliers from 10 records, it becomes significantly more important!

In general, if you are using a 2SD method, I would say you should remove both of them at the same time - you set the exclusion criteria, and then you remove everything that doesn't fit it. It does not seem to me that you have any analytic justification for the other approach - why would you remove one and then recalculate?

With that said - if the outlying data points don't have extreme leverage on your model, or are generally unobtrusive, do you think it's necessary to even remove them? I usually suggest not dropping observations unless they are severely disruptive to modeling. HTH!