Solved – Regression analysis vs outlier detection

outliersregression

I have seen many examples where even if the goal is to find outliers, the first step is to go through linear (or any other) regression and then move on to outlier identification. Why?

If one wishes to find outliers only, can't regression analysis be skipped by using methods like 68–95–99.7 rule?

Best Answer

Checking for outliers in a single time series by using 1-, 2-, or 3-standard deviations as thresholds of outlierness is often of limited interest. Usually, this type of investigation is associated with an explicit dependent variable Y and several independent X variables. In other words, your objective is to estimate and explain the behavior of Y by a bunch of associated or causal variables Xs.

Given the above, what does investigating outliers entail? Well, there are two types of outliers. The first type consists of estimates associated with large errors or residuals. This type of outliers is measured as studentized residuals. Your second type consists of X outliers, where their values at a given point deviates significantly from their average. This type of outlier is called Hat-Leverage.

So, what do you do with these studentized residuals and Hat-Leverage measures? They are actually combined into a third and last measure of outlierness: Cook's D value. The higher this value the more a specific datapoint will have an influence on the regression coefficients of your regression model. To fully understand those concepts learn how to interpret and graph an Influence Plot in R.

An Influence Plot in R will readily identify for you the data points that have the most influence on your regression coefficients. So, next you rerun your regression by removing one of those influential data point at a time (with the highest Cook's D value), and observe how much your regression coefficients have moved. If a couple of data points pull in the same direction you may consider removing them both from your regression and observe how much your regression coefficients have moved.

If you are interested in testing your regression for outliers further, there is an entire family of Robust Regressions catered to do exactly that (Ridge Regression, Quantile Regression, M-M Regression). So, you could rerun your model using one or more of those Robust Regression models.

Ideally, when you rerun your regressions removing the most influential outliers or when you use a Robust Regression, in all cases your regression coefficients would remain relatively stable vs. your original regression. Similarly, the statistical significance of all your independent variables Xs should also remain reasonably stable. If they are not stable, you may have explicit justifications or explanations for the mentioned instability. And, that is fine. If you do not have an acceptable explanation it suggests that you consider removing the variables associated with unstable coefficients and weakening statistical significance.

Outlier testing is one of the most important aspect of testing a model.

Notice that in this entire discourse, I did not speak of any outliers of Y. By themselves, they are not always interesting. On the other hand, the difference between Y and Y estimate (the residual) is extremely interesting. But, as indicated residual outliers are only part of outlier investigation and testing.

Related Solutions

Time-Series – Robust Time-Series Regression Techniques for Outlier Detection

I took your 90 days of data (24 hourly readings per day) and analyzed it using AUTOBOX a piece of software that I have helped develop using a 28 day forecast horizon. The documentation for the approach can be found in the User Guide available from the AFS website. I will try and give you you a general overview here. The data is analyzed in a parent-to-child approach where a model is initially developed for the daily totals and here incorporating memory,daily effects and anomalies, level shifts , local time trends etc.. This "parent model" leads to forecasts and confidence limits based upon possible different daily error variances which also differed by hour ( and the potential for anomalies to arise in the future using simulation procedures. The next step is to identify 24 causal models using the parent as a possible predictor using memory as needed , level shifts as needed while identifying and remedying possible anomalies/level shifts/local time trends. As an example of this let me show you the graphical output for hour 0,6,9,15,and 18 . As an example consider the model for hour 4 . GROUPT, the daily total series is significant along with three days of the week (2,3,and 4). Additionally 5 unusual values (outliers) were found at periods 89,55,19,78 and 2. There is a level shift (permanent change upwards in the mean value) at period 20 and a partial daily effect (day 3) starting at period 19 . Now we have 24 sets of forecasts for the children for the next 28 days and a set of forecasts for the parent for the next 28 days. We reconcile these two to obtain the final forecasts for 24 hours for the 28 day forecast. The reconciliation can be done in a parent-to-child or a child-to-parent manner. Following are snapshots of both procedures. . Clearly 90 days of data insufficient to capture , weekly effects, holiday effects , specific days of the month effects , long-weekend effects , week-in-month effects, monthly effects etc.. but if you had a longer series you can get the picture as to what might be possible. The detailed output can be made available to you or any interested party by contacting me as it is to voluminous to post. With this analysis one might want to creatively program the approach with free R tools but there are a lot of pitfalls awaiting such enterprise. Hope this helps your research into what I think is a very important statistical problem regarding detecting exceptional events and accounting for their effect in the forecast horizon.

Solved – Unsupervised outlier detection in 2D space

Your task seems to be rather a clustering than an outlier detection task.

In the following, I use this popular data set of User locations (Joensuu).

Running OPTICS with the parameters

-dbc.in /tmp/MopsiLocations2012-Joensuu.txt
-algorithm clustering.optics.OPTICSXi -opticsxi.xi 0.05
-algorithm.distancefunction geo.LngLatDistanceFunction
-optics.epsilon 5000.0 -optics.minpts 50

yields the following (hierarchical) clustering. You can see there are three larger clusters (corresponding to Joensuu, Lieska, and Savijärvi; note that the plot has latitude and longitude 'the wrong way'), and some noise (violet here) that is not density-reachable with 5km distance and 50 points. These are your outliers.

You can tell there are some subclusters in both cities. For example one corresponding to the Prisma Joensuu shopping mall. To see more detail, it is helpful to further reduce epsilon, maybe to just 500 meters.

Best Answer

Related Solutions

Time-Series – Robust Time-Series Regression Techniques for Outlier Detection

Solved – Unsupervised outlier detection in 2D space

Related Question