Solved – Outlier detection using outlierTest function

generalized linear modeloutliersr

I found an outlier using the outlierTest function in the car package. However, I can see from the results that the Externally Studentized Residual and p-values.
This is a result.

This indicates that the 718th observation has an outlier. right??

The code to derive the result is as follows.

credit<-read.csv("german.csv", header = TRUE)

F=c(1,2,4,5,7,8,9,10,11,12,13,15,16,17,18,19,20,21)
for(i in F) credit[,i]=as.factor(credit[,i])

german_logit<-glm(Creditability~.,data=credit, family = "binomial")
library("car")
german_outlier<-outlierTest(german_logit,n.max=9999)
german_outlier

If so, is it correct to delete the 718th observation?

I want to know what variable has outlier and its value, because I want to change that value as proper value. What function do I have to use?

Best Answer

So few points of clarity:

What do you mean by an outlier? Here observation 718 is such that its dependent variable in the glm model has an unusual value based on its independent variables. If you look at the dataset in a different way i.e. using say bivariate analysis on another variable, the same observation or may not get flagged as an outlier.

To display data values use credit[718,] for more information on subsetting use ?'[' in console to pull up the help page.

You're passing all variables to the model using formula Creditability ~. so your outlier will be a row, and not a single variable.

Now onto deleting observations, it is advised to instead create a column or a list as an indicator of outliers / outlier row numbers. In such a way, you can subset your data set according and you never lose data.

Related Solutions

Solved – Outlier detection using regression

Your best option to use regression to find outliers is to use robust regression.

Ordinary regression can be impacted by outliers in two ways:

First, an extreme outlier in the y-direction at x-values near $\bar x$ can affect the fit in that area in the same way an outlier can affect a mean.

Second, an 'outlying' observation in x-space is an influential observation - it can pull the fit of the line toward it. If it's sufficiently far away the line will go through the influential point:

enter image description here

In the left plot, there's a point that's quite influential, and it pulls the line quite a way from the large bulk of the data. In the right plot, it's been moved even further away -- and now the line goes through the point. When the x-value is that extreme, as you move that point up and down, the line moves with it, going through the mean of the other points and through the one influential point.

An influential point that's perfectly consistent with the rest of the data may not be such a big problem, but one that's far from a line through the rest of the data will make the line fit it, rather than the data.

If you look at the right-hand plot, the red line - the least squares regression line - doesn't show the extreme point as an outlier at all - its residual is 0. Instead, the large residuals from the least squares line are in the main part of the data!

This means you can completely miss an outlier.

Even worse, with multiple regression, an outlier in x-space may not look particularly unusual for any single x-variable. If there's a possibility of such a point, it's potentially a very risky thing to use least squares regression on.

Robust regression

If you fit a robust line - in particular one robust to influential outliers - like the green line in the second plot - then the outlier has a very large residual.

In that case, you have some hope of identifying outliers - they'll be points that aren't - in some sense - close to the line.

Removing outliers

You certainly can use a robust regression to identify and thereby remove outliers.

But once you have a robust regression fit, one that is already not badly affected by outliers, you don't necessarily need to remove the outliers -- you already have a model that's a good fit.

Solved – Simple algorithm for online outlier detection of a generic time series II: Daily cycle within annual

I'm not an expert on senor data. However, your data reminded me of click stream/ internet data. I recently stumbled upon twitters anomaly detection algorithm. I have not personally had great success with this method, but wanted to give a try on your data because of similarity of this data with data generated by click stream/tweets. I used annual cycle for seasonality (365*48 = 17520).

data <- read.csv("C:/Desktop/LOESS2.csv", header = TRUE)

data.outliers.annual <- AnomalyDetectionVec(data[,2],period = 17520, plot=TRUE,direction = "both")

Below is the plot from the above data:

enter image description here

Although this is not ideal, based on visual inspection this method does a decent job of detecting obvious outliers.

An alternative approach would be to look into state space methods to capture multi seasonality and them simultaneously detect outliers. I'll try to post if I find time.

Hope this was helpful

Best Answer

Related Solutions

Solved – Outlier detection using regression

Solved – Simple algorithm for online outlier detection of a generic time series II: Daily cycle within annual

Related Question