Solved – A measure that corresponds to variance that is robust against outliers

descriptive statisticsoutliersvariance

One theory holds that the increase in variance over time in a particular measure (let's call it $x$) should be linear.

I've collected a dataset in order to test this claim of the theory. What I've found, however, is that $x$ is almost always very small, but is occasionally quite large. The outliers in my dataset (the rare large values of $x$) seem to be causing fluctuations in the variance over time (independent datasets were collected for each time point) that make it difficult to see a clear pattern of increase.

What would statisticians recommend I do to reduce my analysis' vulnerability to outliers, while still addressing the claim of the theory I'm interested in?

One solution to reduce the influence of outliers would be to compute something like the interquartile range. But then, would I be looking for a linear increase in the interquartile range? (It seems to me the answer is no, but I'm not sure what to do about this.)

Another solution might be to fit a Gaussian distribution to my data, and derive the variance from the best-fitting width parameter. But this makes the assumption that my data are distributed normally.

Maybe the easiest thing to do would be to use some threshold for cutting outliers entirely from the dataset. But this seems like cheating to me. (Maybe it isn't cheating.)

Best Answer

The median absolute deviation is one generally accepted measure of the spread of data points, robust in the sense that it is insensitive to the exact values of outliers unless outliers represent over half of the observations. This is a very useful alternative to variance/standard deviation in cases like yours. There are also additional robust measures of the spread (scale) of observations; see the references in the linked page for further information.

Related Solutions

Solved – Determine outliers using IQR or standard deviation

It seems like you have so many outliers that only looking at the residuals after fitting with, say, ordinary least squares might be misleading---some sample points could have so high influence (or "leverage") that the fit is changed, maybe misleading, just to make the residuals corresponding to those high leverage points small. So you probably needs robust fitting methods!

You can start by looking at the answers to How to optimize a regression by removing 10% "worst" data points? and maybe search this site for the robust tag: https://stats.stackexchange.com/questions/tagged/robust

Solved – Best statistic for measuring prediction accuracy that is robust for outliers

there is a relationship between RMSE (root mean square error) and MAE (mean absolute error) that could help you in choosing between these.

MAE ≤ RMSE ≤ sqrt(n)·MAE , where the most extreme difference occurs when all the errors are in one observation, and the rest of the errors are zero. Thus RMSE can increase with the number of observations, even if the underlying stochastic process is unchanged. This does not happen for MAE.

When the errors are normal distributed, this effect is very small, but for errors that are more fat-tailed, this effect can be problematic. Especially since it makes it difficult to compare samples with different number of observations.

This is well explained in this paper by Willmott & Matsuura

Also it's quite easy to simulate this effect in R:

numObs<-30
sumRMSQ<-0
sumMAE<-0
for(i in 1:10000){

    TestError<-rt(numObs, 3)

    currentRMSQ<-sqrt(mean(TestError^2))
    sumRMSQ<-sumRMSQ+currentRMSQ

    currentMAE<-mean(abs(TestError))
    sumMAE<-sumMAE+currentMAE

}
avgRMSQ<-sumRMSQ/10000
cat("RMSQ numObs: ",numObs, avgRMSQ,"\n")

avgMAE<-sumMAE/10000
cat("MAE numObs: ",numObs, avgMAE,"\n")

This simulates errors that are t-distributed with 3 degrees of freedom (to get a reasonable fat tail). it runs each calculation 10,000 times and calculates the average. When this is done for the sample sizes: 30, 100, 1000 and 10000 we get the folowing result:

numObs:  30    RMSQ: 1.614496
numObs:  100   RMSQ: 1.655523
numObs:  1000  RMSQ: 1.702508
numObs:  10000 RMSQ: 1.725051    

numObs:  30    MAE:  1.106086
numObs:  100   MAE:  1.10151
numObs:  1000  MAE:  1.102015
numObs:  10000 MAE:  1.10287

The results shows a clear increase in RMSQ as the number of observations increase, but this is not the case for MAE. If one replaces the t-distribution in the code with a normal distribution, one can see that this effect all but disapears.

Based on this result, and also that it's easier to have an intuitive understanding of the MAE result, I would go for MAE.

Hope this is of help. Regards, Morten Bunes Gustavsen

Related Question