Solved – Treatment of outliers in financial data

anovaoutliersregressiontrimmed-meanwinsorizing

I have a data set with financial panel data from 150 companies. I want to analyse the data using linear repeated measures ANOVA and OLS Regression (so far). For this, I want to use the absolute values (e.g. Revenue 2012-2016) as well as calculated averages (e.g. Avg. Revenue Growth) as dependent variable.

I have identified quite some (extreme) outliers in the data, for example companies that have a revenue growth of 10000%, but I am unsure how to treat these outliers, due to the financial nature of the data. If a company for example invests 1 Mio. in 2012 and then earns 500k in 2013, the revenue growth may be shown as an outlier, although this is a legitimate business practice (I therefore assume the outliers to be representative). However, these business practices are (in theory) not related to/ result of the independent variable.

I don't feel like trimming or winsorizing the data is the correct way to go, as I would reduce my already small sample. However, if I keep the outliers, they will influence my analysis.

Any suggestions on how to proceed? I have read about Quantile regression, but am not familiar with it and don't know if it is applicable if the outliers are very few (per variable) and very extreme.

Asking more specifically:

  1. Should I trim / winsorize financial data? Why, why not?
  2. What alternatives are there?

Best Answer

These are not outliers. I have a paper on this and several working papers related to this. Do not trim or Windsorize your data.

The distributions involved lack a mean, so there is no such thing as ANOVA or OLS regression. The distributions involved also lack sufficient statistics. In practice, this limits you to only a few choices.

The first choice is to take the log of the data. Taking the log may or may not bias the estimates depending on the theoretical construction you are testing. In some cases, the bias is quite substantial. In others, it is zero. There is also a theoretical issue. These distributions do not have a covariance matrix even in log form. There is a scale matrix, but it collapses to a single point. You would need to be careful in interpreting the coefficients. While there isn't a covariance for the data, there is a variance.

The second choice is quantile regression or Theil's regression. The advantage to either of these is that you minimize the absolute linear loss, but, while it is an unbiased estimator of the median, the median isn't $\mu$ and so relative to the true center of location you have a bias. In fact, this guarantees the bias that logarithms may avoid. The advantage over the log transformation is that ranks are sufficient statistics because the sample is sufficient. You will not lose information where you will with the log transformation.

The third choice is some form of Bayesian solution. The problem is that we do not know your distribution. The advantage of a Bayesian solution is that the likelihood function is always minimally sufficient. This is slightly different than saying all the information is in the statistic, but all the information to calculate the likelihood is in the statistic and there can be no greater reduction in dimension. All Bayesian solutions with a proper prior are biased solutions, but the bias is usually quite a bit less than the bias on the Frequentist side from the distinction of the median minus $\mu$.

Bayesian solutions also have the difficulty of being labor intensive unless you are used to using them. A Bayesian solution would "solve" all of your problems, but they have a significant learning curve and you need a reasonable prior density. The prior density contains your beliefs about the parameters before you actually saw the data. Because you would have different beliefs than I would, the result is always subjective.

You would also need a good set of possible guesses as to the potential density function involved unless you could derive the density from first principles.

As for Frequentist tests, you should pick up a good book on non-parametric and distribution-free statistics. Kruskal–Wallis one-way analysis of variance may accomplish what you are looking for in ANOVA, but it may not. This is a case where you really want to pick up a textbook. There are a ton of them on Amazon. My books are rather old and there have been advances, so I suggest an academic library as your first stop. Do not go down the path of Bayesian nonparametric statistics unless you are an advanced user of statistics. Frequentist non-parametric and distribution-free methods are very simple and simplicity is its own virtue here. If you happen to find a book on Amazon on Bayesian non-parametric statistics, don't buy it. The learning curve will be nearly vertical.

There are books explicitly on non-parametric econometrics. You should look at those.