Solved – Dealing with probable outliers in the dependent variable

datasetdependent variableoutliers

I am trying to fit a simple regression to a data set with ~45,000 observations. The dependent variable is revenue growth, but I'm concerned some observed values are incorrectly entered data.

To elaborate: The mean growth was about 6% (standard deviation is approximately 18%), but observations range from -100% to 200%. I'm certain 200% was incorrectly entered, but there is a decent number of observations above 100% which disconcerts me. While growth in this range is plausible and has been observed, I have doubts that all of them are accurate (my skepticism is an increasing function of the number of observations in this range).

What is the best way to deal with this data, especially because there are too many observations to check manually? Should I remove all of these data points and run the model for on data which I am confident is correct? Or will this bias the model because it bounds the dependent variable below 100%? Is there something more sophisticated that can salvage these observations?

Best Answer

Aside from the technicalities of identifying outliers (Tukey, SD, hotelling t2 for multivariate data, etc) I think that the first question should be: Why do we want to identify outliers? If your primary purpose is e.g. to model basic principles for economic growth in the general body of companies, then you may want to exclude rocket companies from your data, since they obviously work by a different mechanism.

Conversely, you may want to identify potential rockets. Then you will identify a small set of outliers from your general data swarm and in a subsequent step discard those outliers that arise from simple input errors. Then you can use some clever modeling principle or deep analysis to identify causes for these observations to stand out from the others...

Your fundamental analysis question will determine the best approach for identifying outliers, but as a general principle you should make sure to check why observations are outliers. Not just ascertain that they are.

I don't know much about quantitative economics, but why not make e.g. a normality test before/after a series of transformations (e.g. log transformations) and see if observations become modelable before deciding on . aprinciple for identifying outliers. Then, as I mentioned above, if your aim is to find sort of general principles of economic growth, you may want to sort out not only obvious errors but also those observations that are governed by other principles, since they probably will not be well described by general mechanisms.


EDIT: Your comment suggests that you may be more interested in identifying the interesting outliers than modeling the bulk of the observations. Then why not train a classification model? You haven't specified your independent data, but I'm guessing you have some input parameters...

Then train your independent data on a 3 class problem: 1. Normal companies 2. Apparent faulty entries 3. Rockets

You may also want to add other categories, such as bombs...

That way you could investigate whether there is something systematically different between the different types of outliers. An analysis of important variables distinguishing different types of companies could then help you make an informed decision on how to treat your outliers in relation to your subsequent analyses (e.g. whether they should be included in the main model, completely discarded or analysed separately in a different model).

HTH HAND
Carl

Related Question