Solved – Bootstrapping – do I need to remove outliers first

bootstrapoutliers

We've run a split test of a new product feature and want to measure if the uplift on revenue is significant. Our observations are definitely not normally distributed (most of our users don't spend, and within those that do, it is heavily skewed towards lots of small spenders and a few very big spenders).

We've decided on using bootstrapping to compare the means, to get round the issue of the data not being normally distributed (side-question: is this a legitimate use of bootstrapping?)

My question is, do I need to trim outliers from the data set (e.g. the few very big spenders) before I run the bootstrapping, or does that not matter?

Best Answer

Before addressing this, it's important to acknowledge that the statistical malpractice of "removing outliers" has been wrongly promulgated in much of the applied statistical pedagogy. Traditionally, outliers are defined as high leverage, high influence observations. One can and should identify such observations in the analysis of data, but those conditions alone do not warrant removing those observations. A "true outlier" is a high leverage/high influence observation that's inconsistent with replications of the experimental design. To deem an observation as such requires specialized knowledge of that population and the science behind the "data generating mechanism". The most important aspect is that you should be able to identify potential outliers apriori.

As for the bootstrapping aspect of things, the bootstrap is meant to simulate independent, repeated draws from the sampling population. If you prespecify exclusion criteria in your analysis plan, you should still leave excluded values in the referent bootstrap sampling distribution. This is because you will account for the loss of power due to applying exclusions after sampling your data. However, if there are no prespecified exclusion criteria and outliers are removed using post hoc adjudication, as I'm obviously rallying against, removing these values will propagate the same errors in inference that are caused by removing outliers.

Consider a study on wealth and happiness in an unstratified simple random sample of 100 people. If we took the statement, "1% of the population holds 90% of the world's wealth" literally, then we would observe, on average, one very highly influential value. Suppose further that, beyond affording a basic quality of life, there was no excess happiness attributable to larger income (nonconstant linear trend). So this individual is also high leverage.

The least squares regression coefficient fit on unadulterated data estimates a population averaged first order trend in these data. It is heavily attenuated by our 1 individual in the sample whose happiness is consistent with those near median income levels. If we remove this individual, the least squares regression slope is much larger, but the variance of the regressor is reduced, hence inference about the association is approximately the same. The difficulty with doing this is that I did not prespecify conditions in which individuals would be excluded. If another researcher replicated this study design, they would sample an average of one high income, moderately happy individual, and obtain results that were inconsistent with my "trimmed" results.

If we were apriori interested in the moderate income happiness association, then we should have prespecified that we would, e.g. "compare individuals earning less than $100,000 annual household income". So removing the outlier causes us to estimate an association we cannot describe, hence the p-values are meaningless.

On the other hand, miscalibrated medical equipment and facetious self-reported survey lies can be removed. The more accurately that exclusion criteria can be described before the actual analysis takes place, the more valid and consistent the results that such an analysis will produce.