Solved – What are the consequences of removing the tails of a distribution

biasdistributionsoutliersquantilesregression

I would like to know the consequences of removing the tails of a distribution by deleting observations above and below certain thresholds.

For instance, if one were to calculate the percentiles of a measurement, then remove all values that are below and above a percentile threshold at each end (all observations below 1st percentile, all observations above 99th percentile).

Intuition tells me that this is a bad idea, but I would like a more concrete explanation as to why.

Here are some questions I have:

  • How would this change the behavior of the distribution?
  • What statistical principles are being violated here?
  • How would this change any conclusions reached during analysis of such data?
  • Is this a viable method for the elimination of outliers?
  • Is this strategy acceptable in any situation?

Thank you in advance.


Edit:

Thank you for the response, Glen_b. As a follow up, I would like to ask about a specific situation.

Suppose that we would like to calculate standard scores for a measurement that account for some covariate. We do this by regressing the measurement against the covariate, then obtain standard scores using the predicted responses from the regression.

We would like to make this process more robust to outliers.

Is it advisable to trim data prior to any analysis (without retaining the discarded values), then perform the analysis?

As an alternative to simply discarding the data, could one fit a regression model using a trimmed subset, then apply this model to standardize the entire dataset? Would this be similar to Least Trimmed Squares Regression?


Edit #2:

Clarification: We use the covariate as the independent variable/predictor of the measurement in the regression.

The goal is to correct the measurements for a covariate, since we believe that the measurement is highly dependent on this covariate.

We do this by standardizing the values using the predicted response from the regression model. The standardization can then be applied to newly obtained pairs of measurement and covariate values to determine whether they behave similarly to the original sample.

$Z(Y_{i})$ = $\frac{y_{i}-E(y|x_{i})}{\hat{\sigma}}$

Outliers are a concern with respect to the dependent variable (y-outliers), the measurement.

What type of robust regression would be suitable? M-estimation?

Best Answer

Removing observations below the $k$th percentile and above the (100-$k$)th before calculating some estimator is the same as trimming (i.e. calculating a trimmed estimator).

The effect of trimming on the distribution is effectively that of truncation at both ends. The impact of sample-based trimming on the observed distribution (averaged across many samples) is a little different because the quantiles are random variables rather than fixed numbers; in very large samples it becomes quite similar to truncation though.

The impact it has on whatever you're doing depends on the circumstances; for example, the impact on some estimator depends on the estimator you're applying after trimming off the observations, and on the distribution you apply the procedure to.

It can certainly reduce the effect contamination that produces gross outliers (though other estimators, such as M-estimators for location, might be a better choice than a trimmed location estimator in many situations; similarly for other kinds of estimators).

If you apply it to (for example) a variance calculation, it will result in downward bias. Some authors have suggested trimming for means and Winsorizing rather than trimming for calculating variances or standard deviations (which doesn't eliminate the bias, but will reduce it).


What statistical principles are being violated here?

I'm not sure quite what you're asking for, but the properties of many things will change; for example see the answer to the next question.

How would this change any conclusions reached during analysis of such data?

It depends on what you're doing! For example, t-tests applied to trimmed samples will no longer have the nominal significance level; but an adjustment of the variance and of the degrees of freedom should enable you to get close to the desired type I error rate.


This approach is certainly sometimes used. It's a common simple choice and in some situations it performs pretty well -- but it's not always the best choice.

You may find it helpful to read a little about robust statistics.


Edit: further answers in response to new question

You previously described a univariate procedure (trimming off large and small values from a distribution), but now you're asking about regression, which involves more than one variable. This changes things.

With regression, you're talking about a different conditional distribution of the response for each point -- you can't simply ignore the IV when trying to figure out what's most extreme.

Suppose that we would like to calculate standard scores for a measurement that account for some covariate. We do this by regressing the covariate against the measurement, then obtain standard scores using the predicted responses from the regression.

It's not clear to me how this gives you what you want. (Nor indeed is it clear which way around your regression goes; I suspect that when you says "regress against" you are phrasing it as IV "regressed against" DV, which would seem to be reversed from the usual convention.)

We would like to make this process more robust to outliers.

I'll address this leaving aside my concerns above.

Is it advisable to trim data prior to any analysis (without retaining the discarded values), then perform the analysis?

If I have understood you correctly, no, since you're applying a marginal (i.e. unconditional) approach to correct problems with a conditional model, before you have had a chance to even assess whether it's conditionally unusual. I'd instead advise considering robust regression methods.

As an alternative to simply discarding the data, could one fit a regression model using a trimmed subset, then apply this model to standardize the entire dataset?

I would advise against it for the reason outlined above.

Would this be similar to Least Trimmed Squares Regression?

If I have understood you correctly, in no way is that similar. (It does involve trimming of something, but it's not at all like I understand from what you're proposing.)

It might be worth addressing what kind of problems you anticipate with your data - is it y-outliers?, x-outliers?, some of each?, both together?


Answers to new edit:

If it's only y-outliers at issue (i.e. influential observations are not present), an M-estimate may be reasonable, but so would many other robust regression estimators. [You can use trimming, as well, but you would need to apply it to residuals ... from a robust esitmate, and if you have one already, you probably don't need to trim.]

Related Question