Solved – Winsorizing data

multiple regressionwinsorizing

I am currently working on my bachelor thesis in finance and I faced some problems regarding my dataset. I wanted to analyze the effect of leverage on the performance of companies and as many researchers before me, I wanted to use a multiple linear regression analysis. My tutor advised me to winsorize the data at 2.5% and 97.5%. However, when I checked the statistics for it, for some of my variables over 200 observations out of 4000 have been detected as outliers. For other variables even 2000 observations are being marked as outliers. I was searching for answers on the web and tried different methods in order to reduce the numbers. However, it still doesn’t work, and it doesn’t make any sense to me. It would be more than lovely if someone could give me some advice on it.

Thank you 🙂

Best Answer

If you have 4000 observations and you winsorize the top 2.5% and bottom 2.5% of data, then 200 observations will be affected. It doesn't matter what these values are, and it doesn't imply that they were outliers in any meaningful sense of the term.

Winsorizing data shouldn't remove any observations, but it will change them.


EDIT: Some additional information in response to comments.

One distinction to make is between trimming and Winsorization. Trimming will simply remove observations that fall outside of specified quantiles. So trimming to 95% will remove the top 2.5% of observations and the bottom 2.5% of observations.

Winsorizing doesn't remove observations, but changes the values of those observations outside a specified quantile to the value at that quantile. I think this makes sense with a simple example.

One word of caution is that there are different methods to find percentiles, so the defaults on other software packages may find somewhat different results.

Here, the data are Winsorized to 60%. The 20th percentile is calculated as 2.8 and the 80th percentile is calculated as 8.2. So the values less than 2.8 are replaced by 2.8 and the values greater than 8.2 are replaced with 8.2.

if(!require(psych)){install.packages("psych")}

A = c(1,2,3,4,5,6,7,8,9,10)

quantile (A, c(0.20, 0.80))

   ### 20% 80% 
   ### 2.8 8.2 

library(psych)

winsor(A, trim = 0.20)   # This Winsorizes to the inner 60% of observations

###   [1] 2.8 2.8 3.0 4.0 5.0 6.0 7.0 8.0 8.2 8.2
Related Question