Solved – Removing outliers from asymmetric data

descriptive statisticsoutliersrwinsorizing

I have a data set that includes the number of visits to a website. Here are some descriptive statistics for my data

Median: 4
Mean: 14.1352
SD: 121.8119

Clearly, there are some huge values (individuals who have visited the site thousands of times.) To remove these outliers I considered simply removing any data that falls outside more than 3.5 standard deviations from the mean. The result is that I discovered there is still a significant fat tail with my data. After removal of data that falls outside more than 3.5 standard deviations my descriptive statistics adjust to

Median: 4
Mean: 10.2201
SD: 19.7492

I also explored using a winsorized mean but again since my data is asymmetric I feel like my descriptive statistics are biased. Is there a method that I can use to reevaluate my data to provide descriptive statistics that would represent a ‘majority’ of the population?

As I understand the concept of bootstrapping, I could sample my population and then resample and generation thousands of populations that may represent my population differently based on the resample of my original sample population. Would this method be appropriate?

Any other ideas or direction?

Any references or examples with R would be very much appreciated as well.

Best Answer

Don't remove any outliers until you explore the data a bit further. I suggest that you should do a log transform on the data and see whether it becomes more nearly symmetrical--the outliers may not be as extreme as you think. (Log values make perfect sense if there is some sort of power law at play.)