Solved – How to deal when you have too many outliers

boxplotdata preprocessingdata transformationmachine learningoutliers

Box plot of the variable fare
I have attached the boxplot of a variable called Fare(of a journey). This is a continuous variable which has outliers. According to some articles on outliers, I learned that any data point that is above/below the whiskers is an outlier. I also learned that the whisker distance is calculated by 75th percentile + 1.5*(Inter-Quartile Range).

In the case that I have attached, you can see there are too many outliers(200/891 observations). If I replace all these points with missing values(can be imputed later), won't it produce bias? Few articles asked to consider 3*IQR instead of 1.5*IQR. Should I do that way? How to deal when you have too many outliers?

Best Answer

These are not outliers. I am an economist and this is the way the data should look, based on your comments. It is a poor dataset to start a beginner on.

What you are looking at is called "price discrimination." In particular, it is third degree price discrimination. Another real world example, although it is an example of first degree price discrimination, is with the Apple i-phone. When it first came out they restricted production. As a consequence, the supply curve and the demand curve did not meet. Only those who valued it the most tried to buy it and they were willing to pay the most. Then they produced more, but still not enough for the supply curve and the demand curve to meet. People stood in line and those willing to pay the most got a phone. They continued this process until the price fell to the equilibrium price.

In doing this, they extracted as much revenue as possible from each person. There is a hidden structure in this data that you need to extract. It probably had to do with square footage, amenities and location. You do need to go and ask a new question as this won't get you where you are looking to go. The data has no outliers in it.

Without really looking at it closely, it is probably a Pareto distribution and not all Pareto distributions even have a mean, let along the nice properties you want a beginner to see.