Solved – Dealing with outliers with Linear Regression

machine learningpythonscikit learn

I'm trying to predict the prices of Airbnb's given some of their attributes using a very simple linear regression model.

Despite obtaining a median absolute error of around 24 euros on the training and testing set which I'm happy with, when I plot the residual error there's a lot of large outliers.

My question is; how should one deal with them? Should they be identified and dealt with in the pre-processing stage. Or simply omitted in the prediction phase?

from sklearn import linear_model
from sklearn import metrics

lin = linear_model.LinearRegression()

lin.fit(X_train, y_train)
y_train_predict = lin.predict(X_train)
y_test_predict = lin.predict(X_test)

plt.figure(figsize=(20, 20))
plt.scatter(y_train_predict, y_train_predict - y_train, c='b', s=40, alpha=0.5)
plt.scatter(y_test_predict, y_test_predict - y_test, c = 'g', s=40)

Residual Plot

Best Answer

I see your intention is towards predictive power, not inference. So consider that your pre-processing pipeline should be run separately in each fold during training, otherwise you incur optimistic bias on your performance estimates: what looks like a outlier in the whole dataset might not look like within a traning fold. So if you intend on removing those points, do so through automatized criteria.

If the data acquisition was actually faulty (and you have strong reasons to believe so), you are justified removing what seems to be outliers. See Should I report the descriptive statistics in publication before or after outliers removal?

You could try your hand on robust statistics though, more specifically robust estimator for linear regression. These are resistant to leverage points, and are completely automatic from the training point of view.

Related Question