House Price Model – How Adding Latitude and Longitude Enhances House Price Prediction Models

data transformationfeature selectionmachine learning

I'm new to machine learning, and I'm trying to get a sense of how you optimize data for a model. I'm following this official Kaggle tutorial, which teaches the basics of machine learning through house price prediction. They use a decision tree, but I found it odd which features they feed into the model to predict the price of a house:

house_price_features = ['Rooms', 'Bathroom', 'Landsize', 'Latitude', 'Longitude']

Rooms, bathrooms, and landsize all make sense to me – but latitude and longitude? Obviously there is a correlation between location and price, but it's not going to follow a nice curve. Sometimes, going a block up will increase house prices twofold; sometimes, it'll have no effect at all. Intuitively, I feel like all a model can do with those features in predicting price is overfit. So, my question is twofold:

  1. Were they right in giving this model latitude and longitude to predict price, or is this extraneous information that can only hurt the model? Why?
  2. If the answer to the above is "no", is there any transformation of the latitude and longitude data (i.e. into neighborhood IDs) that would make the data more helpful?

Best Answer

The answer is yes because location usually is the main driver of house prices per square feet. Dropping it would deteriorate the model performance probably in a dramatic way.

Based on lat/lon, tree-based methods divide the map in rectangular pieces. The stronger the effect and the more data in a certain area, the smaller the pieces. In less dense regions, the pieces would not be too small.

You would not add them as linear effects in a linear regression. There, you would need to consider different approaches. A simple would be to represent lat/lon each by a cubic spline and add interaction terms between them.

Related Question