Solved – Random Forest Regression with sparse data in Python

feature-engineeringmachine learningmany-categoriesrandom forestregression

I am working on a Random Forest regression model to predict housing prices.
I have about 500k rows of data with the following information:

1.House area in square meters.

2.Number of rooms.

3.City.

4.Street.

5.Floor.

6.The transaction date.

7.Type of house (single house, apartment building etc.)

8.The amount paid for the house.

I am planning on making a different model for each city, but I'm having trouble in representing the street name.
I was thinking on using One Hot Encoder to represent the street name but some cities have over 1000 streets and that would give me over 1000 variables with moslty zero values.

I have read about sparse representation
but I don't know how to use it in practice.

Let's say I already have a sparse representation of my data, how do I feed it to the Random Forest?
Does the Random Forest Regressor from the sklearn library in Python support sparse data?
If not, then is there another way to go about using Random Forest with sparse data in Python?

Best Answer

This is a variation on a FAQ (Frequently Asked Question) here similar posts, but so far no really good answers (as far as I can see, If you disagree please guide us to the good answers!) It seems that tree models like forests have problems with high-cardinality nominal variables, so this is an area where we can expect huge differences between implementations, so try/compare different implementations!

One paper/blog that seems to take this serious, in particular, they compare H2o and scikit-learn, and prefers the former. H20 do not use one-hot encoding, which they identify as a problem here. So some words about categorical encoding. Numerical encodings, like one-hot (better known as dummys), come from linear models. In linear models, a nominal (categorical) variable with $k$ levels is represented as a $k-1$-dimensional (assuming an intercept in the model) linear subspace. This linear subspace can be represented in many different ways, corresponding to a choice of basis.

For linear models and methods the choice of a basis is just a convenience, results with any of them are equivalent. But when using non-linear methods like trees, forests, this is no longer true. In particular, when using one-hot encoding, you are only searching for splits on single levels, which might be highly inefficient, especially when there is very many levels. Some kind of hierarchical encoding might be much better. There must be a huge scope for work here! You could look throug for some ideas, but most posts there is about linear models. You could try the idea in Strange encoding for categorical features. Also remember that with random forests, there is no need to use the same predictors/encodings for each tree search, you could, as an idea, use random projections, but different ones for each tree search. But if there is existing implementations with such ideas, I do not know.

Some other relevant and interesting links/papers I found is one-hot-encoding-is-making-your-tree-based-ensembles-worse-heres-why, Random Forests, Decision Trees, and Categorical Predictors: The “Absent Levels” Problem, a stored google scholar search.