Time Series – Weather Data in Time Series Predictions Using Random Forest

Disclaimer: I know this is a long-ish post but I don't need code solutions just high level general direction approaches that are usually used in situations like these.

So let's say I want to predict number of people on the street or city square at any given moment.

I have hourly data of average number of people seen on the street.
Let's say this data was collected automatically with camera and image recognition software which counted the people.

I also have weather data: temperature, rain and wind speed.

From some basic data analysis we can see some general patterns:
– more people on the street when it's warmer, no rain and no wind
– fewer people when it's colder or when it rains or when there is very strong wind

If I take some standard model like RandomForestRegressor from scikit-learn and ignore the weather variables it will capture daily and weekly trends quite well.
It will not take into account weather conditions but overall it works and predicts fairly good (especially on larger timescale).

Now my question begins:

How do I handle sparse events like weather conditions to improve Random Forest's predictions (like multiplying it with some coef or some better approach)?

E.g. let's say I have historical data on raining on Wednesday (20% less people) and raining on Thursday (15% less people).
But I don't have data on raining on Friday.
And my weather forecast says it's going to rain next Friday so I have opportunity to use this information and improve my prediction but how?

If I simply put everything in Random Forest it will "say" there is no bin or data for this case (Friday and raining).
No data means prediction is zero. (which is not true – in reality it will be lower than usual but not zero)

Yes I could calculate how much is an average number of people lower on rainy days compared to regular days (Wednesday, Thursday) and then apply this number to Friday but I wonder if this is the right approach?

Also what happens if I want to include many different factors (or just experiment with many different factors)? This seems like a very slow and tedious process.

What would be better approach? (different models or neural nets or…?)

Best Answer

You can have a categorical (nominal) feature called "Weather" with the cardinalities: "Rainy", "Sunny", etc. For the days where you have no data, you can represent missingness explicitly by adding a "No Weather Data" category.

You can feed your data into a random forest, or any linear / non-linear model in this way. Obviously you will need to one-hot encode your nominal feature(s) before feeding it in.

If you use R, this is done automatically with most packages so you don't have to worry about it. In Python, you have to do this manually.

Disclaimer: I know this is a long-ish post but I don't need code solutions just high level general direction approaches that are usually used in situations like these.

Now my question begins:

Best Answer

Related Solutions

Solved – In sample splitting for time series data, do we randomly select data

Related Question