I am using a random forest regressor and I split the independent variables with shuffle = True
, I get a good r squared but when I don't shuffle the data the accuracy gets reduced significantly. I am splitting the data as below-
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=rand, shuffle=True)
I fit the random forest regressor as below-
reg = RandomForestRegressor()
reg.fit(X_train, y_train)
When I do the cross-validation with unshuffled data, the accuracy is not good, but it improves when the data is shuffled.
This is how the response variable I am trying to predict looks like-
What could be the reason behind that?
Best Answer
In our conversation, you revealed two important facts not mentioned in your original question:
Large differences in performance with and without shuffling are a hint that the order of the data contains information. Often it's information about time, eg, the order in which the data points occurred or were collected.
Understanding why/how can help you to a) avoid overestimating the predictive performance of the model and, sometimes, b) reconsider how you approach the problem altogether. Your problem seems to be in the latter category.
Don't use data from the future to predict the past.
A common situation where shuffling (before splitting into train/validation/test) is not appropriate is when the data is ordered in time. In that case you want to use the first portion of your data for training and the second portion for validation (the two parts don't need to be the same size). This correctly reflects the fact that the model won't have data from the future when it's deployed in real life.
Since your data has a time index (years), it's not appropriate to shuffle before splitting even if the results are worse. It's a more realistic, if disappointing, evaluation of model performance.
Estimation is not the same as prediction.
If you are interested in estimation, a random forest is not the right approach. You want to specify a flexible functional form for the relationship between features and response and then estimate its parameters. The model also has to take into consideration the serial nature of the observations. There is no need to split the data for training and validation; train on all the data and evaluate how well the model fits instead. Regression with correlated errors might be a good place to start.
Here are some additional thoughts that I leave here because your other question got closed down:
You plot total crop area against time and the most striking pattern is that less land is dedicated to crops since the 70s. Do any of the predictors dropped/rose sharply during that period of time? Was there a large shift from agriculture to manufacturing? Important changes in policy? Do you have any data on economic activity other than agriculture? Should you even lump together data from before and after ~1970s? Maybe not esp. if you don't have data on broader economic conditions apart from agriculture.