Solved – Dealing with good performance on training and validation data, but very bad performance on testing data

cross-validationmodel selectionoverfittingr-squaredregression

I have a regression problem with 5-6k variables. I divide my data into 3 non-overlapping sets: training, validation, and testing. I train using only the training set, and generate a lot of different linear regression models by choosing a different set of 200 variables for each model (I try about 100k such subsets). I score a model as $\min(R^2_{\text{training data}}, R^2_{\text{validation data}})$. Using this criterion, I end up choosing a model. It turns out that the model chosen has very similar $R^2$ on the training and the validation data. However, when I try this model on the testing data, it has much lower $R^2$. So it seems I am somehow overfitting on both the training and the validation data. Any ideas on how can I get a more robust model?

I tried increasing the training data size, but that didn't help. I am thinking of perhaps shrinking the size of each subset.

I have tried using regularization. However, the models I obtain using the lasso or the elastic net have much lower $R^2$ on the training set as well as the validation set, as compared to the model I obtain by doing the subset selection approach. Therefore, I don't consider these models, since I assume that if Model A performs better than Model B on both the training set as well as the validation set, Model A is clearly better than Model B. I would be very curious if you disagree with this.

On a related note, do you think $R^2$ is a bad criteria for choosing my models?

Best Answer

While this sounds somewhat like overfitting, I think it's actually more likely that you've got some kind of "bug" in your code or your process. I would start by verifying that your test set isn't somehow systematically different from the training/validation set. Suppose your data is sorted by date (or whatever). If you used the first 50% for training, the next 25% for validation, and the rest for testing, you may have accidentally stratified your data in a way that makes the training data somewhat representative of the validation data, but less so for the testing data. This is fairly easy to do by accident.

You should also ensure you're not "double-dipping" in the validation data somehow, which sometimes happens accidentally.

Alternately, CV's own @Frank Harrell has reported that a single train/test split is often too variable to provide useful information on a system's performance (maybe he can weigh in with a citation or some data). You might consider doing something like cross-validation or bootstrapping, which would let you measure both the mean and variance of your accuracy measure.

Unlike Mikera, I don't think the problem is your scoring mechanism. That said, I can't imagine a situation where your $R^2_{training} < R^2_{validation}$, so I'd suggest scoring using the validation data alone.

More generally, I think $R^2$ or something like it is a reasonable choice for measuring the performance of a continuous-output model, assuming you're aware of its potential caveats. Depending on exactly what you're doing, you may also want to look at the maximum or worst-case error too. If you are somehow discretizing your output (logistic regression, some external thresholds), then looking at precision/recall/AUC might be a better idea.

Related Question