Solved – Comparing RMSE to model

logisticpoisson distributionrms

I'm assessing the accuracy of the prediction of my model using the RMSE on a new data set. Now the RMSE in itself doesn't give any indication of whether it is a good model since there is no threshold that says it is 'good'. My question is, would it make sense to calculate the RMSE for a null model with just the mean as a predictor and compare this to the RMSE of my model? Or should I compare the RMSE of the model on the 'train' data to the RMSE of the 'test' data?

The model that I'm currently using is the best with all my available predictors based on BIC scores, but I'm trying to figure out how well the model actually does. I've also calculated the adj. R-square, which says that 20.7% of the variance is explained by my model, but I doubt whether this is a good accuracy measurement.

Best Answer

Your suggestion about using a null model is similar to $R^2$. $R^2$ is defined as $1-MSE/V$, where $MSE$ is the model's mean squared error and $V$ is the variance of the observed output. You can think of the variance as the mean squared error of a null model that always gives the mean as its predicted output. Even here, the question is: how much better can you do? This is very hard to answer. The reason is that it's hard to know whether the error reflects variation in the output that's fundamentally unpredictable from the input (e.g. 'noise', but could be something else), or whether additional structure is present that the model has simply failed to capture. Sometimes looking at the residuals can give a hint. Under some circumstances, it's possible to estimate the 'noise' level. For example, if you have many repeated trials where inputs are identical, you can measure variability of the output for equal inputs. This gives a bound on the maximum possible performance. You would typically encounter this situation in the context of controlled experiments. Or, you may be able to do something similar if you have access to a known 'correct model' (e.g. in a theoretical setting, or if you're modeling a well understood physical system). Otherwise, it's hard to know whether there's a better model out there.

Looking at the training vs. test error can give you some idea about the extent to which your model is overfitting (the expected training error would be lower than the expected test error). There can be variability here when using a small number of samples and/or few repetitions. A gap between training and test error isn't a problem per se, but a large gap might signal a problem. Even so...one model that overfits might still have better generalization performance than another model that doesn't.

Instead of asking how good your model is, you can also ask how bad it is. You could use a significance testing approach to see whether your prediction is better than 'chance'. For example, you might compare the test error on real data to the test error on permuted data (where relationships between the input/output have been destroyed, and any apparent performance is due to sampling variability or overfitting).

Related Question