Regression – How to Compare RMSE for the Same Model with Varying Sample Sizes

econometricsregressionresidualsrms

My empirical research is based on a variable $a_{i,t} \sim f(\mathrm{RMSE})$, i.e. it is based on the root mean squared error (RMSE) of a certain regression model $Y_{i,t} = f(X_{i,t}, \beta) + \epsilon_{i,t}$. The regression is applied using $n=40$ observations, with a minimum of 24 observations available.

Is my variable $a_{i,t}$ comparable across entities, if the underlying number of observations varies between the range $24 \le n \le 40$? Is $a_{i,t}$ somehow dependent on the number of observations used in the regression?


My question is not related to those (e.g. [1] or [2]), where the RMSE is tried to be used to compare different regression models. The model is the same for all regressions, but the number of observations varies.

Best Answer

It seems like you are not using RMSE to validate your model's predictive performance. It's a useful quantity for other reasons, like theory. For some of your firms, you have less data to work with, so you are concerned that you might have higher RMSE just because you have less data, but you could have lower RMSE because you over fit. If you have a lot of terms, this can be a real concern with only 24 observations. I think you can gauge how bad this problem is by doing some simulations. Start with the firms where you have a full history, and do your analysis and get the RMSE. Then refit your model truncating each firm. If the RMSE changes when a firm is truncated, compared to the full history model, you know this a bad idea. Maybe there are selection issues there with the firms that have less history, so it is not perfect.

Related Question