Solved – Using a combined RMSE

rms

I have 12 soil water sensors with a few years of actual soil water samples that have been retrieved from near each of the sensors. We have found that individually regressing the data from each sensor vs the soil samples performs better than when the data from all sensors is placed in one regression. We calculated an RMSE for all of the sensor data using a single regression, but we also calculated it from each sensor's residuals from the individual sensor regressions in a single combined RMSE. Is this an inappropriate use of RMSE?

Best Answer

I think the basic idea of wanting to calculate a grand RMSE for all the sensors, despite how you've fit regression models separately for each sensor, is fine. But be aware of three things:

  1. If you compute the grand MSE by simply taking the mean of the 12 sensor MSEs, you won't be accounting for how different sensors may have different amounts of data. If you want to (you probably do), you should weight the MSEs by sample size, or equivalently, put all the squared errors into one vector and take the mean (and then square root) of that.

  2. You say that performing a separate regression for each sensor "performs better" than an overall regression. If your measure of performance is just RMSE, with each of your two models (one regression vs. separate regressions) trained and tested on the same data, then it is a given that the more flexible approach of doing separate regressions will produce a RMSE no greater than that of the overall regression. Your models are nested, so the more flexible one is guaranteed to fit the data at least as well as the less flexible one. This does not imply that, for example, the more flexible model is more correct than the less flexible one, nor that its coefficients are more informative, nor that it will be more accurate in predicting future observations. In short, your more flexible model may be overfitting.

  3. Instead of fitting completely separate regression models for each sensor, you may be better served by using a mixed model, with the effect of each sensor being a random effect.

Related Question