I'm trying to get a sense of my prediction errors for a Bayesian regression model and I was using the Root-Mean-Squared Error. My question is, since are predictions are stochastic, would it make sense to take multiple draws of each sample point to account for variability in your parameter draws. That is, if we had 10 observed data points points, should one make 10 predictions for each data point, which would give us a total of 100 error values, average the sum of squares of those error values over 100 and take the square root? Thanks!
Solved – Root-Mean Squared Error for Bayesian Regression Models
bayesianpredictive-modelsregressionrms
Related Solutions
$R^2$ is the squared correlation of the OLS prediction $\hat{Y}$ and the DV $Y$. In a multiple regression with three predictors $X_{1}, X_{2}, X_{3}$:
# generate some data
> N <- 100
> X1 <- rnorm(N, 175, 7) # predictor 1
> X2 <- rnorm(N, 30, 8) # predictor 2
> X3 <- abs(rnorm(N, 60, 30)) # predictor 3
> Y <- 0.5*X1 - 0.3*X2 - 0.4*X3 + 10 + rnorm(N, 0, 10) # DV
> fitX123 <- lm(Y ~ X1 + X2 + X3) # regression
> summary(fitX123)$r.squared # R^2
[1] 0.6361916
> Yhat <- fitted(fitX123) # OLS prediction Yhat
> cor(Yhat, Y)^2
[1] 0.6361916
$R^2$ is also equal to the variance of $\hat{Y}$ divided by the variance of $Y$. In that sense, it is the "variance accounted for by the predictors".
> var(Yhat) / var(Y)
[1] 0.6361916
The squared semi-partial correlation of $Y$ with a predictor $X_{1}$ is equal to the increase in $R^2$ when adding $X_{1}$ as a predictor to the regression with all remaining predictors. This may be taken as the unique contribution of $X_{1}$ to the proportion of variance explained by all predictors. Here, the semi-partial correlation is the correlation of $Y$ with the residuals from regression where $X_{1}$ is the predicted variable and $X_{2}$ and $X_{3}$ are the predictors.
# residuals from regression with DV X1 and predictors X2, X3
> X1.X23 <- residuals(lm(X1 ~ X2 + X3))
> (spcorYX1.X23 <- cor(Y, X1.X23)) # semi-partial correlation of Y with X1
[1] 0.3172553
> spcorYX1.X23^2 # squared semi-partial correlation
[1] 0.1006509
> fitX23 <- lm(Y ~ X2 + X3) # regression with DV Y and predictors X2, X3
# increase in R^2 when changing to full regression
> summary(fitX123)$r.squared - summary(fitX23)$r.squared
[1] 0.1006509
There seem to be at least two distinct questions intertwined here.
First is the question of the right model for your data. As your response is, and can only be, positive integers it seems unlikely that linear regression by itself is a suitable choice because, as you have found, it may predict impossible values: the choice of figure of merit or error metric is by comparison quite secondary. You have various alternatives open to you, including working with a logarithmic transformation. My top suggestion would be to check out Poisson regression. In R that can be done using glm()
and quite possibly in other ways. (R experts may well add much more.) See for an introduction
http://en.wikipedia.org/wiki/Poisson_regression
and for one engaging discussion see
http://blog.stata.com/tag/poisson-regression/
The Stata content of that blog does not render the posting useless or uninteresting to people who don't use Stata.
Poisson regression can only predict positive values. (Those predictions can be fractional, to be understood in exactly the same spirit as statements that the mean number of children per household is 1.2, or whatever.)
The second question is about RMSE and NRMSE. The merit of RMSE is to my mind largely that it is in the same units of measurement as the response variable. Statisticians and non-statisticians should find it relatively easy to think in terms of RMSE of 3.4 metres or 5.6 grammes or 7.8 as a count.
Naturally, nothing stops you scaling it and it then loses that interpretation and becomes a relative measure. It is just what it is and joins a multitude of other such measures, e.g. R-square and its many pseudo-relatives, (log-)likelihood and its many relatives, AIC, BIC and other information criteria, etc., etc. The choice of figure of merit, error metric or of whatever you call them -- if I recall correctly Bowley wrote of "misfit" in 1902; that's a nice word worthy of use -- is partly a matter of personal taste, partly a matter of audience (only technical audiences can be expected to recognise AIC, for example), and mostly a matter of what has become conventional in your field.
Best Answer
You can calculate whatever error metric you would like as long as you are using the posterior distribution to generate predictions. For example, if you have a matrix named "predictions" that consist of a sample of predicted values for each observation (columns=observations, rows=predicted values from posterior), then the "Bayesian RMSE" calculated in R from simulated data would look something like this:
Where the result is a vector of RMSE values equal to the number of predicted values sampled from the posterior predictions.