Solved – Is it a good idea to evaluate cross-validation using correlation coefficient

cross-validation

I am doing cross-validation of my model. I was looking for a metric, that would be able to compare the predictions with the independent data, and I thought that correlation coefficient r would be very easy to interpret. So, just correlate the predictions with the data. But I am not a statistician, so I might have overlooked some issues. So, is it actually a good idea to do this?

P.S.: in my particular model the data are counts or sometimes averaged counts, so numbers >= 0, poisson-like distributed, so maybe some normalization would be needed…. but let's keep the question more general 🙂

Best Answer

The correlation coefficient isn't really a measure of predictive performance (except in the special case of linear regression). For example, these two vectors have 100% correlation:

y <- 1:100
yhat <- 50 + (1:100)/2
cor(y, yhat)  # 1

That said, it's a measure that most people will recognise, and unless your models are going very wrong, a higher correlation generally means more accurate predictions.

Other measures you can use include:

  • Mean squared error:

$$\frac{1}{n} \sum(y - \hat{y})^2$$

  • Mean absolute error:

$$\frac{1}{n} \sum|y-\hat{y}|$$

  • Mean absolute percentage error:

$$\frac{1}{n} \sum \frac{|y - \hat{y}|}{y}$$

The last one probably makes the most sense for strictly positive data, but be aware that if the denominator is small, you can get an exaggerated error measure.

Related Question