Regression Analysis – Understanding Strange Definition of Coefficient of Determination

linear modelpearson-rr-squaredregressiontime series

In Wei and Kusiak, 2015 a metric is used to evaluate the performance of a time-series prediction model. The paper calls it

[the] correlation coefficient ($R^{2}$)

and defines it as

$R^{2} = 1-\frac{\sum\limits_{i}{(f_{i}-y_{i})^{2}}}{\sum\limits_{i}{(f_{i}-y_{i})^{2}}+\sum\limits_{i}{(f_{i}-\bar{y}_{i})^{2}}}$

where $f_{i}$ is the predicted value produced by the model, $y_{i}$ is the observed value, $\bar{y}_{i}$ is the mean of the observed value,

Is this a mistake?

I'm familiar with $R^{2}$ being used to describe the coefficient of determination rather than the coefficient of correlation, which I'd expect to be denoted as $r$.

The equation looks a bit like the formula that I'm familiar with for the coefficient of determination, but with an additional term in the denominator. For comparison, this is the equation that I'm familiar with for the coefficient of determination:

$$R^{2}=1-\frac{\sum\limits_{i}{(f_{i}-y_{i})^{2}}}{\sum\limits_{i}{(y_{i}-\bar{y})^{2}}}$$

Finally, $\bar{y}_{i}$ is defined as the mean of the observed value, but I would expect this to be the mean of the observed values, and therefore be better denoted as $\bar{y}$.

I appreciate that the naming of these metrics isn't universal, but can anyone can point me to a reference for this equation?

Best Answer

This definition and the common one are equivalent.

First, note that the Wiki you link to has $y_i$ in the denominator, not $f_i$. I agree that the sample mean needs to $i$ index. I assume the original source uses a linear model.

Write $$ \sum_i\left\{(f_{i}-y_{i})^{2}+(f_{i}-\bar{y}_{i})^{2}\right\}=\sum_i\left\{f_i^2-2f_iy_i+y_i^2+f_i^2-2f_i\bar{y}+\bar{y}^2\right\}. $$ We may write, in matrix notation, $$\sum_i f_iy_i=y'P'y,$$ where $P=X(X'X)^{-1}X'$ is the projection matrix turning $y$ into fitted values $f$, $f=Py$. $P$ is symmetric and idempotent (see e.g. Interpretation of $\mathbf{y}^T(\mathbf{I}-\mathbf{H})\mathbf{y}$ in OLS), so that $$\sum_i f_iy_i=y'P'y=y'P'Py=f'f=\sum_if_i^2.$$

Hence, the equation simplifies to $$ \sum_i\left\{(f_{i}-y_{i})^{2}+(f_{i}-\bar{y}_{i})^{2}\right\}=\sum_i\left\{y_i^2-2f_i\bar{y}+\bar{y}^2\right\}. $$ Now, write $\sum_if_i=\iota'Py$ for a vector of ones $\iota$. If the regressors contain a constant (see also Proof that the mean of predicted values in OLS regression is equal to the mean of original values?), $\iota'P=\iota'$ (the fitted values of a regression of $\iota$ on itself and something else will be itself), so that $\sum_if_i=\iota'y=\sum_iy_i$. Thus, $$ \sum_i\left\{y_i^2-2y_i\bar{y}+\bar{y}^2\right\}=\sum_i(y_i-\bar{y})^2 $$

Quick numerical verification:

n <- 10
y <- rnorm(n)
x <- rnorm(n)
limo <- lm(y~x)

> summary(limo)$r.squared
[1] 0.03260629

> 1-sum(resid(limo)^2)/(sum(resid(limo)^2)+sum((fitted(limo)-mean(y))^2))
[1] 0.03260629