Solved – Coefficient of determination

r-squared

I'm taking an online intro class on statistics and right now we are covering a topic on relationship between quantitative variables. One of the subtopics is coefficient of determination. Here is an excerpt from the book:

The coefficient of determination is very simple to calculate if you
know the correlation coefficient, since it is just $r^2$. The
coefficient of determination can be interpreted as the percentage of
variation of the $Y$ variable that can be attributed to the
relationship. In other words, a value of $r^2 = 0.63$ can be
interpreted as “63% of the variation in $Y$ can be attributed to the
variation in $X$.

Here are my questions:

  1. Why the coefficient of determination can be interpreted as the percentage of $Y$ variable that can be attributed to the relationship? In other words how did the author come up with this interpretation?
  2. What does the last sentence in the excerpt mean? What does it mean that 63% of the variation in $Y$ can be attributed to the variation in $X$?

Best Answer

Suppose you run a regression of $Y$ on regressor matrix $X$ with error term $\varepsilon$, i.e. \begin{align} Y = X\beta + \varepsilon \end{align} where $Y$, $\beta$, and $\varepsilon$ are $n\times1$ vectors and $X$ is a $n\times p$ matrix. Using Ordinary Least Squares (OLS), you estimate $\beta$ as $\hat{\beta}$ and obtain $\hat{y} = X\hat{\beta}$. Denote $\bar{y} = n^{-1}\sum_{i=1}^ny_i$, i.e. $\bar{y}$ is the average value of the entries in $y$.

Define the Total Sum of Squares (TSS) as $TSS:=\sum_{i=1}^n (y_i - \bar{y})^2$. This is the total square variation of $y$ without explaining any of this variation using $X$. One can further define the Residual Sum of Square (RSS) as $RSS:=\sum_{i=1}^n (y_i - \hat{y_i})^2$ and the Explained Sum of Square (ESS) as $ESS:= \sum_{i=1}^n (\hat{y_i} - \bar{y})^2$. RSS is called this way because it gives the variation of $y$ after using the fitted value $\hat{y_i}$ (instead of the mean) as predictor. ESS is the remaining (unexplained) variation after fitting the model.

$R^2$ is defined as $R^2:=1-RSS/TSS$. In the special case of a linear regression (and only then!) does this definition coincide with taking the square of the (estimated) correlation coefficient $r$. Finally, to answer your question, it is easy to just consult the above formula for $R^2$: Because it holds that $TSS = RSS + ESS$, one can rewrite $R^2$ as $R^2 = ESS/TSS = \frac{ESS}{n}/\frac{TSS}{n}$. Crucially, note that $\frac{ESS}{n}$ would be the * unexplained variance* and $\frac{TSS}{n}$ the total variance. This way, $R^2$ can be thought of as indicating the amount of 'explained variance/variation' that $X$ has with respect to $Y$.