I'm taking an online intro class on statistics and right now we are covering a topic on relationship between quantitative variables. One of the subtopics is coefficient of determination. Here is an excerpt from the book:
The coefficient of determination is very simple to calculate if you
know the correlation coefficient, since it is just $r^2$. The
coefficient of determination can be interpreted as the percentage of
variation of the $Y$ variable that can be attributed to the
relationship. In other words, a value of $r^2 = 0.63$ can be
interpreted as “63% of the variation in $Y$ can be attributed to the
variation in $X$.
Here are my questions:
- Why the coefficient of determination can be interpreted as the percentage of $Y$ variable that can be attributed to the relationship? In other words how did the author come up with this interpretation?
- What does the last sentence in the excerpt mean? What does it mean that 63% of the variation in $Y$ can be attributed to the variation in $X$?
Best Answer
Suppose you run a regression of $Y$ on regressor matrix $X$ with error term $\varepsilon$, i.e. \begin{align} Y = X\beta + \varepsilon \end{align} where $Y$, $\beta$, and $\varepsilon$ are $n\times1$ vectors and $X$ is a $n\times p$ matrix. Using Ordinary Least Squares (OLS), you estimate $\beta$ as $\hat{\beta}$ and obtain $\hat{y} = X\hat{\beta}$. Denote $\bar{y} = n^{-1}\sum_{i=1}^ny_i$, i.e. $\bar{y}$ is the average value of the entries in $y$.
Define the Total Sum of Squares (TSS) as $TSS:=\sum_{i=1}^n (y_i - \bar{y})^2$. This is the total square variation of $y$ without explaining any of this variation using $X$. One can further define the Residual Sum of Square (RSS) as $RSS:=\sum_{i=1}^n (y_i - \hat{y_i})^2$ and the Explained Sum of Square (ESS) as $ESS:= \sum_{i=1}^n (\hat{y_i} - \bar{y})^2$. RSS is called this way because it gives the variation of $y$ after using the fitted value $\hat{y_i}$ (instead of the mean) as predictor. ESS is the remaining (unexplained) variation after fitting the model.
$R^2$ is defined as $R^2:=1-RSS/TSS$. In the special case of a linear regression (and only then!) does this definition coincide with taking the square of the (estimated) correlation coefficient $r$. Finally, to answer your question, it is easy to just consult the above formula for $R^2$: Because it holds that $TSS = RSS + ESS$, one can rewrite $R^2$ as $R^2 = ESS/TSS = \frac{ESS}{n}/\frac{TSS}{n}$. Crucially, note that $\frac{ESS}{n}$ would be the * unexplained variance* and $\frac{TSS}{n}$ the total variance. This way, $R^2$ can be thought of as indicating the amount of 'explained variance/variation' that $X$ has with respect to $Y$.