R-Squared Distribution in Linear Regression – Null Hypothesis Insights

intuitionmathematical-statisticsr-squaredregression

What is the distribution of the coefficient of determination, or R squared, $R^2$, in linear univariate multiple regression under the null hypothesis $H_0:\beta=0$?

How does it depend on the number of predictors $k$ and number of samples $n>k$? Is there a closed-form expression for the mode of this distribution?

In particular, I have a feeling that for simple regression (with one predictor $x$) this distribution has mode at zero, but for multiple regression the mode is at a non-zero positive value. If this is indeed true, is there an intuitive explanation of this "phase transition"?

Update

As @Alecos showed below, the distribution indeed peaks at zero when $k=2$ and $k=3$ and not at zero when $k>3$. I feel that there should be a geometrical view on this phase transition. Consider geometrical view of OLS: $\mathbf y$ is a vector in $\mathbb R^n$, $\mathbf X$ defines a $k$-dimensional subspace there. OLS amounts to projecting $\mathbf y$ onto this subspace, and $R^2$ is squared cosine of the angle between $\mathbf y$ and its projection $\hat{\mathbf y}$.

Now, from @Alecos's answer it follows that if all vectors are random, then the probability distribution of this angle will peak at $90^\circ$ for $k=2$ and $k=3$, but will have a mode at some other value $<90^\circ$ for $k>3$. Why?!

Update 2: I am accepting @Alecos'es answer, but still have a feeling that I am missing some important insight here. If anybody ever suggests any other (geometrical or not) view on this phenomenon that would make it "obvious", I will be happy to offer a bounty.

Best Answer

For the specific hypothesis (that all regressor coefficients are zero, not including the constant term, which is not examined in this test) and under normality, we know (see eg Maddala 2001, p. 155, but note that there, $k$ counts the regressors without the constant term, so the expression looks a bit different) that the statistic

$$F = \frac {n-k}{k-1}\frac {R^2}{1-R^2}$$ is distributed as a central $F(k-1, n-k)$ random variable.

Note that although we do not test the constant term, $k$ counts it also.

Moving things around,

$$(k-1)F - (k-1)FR^2 = (n-k)R^2 \Rightarrow (k-1)F = R^2\big[(n-k) + (k-1)F\big]$$

$$\Rightarrow R^2 = \frac {(k-1)F}{(n-k) + (k-1)F}$$

But the right hand side is distributed as a Beta distribution, specifically

$$R^2 \sim Beta\left (\frac {k-1}{2}, \frac {n-k}{2}\right)$$

The mode of this distribution is

$$\text{mode}R^2 = \frac {\frac {k-1}{2}-1}{\frac {k-1}{2}+ \frac {n-k}{2}-2} =\frac {k-3}{n-5} $$

FINITE & UNIQUE MODE
From the above relation we can infer that for the distribution to have a unique and finite mode we must have

$$k\geq 3, n >5 $$

This is consistent with the general requirement for a Beta distribution, which is

$$\{\alpha >1 , \beta \geq 1\},\;\; \text {OR}\;\; \{\alpha \geq1 , \beta > 1\}$$

as one can infer from this CV thread or read here.
Note that if $\{\alpha =1 , \beta = 1\}$, we obtain the Uniform distribution, so all the density points are modes (finite but not unique). Which creates the question: Why, if $k=3, n=5$, $R^2$ is distributed as a $U(0,1)$?

IMPLICATIONS
Assume that you have $k=5$ regressors (including the constant), and $n=99$ observations. Pretty nice regression, no overfitting. Then

$$R^2\Big|_{\beta=0} \sim Beta\left (2, 47\right), \text{mode}R^2 = \frac 1{47} \approx 0.021$$

and density plot

enter image description here

Intuition please: this is the distribution of $R^2$ under the hypothesis that no regressor actually belongs to the regression. So a) the distribution is independent of the regressors, b) as the sample size increases its distribution is concentrated towards zero as the increased information swamps small-sample variability that may produce some "fit" but also c) as the number of irrelevant regressors increases for given sample size, the distribution concentrates towards $1$, and we have the "spurious fit" phenomenon.

But also, note how "easy" it is to reject the null hypothesis: in the particular example, for $R^2=0.13$ cumulative probability has already reached $0.99$, so an obtained $R^2>0.13$ will reject the null of "insignificant regression" at significance level $1$%.

ADDENDUM
To respond to the new issue regarding the mode of the $R^2$ distribution, I can offer the following line of thought (not geometrical), which links it to the "spurious fit" phenomenon: when we run least-squares on a data set, we essentially solve a system of $n$ linear equations with $k$ unknowns (the only difference from high-school math is that back then we called "known coefficients" what in linear regression we call "variables/regressors", "unknown x" what we now call "unknown coefficients", and "constant terms" what we know call "dependent variable"). As long as $k<n$ the system is over-identified and there is no exact solution, only approximate -and the difference emerges as "unexplained variance of the dependent variable", which is captured by $1-R^2$. If $k=n$ the system has one exact solution (assuming linear independence). In between, as we increase the number of $k$, we reduce the "degree of overidentification" of the system and we "move towards" the single exact solution. Under this view, it makes sense why $R^2$ increases spuriously with the addition of irrelevant regressions, and consequently, why its mode moves gradually towards $1$, as $k$ increases for given $n$.

Best Answer

Related Solutions

Shrinkage Methods – Unified View of Stein’s Paradox, Ridge Regression, and Random Effects in Mixed Models

Connection between James–Stein estimator and ridge regression

Connection between James–Stein estimator and random effects models

Connection between random effects models and ridge regression

Connection between (multilevel) random effects models and that in genetics

Regression – Conditions for Ridge Regression to Improve Over Ordinary Least Squares

Variance of Ridge Estimator

Comment

Why is ridge regression usually recommended only in the case of correlated predictors?

Related Question