Solved – What happens to adjusted R squared as sample size increases

r-squaredsample-size

What effect does sample size have on adjusted R squared values?

Best Answer

Adjusted r-squared is intended to be an unbiased estimate of population variance explained using the population regression equation. There are several different formulas for adjusted r-squared and there are various definitions of population variance explained (e.g., fixed versus random-x assumptions). Most commonly, statistical software will report the Ezekiel formula which makes the fixed-x assumption.

In general, as sample size increases,

the difference between expected adjusted r-squared and expected r-squared approaches zero; in theory this is because expected r-squared becomes less biased.
the standard error of adjusted r-squared would get smaller approaching zero in the limit.

So the main take-home message is that if you are interested in population variance explained, then adjusted r-squared is always a better option than r-squared. That said, as your sample size gets very large, r-squared won't be that biased (note that for models with large numbers of predictors, sample size needs to be even bigger for r-squared to approach being unbiased).

1. What formula does `lm` in R use for adjusted r-square?

As already mentioned, typing summary.lm will give you the code that R uses to calculate adjusted R square. Extracting the most relevant line you get:

ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)

which corresponds in mathematical notation to:

$$R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1}$$

assuming that there is an intercept (i.e., df.int=1), $n$ is your sample size, and $p$ is your number of predictors. Thus, your error degrees of freedom (i.e., rdf) equals n-p-1.

The formula corresponds to what Yin and Fan (2001) label Wherry Formula-1 (there is apparently another less common Wherry formula that uses $n-p$ in the denominator instead $n-p-1$). They suggest it's most common names in order of occurrence are "Wherry formula", "Ezekiel formlua", "Wherry/McNemar formula", and "Cohen/Cohen formula".

2. Why are there so many adjusted r-square formulas?

$R^2_{adj}$ aims to estimate $\rho^2$, the proportion of variance explained in the population by the population regression equation. While this is clearly related to sample size and the number of predictors, what is the best estimator is less clear. Thus, you have simulation studies such as Yin and Fan (2001) that have evaluated different adjusted r-square formulas in terms of how well they estimate $\rho^2$ (see this question for further discussion).

You will see with all the formulas, the difference between $R^2$ and $R^2_{adj}$ gets smaller as the sample size increases. The difference approaches zero as sample size tends to infinity. The difference also get smaller with fewer predictors.

3. How to interpret $R^2_{adj}$?

$R^2_{adj}$ is an estimate of the proportion of variance explained by the true regression equation in the population $\rho^2$. You would typically be interested in $\rho^2$ where you are interested in the theoretical linear prediction of a variable. In contrast, if you are more interested in prediction using the sample regression equation, such is often the case in applied settings, then some form of cross-validated $R^2$ would be more relevant.

References

Yin, P., & Fan, X. (2001). Estimating $R^2$ shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF

Solved – Coefficient of determination ($R^2$) and sample size

It depends on whether you are interested in $r^2$, the sample correlation coefficient, or the $R^2$ multiple correlation coefficient, used to assess the performance of regressions.

Both $r^2$ and adjusted $r^2$ are negatively biased--that is, the sample values are slightly smaller than the corresponding population value--but the adjusted formula is somewhat less biased. In addition to the sample size, the amount of bias depends on the value, with $r^2$ near zero and one showing the least bias and those near 0.6-0.8 showing the most bias.

Table 1 of a paper by Zimmerman, Zumbo, and Williams (2003) illustrates the bias as a function of sample size and correlation value. Elsewhere in the paper, they show simulation data indicating that the Fisher and Olkin and Pratt adjusted $r^2$ reduce this bias considerably.

There is also a decent amount of work looking at "$R^2$ shrinkage", which is a related phenomena that comes up a lot in regression-related contexts, but has the opposite sign (it is positively-biased, and adjustments bring it back down). Yin and Fan (2001) have a fairly comprehensive comparison of methods for estimating it, and Page 3/205 has some citations to descriptions of the problem.

Finally, you should be aware that there are lots of methods for adjusting $r^2$/$R^2$ (in fact, there are even multiple ($\ge3$) versions of the Olkin and Pratt adjustment formula floating around, some of which correct for the number of parameters), so it might help to be more specific about whatever you have in mind

Best Answer

Related Solutions

Solved – the adjusted R-squared formula in lm in R and how should it be interpreted

1. What formula does lm in R use for adjusted r-square?

2. Why are there so many adjusted r-square formulas?

3. How to interpret $R^2_{adj}$?

References

Solved – Coefficient of determination ($R^2$) and sample size

Related Question

1. What formula does `lm` in R use for adjusted r-square?