Solved – Which is better: r-squared or adjusted r-squared

machine learningmultiple regressionr-squaredregression

I just started to learn about the following statistical measures, r-squared and adjusted r-squared and was wondering why can't we use adjusted r-squared for every regression model considering the fact that it penalizes the model for useless variables, unlike the former. Is there any advantage of r-squared over adjusted r-squared in some conditions?

Best Answer

Adjusted $R^2$ is the better model when you compare models that have a different amount of variables.

The logic behind it is, that $R^2$ always increases when the number of variables increases. Meaning that even if you add a useless variable to you model, your $R^2$ will still increase. To balance that out, you should always compare models with different number of independent variables with adjusted $R^2$.

Adjusted $R^2$ only increases if the new variable improves the model more than would be expected by chance.

1. What formula does `lm` in R use for adjusted r-square?

As already mentioned, typing summary.lm will give you the code that R uses to calculate adjusted R square. Extracting the most relevant line you get:

ans$adj.r.squared <- 1 - (1 - ans$r.squared) * ((n - df.int)/rdf)

which corresponds in mathematical notation to:

$$R^2_{adj} = 1 - (1 - R^2) \frac{n-1}{n-p-1}$$

assuming that there is an intercept (i.e., df.int=1), $n$ is your sample size, and $p$ is your number of predictors. Thus, your error degrees of freedom (i.e., rdf) equals n-p-1.

The formula corresponds to what Yin and Fan (2001) label Wherry Formula-1 (there is apparently another less common Wherry formula that uses $n-p$ in the denominator instead $n-p-1$). They suggest it's most common names in order of occurrence are "Wherry formula", "Ezekiel formlua", "Wherry/McNemar formula", and "Cohen/Cohen formula".

2. Why are there so many adjusted r-square formulas?

$R^2_{adj}$ aims to estimate $\rho^2$, the proportion of variance explained in the population by the population regression equation. While this is clearly related to sample size and the number of predictors, what is the best estimator is less clear. Thus, you have simulation studies such as Yin and Fan (2001) that have evaluated different adjusted r-square formulas in terms of how well they estimate $\rho^2$ (see this question for further discussion).

You will see with all the formulas, the difference between $R^2$ and $R^2_{adj}$ gets smaller as the sample size increases. The difference approaches zero as sample size tends to infinity. The difference also get smaller with fewer predictors.

3. How to interpret $R^2_{adj}$?

$R^2_{adj}$ is an estimate of the proportion of variance explained by the true regression equation in the population $\rho^2$. You would typically be interested in $\rho^2$ where you are interested in the theoretical linear prediction of a variable. In contrast, if you are more interested in prediction using the sample regression equation, such is often the case in applied settings, then some form of cross-validated $R^2$ would be more relevant.

References

Yin, P., & Fan, X. (2001). Estimating $R^2$ shrinkage in multiple regression: A comparison of different analytical methods. The Journal of Experimental Education, 69(2), 203-224. PDF

Solved – Why is it that a lower R-Squared on a difference regression model could be better than higher R-squared on a levels regression model

There is a blog post that tries to explain why: http://www.portfolioprobe.com/2011/01/12/the-number-1-novice-quant-mistake/

Basically using levels gives you spurious answers because there is no component of the data that is independent across observations.

Best Answer

Related Solutions

Solved – the adjusted R-squared formula in lm in R and how should it be interpreted

1. What formula does lm in R use for adjusted r-square?

2. Why are there so many adjusted r-square formulas?

3. How to interpret $R^2_{adj}$?

References

Solved – Why is it that a lower R-Squared on a difference regression model could be better than higher R-squared on a levels regression model

Related Question

1. What formula does `lm` in R use for adjusted r-square?