Solved – R-squared and sample size

coefficient of variationr-squaredregressionregression coefficients

I was wondering if R-squared is affected by the sample size? Is adjusted R-squared also affected?

The reason behind this though is, that i have run a multiple linear regression on two samples. The R^2 on the smaller sample (n=50) is substantially higher than the R^2 on the larger sample (n=150) suspiciously so.

I looked into similar posts here and here but i cannot make any sense of it.

Unfortunately, i am not allowed to post my data.

Best Answer

For adjusted $R^2$ the answer is yes.

$$R^2_{adj} = 1- \dfrac{n-1}{n-p}\dfrac{SSRes}{SSTotal}$$

(Some sources may write the denonominator of $R^2_{adj}$ as $n-p-1$. That assumes that $p$ does not consider the intercept term. In the book I used, the author assumes that $p$ does consider the intercept term.)

If you do a regression on 100 observations and get $\frac{SSRes}{SSTotal} = 0.8$, then if you do the same regression on 200 observations and also get $\frac{SSRes}{SSTotal} = 0.8$, you will have a larger $R^2_{adj}$.

For $R^2$, the answer is no...sort of. $R^2$ is something like correlation between variables, and the relationship between quantities doesn't depend on how many times you observe the variables. It doesn't depend on you observing the variables at all! However, as you play with data, you're not feeding the same values into the formula to calculate $R^2$, so you won't end up with the same value for the same reason that you won't end up with the same $\bar{x}$ on 99 observations as you would on 100, though you have no way to know if adding that 100th observation will increase or decrease the value you're calculating. With adjusted $R^2$, you know that, all else being equal, more observations means a larger value.

What happened in your example is that the smaller sample happened to get lucky and find a strong relationship, but when you included more data, that relationship turned out not to be so strong.

Keep in mind that you're very unlikely to find that your $\frac{SSRes}{SSTotal}$ ratio is the same for two different samples, even if they're drawn from the same population. This is true whether the sample sizes are the same or different.

Related Question