Solved – R-Squared – A Biased estimate because it’s systemtically too high or low

multiple regressionr-squaredregression

I am taking an online course on regression modeling and came across the following (which is actually taken from a Minitab blog):

R-squared as a Biased Estimate

R-squared measures the strength of the relationship between the predictors and response. The R-squared in your regression output is a biased estimate based on your sample.

An unbiased estimate is one that is just as likely to be too high as it is to be too low, and it is correct on average. If you collect a random sample correctly, the sample mean is an unbiased estimate of the population mean.
A biased estimate is systematically too high or low, and so is the average. 

I'm still pretty new to regression modeling and I don't quite understand why $r^2$ is a biased estimate. Could someone dumb this down a bit for me?

Best Answer

I found this answer from The Stats Geek, which explains it quite well:

Why is the standard estimate (estimator) of $r^2$ biased? One way of seeing why it can't be unbiased is that by its definition the estimates always lie between 0 and 1. From one perspective this a very appealing property - since the true $r^2$ lies between 0 and 1, having estimates which fall outside this range wouldn't be nice (this can happen for adjusted $r^2$). However, suppose the true $r^2$ is 0 - i.e. the covariates are completely independent of the outcome Y. In repeated samples, the $r^2$ estimates will be above 0, and their average will therefore be above 0. Since the bias is the difference between the average of the estimates in repeated samples and the true value (0 here), the simple $r^2$ estimator must be positively biased.

Related Question