Regression – Can R-Squared Be Too Low in Multiple Linear Regression?

r-squaredregression

This is a very general question about R-squared or the coefficient of determination. I found a couple of threads on CV but none that answers my question in a straightforward way.

In short, what is a ‘low’ R-squared when running multiple linear regression? From which minimum value should we conclude that our model does not make better than the baseline?

I sometimes see R-squared values that are as low as 0.15, yet the models are significant. I guess this depends on size, on whether R-squared is used for prediction or inference, etc., however I still do not have a good intuition for it.

It also seems to me that in the ‘hard’ sciences, R-squared tend to be high (say, 0.8 or higher in classic cases), whereas in the social sciences, from what I can see, it tends to be lower (say, under 0.5). I know this might be a gross generalization, however.

Any thoughts much appreciated.

Best Answer

Consider what $R^2$ means: proportion of variability explained, compared to a baseline model that always guesses the average value of the pooled response variable.

If you’re higher than $R^2=0$, which you probably will be with in-sample data when you use an intercept, then you’re beating the baseline performance.

Related Solutions

Solved – R squared change multiple linear regression

The OP essentially wants to calculate the $R^2$ differences for individual variables. So the models would be:

m0 = lm(var ~ VAR1 + VAR2 + VAR3)
m1 = lm(var ~ VAR2 + VAR3)
m2 = lm(var ~ VAR1 + VAR3)
m3 = lm(var ~ VAR1 + VAR2)

And then you can do

pr2m1 = summary(m0)$r.squared-summary(m1)$r.squared
pr2m2 = summary(m0)$r.squared-summary(m2)$r.squared
pr2m3 = summary(m0)$r.squared-summary(m3)$r.squared

But this probably doesn't do what you think it does (explain exactly how much the individual variables explain via the $R^2$ difference) - at least if there is any multicollinearity (which there usually is).

Solved – Change in r squared due to clustering in multiple linear regression

k-means does not yield an uniform sample.

K-means clusters can be of very different sizes.

There is a tendency in k-means for high values of k to produce singleton clusters, i.e. clusters that contain a single instance (“outlier”) only. In your case, thid puts a lot more weighton such instances.

k-means was never meant to be used to replace sampling.

Best Answer

Related Solutions

Solved – R squared change multiple linear regression

Solved – Change in r squared due to clustering in multiple linear regression

Related Question