Regression Analysis – Why Adding a Linear Regression Predictor Decreases R Squared

linearr-squaredregression

My dataset ($N \approx 10,000$) has a dependent variable (DV), five independent "baseline" variables (P1, P2, P3, P4, P5) and one independent variable of interest (Q).

I have run OLS linear regressions for the following two models:

DV ~ 1 + P1 + P2 + P3 + P4 + P5
                                  -> R-squared = 0.125

DV ~ 1 + P1 + P2 + P3 + P4 + P5 + Q
                                  -> R-squared = 0.124

I.e., adding the predictor Q has decreased the amount of variance explained in the linear model. As far as I understand, this shouldn't happen.

To be clear, these are R-squared values and not adjusted R-squared values.

I've verified the R-squared values using Jasp and Python's statsmodels.

Is there any reason I could be seeing this phenomenon? Perhaps something relating to the OLS method?

Best Answer

Could it be that you have missing values in Q that are getting auto-dropped? That'd have implications on the sample, making the two regressions not comparable.

Related Solutions

Solved – R-squared result in linear regression and “unexplained variance”

$R^2$ is the squared correlation of the OLS prediction $\hat{Y}$ and the DV $Y$. In a multiple regression with three predictors $X_{1}, X_{2}, X_{3}$:

# generate some data
> N  <- 100
> X1 <- rnorm(N, 175, 7)                                 # predictor 1
> X2 <- rnorm(N,  30, 8)                                 # predictor 2
> X3 <- abs(rnorm(N, 60, 30))                            # predictor 3
> Y  <- 0.5*X1 - 0.3*X2 - 0.4*X3 + 10 + rnorm(N, 0, 10)  # DV
> fitX123 <- lm(Y ~ X1 + X2 + X3)  # regression
> summary(fitX123)$r.squared       # R^2
[1] 0.6361916

> Yhat <- fitted(fitX123)          # OLS prediction Yhat
> cor(Yhat, Y)^2
[1] 0.6361916

$R^2$ is also equal to the variance of $\hat{Y}$ divided by the variance of $Y$. In that sense, it is the "variance accounted for by the predictors".

> var(Yhat) / var(Y)
[1] 0.6361916

The squared semi-partial correlation of $Y$ with a predictor $X_{1}$ is equal to the increase in $R^2$ when adding $X_{1}$ as a predictor to the regression with all remaining predictors. This may be taken as the unique contribution of $X_{1}$ to the proportion of variance explained by all predictors. Here, the semi-partial correlation is the correlation of $Y$ with the residuals from regression where $X_{1}$ is the predicted variable and $X_{2}$ and $X_{3}$ are the predictors.

# residuals from regression with DV X1 and predictors X2, X3
> X1.X23 <- residuals(lm(X1 ~ X2 + X3))
> (spcorYX1.X23 <- cor(Y, X1.X23))   # semi-partial correlation of Y with X1
[1] 0.3172553

> spcorYX1.X23^2                     # squared semi-partial correlation
[1] 0.1006509

> fitX23 <- lm(Y ~ X2 + X3)          # regression with DV Y and predictors X2, X3

# increase in R^2 when changing to full regression
> summary(fitX123)$r.squared - summary(fitX23)$r.squared
[1] 0.1006509

Solved – the relationship between R-squared and p-value in a regression

The answer is no, there is no such regular relationship between $R^2$ and the overall regression p-value, because $R^2$ depends as much on the variance of the independent variables as it does on the variance of the residuals (to which it is inversely proportional), and you are free to change the variance of the independent variables by arbitrary amounts.

As an example, consider any set of multivariate data $((x_{i1}, x_{i2}, \ldots, x_{ip}, y_i))$ with $i$ indexing the cases and suppose that the set of values of the first independent variable, $\{x_{i1}\}$, has a unique maximum $x^*$ separated from the second-highest value by a positive amount $\epsilon$. Apply a non-linear transformation of the first variable that sends all values less than $x^* - \epsilon/2$ to the range $[0,1]$ and sends $x^*$ itself to some large value $M \gg 1$. For any such $M$ this can be done by a suitable (scaled) Box-Cox transformation $x \to a((x-x_0)^\lambda - 1)/(\lambda-1))$, for instance, so we're not talking about anything strange or "pathological." Then, as $M$ grows arbitrarily large, $R^2$ approaches $1$ as closely as you please, regardless of how bad the fit is, because the variance of the residuals will be bounded while the variance of the first independent variable is asymptotically proportional to $M^2$.

You should instead be using goodness of fit tests (among other techniques) to select an appropriate model in your exploration: you ought to be concerned about the linearity of the fit and of the homoscedasticity of the residuals. And don't take any p-values from the resulting regression on trust: they will end up being almost meaningless after you have gone through this exercise, because their interpretation assumes the choice of expressing the independent variables did not depend on the values of the dependent variable at all, which is very much not the case here.

Best Answer

Related Solutions

Solved – R-squared result in linear regression and “unexplained variance”

Solved – the relationship between R-squared and p-value in a regression

Related Question