Solved – What happens when we introduce more variables to a linear regression model

linear modelr-squaredregression

Let’s consider the following regression model:

$y = B_{0} + B_{1}*x$

where

  • $B_{0}$ — represents the intercept
  • $B_{1}$ — represents the coefficient
  • $x$ — represents the independent variable
  • $y$ — represents the output or the dependent variable

or

Multiple linear regression:

enter image description here

Mathematically, R-squared is calculated by dividing the sum of squares of residuals ($SS_{res}$) by the total sum of squares ($SS_{tot}$) and then subtract it from 1. In this case, $SS_{tot}$ measures the total variation. $SS_{res}$ measures explained variation and $SS_{res}$ measures the unexplained variation.

As $SS_{res}+ SS_{res} = SS_{tot}$

$R² = Explained variation / Total Variation$

enter image description here

Adjusted R-Squared can be calculated mathematically in terms of the sum of squares. The only difference between R-square and the Adjusted R-square equation is the degree of freedom.

enter image description here

In the above equation, $df_{t}$ is the degrees of freedom $n– 1$ of the estimate of the population variance of the dependent variable, and $df_{e}$ is the degrees of freedom $n – p – 1$ of the estimate of the underlying population error variance.

Adjusted R-squared value can be calculated based on the value of r-squared, the number of independent variables (predictors), total sample size.

enter image description here

What happens when we introduce more variables to a linear regression model in terms of $R^2$ and adjusted $R^2$?

Will they increase, decrease, or remain constant?

Best Answer

If you introduce more variables, the $R^2$ will always increase, it can never decrease. This follows mathematically from the observation that $$ (y-\beta_0-\beta_1 x_1-...-\beta_p x_p-\beta_{p+1} x_{p+1})^2 \leq(y-\beta_0-\beta_1 x_1-...-\beta_p x_p)^2$$

On the other hand, the adjusted $R^2$ makes an adjustement for the number of variables. It will typically increase if your new variable is highly correlated to your response $Y$ and decrease if this new variable is only slightly correlated to your response. Therefore, it is considered as a better measure that standard $R^2$ because the $R^2$ will tend to always increase with the number of new variables.

Related Question