Simple Linear Regression – Fitting Models Without Intercept for Accurate Predictions

interceptregression

Say we fit a simple linear regression without the intercept term. I know this is inadvisable for the most part, unless there are some situations where it is reasonable to assume that when $x=0$, $y$ is also 0. I have been reading that when fitting without the intercept, it may cause the residual term to have non-zero mean. It makes sense but I can't seem to prove to myself (out of personal curiosity) why this is true. Can anyone help me understand it through a proof or some equations?

Best Answer

I'll give a linear algebra based, in some sense geometry based explanation.

Linear regression projects vector $\mathbf{y}$ onto the linear span of the columns of $X$

Let's say you have some vector $\mathbf{y}$ (length $n$) of outcomes and matrix $X$ (dimensions $n$ by $k$) of data. The vector $\mathbf{y}$ can be split into the sum of two orthogonal vectors:

  1. The projection of $\mathbf{y}$ onto the linear space that is spanned by the columns of $X$. This is $$\mathrm{proj}_X \mathbf{y} = X\hat{\mathbf{b}} \quad \quad \text{where } \hat{\mathbf{b}}=(X'X)^{-1}X'\mathbf{y}$$
  2. Some leftover residual that is orthogonal to the columns of $X$. $$ \mathbf{e} = \mathbf{y} - \mathrm{proj}_X \mathbf{y} = \mathbf{y} - X \hat{\mathbf{b}}$$

Now observe that this is exactly what linear regression does! (The minimum of $\|\mathbf{e}\|_2$ is achieved when $\mathbf{b}$ is chosen so that $\mathbf{e}$ is orthogonal to the span of the columns of X.)

Residual $\mathbf{e}$ is orthogonal to each column of $X$

We can verify that the dot product of residual $\mathbf{e}$ with each column of $X$ is zero: \begin{align*} X'\mathbf{e} &= X'\left(\mathbf{y} - X(X'X)^{-1}X' \mathbf{y} \right)\\ &= X'\mathbf{y}-(X'X)(X'X)^{-1}X'\mathbf{y}\\ &= \mathbf{0} \end{align*}

Include an intercept vs. not including an intercept

The way you include an intercept in linear regression is that you add a vector of 1s as one of your variables. You have $x_{i,1} = 1$ for all observations $i$. You make one of the columns in your matrix $X$ equal to 1. To be explicit:

$$ X = \begin{bmatrix} 1 & x_{1,2} & \ldots & x_{1,k} \\ 1 & x_{2,2} & \ldots & x_{2,k} \\ \ldots & \ldots & \ldots & \ldots\\ 1 & x_{n,2} & \ldots & x_{n, k} \end{bmatrix}$$

When you include an intercept (i.e. a constant) in the regression, it forces the residual $\mathbf{e}$ to be orthogonal to the constant! If $\mathbf{e}$ and $\mathbf{1}$ are orthogonal then their dot product is zero.

$\mathbf{e}'\mathbf{1} = \sum_i e_i$ so if $\mathbf{e}$ and $\mathbf{1}$ are orthogonal, the mean of $\mathbf{e}$ is zero.

Discussion:

Basically the only assumption that I used here is that matrix $X'X$ is full rank (so that I could invert that matrix). Purely mechanically, including a constant in a regression forces residual $\mathbf{e}$ to be mean zero.

Related Question