Solved – Simple linear regression fit manually via matrix equations does not match lm() output

linear algebrarregressionself-study

I am trying to fit a linear model using matrices to my data set even though I can use OLS and do it without matrices as a simple tutorial for myself to better understand both R and matrix notation.

This is the model I am trying to fit:

$$\bf Y=X\boldsymbol\beta+\varepsilon$$

where $\bf Y$ is a $1\times n$ matrix, $\bf X$ is a $n\times k$ matrix (where $k$ is the number of $\beta$'s, which in this case is 2), $\boldsymbol \beta$ is a $k\times 1$ matrix and lastly our error term is $n\times 1$. I understand this portion.

When I simply use the lm() command to fit my data, I get the following from the summary() command:

Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.0503 -1.4390  0.4921  1.0589  3.9446 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   2.9849     0.8219   3.632  0.00191 ** 
x             0.5612     0.1084   5.178 6.32e-05 ***

So the summary() is telling me that the $\beta$ matrix is a $2\times 1$ matrix, with the first number (which is $\beta_0$) as 2.0949 and the second number (which is $\beta_1)$ as 0.1084. My question is this:

We know that the matrix $\beta$ is actually:

$$\boldsymbol\beta=(\bf{X}^T\bf{X})^{-1}(\bf{X}^T\bf{Y})$$

and when I try to simply carry out this calculate by hand using R using b=(t(x)*x)^-1*t(x)*y, I get a $1\times 20$ vector (where $20$ of course is $n$, the number of observations). Why am I not getting a $2\times 1$ matrix like I should be getting?

Best Answer

You've made two mistakes in your R code for b.

  1. solve is used for matrix inversion. Raising X to the $-1$ power inverts each element of X, which can occasionally be useful, but is not what we want here.
  2. R uses the operator %*% for matrix multiplication. Otherwise, it does element-wise multiplication and requires your arrays to be conformable according to R's handling of vectors. Again, occasionally useful.

This is all explained in more detail in the R documentation, or the nigh-uncountable intro to R materials online, such as this one.

b<-solve(t(X)%*%X)%*%t(X)%*%y is the literal representation of the normal equations, while b<-solve(crossprod(X), crossprod(X,y)) is faster and more idiomatic, and matches the output of lm(). But you definitely don't want to explicitly compute the normal equations directly for numerical reasons (and R won't warn you until you lose all significant figures). The lm method uses QR decomposition by default, which is more numerically stable. If you apply the correct "hand cranked" computation and it still differs from the R output, it's plausibly because of loss of numerical precision due to the explicit inversion of $X^TX$.

Don't forget to add a column of $1$ for the intercept!

Related Question