I am trying to fit a linear model using matrices to my data set even though I can use OLS and do it without matrices as a simple tutorial for myself to better understand both R
and matrix notation.
This is the model I am trying to fit:
$$\bf Y=X\boldsymbol\beta+\varepsilon$$
where $\bf Y$ is a $1\times n$ matrix, $\bf X$ is a $n\times k$ matrix (where $k$ is the number of $\beta$'s, which in this case is 2), $\boldsymbol \beta$ is a $k\times 1$ matrix and lastly our error term is $n\times 1$. I understand this portion.
When I simply use the lm()
command to fit my data, I get the following from the summary()
command:
Call:
lm(formula = y ~ x)
Residuals:
Min 1Q Median 3Q Max
-4.0503 -1.4390 0.4921 1.0589 3.9446
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.9849 0.8219 3.632 0.00191 **
x 0.5612 0.1084 5.178 6.32e-05 ***
So the summary()
is telling me that the $\beta$ matrix is a $2\times 1$ matrix, with the first number (which is $\beta_0$) as 2.0949 and the second number (which is $\beta_1)$ as 0.1084. My question is this:
We know that the matrix $\beta$ is actually:
$$\boldsymbol\beta=(\bf{X}^T\bf{X})^{-1}(\bf{X}^T\bf{Y})$$
and when I try to simply carry out this calculate by hand using R using b=(t(x)*x)^-1*t(x)*y
, I get a $1\times 20$ vector (where $20$ of course is $n$, the number of observations). Why am I not getting a $2\times 1$ matrix like I should be getting?
Best Answer
You've made two mistakes in your R code for
b
.solve
is used for matrix inversion. RaisingX
to the $-1$ power inverts each element ofX
, which can occasionally be useful, but is not what we want here.%*%
for matrix multiplication. Otherwise, it does element-wise multiplication and requires your arrays to be conformable according to R's handling of vectors. Again, occasionally useful.This is all explained in more detail in the R documentation, or the nigh-uncountable intro to R materials online, such as this one.
b<-solve(t(X)%*%X)%*%t(X)%*%y
is the literal representation of the normal equations, whileb<-solve(crossprod(X), crossprod(X,y))
is faster and more idiomatic, and matches the output oflm()
. But you definitely don't want to explicitly compute the normal equations directly for numerical reasons (and R won't warn you until you lose all significant figures). Thelm
method uses QR decomposition by default, which is more numerically stable. If you apply the correct "hand cranked" computation and it still differs from the R output, it's plausibly because of loss of numerical precision due to the explicit inversion of $X^TX$.Don't forget to add a column of $1$ for the intercept!