Solved – Coefficients in Linear Regression and $z$-tests

linear modelmultiple regressionregression

Suppose that we are carrying out a linear regression, in which we have $p$-parameters and $N$ observations in our training data. Then let us denote by $X$ the matrix of dimension $N\times (p+1)$ with each row and input vector where we have concatenated $1$ to the first position. Similarly let $y$ be the $N$-vector that has the outputs from the training data.

Now we are looking for the parameter $\beta$ in a model of the form:

$$Y=X^T\beta$$

and we know that our estimates, with respect to the RSS, are given by:

$$\hat{\beta} =(X^TX)^{-1}X^Ty$$

and then by making the assumption that our model is correct and that the errors are additive and Gaussian then we know that:

$$\hat{B} \sim N(\beta, (X^TX)^{-1}\sigma^2)$$

These assumption then allow us to test the significance of the parameter $\beta$ by using the test statistic:

$$z_j=\frac{\hat{\beta}}{\hat{\sigma}\sqrt{v_j}}$$

where $v_j$ is the diagonal entry of $(X^TX)^{-1}$

Now I am slightly confused as to what it means for us to test a parameter and find it not to be significant. Suppose that we test a ${\beta_j}$ and do not reject the null hypothesis that ${\beta_j}=0$, and so would drop this from our model. Will this not increase the RSS of the model that we have minimized in our parameter estimates?

I realize that my question is pretty ill-formed but I have gotten myself a bit confused: basically I am asking what it means for a parameter to be judged to be not significant with respect to minimizing the RSS of the model.

Best Answer

In layman's words, the $z$ test is about how far away your estimate lies from the hypothesised population mean (parameter). If the significance test shows that the distance is not far enough for the estimate to be significnatly different from 0 or hypothesised value, you should drop the variable from the model. Note that given your data, OLS attempts to minimise the RSS. Non-significance could be due to various issues in your model, i.e. mis-specification, multicollinearity, serial correlation or heteroskedastic error etc.

Related Question