[Math] Proving Convergence of Least Squares Regression with i.i.d. Gaussian Noise

estimationleast squareslinear regressionregression

I have a basic question that I can't seem to find an answer for — perhaps I'm not wording it correctly. Suppose that we have an $n$-by-$d$ matrix, $X$ that represents input features, and we have a $n$-by-$1$ vector of output labels, $y$. Furthermore, let these labels be a noisy linear transformation of the input features:

$$ y = X \cdot w + \epsilon$$

Where $w$ represents a $d$-dimensional set of true weights and $\epsilon$ is i.i.d. zero-mean Gaussian noise, and I am interested in inferring $w$ using ordinary least squares linear regression (OLS).

I would like to prove that as the number of data points, $n$, increases, the weights predicted by OLS converges in probability to the true weights $w$ — say, in $l_2$-norm.

Can anyone help me to go about proving this, or point me to references? Thanks!

As pointed out in the comments, it is important to keep in mind how $X$ is constructed. Let $X$ be a random matrix.

Best Answer

Here is the probabilistic approach to the proof. Assuming that $X$ is a random matrix and $E\| x \| < \infty$ (and $Ey^2 < \infty)$.

Your minimization problem is $$\min_{w\in \mathbb{R}^d} \| Xw - y \|_2^2. $$ It is straightforward to show that the OLS estimator ("weights" using your terminology) are given by $$ \hat{w}=(X'X)^{-1}X'y, $$ or equivalently $$ \hat{w}= \left( \frac 1 n \sum_{i=1}^n x_i x_i^T\right)^{-1} \left( \frac 1 n \sum_{i=1}^n x_i^T y_i\right). $$ Note that convergence in $L^2$ norm occur iff $\hat{w}$ converge in probability to $w$, thus by using WLLN you can observe that $$ \frac 1 n \sum_{i=1}^nx_ix_i^T \xrightarrow{p}Ex_ix_i^T=E(X'X) $$ as $n\to \infty$, and $$ \frac 1 n \sum_{i=1}^nx_i^Ty_i \xrightarrow{p}E(X'y). $$ As such, by the continuous mapping theorem for $f(X'X, X'y)=(X'X)^{-1}X'y$, you get that $$ \hat{w}\xrightarrow p (E(X'X))^{-1}E(X'y), $$ which are the population ("real") weights of $C(X)$.

Best Answer

Related Solutions

Proof of $\frac{1}{n}\mathrm{E} \left[ \| \mathbf{X}\mathbf{\hat{w}} – \mathbf{X}\mathbf{w}^{*} \|^{2}_{2} \right] = \sigma^{2}\frac{d}{n}$

Related Question