OLS MLE and Cramer-Rao for Basic Linear Regression

fisher informationmaximum likelihoodparameter estimationregressionstatistics

In general I know Maximum Likelihood Estimation of parameters should have a variance bounded by the Cramer-Rao Bound (i.e variance of estimated parameter should go to $E [I(\theta \vert x)^{-1}]$.

For the simple linear regression, $y_i=w \cdot x_i+b + \epsilon$ , $\epsilon \sim N(0,\sigma^2)$. Here $w$ is the weight and $b$ is the bias, the MLE and equivalently OLS estimate of weight, $\widehat w = \large \frac{\sum_{i=1}^{n}x_i(y_i-\bar y)}{\sum_{i=1}^{n}x_i(x_i-\bar x)}$ , and then the variance of $\widehat w$ = $\frac{\sigma^2}{(x_i-\bar x)^2}$.

I get this. But the log-likelihood function for this would be

$ l=\ln\left(\frac{1}{\sigma\sqrt{2\pi }}\right)^n-\frac{1}{2\sigma^2}\sum_{i=1}^{n}\left(y_i\:-w\cdot x_i-b\right)^2$ right?. But then $\Rightarrow \frac{\partial l}{\partial w} = \frac{1}{\sigma^2} \sum_{i=1}^{n}(y_i-w\cdot x_i-b)x_i $ ,
and
$\Rightarrow – \frac{\partial^2 l}{\partial^2 w} = \frac{\sum_{i=1}^{n}{x_i}^2}{\sigma^2}$

In an informal sense,
$\Rightarrow E[(- \frac{\partial^2 l}{\partial^2 w})^ {-1}] = \frac{{\sigma^2}}{\sum_{i=1}^{n}x_i^2}$, right?

This is agreed always going to be smaller than the variance estimate of $\widehat w$ from OLS or MLE, which makes sense, as this is the theoretical limit. But why is the MLE or OLS estimate of variance only equal to C.R.B when $\bar x =0$, not more importantly, why is MLE not converging to CRB? I feel I am missing something very obvious, I would be very thankful if someone could share some insight.

Thanks

Best Answer

What you're missing is something rather subtle about the Cramer-Rao bound: what's required is the information matrix for all parameters of interest. The lower bound that you've calculated, namely $\sigma^2/\sum x_i^2$, is correct if $w$ is the only parameter in your model and the other parameters $b$ and $\sigma$ are assumed to be known. In that case, you can check that the OLS estimator for $w$ is not the usual one, but instead will be $$\hat w:=\frac{\sum x_i(y_i-b)}{\sum x_i^2},$$ and this estimator indeed has variance equal to the C-R lower bound.

If you're estimating both $b$ and $w$, then the multivariate C-R inequality is a statement about the covariance matrix for any unbiased estimator $(\hat b,\hat w)$ of the parameter vector $(b, w)$: $$\operatorname{Cov}(\hat b,\hat w)\ge I(b,w)^{-1}$$ and the inequality, being a comparison of two matrices, is an assertion that the LHS minus the RHS is positive semidefinite. In particular, each diagonal element of the LHS is at least as great as the corresponding diagonal element on the RHS.

For the basic linear model $y_i=wx_i+b+\epsilon_i$ where both $b$ and $w$ are to be estimated, you can verify that the information matrix is : $$I(b,w)=\frac1{\sigma^2}\begin{pmatrix}n&\sum x_i\\\sum x_i &\sum x_i^2\end{pmatrix} $$ with inverse: $$I(b,w)^{-1}=\frac{\sigma^2}{n\sum x_i^2-(\sum x_i)^2} \begin{pmatrix}\sum x_i^2 & -\sum x_i\\-\sum x_i&n\end{pmatrix} $$ so we can conclude from the C-R inequality $$\operatorname{Var}(\hat w)\ge \frac{\sigma^2 n}{n\sum x_i^2-(\sum x_i)^2}=\frac{\sigma^2}{\sum(x_i-\bar x)^2}, $$ as expected.

Related Question