Can the MMSE estimator be just interpreted as Tikhonov regularization

estimationparameter estimationstatistics

Given an ordinary least squares problem with the normal setup (error Gaussian i.i.d. with $\sigma^2$, $N$ observations and $L$ unknowns):

$$
\mathbf{y} = \mathbf{X}\mathbf{b} + \mathbf{e}
$$

The estimation is given as

$$
\hat{\mathbf{b}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}
$$

The coefficient error (covariance matrix) is given as:

$$
\mathbf{\Sigma_b} = \sigma^2 (\mathbf{X}^T\mathbf{X})^{-1}
$$

The average mean-squared error, i.e., the error between $\mathbf{b}$ and $\hat{\mathbf{b}}$ can be approximated as the average of the cofficient errors (trace):

$$
\overline{\mathrm{MSE}} = \mathbb{E}\{ \|\hat{\mathbf{b}}-\mathbf{b}\|_2^2 \} \approx \frac{1}{L}\sigma^2 \operatorname{Tr} (\mathbf{X}^T\mathbf{X})^{-1}
$$

According to the Gauss-Markov theorem, the Least Squares estimate provides the best linear unbiased estimator which is also the minimum variance unbiases estimator (MVUE) because the noise is Gaussian.

This estimator minimizes the residuals $\|\mathbf{X}\mathbf{b}-\mathbf{y}\|_2^2$ .

Do I understand correctly that this estimator ONLY minimizes the MSE among other unbiased estimators but does not generally minimize the MSE?

Tikhonov regularization provides a biased estimate but can result in lower MSE:

$$
\hat{\mathbf{b}} = (\mathbf{X}^T\mathbf{X} + \mathbf{\Gamma}^T \mathbf{\Gamma})^{-1} \mathbf{X}^T \mathbf{y}
$$

Is it valid to say that Tikhonov regularization minimizes the mean-squared-error (i.e., it is a MMSE estimator) and as such is the best estimator (despite being biased)?

In that case, is the MMSE estimator given by the Tikhonov regularization with $\mathbf{\Gamma} = \sigma \mathbf{I}$?

Is there a relation between Tikhonov regularization and the MMSE estimator?

Best Answer

  1. Yes, the OLS is the UMVUE, that is among the unbiased linear estimator it is the best (smallest variance). In general you can tell nothing concrete about the unbiased estimators as this class is is "too big". No contradiction and no surprise that by introducing bias you can get smaller variance, it is called the bias-variance trade-off. https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff#targetText=In%20statistics%20and%20machine%20learning,across%20samples%2C%20and%20vice%20versa.

  2. Take the simple form of Ridge regularization, that is, $\Gamma ^ T \Gamma = \lambda \mathrm{I}$, in this case if $\lambda$ is very large then the estimators are dominated by its value and not that much by the data itself. That is, they become smaller (biased) in absolute value with lower variance. If you check the univariate case where $y = x\beta + \epsilon$, then the OLS if $\beta$ is $$ (X'X)^{-1}X'y = \frac{\sum x_i y_i}{ \sum x_i^2} $$
    where the ridge model estimator of $\beta$ is given by $$ (X'X + \lambda I)^{-1}X'y = \frac{\sum x_i y_i}{ \sum x_i^2 + \lambda} , $$ so $$ | \hat{\beta} ^{OLS} |\ge | \hat{\beta}^{Ridge} | $$ and the same with the variance, i.e., $$ Var( \hat{\beta} ^{OLS} ) = \frac{\sigma ^2}{ \sum x_i^ 2} \ge \frac{\sigma ^2 \sum x_i ^2}{ (\sum x_i^ 2 + \lambda )^2} = Var( \hat{\beta} ^{Ridge}). $$ If you take $\lambda \to \infty$ then you clearly observe the fact that Ridge's estimator and its variance tend to zero.