Solved – Is Tikhonov regularization the same as Ridge Regression

regressionregularizationridge regressionterminologytikhonov-regularization

Tikhonov regularization and ridge regression are terms often used as if they were identical. Is it possible to specify exactly what the difference is?

Best Answer

Tikhonov regularizarization is a larger set than ridge regression. Here is my attempt to spell out exactly how they differ.

Suppose that for a known matrix $A$ and vector $b$, we wish to find a vector $\mathbf{x}$ such that :

$A\mathbf{x}=\mathbf{b}$.

The standard approach is ordinary least squares linear regression. However, if no $x$ satisfies the equation or more than one $x$ does—that is the solution is not unique—the problem is said to be ill-posed. Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as:

$\|A\mathbf{x}-\mathbf{b}\|^2 $

where $\left \| \cdot \right \|$ is the Euclidean norm. In matrix notation the solution, denoted by $\hat{x}$, is given by:

$\hat{x} = (A^{T}A)^{-1}A^{T}\mathbf{b}$

Tikhonov regularization minimizes

$\|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2$

for some suitably chosen Tikhonov matrix, $\Gamma $. An explicit matrix form solution, denoted by $\hat{x}$, is given by:

$\hat{x} = (A^{T}A+ \Gamma^{T} \Gamma )^{-1}A^{T}{b}$

The effect of regularization may be varied via the scale of matrix $\Gamma$. For $\Gamma = 0$ this reduces to the unregularized least squares solution provided that (ATA)−1 exists.

Typically for ridge regression, two departures from Tikhonov regularization are described. First, the Tikhonov matrix is replaced by a multiple of the identity matrix

$\Gamma= \alpha I $,

giving preference to solutions with smaller norm, i.e., the $L_2$ norm. Then $\Gamma^{T} \Gamma$ becomes $\alpha^2 I$ leading to

$\hat{x} = (A^{T}A+ \alpha^2 I )^{-1}A^{T}{b}$

Finally, for ridge regression, it is typically assumed that $A$ variables are scaled so that $X^{T}X$ has the form of a correlation matrix. and $X^{T}b$ is the correlation vector between the $x$ variables and $b$, leading to

$\hat{x} = (X^{T}X+ \alpha^2 I )^{-1}X^{T}{b}$

Note in this form the Lagrange multiplier $\alpha^2$ is usually replaced by $k$, $\lambda$, or some other symbol but retains the property $\lambda\geq0$

In formulating this answer, I acknowledge borrowing liberally from Wikipedia and from Ridge estimation of transfer function weights

Related Question