Tikhonov regularization and ridge regression are terms often used as if they were identical. Is it possible to specify exactly what the difference is?
Solved – Is Tikhonov regularization the same as Ridge Regression
regressionregularizationridge regressionterminologytikhonov-regularization
Related Solutions
Yes.
Yes.
LASSO is actually an acronym (least absolute shrinkage and selection operator), so it ought to be capitalized, but modern writing is the lexical equivalent of Mad Max. On the other hand, Amoeba writes that even the statisticians who coined the term LASSO now use the lower-case rendering (Hastie, Tibshirani and Wainwright, Statistical Learning with Sparsity). One can only speculate as to the motivation for the switch. If you're writing for an academic press, they typically have a style guide for this sort of thing. If you're writing on this forum, either is fine, and I doubt anyone really cares.
The $L$ notation is a reference to Minkowski norms and $L^p$ spaces. These just generalize the notion of taxicab and Euclidean distances to $p>0$ in the following expression: $$ \|x\|_p=(|x_1|^p+|x_2|^p+...+|x_n|^p)^{\frac{1}{p}} $$ Importantly, only $p\ge 1$ defines a metric distance; $0<p<1$ does not satisfy the triangle inequality, so it is not a distance by most definitions.
I'm not sure when the connection between ridge and LASSO was realized.
As for why there are multiple names, it's just a matter that these methods developed in different places at different times. A common theme in statistics is that concepts often have multiple names, one for each sub-field in which it was independently discovered (kernel functions vs covariance functions, Gaussian process regression vs Kriging, AUC vs $c$-statistic). Ridge regression should probably be called Tikhonov regularization, since I believe he has the earliest claim to the method. Meanwhile, LASSO was only introduced in 1996, much later than Tikhonov's "ridge" method!
Deconvolution is part of a more general class of problems called inverse problems. In the case of deconvolution you want to recover the original image from a given one which is affected by noise after being altered by some process/system modeled through a filter, that is, $$ observed(x) = h * input(x) + \epsilon $$ where the noise if often assumed to be white noise. Notice that this problem is ill-posed. First, assume that the image is not affected by noise. h is usually a blur filter which is band limited. If you calculate the Fourier transform, $$ Observed(\omega) = H(\omega)Input(\omega) $$ which is not defined for values outside of the band. Hence, you need to introduce and additional bound on the filter in order to be able to solve it. If you have noise in addition, $$ Input(\omega) = \frac{Output(\omega)}{H(\omega)} + \frac{Noise(\omega)}{H(\omega)} $$ so that variations in the noise produce very different solutions.
The idea of Tikhonov regularization is to stabilize the problem by adding some constraint on the possible solutions. This constraint presents some a priori knowledge about the problem. Concretely, you would solve, $$ H(f) = \sum_{i}\left(f(x_{i})-y_{i}\right)^{2} + \lambda ||Df||^{2} $$ where $D$ is a differential operator like for example $\frac{d^{2}}{dx^{2}}$. This condition is basically imposing some smoothness on the filter response. See the paper by Poggio et al. (Regularization Neworks and Neural Networks Architectures) for a detailed derivation and some concrete cases.
Now it can be proved that this results in the following solution, $$ f(x) = \sum_{i}c_{i}G(x-x_{i}) $$ where $G$ is the Green function (a.k.a. kernel in the context of regression) associated with the regularizer. By means of cross-validation you can search for good values of $\lambda$ and the order of the differential operator.
Best Answer
Tikhonov regularizarization is a larger set than ridge regression. Here is my attempt to spell out exactly how they differ.
Suppose that for a known matrix $A$ and vector $b$, we wish to find a vector $\mathbf{x}$ such that :
$A\mathbf{x}=\mathbf{b}$.
The standard approach is ordinary least squares linear regression. However, if no $x$ satisfies the equation or more than one $x$ does—that is the solution is not unique—the problem is said to be ill-posed. Ordinary least squares seeks to minimize the sum of squared residuals, which can be compactly written as:
$\|A\mathbf{x}-\mathbf{b}\|^2 $
where $\left \| \cdot \right \|$ is the Euclidean norm. In matrix notation the solution, denoted by $\hat{x}$, is given by:
$\hat{x} = (A^{T}A)^{-1}A^{T}\mathbf{b}$
Tikhonov regularization minimizes
$\|A\mathbf{x}-\mathbf{b}\|^2+ \|\Gamma \mathbf{x}\|^2$
for some suitably chosen Tikhonov matrix, $\Gamma $. An explicit matrix form solution, denoted by $\hat{x}$, is given by:
$\hat{x} = (A^{T}A+ \Gamma^{T} \Gamma )^{-1}A^{T}{b}$
The effect of regularization may be varied via the scale of matrix $\Gamma$. For $\Gamma = 0$ this reduces to the unregularized least squares solution provided that (ATA)−1 exists.
Typically for ridge regression, two departures from Tikhonov regularization are described. First, the Tikhonov matrix is replaced by a multiple of the identity matrix
$\Gamma= \alpha I $,
giving preference to solutions with smaller norm, i.e., the $L_2$ norm. Then $\Gamma^{T} \Gamma$ becomes $\alpha^2 I$ leading to
$\hat{x} = (A^{T}A+ \alpha^2 I )^{-1}A^{T}{b}$
Finally, for ridge regression, it is typically assumed that $A$ variables are scaled so that $X^{T}X$ has the form of a correlation matrix. and $X^{T}b$ is the correlation vector between the $x$ variables and $b$, leading to
$\hat{x} = (X^{T}X+ \alpha^2 I )^{-1}X^{T}{b}$
Note in this form the Lagrange multiplier $\alpha^2$ is usually replaced by $k$, $\lambda$, or some other symbol but retains the property $\lambda\geq0$
In formulating this answer, I acknowledge borrowing liberally from Wikipedia and from Ridge estimation of transfer function weights